arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31603 2026-06-01 cs.CV cs.AI 版本更新

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Lumos-Nexus: 面向视频统一模型的高效频率桥接与同质潜在空间

Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu, Yujie Wei, Fei Du, Hai Ci, Tao Feng, Jiasheng Tang, Weihua Chen, Fan Wang, Yong Liu

发表机构 * Zhejiang University(浙江大学) DAMO Academy, Alibaba Group(阿里云达摩院) Hupan Lab(虎扑实验室) National University of Singapore(新加坡国立大学) Hong Kong University of Science and Technology(香港科技大学) Fudan University(复旦大学) Tsinghua University(清华大学)

AI总结 提出Lumos-Nexus框架,通过两阶段训练和渐进频率桥接,在保持推理能力的同时显著提升视频生成保真度。

Comments Project page (https://jiazheng-xing.github.io/nexus-lumos-home/) and Code (https://github.com/alibaba-damo-academy/Lumos-Custom/) are available

详情
AI中文摘要

基于连接器的视频统一模型在指令引导的视频合成中展现出强大能力,但将大型高保真生成器集成到统一训练循环中计算成本过高,限制了可实现的视觉质量。因此,我们提出Lumos-Nexus,一个训练高效的统一视频生成框架,促进强推理驱动生成能力的发展,同时显著提升视觉保真度。Lumos-Nexus采用两阶段设计:1)训练时,仅将轻量级生成器与理解模块对齐,以学习接收推理驱动的语义控制。2)推理时,我们引入统一渐进频率桥接(UPFB),在共享潜在空间中逐步将生成任务移交给高容量预训练生成器,实现从粗到细的细化,在不牺牲推理质量的情况下生成高保真视频。为填补推理驱动视频生成基准的空白,我们引入VR-Bench,评估模型将推断意图转化为连贯且语义对齐的视频内容的能力。大量实验表明,Lumos-Nexus在VBench上实现了视觉真实感和时间连贯性的显著提升,同时在VR-Bench上展现出强大的基于推理的生成性能。代码和模型可在https://jiazheng-xing.github.io/nexus-lumos-home/获取。

英文摘要

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

2605.31593 2026-06-01 cs.CR cs.AI 版本更新

Stateful Online Monitoring Catches Distributed Agent Attacks

有状态在线监控捕获分布式智能体攻击

Davis Brown, Samarth Bhargav, Arav Santhanam, Kasper Hong, Ivan Zhang, Matan Shtepel, Steffi Chern, Alexander Robey, Eric Wong, Hamed Hassani

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对分布式智能体攻击中跨账户聚合的恶意行为难以被单上下文监控检测的问题,提出一种基于实时聚类的有状态在线监控方法,能够更早、更有效地捕获分布式攻击,同时保持低延迟。

详情
AI中文摘要

语言模型可以发现数千个严重的软件漏洞,并且智能体越来越多地被滥用于网络攻击。为了避免检测,攻击者经常分布他们的滥用行为,将有害任务分割到多个用户账户中,使得每个单独的记录看起来无害。由于安全监控器一次只评估一个智能体上下文,它们在结构上无法检测到仅在跨多个账户的聚合中才可见的滥用行为。我们通过构建据我们所知第一个分布式智能体攻击来证明这一漏洞是真实存在的,该攻击是一个多智能体框架,能够在完成困难的网络安全任务的同时,将有害目标隐藏在具有有限上下文的子智能体中,从而规避标准监控器,后者捕获它的频率仅为先前智能体攻击的五分之一。为了防御,我们开发了一种在线有状态监控器,它使用实时聚类来收集跨多个智能体记录的微弱可疑信号,并且仅在极少数情况下升级到语言模型以标记跨用户账户的滥用行为。在模拟数据中心流量的大规模评估中,我们的监控器帕累托优于标准监控器,提前30%捕获分布式攻击,并在网络滥用达到最有害阶段之前标记出来。至关重要的是,这对于约99%的用户流量带来的额外延迟可以忽略不计。这种检测优势在良性背景流量非常大时仍然存在但会缩小。经过广泛的红队演练,我们改进了防御,并且令人惊讶地发现它也能捕获标准越狱,因为自适应攻击者会跨账户重复使用攻击变体。我们的结果指向了一类新的安全监控器,它们对用户群体而非孤立记录进行推理。

英文摘要

Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi-agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real-time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large-scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red-teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.

2605.31590 2026-06-01 cs.CV cs.AI 版本更新

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

TunerDiT: 无需训练的多事件视频生成扩散变压器渐进式引导

Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai, Lei Zhang, Xun Xiao, Thomas Seidl, Daniel Cremers, Volker Tresp

发表机构 * Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学) Technical University of Munich(慕尼黑技术大学) MCML University of Hamburg(汉堡大学) Huawei European Research Institute(华为欧洲研究院)

AI总结 针对长视频多事件生成难题,提出无需额外训练的TunerDiT方法,通过事件分区掩码和跨事件提示融合实现渐进式引导,在8项指标上达到最优。

Comments 17 pages, 13 figures

详情
AI中文摘要

文本到视频(T2V)生成在生成长时间跨度包含多个事件的视频时面临挑战性问题。受扩散过程内在特性的启发,我们探测了视频扩散变压器(DiTs),并发现了DiT去噪轨迹中的内在转折点,其中条件文本从全局布局到细粒度细节影响生成。基于这一发现,我们提出了TunerDiT,一种简单而有效的渐进式引导方法,无需额外训练即可实现多事件生成。TunerDiT包含两个引导手柄:(1)事件分区掩码,强制事件边界同时允许跨事件过渡带;(2)跨事件提示融合,注入相邻事件语义用于后期细化。我们贡献了一个自策提示套件用于多事件生成基准测试,即Meve。与其他无训练方法相比,TunerDiT在8项指标上达到了最先进性能,并在视频一致性和事件分离之间提供了可调权衡。文本对齐的提升随事件数量增加而增强,表明随着事件数量增加存在扩展可能性。

英文摘要

Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.

2605.31586 2026-06-01 cs.CL cs.AI 版本更新

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

语言模型学习构式语义,更不用说句法:探究LM对配对焦点构式的理解

Wesley Scivetti, Ethan Wilcox, Nathan Schneider, Kanishka Misra, Leonie Weissweiler

发表机构 * Georgetown University(乔治城大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Leipzig University(莱比锡大学)

AI总结 通过构建新数据集,研究不同规模开源语言模型对英语中稀有配对焦点构式(如“let alone”)的语义理解,发现中等规模模型能掌握其形式和意义,且语义学习晚于句法知识,并与世界知识相关。

Comments Conference on Natural Language Learning (CoNLL) 2026

详情
AI中文摘要

理解稀有构式(形式-意义配对)的语义已被证明是一个具有挑战性的问题,目前只有最大的LLM才能解决。开源模型是否具有稳健的构式理解,以及如果具备,这种知识习得背后的学习动态是什么,仍然是一个开放问题。聚焦于英语中一组稀有的配对焦点构式(例如“let alone”、“much less”),我们构建了一个新颖的数据集,利用标量形容词语义和一般世界知识来测试它们的意义。通过测试一系列在参数数量、架构和预训练数据集大小上不同的模型,我们发现几个中等规模的模型对配对焦点构式的形式和意义都敏感,尽管在人类规模数据上训练的模型在所有意义评估中均失败。转向一组开放检查点模型的训练动态,我们发现配对焦点理解在训练后期出现,晚于配对焦点句法知识,并且配对焦点语义的学习与世界知识某些领域的提升相关。总体而言,我们的实证结果支持中等规模开源模型能够掌握稀有配对焦点构式的结论,并展示了配对焦点构式知识与其他意义领域之间的联系。

英文摘要

Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. "let alone", "much less"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.

2605.31584 2026-06-01 cs.CL cs.AI cs.LG 版本更新

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

LongTraceRL: 基于评分奖励从搜索智能体轨迹中学习长上下文推理

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

发表机构 * Tsinghua University(清华大学)

AI总结 提出LongTraceRL框架,通过知识图谱随机游走生成多跳问题并利用搜索智能体轨迹构建分层干扰物,结合基于实体链的评分奖励进行过程监督,提升大语言模型在长上下文推理中的表现。

详情
AI中文摘要

长上下文推理仍然是大型语言模型的核心挑战,模型往往难以在大量干扰内容中定位和整合关键信息。基于可验证奖励的强化学习(RLVR)在此任务上展现出潜力,但现有方法受限于低混淆度的干扰物和稀疏的、仅基于结果的奖励信号,无法监督中间推理步骤。为解决这些问题,我们引入了 extsc{LongTraceRL}。在数据构建方面,我们通过知识图谱随机游走生成多跳问题,并利用搜索智能体轨迹构建\emph{分层干扰物}:智能体读取但未引用的文档(高混淆度)和搜索结果中出现但从未打开的文档(低混淆度),从而生成比随机采样或单次搜索构建的训练上下文更具挑战性的内容。在奖励设计方面,我们提出了一种\emph{评分奖励},利用每条推理链上的黄金实体作为细粒度的实体级过程监督。该评分奖励仅应用于最终答案正确的响应(正向策略),以区分正确响应之间的推理质量,并防止奖励作弊。在五个长上下文基准上对三种推理LLM(4B-30B)进行的实验表明, extsc{LongTraceRL} 始终优于强基线,并鼓励全面、基于证据的推理。代码、数据集和模型可在 \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL} 获取。

英文摘要

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.

2605.31581 2026-06-01 cs.AI 版本更新

Choosing the Lens: Strategic Perspective Activation in Context-Dependent Argumentation

选择视角:上下文相关论证中的策略性视角激活

Albert Sadowski, Jarosław A. Chudziak

发表机构 * Warsaw University of Technology(华沙技术大学)

AI总结 本文提出上下文相关论证框架(CDAF),通过击败函数和视角标记特化,研究代理如何通过选择相关性集和优先级来策略性地影响论证结果。

Comments Accepted to LAMAS&SR workshop at FLoC 2026

详情
AI中文摘要

相同的论证通常需要在不同的外部体制下进行评估。对体制有影响力的代理拥有标准形式主义无法直接捕捉的策略杠杆。我们引入了上下文相关论证框架(CDAF),这是对Dung理论的扩展,其中击败函数根据上下文决定哪些攻击成功。视角标记特化从相关性集$ρ$和优先级$π$推导出击败函数。相关性集是代理的行动空间。在一个小型工作示例中,代理的目标论证在完全相关单射优先级下被拒绝,但在部分激活下被接受,而VAF受众无法镜像其中一种激活。我们定义了相应的决策问题ACTIVATION-MANIPULATION,并记录了基线复杂度界限。紧界限和多代理变体留待未来研究。

英文摘要

The same arguments often need to be evaluated under different external regimes. An agent with influence over the regime has a strategic lever that standard formalisms do not directly capture. We introduce context-dependent argumentation frameworks (CDAFs), an extension of Dung's theory in which a defeat function determines, per context, which attacks succeed. A perspective-labeled specialisation derives the defeat function from a relevance set $ρ$ and a priority $π$. The relevance set is the agent's action space. In a small worked example, the agent's target argument is rejected under every full-relevance injective priority, yet accepted under partial activations, one of which no VAF audience can mirror. We define the corresponding decision problem, ACTIVATION-MANIPULATION, and record baseline complexity bounds. Tight bounds and multi-agent variants are left open.

2605.31575 2026-06-01 cs.IR cs.AI 版本更新

SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics

SPECTRA: 具有相关性真值表和受控干扰物诊断的合成信息检索测试集

Eric Liang

发表机构 * Oracle

AI总结 提出SPECTRA框架,通过分离潜在主题结构、文本实现、元数据控制、查询意图生成和确定性相关性真值表,生成合成文本语料库和检索测试集,以诊断检索系统的扩展性和故障模式。

详情
AI中文摘要

可扩展的信息检索测试需要足够大的语料库来测试索引构建、排序延迟、查询路由和评估工具,但人工判断的测试集仍然昂贵,并且在文档私有或仍在设计时可能不可用。本文介绍了SPECTRA,一个可复现的框架,通过分离潜在主题结构、表面文本实现、元数据控制、查询意图生成和确定性相关性真值表,生成合成文本语料库和检索测试集。该框架旨在作为Cranfield风格和TREC风格评估的诊断补充,而非替代人工评估。一个单进程Python原型生成了多达60,000个文档和961万个标记的语料库,同时保持了可控的长尾词汇增长,并为96个查询生成了分级相关性标签。在本地模拟研究中,生成速度接近线性,约为每秒12,000到14,000个文档,估计的Zipf斜率绝对值保持在0.86附近,增加跨主题干扰文本使BM25 nDCG@10从2%干扰物时的1.00下降到36%干扰物时的0.43。这些结果表明,轻量级合成语料库可以在昂贵的集合构建开始之前暴露检索系统的扩展性和故障模式。

英文摘要

Scalable information retrieval testing needs corpora that are large enough to stress index construction, ranking latency, query routing, and evaluation tooling, yet human-judged test collections remain expensive and may be unavailable when documents are private or still under design. This paper introduces SPECTRA, a reproducible framework for generating synthetic text corpora and retrieval test collections through a separation of latent topical structure, surface text realization, metadata controls, query intent generation, and deterministic relevance oracles. The framework is intended as a diagnostic complement to Cranfield-style and TREC-style evaluation, not as a replacement for human assessment. A single-process Python prototype generated corpora up to 60,000 documents and 9.61 million tokens while preserving controllable long-tail vocabulary growth and producing graded relevance labels for 96 queries. In the local simulation study, generation remained close to linear at roughly 12K to 14K documents per second, estimated Zipf slopes stayed near 0.86 in absolute value, and increasing cross-topic distractor text reduced BM25 nDCG@10 from 1.00 at 2% distractors to 0.43 at 36% distractors. These results show that lightweight synthetic corpora can expose retrieval-system scaling and failure modes before costly collection construction begins.

2605.31564 2026-06-01 cs.CL cs.AI 版本更新

What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

什么先被揭开?面向图到文本生成的扩散模型轨迹分析

Qing Wang, Jacob Devasier, Chengkai Li

发表机构 * The University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 本文首次系统研究掩码扩散语言模型在图到文本生成中的解码轨迹,发现其优先生成实体,并针对监督微调导致的输出长度固定问题提出无训练推理时修改方法λ缩放结构解码,恢复+9.4 BLEU-4,同时引入Graph-LLaDA模型以显式融入关系图结构。

详情
AI中文摘要

我们首次系统研究了掩码扩散语言模型(MDLM)在图到文本生成中的应用。我们分析了MDLM的生成轨迹——即迭代解码过程中令牌被掩码的顺序——发现与自回归LLM线性生成文本不同,MDLM自然优先处理实体,然后是关系词和功能词,结构令牌最后解决。我们进一步发现了一个先前未记录的监督微调失败模式:SFT通过过早地将结构性的句子结束令牌锚定在解码轨迹早期,破坏了这一策略,从而有效固定了输出长度,这可能导致信息遗漏或幻觉。为了解决这个问题,我们提出了λ缩放结构解码,一种无训练的推理时修改方法,降低结构令牌的置信度,并恢复了+9.4 BLEU-4。最后,我们引入了Graph-LLaDA,它将图Transformer编码器集成到LLaDA的解码过程中,以显式融入关系图结构。在LAGRANGE上的跨数据集评估表明,先前的基线过拟合于特定数据集模式,而基于LLM和MDLM的方法泛化能力显著更好。

英文摘要

We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories -- the order in which tokens are unmasked during iterative decoding -- and find that, unlike autoregressive LLMs which generate text linearly, MDLMs naturally prioritize entities first, followed by relational and function words, with structural tokens resolved last. We further identify a previously undocumented failure mode of supervised fine-tuning: SFT disrupts this strategy by prematurely anchoring structural sentence-ending tokens early in the decoding trajectory, effectively fixing the output length which can lead to omitted or hallucinated information. To address this, we propose lambda-scaled structural decoding, a training-free inference-time modification that downweights structural token confidence and recovers +9.4 BLEU-4. Finally, we introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process to explicitly incorporate relational graph structure. Cross-dataset evaluation on LAGRANGE reveals that previous baselines overfit to dataset-specific patterns, while LLM- and MDLM-based approaches generalize significantly better.

2605.31558 2026-06-01 cs.LG cs.AI 版本更新

Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

位置注意力头与符号注意力头:学习动态、RoPE几何和长度泛化

Felipe Urrutia, Juan José Alegría, Cinthia Sanchez Macias, Jorge Salas, Cristian B. Calderon, Cristobal Rojas

发表机构 * CENIA & Faculty of Mathematics UC Santiago(CENIA与圣托里尼大学数学系) IMC UC & CENIA Santiago(UC IMC与圣托里尼CENIA)

AI总结 通过控制实验研究Transformer注意力头在位置推理和符号推理任务中的学习动态,发现位置和符号注意力头的不同机制及其对长度泛化的影响。

详情
AI中文摘要

基于Transformer的语言模型在当今社会广泛应用。因此,理解它们解决结构化任务的机制以及预测它们在新型场景中的行为对于安全部署至关重要。我们通过在两个结构等价的多跳推理任务上训练仅解码器Transformer(GPT-J)来研究注意力头的学习动态:一个需要位置推理的数字任务和一个需要符号推理的字母任务。利用最近引入的度量标准,该标准将注意力头的行为分类为给定提示下的位置性或符号性,我们表明成功学习与纯头(即表达为位置性或符号性的头)的出现相关。尽管任务结构等价,但它们施加了不同的机制需求:数字任务需要位置头和符号头,而字母任务仅需要符号头。然后,我们识别这些头的计算角色,描述它们实现的基本功能,并给出理论构造,展示单层基于RoPE的注意力如何通过几何可解释的查询、键和值操作实现这些功能。该分析通过一种新的差异概念形式化,在位置和符号机制对更长序列的鲁棒性上产生了定量分离。我们在受控模型和真实世界模型中经验验证了由此产生的预测,表明符号机制更可靠地外推到更长序列,而位置机制面临更严格的限制。

英文摘要

Transformer-based language models are widespread in today's society. As such, understanding the mechanisms by which they solve structured tasks and predicting how they may behave in novel scenarios is of great importance for safe deployment. We study the learning dynamics of attention heads in a controlled setting by training a decoder-only Transformer (GPT-J) on two structurally equivalent multi-hop reasoning tasks: a number task requiring positional reasoning and a letter task requiring symbolic reasoning. Using a recently introduced metric that classifies attention-head behavior as positional or symbolic for a given prompt, we show that successful learning is associated with the emergence of pure heads, i.e., heads that express themselves as either positional or symbolic. Despite the tasks' structural equivalence, they impose different mechanistic demands: the number task requires both positional and symbolic heads, whereas the letter task requires only symbolic heads. We then identify the computational roles of these heads, characterize the basic functions they implement, and give theoretical constructions showing how single-layer RoPE-based attention can realize these functions through geometrically interpretable query, key, and value operations. This analysis yields a quantitative separation between positional and symbolic mechanisms in their robustness to longer sequences, formalized through a novel notion of discrepancy. We empirically validate the resulting predictions in both controlled and real-world models, showing that symbolic mechanisms extrapolate more reliably to longer sequences while positional mechanisms face sharper limitations.

2605.31556 2026-06-01 cs.CV cs.AI cs.CL cs.CY cs.HC 版本更新

Vision-Language Models Suppress Female Representations Under Ambiguous Input

视觉-语言模型在模糊输入下抑制女性表征

Arnau Marin-Llobet, Simon Henniger, Mahzarin R. Banaji

发表机构 * School of Engineering and Applied Sciences(工程与应用科学系) Department of Psychology(心理学系)

AI总结 本研究通过引入零样本度量LALS,发现视觉-语言模型在模糊输入下内部编码与输出存在系统性解耦,女性信号在生成前被抑制,揭示了模型对性别偏见的内部处理机制。

Comments 16 pages, 12 figures, 1 table

详情
AI中文摘要

对齐训练使视觉-语言模型(VLM)避免表达人口统计偏见,当性别清晰可见时,它们基本成功。但对于模糊输入(如全副武装的工人、从背后看到的人物)——实践中常见但很少研究的情况——我们发现,在模糊输入图像时,最小的提示压力就会暴露职业-性别默认值,模型甚至对强烈女性刻板印象的职业也倾向于男性。但这些输出是否反映了模型实际内部编码的内容?我们引入LALS(潜在关联倾向分数),一种零样本度量,将视觉标记激活投影到模型的文本嵌入空间中,以测量每个标记和层的概念关联。在15个职业、超过800张性别模糊图像和四个VLM上,内部表征和输出系统性地解耦:模型通常内部编码女性关联但输出男性。逐层分析揭示了一个不对称滤波器——男性信号端到端放大,而女性信号在中间网络达到峰值并在生成前被抑制——颜色消融实验表明,文化负载的视觉线索(如服装颜色)进一步调节这些内部关联。

英文摘要

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.

2605.31535 2026-06-01 cs.CV cs.AI cs.LG 版本更新

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

RayDer: 从真实世界视频中可扩展的自监督新视角合成

Ulrich Prestel, Stefan Andreas Baumann, Nick Stracke, Björn Ommer

发表机构 * Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心 (MCML))

AI总结 提出统一前馈变压器RayDer,将相机估计、场景重建和渲染整合为单一骨干,实现自监督新视角合成的可扩展幂律缩放,在零样本开放集性能上媲美有监督方法。

Comments Project Page: https://compvis.github.io/rayder

详情
AI中文摘要

自监督新视角合成(NVS)在扩展方面仍然具有挑战性,尽管视频数据丰富,这主要是由于在真实视频上训练的脆弱性以及多网络系统设计的难以预测的缩放行为。我们引入了RayDer,一个统一的前馈变压器,将相机估计、场景重建和渲染整合到一个单一骨干中,将自监督NVS转化为一个适定的单模型缩放问题。一个最小的动态状态,被视为干扰因素,吸收时变内容,使得在无约束的真实世界视频上稳定训练成为可能。重要的是,RayDer将静态场景NVS作为其目标任务:动态内容仅作为可扩展的监督被利用,而不是像动态场景(4D)NVS那样重建。在多个模型大小和数量级的数据上,RayDer展示了与数据和计算量相关的清晰幂律缩放,并优于静态场景数据混合。在大量基准测试中,RayDer实现了与最先进的有监督方法相竞争的强大零样本开放集性能。项目页面:https://compvis.github.io/rayder

英文摘要

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder

2605.31534 2026-06-01 cs.CV cs.AI 版本更新

Feature-Optimized Vision for Adaptive 3D Scene Reconstruction

面向自适应3D场景重建的特征优化视觉

Eric Liang

发表机构 * Oracle

AI总结 提出一种自适应特征优化视觉前端,通过评分纹理、可重复性、独特性、预期三角化角度和空间覆盖来分配每视图特征预算,以最大化有效轨迹并降低重建RMSE。

详情
AI中文摘要

三维场景重建依赖于局部图像证据,这些证据既要在视觉上具有判别性,又要在几何上有用。固定的特征阈值和均匀的特征预算易于部署,但可能会在重复纹理、低视差区域或不稳定点上浪费计算。本文提出了一种用于3D重建的自适应特征优化视觉前端。该方法通过纹理、可重复性、独特性、预期三角化角度和空间覆盖对候选特征进行评分,然后在固定重建流程下分配每视图特征预算以最大化有效轨迹。一个小型合成多视图原型在走廊、立面、物体桌面和杂乱场景中评估了四种选择策略。与随机、仅纹理和均匀网格基线相比,自适应策略在保持广泛图像覆盖的同时,获得了最佳的质量感知完整性和最低的聚合重建RMSE。结果并非替代现代学习匹配或神经重建系统;它是一个模块化的前端策略,可以使经典和学习的3D流程更审慎地决定将计算花费在哪些视觉证据上。

英文摘要

Three-dimensional scene reconstruction depends on local image evidence that is both visually discriminative and geometrically useful. Fixed feature thresholds and uniform feature budgets are easy to deploy, but they can waste computation on repeated texture, low-parallax regions, or unstable points. This paper proposes an adaptive feature-optimized vision front end for 3D reconstruction. The method scores candidate features by texture, repeatability, distinctiveness, expected triangulation angle, and spatial coverage, then allocates a per-view feature budget to maximize useful tracks under a fixed reconstruction pipeline. A small synthetic multi-view prototype evaluates four selection policies across corridor, facade, object-table, and cluttered scenes. Compared with random, texture-only, and uniform-grid baselines, the adaptive policy obtains the best quality-aware completeness and the lowest aggregate reconstruction RMSE while preserving broad image coverage. The result is not a replacement for modern learned matching or neural reconstruction systems; it is a modular front-end policy that can make classical and learned 3D pipelines more deliberate about which visual evidence they spend compute on.

2605.31520 2026-06-01 cs.SE cs.AI cs.CR 版本更新

Separating Secrets from Placeholders: A Hybrid CNN-CodeBERT Framework for Three-Class Credential Leakage Detection

区分秘密与占位符:一种用于三类凭证泄露检测的混合CNN-CodeBERT框架

Maksuda Bilkis Baby, Khushika Shah, Naiyue Liang, Lei Zhang

发表机构 * Information Systems, University of Maryland, Baltimore County, USA(信息学院,马里兰大学巴尔的摩县分校,美国) Computer Science and Electrical Engineering, University of Maryland, Baltimore County, USA(计算机科学与电气工程系,马里兰大学巴尔的摩县分校,美国)

AI总结 针对现有凭证泄露检测工具高误报率的问题,提出一种基于CodeBERT语义理解与字符级模式识别的三分类框架,将占位符/弱凭证作为独立类别建模,在新构建的9426样本数据集上达到0.86的MCC和0.90的宏F1分数,将高严重性警报减少33%而不牺牲安全覆盖。

Comments Accepted at ICSME 2026 (International Conference on Software Maintenance and Evolution)

详情
AI中文摘要

公共源代码仓库中的凭证泄露构成严重安全威胁,仅2024年就有超过2380万个秘密被暴露。现有检测工具由于刚性模式匹配和二元分类方案无法区分真实凭证与占位符或弱凭证,导致高误报率。我们提出一个三分类框架,明确将占位符或弱凭证建模为一个独立类别,利用基于CodeBERT的语义理解结合字符级模式识别。我们在一个新构建的包含10种编程语言、9426个样本的数据集上评估了我们的方法。我们的模型实现了0.86的马修斯相关系数和0.90的宏F1分数,对真实凭证泄露达到93%的召回率和89%的精确率,同时将高严重性警报减少了33.0%(从373降至250),且未牺牲安全覆盖。与先前的字符级方法相比,我们的方法将占位符或弱凭证检测的F1分数从54%提升至81%,同时保持了强大的跨语言泛化能力,在留一语言评估中,10种语言中有9种语言的F1分数超过0.80。

英文摘要

Credential leakage in public source code repositories poses a critical security threat, with over 23.8 million secrets exposed in 2024 alone. Existing detection tools suffer from high false-positive rates because rigid pattern matching and binary classification schemes fail to distinguish genuine credentials from placeholder or weak credentials. We propose a three-class classification framework that explicitly models placeholder or weak credentials as a distinct class, leveraging CodeBERT-based semantic understanding combined with character-level pattern recognition. We evaluate our approach on a newly constructed dataset of 9,426 samples spanning 10 programming languages. Our model achieves a Matthews Correlation Coefficient of 0.86 and a macro F1-score of 0.90, achieving 93% recall and 89% precision for genuine credential leaks while reducing high severity alerts by 33.0% (from 373 to 250) without sacrificing security coverage. Compared to prior character-level approaches, our method improves placeholder or weak credential detection from 54% to 81% F1-score while maintaining strong cross language generalization, with 9 of 10 languages achieving F1 above 0.80 under leave-one-language-out evaluation.

2605.31509 2026-06-01 cs.LG cs.AI 版本更新

Skill Reuse as Compression in Agentic RL

智能体强化学习中的技能重用作为压缩

Zhikun Xu, Yu Feng, Jacob Dineen, Taiwei Shi, Jieyu Zhao, Ben Zhou

发表机构 * Arizona State University(亚利桑那州立大学) University of Pennsylvania(宾夕法尼亚大学) University of Southern California(南加州大学)

AI总结 提出ReuseRL方法,基于最小描述长度原则将成功轨迹压缩为可重用技能字典,并通过分割代价惩罚低效编码行为,在多个环境中提升分布内和分布外成功率。

Comments Work in progress

详情
AI中文摘要

使用强化学习训练的大语言模型智能体通常学习到脆弱且任务特定的捷径。我们假设,当智能体的成功轨迹在结构上可压缩,分解为一小组可重用的抽象模式时,智能体能够更好地泛化。为形式化这一观点,我们引入ReuseRL,它将智能体强化学习建立在最小描述长度原则之上。ReuseRL从成功轨迹中提取共享技能字典,并通过分割代价增强强化学习目标,显式惩罚编码效果差的特殊行为。我们证明了该压缩惩罚的PAC-Bayes泛化界。在ALFWorld、TextWorld-Cooking和Countdown-Stepwise上,ReuseRL在分布内和分布外成功率上均优于原始GRPO和强回合长度基线。

英文摘要

Large language model agents trained with reinforcement learning (RL) often learn brittle, task-specific shortcuts. We hypothesize that agents generalize better when their successful trajectories are structurally compressible, decomposed into a small set of reusable abstract patterns. To formalize this, we introduce ReuseRL, which grounds agentic RL in the Minimum Description Length (MDL) principle. ReuseRL extracts a shared skill dictionary from successful trajectories and augments the RL objective with a segmentation cost, explicitly penalizing idiosyncratic behaviors that encode poorly. We prove a PAC-Bayes generalization bound for this compression penalty. Across ALFWorld, TextWorld-Cooking, and Countdown-Stepwise, ReuseRL improves in- and out-of-distribution success over vanilla GRPO and strong round-length baselines.

2605.31500 2026-06-01 cs.LG cs.AI 版本更新

On Efficient Scaling of GNNs via IO-Aware Layers Implementations

通过IO感知层实现实现GNN的高效扩展

Daria Fomina, Daniil Krasylnikov, Alexey Boykov, Andrey Dolgovyazov, Vyacheslav Zhdanovskiy, Fedor Velikonivtsev

发表机构 * HSE University(俄罗斯高等经济大学) ITMO University(ITMO大学)

AI总结 针对GNN中稀疏不规则内存访问瓶颈,提出三种GPU内核族(SpMM卷积、归约聚合、注意力层)以减少数据移动并提升局部性,在真实图上实现高达8.5倍加速和76倍内存降低。

Comments International Conference on Machine Learning (ICML) 2026, Spotlight Paper

详情
AI中文摘要

图神经网络(GNN)受限于稀疏、不规则的内存访问。流行的框架如DGL和PyTorch Geometric支持通用消息传递,但复杂层通常具体化边中间结果,增加内存流量并限制在大图上的可扩展性。我们以I/O和算术强度为中心的观点表明,广泛使用的层分为三种内核族:基于SpMM的卷积、基于归约的聚合和基于注意力的层(GATv2/Graph Transformer)。对于每个族,我们开发了减少数据移动、改善局部性并在真实图上保持鲁棒性的GPU内核。我们还研究了图重排序,发现其影响取决于内核映射:它对邻居并行(以gather为主)内核的益处比特征并行设计更一致。实验表明,我们的融合注意力内核在Graph Transformer上达到高达$ extbf{3.9} imes$的加速(中位数$ extbf{1.6} imes$),在局部密集图上使用Tensor Core(块稀疏)变体达到高达$ extbf{7.3} imes$;对于GATv2,我们达到高达$ extbf{8.5} imes$的加速(中位数$ extbf{2.0} imes$),同时峰值内存降低高达$ extbf{76} imes$(中位数$ extbf{6} imes$)。我们的度感知归约内核达到高达$ extbf{10} imes$的加速(中位数$ extbf{2.6} imes$)。对于基于SpMM的层,适当缓存的cuSPARSE比DGL达到高达$ extbf{8} imes$的加速,并在大多数评估中优于评估的自定义基线。我们发布我们的实现作为即插即用的替代品,以支持可重现的、硬件感知的GNN加速。

英文摘要

Graph Neural Networks (GNNs) are bottlenecked by sparse, irregular memory access. Popular frameworks such as DGL and PyTorch Geometric support general message passing, but complex layers often materialize edge-wise intermediates, increasing memory traffic and limiting scalability on large graphs. We take an I/O- and arithmetic-intensity--centric view and show that widely used layers fall into three kernel families: SpMM-based convolutions, reduction-based aggregations, and attention-based layers (GATv2/Graph Transformer). For each family, we develop GPU kernels that reduce data movement, improve locality, and remain robust across realistic graphs. We also study graph reordering and find that its impact depends on the kernel mapping: it benefits neighbor-parallel (gather-dominated) kernels more consistently than feature-parallel designs. Empirically, our fused attention kernels reach up to $\textbf{3.9}\times$ speedup for Graph Transformer (median $\textbf{1.6}\times$), with Tensor Core (block-sparse) variants up to $\textbf{7.3}\times$ on locally dense graphs; for GATv2 we reach up to $\textbf{8.5}\times$ speedup (median $\textbf{2.0}\times$) while reducing peak memory by up to $\textbf{76}\times$ (median $\textbf{6}\times$). Our degree-aware reduction kernels achieve up to $\textbf{10}\times$ speedup (median $\textbf{2.6}\times$). For SpMM-based layers, properly cached cuSPARSE achieves up to $\textbf{8}\times$ speedup over DGL and outperforms evaluated custom baselines in the majority of evaluations. We release our implementations as drop-in replacements to support reproducible, hardware-aware GNN acceleration.

2605.31492 2026-06-01 cs.AI 版本更新

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

LinTree: 通过显式结构化搜索历史提升LLM推理能力

Liwei Kang, Yee Whye Teh, Wee Sun Lee

发表机构 * National University of Singapore(新加坡国立大学) University of Oxford(牛津大学)

AI总结 针对LLM推理中隐式搜索树导致性能不佳的问题,提出LinTree方法,通过添加父指针显式表示线性化树结构,在Blocks World、网格导航和Sokoban任务中提升了任务性能和搜索效率。

Comments 16 pages, 3 figures

详情
AI中文摘要

大型语言模型(LLM)通常通过生成中间轨迹来解决推理问题,这些轨迹探索并修正部分解决方案。从搜索的角度来看,这些轨迹可以视为线性化的搜索树,其中模型扩展部分解决方案,失败时放弃并回溯尝试替代方案。与传统启发式搜索相比,这种策略有一个潜在优势:它基于整个搜索轨迹而非仅当前局部状态进行条件化。我们首先测试LLM是否利用这一优势,通过比较轨迹条件推理策略与配备仅观察当前局部状态的LLM启发式的最佳优先搜索。在三个受控推理环境(Blocks World、网格导航和Sokoban)中,我们发现仅原始访问搜索历史不足以可靠地超越启发式搜索。然后我们研究了一个可能的原因:在LLM推理轨迹中,底层搜索树仅隐式表示,当模型回溯或切换分支时,轨迹并未明确标识正在重新访问哪个早期搜索状态。我们表明,添加简单的父指针以显式表示线性化树(LinTree)结构,相对于隐式推理模型和LLM启发式引导搜索,提高了任务性能和搜索效率。这些结果表明,当树结构被显式化时,搜索历史变得最为有用,从而激励LLM推理中更具结构意识的表示。

英文摘要

Large language models (LLMs) often solve reasoning problems by generating intermediate traces that explore and revise partial solutions. From a search perspective, these traces can be viewed as linearized search trees, where the model extends a partial solution, abandons it when it fails, and backtracks to try alternatives. Compared with traditional heuristic-guided search, such a policy has a potential advantage: it conditions on the whole search trace rather than only on the current local state. We first test whether LLMs utilize this advantage by comparing trace-conditioned reasoning policies against best-first search equipped with an LLM heuristic that only observes the current local state. Across three controlled reasoning environments, Blocks World, grid Navigation, and Sokoban, we find that raw access to search history alone is not enough to reliably outperform heuristic search. We then study one possible reason: in LLM reasoning traces, the underlying search tree is only implicitly represented, and when the model backtracks or switches branches, the trace does not explicitly identify which earlier search state is being revisited. We show that adding simple parent pointers to explicitly represent the linearized tree (LinTree) structure improves both task performance and search efficiency relative to implicit reasoning models and LLM-heuristic-guided search. These results suggest that search history becomes most useful when its tree structure is made explicit, motivating more structure-aware representations for LLM reasoning.

2605.31469 2026-06-01 cs.CL cs.AI cs.SD eess.AS 版本更新

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

扩展匈牙利语对话ASR:BEA-Dialogue+语料库

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Katalin Mády

发表机构 * Department of Telecommunications and Artificial Intelligence(电信与人工智能系) Budapest University of Technology and Economics(布达佩斯技术与经济大学) Speechtex Ltd.(Speechtex公司) ELTE Research Centre for Linguistics(ELTE语言学研究中心)

AI总结 针对匈牙利语对话语音识别训练数据不足的问题,本文通过放宽分割标准扩展BEA-Dialogue语料库至200小时,并评估基于Whisper和FastConformer的模型,证明基于序列化输出训练的微调能持续改善识别性能。

详情
AI中文摘要

匈牙利语对话自动语音识别受到公开对话式训练数据有限的制约。BEA-Dialogue语料库解决了这一需求,但其严格的说话人分离的训练/开发/测试分割将可用材料减少到仅85小时。在本文中,我们介绍了BEA-Dialogue+,这是该语料库的扩展版本,它放宽了实验者和对话伙伴的分割标准,同时保持主要说话人的完全分离。这产生了200小时转录的自然对话,并允许对额外训练数据与分割间说话人重叠之间的权衡进行受控研究。我们在两个语料库版本上评估了多个基于Whisper和FastConformer的模型,包括基于序列化输出训练(SOT)的对话转录微调。我们的结果表明,对于未经微调的模型,较大的语料库更具挑战性,而基于SOT的适应在WER、CER、cpWER和cpCER上产生了一致的改进。总体而言,BEA-Dialogue+为匈牙利语对话ASR提供了一个更大但仍具挑战性的基准,以及用于训练和评估对话转录系统的实用资源。

英文摘要

Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.

2605.31468 2026-06-01 cs.AI 版本更新

AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

AutoSci: 面向完整科学生命周期的以记忆为中心的智能体系统

Weitong Qian, Beicheng Xu, Zhongao Xie, Bowen Fan, Guozheng Tang, Jiale Chen, Xinzhe Wu, Mingtian Yang, Chenyang Di, Jiajun Li, Lingching Tung, Peichao Lai, Yifei Xia, Ziyi Guo, Yanwei Xu, Yanzhao Qin, Shaoduo Gan, Xupeng Miao, Bin Cui

发表机构 * Peking University(北京大学)

AI总结 提出AutoSci,一个以记忆为中心、支持完整科学生命周期的智能体系统,通过结构化记忆、多阶段流程、有向无环图增强和演化机制实现自动化科研。

详情
AI中文摘要

科学研究传统上是人力密集型的,要求研究人员在漫长的项目周期中协调文献、想法、实验、手稿和审稿回复。基于LLM的科学智能体的兴起为自动化这一过程创造了机会。这样的系统必须支持完整的研究生命周期,跨项目维护结构化的持久记忆,并随时间改进自身的研究流程。然而,现有系统要么部分满足,要么未能满足这些要求,留下了统一自动化科学研究系统的空白。因此,我们提出了AutoSci,一个面向完整科学生命周期的以记忆为中心的智能体系统。AutoSci围绕四个模块组织。SciMem提供受模式约束的研究记忆,将可重复使用的科学知识分离为长期知识记忆,将项目级工件(如想法、实验、手稿和审稿)分离为活跃研究记忆。SciMem通过一个控制状态、上下文、验证、反馈和编排的框架执行从文献理解到反驳的五阶段生命周期。SciDAG通过有向无环图形式的多智能体操作符和可重用的阶段特定模板增强困难技能。SciEvolve将来自用户、实验、审稿和外部环境的反馈信号转化为对SciMem组织、SciFlow技能和SciDAG模板的版本化更新。这些模块共同使AutoSci成为一个持久的研究环境,能够在研究项目间执行、记忆和演化。代码仓库位于https://github.com/skyllwt/AutoSci。

英文摘要

Scientific research has traditionally been human-intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles. The rise of LLM-based scientific agents creates an opportunity to automate this process. Such a system must support the full research lifecycle, maintain structured persistent memory across projects, and improve its own research procedures over time. However, existing systems either partially satisfy or fail to satisfy these requirements, leaving a gap for a unified automated scientific research system. As a result, we present AutoSci, a memory-centric agentic system for the full scientific research lifecycle. AutoSci is organized around four modules. SciMem provides schema-governed research memory, separating Long-Term Knowledge Memory for reusable scientific knowledge from Active Research Memory for project-level artifacts such as ideas, experiments, manuscripts, and reviews. SciFlow executes a five-stage lifecycle from literature understanding to rebuttal through a harness that controls state, context, verification, feedback, and orchestration. SciDAG augments difficult skills with DAG-shaped multi-agent operators and reusable stage-specific templates. SciEvolve converts feedback signals from users, experiments, reviews, and external environments into versioned updates to SciMem organization, SciFlow skills, and SciDAG templates. Together, these modules make AutoSci a persistent research environment that can execute, remember, and evolve across research projects. The code repository is available at https://github.com/skyllwt/AutoSci.

2605.31464 2026-06-01 cs.LG cs.AI 版本更新

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

GPU预测器:语言模型作为内核运行时优化的选择性替代

Zaid Khan, Justin Chih-Yao Chen, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) AI2 Johns Hopkins University(约翰霍普金斯大学) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 研究利用语言模型作为GPU内核性能的选择性替代,通过强化学习提高预测准确性和校准度,在有限GPU评估预算下加速内核搜索。

Comments Code: https://github.com/codezakh/gpu-forecasters

详情
AI中文摘要

GPU内核是现代深度学习的主力,优化它们(通过进化搜索或编码代理)通常需要在目标硬件上重复测量。虽然这些测量提供了内核搜索所需的地面真实信号,但成本高昂,因为每次评估内核都需要编译并在GPU上重复执行。随着LLM推理的改进降低了编写新内核的成本,并且LLM驱动的搜索扩展到大的搜索预算,设备上的评估成为瓶颈。为了解决这个问题,我们研究LLM如何通过预测所提议内核的性能,作为选择性GPU替代用于内核评估。一个有用的替代应该是准确的,并且应该是选择性的,知道何时可能出错,并推迟到GPU。为了评估替代,我们测量其预测是否准确、校准良好,并且在有限的GPU测量预算下对恢复快速内核实际有用。接下来,我们研究强化学习是否能提高预测准确性和置信度校准。我们的实验表明,LLM可以准确预测相对内核性能,并且通过强化学习可以提高其实用性。在内核搜索中使用替代,使得搜索在相同的GPU评估预算下可以考虑多倍的候选,从而比同等预算的基线找到更快的内核。这些结果表明,LLM可以在内核优化中发挥更广泛的作用,作为GPU的虚拟模型,而不仅仅是搜索的内核生成器。

英文摘要

GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth signal necessary for kernel search, they are costly, because each evaluation of a kernel requires compilation and repeated execution on a GPU. As improvements in LLM inference reduce the cost of writing novel kernels and LLM-driven searches scale to large search budgets, on-device evaluation becomes a bottleneck. To address this, we study how LLMs can serve as selective GPU surrogates for kernel evaluation, by forecasting the performance of proposed kernels. A useful surrogate should be accurate, and it should be selective, by knowing when it could be wrong, and deferring to the GPU. To evaluate surrogates, we measure whether their forecasts are accurate, calibrated, and practically useful for recovering fast kernels under limited GPU-measurement budgets. Next, we study whether reinforcement learning can improve forecast accuracy and confidence calibration. Our experiments demonstrate that LLMs can accurately forecast relative kernel performance, that their utility can be improved through reinforcement learning. Used inside a kernel search, the surrogate lets the search consider several times as many candidates under the same GPU evaluation budget, and that leads to finding faster kernels than an equal-budget baseline. These results suggest that LLMs can play a broader role in kernel optimization, by acting as virtual models of a GPU rather than solely as kernel generators for search.

2605.31463 2026-06-01 cs.LG cs.AI cs.CL cs.DC 版本更新

PithTrain: A Compact and Agent-Native MoE Training System

PithTrain: 一个紧凑且面向智能体的MoE训练系统

Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy, Zichun Yu, Junru Shao, Todd C. Mowry, Chenyan Xiong, Tianqi Chen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Xlue NVIDIA(英伟达)

AI总结 提出PithTrain,一个基于智能体原生设计原则的紧凑型MoE训练框架,通过引入ATE-Bench评估智能体任务效率,在保持生产框架吞吐量的同时,将智能体任务轮次和活跃GPU时间分别降低62%和64%。

详情
AI中文摘要

混合专家模型(MoE)已成为前沿语言模型的主导架构。为满足这一需求,生产框架经过多年的工程努力构建了优化的MoE训练栈。然而,为新的架构和系统优化而演进这些栈仍然代价高昂。随着AI编码智能体的兴起,它们可以自动化训练框架开发的部分工作并加速这一演进。但将这些智能体应用于现有框架会带来隐藏成本,这些成本在当今仅关注吞吐量的评估中不可见。我们将这一缺失维度命名为智能体任务效率(ATE):即使用编码智能体理解、操作和扩展框架的成本。基于四个智能体原生设计原则,我们构建了PithTrain,一个紧凑、智能体原生的MoE训练框架。我们进一步引入了ATE-Bench,涵盖现实世界的训练框架任务。我们的评估表明,PithTrain在吞吐量上与生产框架相当,并且在ATE-Bench上,PithTrain实现了更高的智能体任务效率,智能体轮次减少高达62%,活跃GPU时间减少64%。

英文摘要

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.

2605.31446 2026-06-01 cs.CL cs.AI 版本更新

Fine-grained Verification via Diagnostic Reasoning Supervision for Aspect Sentiment Triplet Extraction

面向方面情感三元组抽取的诊断推理监督细粒度验证

Wenna Lai, Haoran Xie, Guandong Xu, Qing Li, S. Joe Qin

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Lingnan University(岭南大学) Education University of Hong Kong(香港教育大学)

AI总结 提出FiVeD框架,通过诊断推理监督进行细粒度验证,利用质量评分和错误分类等辅助任务提升ASTE三元组抽取的可靠性。

Comments 25 pages, 13 figures, and 6 tables

详情
AI中文摘要

方面情感三元组抽取(ASTE)旨在识别方面词、观点词和情感极性作为结构化三元组,为下游信息系统应用(如意见挖掘、可解释推荐和评论摘要)提供必要输入。先前工作主要关注端到端抽取,而对抽取三元组的事后验证仍相对未被充分探索。这一差距限制了ASTE系统的可靠性,因为预测的三元组可能在局部合理但全局无效。此外,候选无效性是多方面的,候选可用性本质上是分级的,这促使了一种细粒度验证机制,可以过滤或重新排序来自不同抽取器的输出。在本文中,我们提出了FiVeD,一个具有诊断推理监督的细粒度验证框架。具体来说,验证器通过多个互补目标进行训练,包括作为主要任务的有效性分类和质量评分估计,以及作为辅助任务的错误类型分类和理由生成。我们定义了层次化错误类别,并在语义和句法约束下构建合理的错误三元组,利用现成的LLM和特定任务评分标准生成质量评分和诊断理由。在推理过程中,生成的质量评分用于过滤候选输出,支持可调节的精确率-召回率权衡。在多个ASTE基线模型上的实验表明,FiVeD作为即插即用的验证模块,持续将抽取性能提升最多3.53个F1点。

英文摘要

Aspect Sentiment Triplet Extraction (ASTE) aims to identify aspect terms, opinion terms, and sentiment polarities as structured triplets, providing essential inputs for downstream information system applications such as opinion mining, explainable recommendations, and review summarization. Prior work mainly focuses on end-to-end extraction, while post hoc verification of extracted triplets remains comparatively underexplored. This gap limits the reliability of ASTE systems, since predicted triplets may be locally plausible while being globally invalid. Moreover, candidate invalidity is multi-faceted and candidate usability is inherently graded, motivating a fine-grained verification mechanism that can filter or re-rank outputs from diverse extractors. In this paper, we propose FiVeD, a framework for Fine-grained Verification with Diagnostic reasoning supervision. Specifically, the verifier is trained with multiple complementary objectives, including validity classification and quality score estimation as primary tasks, with error type classification and rationale generation as auxiliary tasks. We define hierarchical error categories and construct plausible incorrect triplets under semantic and syntactic constraints, and leverage an off-the-shelf LLM with task-specific rubrics to produce quality scores and diagnostic rationales. During inference, the resulting quality scores are used to filter candidate outputs, supporting adjustable precision-recall tradeoffs. Experiments across multiple ASTE baselines demonstrate that FiVeD consistently improves extraction performance by up to 3.53 F1 points as a plug-and-play verification module.

2605.31445 2026-06-01 cs.GT cs.AI cs.CL cs.LG 版本更新

Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information

二手车销售机器人?作为讨价还价代理的LLM在部分信息下的诚实与轻信

Antonio Valerio Miceli-Barone, Vaishak Belle, Shay B. Cohen

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 研究LLM代理在模拟讨价还价场景中的表现,发现它们偏离博弈论均衡,尝试撒谎但无法有效利用信息不对称,且优化财务效用会增强谈判能力但增加不诚实行为。

Comments 18 pages, 14 figures

详情
AI中文摘要

在这项工作中,我们研究了模拟讨价还价场景中的代理,其中买方和卖方通过文本渠道进行通信,并试图在不同信息制度(完全信息、信息不对称或相互不确定性)下谈判互利交易。我们评估了它们相对于博弈论解决方案的表现,并进一步调查了它们的诚实性(披露或隐瞒信息、误导或欺骗的倾向)以及轻信性(信任或不信任对方提供信息的倾向)。我们研究了零样本LLM代理(使用简单的提示脚手架)以及微调代理,以探讨优化代理以最大化财务利润是否使它们成为更强的谈判者,但也更不诚实和更不信任。我们发现,现成的LLM都显著偏离博弈论均衡,它们试图对自己的私人信息撒谎,但无法有效利用信息不对称。对财务效用的微调使代理在达成更好交易方面更强,但也更不诚实,这突显了优化代理任务对其安全性可能带来的风险。我们发布了我们的代码和一个讨价还价场景数据集。

英文摘要

In this work we study agents in simulated bargaining scenarios, where a buyer and a seller communicate through a text channel and attempt to negotiate mutually beneficial trades, under different information regimes (complete information, information asymmetry or mutual uncertainty). We evaluate their performance w.r.t. game-theoretical solutions and further investigate their honesty (their tendency to disclose or withhold information or to mislead and deceive) as well as their credulity (their tendency to trust or distrust information provided by the other agent). We study zero-shot LLM agents with simple prompting scaffolding as well as fine-tuned agents, in order to investigate whether optimising the agents to maximise financial profits makes them stronger negotiators but also more dishonest and less trusting. We find that off-the-shelf LLMs all substantially deviate from game-theoretical equilibria, they attempt to lie about their private information but cannot efficiently exploit information asymmetries. Fine-tuning on financial utility makes the agents stronger at achieving better deals but also more dishonest, highlighting the risks that optimising agents for a task can have on their safety. We release our code and a dataset of bargaining scenarios.

2605.31444 2026-06-01 cs.AI cs.LO 版本更新

Answer-Set-Programming-based Abstractions for Reinforcement Learning

基于回答集编程的强化学习抽象方法

Rafael Bankosegger, Thomas Eiter, Johannes Oetsch

发表机构 * Siemens AG Österreich(西门子奥地利公司) TU Wien (Vienna University of Technology)(维也纳技术大学) Jönköping University(约翰·科平大学)

AI总结 本文提出使用回答集编程(ASP)实现CARCASS框架中的抽象,以解决强化学习中状态空间巨大带来的挑战,并通过积木世界和Minigrid两个领域的案例验证了该方法的有效性。

Comments Accepted for publication at the 42nd International Conference on Logic Programming (ICLP 2026). To appear in Theory and Practice of Logic Programming (TPLP)

详情
AI中文摘要

强化学习(RL)使自主智能体能够从经验中学习策略,但现实问题通常涉及巨大的状态空间,使得学习和泛化具有挑战性。因此,抽象和近似是必不可少的。关系强化学习(RRL)提供了一种推理对象及其关系的方法,而Martijn van Otterlo的CARCASS框架展示了如何使用逻辑表示在一阶域中建模马尔可夫决策过程(MDP)。CARCASS最初用Prolog实现,利用领域知识创建强大的抽象。我们探索了回答集编程(ASP),这是一种丰富的、与Prolog相反完全声明式的建模语言,以实现CARCASS抽象。我们在两个领域(即积木世界和Minigrid)的案例研究中评估了基于ASP的实现。我们的结果表明,使用ASP的CARCASS为构建RL抽象提供了一种有前景的方法,尤其是在领域知识可用的情况下。

英文摘要

Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging. Abstraction and approximation are therefore essential. Relational Reinforcement Learning (RRL) offers a way to reason about objects and their relations, and the CARCASS framework by Martijn van Otterlo demonstrates how logical representations can model Markov Decision Processes (MDPs) in first-order domains. Originally implemented in Prolog, CARCASS leverages domain knowledge to create powerful abstractions. We explore Answer-Set Programming (ASP), which is a rich and, contrary to Prolog, fully declarative modelling language, to realise CARCASS abstractions. We evaluate our ASP-based implementation in case studies of two domains, viz. Blocks World and Minigrid. Our results indicate that CARCASS with ASP provides a promising approach to constructing abstractions for RL, especially when domain knowledge is available.

2605.31432 2026-06-01 cs.CL cs.AI cs.SD 版本更新

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

DOA:面向语音大语言模型的长形式同声传译的无训练解码器仅注意力策略

Sara Papi, Luisa Bentivogli

发表机构 * Fondazione Bruno Kessler(布鲁诺·克塞塞基金会)

AI总结 提出DOA策略,利用解码器自注意力导出代理对齐,无需训练即可实现语音大语言模型在长形式同声传译中的流式决策。

详情
AI中文摘要

同声语音到文本翻译(SimulST)在语音尚未完成时生成翻译,需要流式策略来决定何时读取和何时写入。最先进的方法依赖于基于注意力的编码器-解码器模型,其中交叉注意力提供显式的对齐信号。相比之下,语音大语言模型(SpeechLLMs)是仅解码器架构,仅依赖自注意力。这引发了一个核心问题:解码器自注意力是否包含足够稳定的对齐信号来指导流式策略。此外,现有方法通常依赖于基于训练的适应或启发式等待-$k$策略,并且尚未在长形式场景中得到验证。为了填补这些空白,我们提出了仅解码器注意力(DOA),这是一种无训练策略,通过从自注意力中导出代理对齐,使现成的SpeechLLMs能够进行长形式同声传译。在Phi4-Multimodal和Qwen3-Omni上的实验表明,DOA提供了有效的对齐信号来支持流式决策,实现了低延迟的长形式SimulST,其质量接近无需重新训练的离线解码。

英文摘要

Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.

2605.31421 2026-06-01 cs.CL cs.AI cs.DS 版本更新

Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm

神经符号句法分析:用CYK算法塑造神经网络

Fabio Massimo Zanzotto, Federico Ranaldi, Giorgio Satta

发表机构 * Human-centric ART, University of Rome Tor Vergata(人本导向ART,罗马托尔维加塔大学) DEI, University of Padua(迪埃学院,帕多瓦大学)

AI总结 本文提出CYKNN,一种将CYK算法直接编码为可训练矩阵-向量乘法的循环神经网络架构,在简单语法任务上超越大语言模型,开辟神经符号方法新途径。

Comments 9 content pages

详情
AI中文摘要

在本文中,我们展示了将算法直接注入神经网络架构的可能性。我们聚焦于一个复杂算法,即用于解析乔姆斯基范式中上下文无关文法的Cocke-Younger-Kasami(CYK)算法,并提出了CYKNN,一种将CYK算法编码为可训练矩阵-向量乘法的简单循环神经网络架构。我们使用一个包含4种变体的非常简单的语法进行实验,结果表明,我们的方法在上下文学习设置中优于参数超过200亿的现有大语言模型,以及经过LoRA微调的Qwen系列较小语言模型。我们的尝试为神经符号方法论开辟了一条不同的途径。

英文摘要

In this paper, we show the possibility of a direct injection of algorithms into neural network architecture. We focus on a complex algorithm, that is, Cocke-Youger-Kasami (CYK) for parsing context-free grammars in Chomsky Normal Form and we propose CYKNN, a simple recurrent neural network architecture for encoding the CYK algorithm in trainable matrix-vector multiplications.We experimented with a very simple grammar with 4 variations showing that our approach outperforms existing LLMs with more than 20B parameters with an in-context learning setting and smaller LLMs of the Qwen family fine-tuned with LoRA. Our attempt paves the way to a different approach to neuro-symbolic methodologies.

2605.31410 2026-06-01 cs.AI 版本更新

FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning

FAM-Bench:面向条件感知的“食物即药物”推理的多模态基准

Mingyang Mao, Bhargav Rishi Medisetti, Utkarsh Grover, Tanvir Ibrahim, Wenyan Li, Tingting Zhang, Xiaomin Lin

发表机构 * Department of Electrical Engineering, University of South Florida(佛罗里达州立大学电气工程系) Muma College of Business, University of South Florida(佛罗里达州立大学Muma商学院) Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 提出FAM-Bench多模态基准,包含2500个营养专家验证的实例,通过菜肴适宜性评估和比较分析两个任务,测试模型在特定健康状况下对食物选择的推理能力。

详情
AI中文摘要

“食物即药物”要求模型不仅推理一道菜是什么或它含有哪些营养,还必须决定一个具体的食物选择是否适合特定的健康状况。现有的食物AI基准主要评估菜肴识别、食谱理解、营养估算或一般营养问答,而这一健康感知决策层在很大程度上未经测试。我们引入了FAM-Bench,一个多模态的“食物即药物”基准,包含2500个营养专家验证的实例,涵盖13种与饮食相关的健康状况。该基准包含两个互补任务:菜肴级适宜性评估,其中模型根据菜肴图像和配料列表判断其是否适合某种状况;以及比较性菜肴分析,其中模型根据状况特定适宜性对四个候选菜肴进行排序。这两个任务都需要整合配料证据、视觉制备线索和临床营养约束,为语言和视觉语言模型中的基于事实的健康感知推理提供了一个标准化的测试平台。

英文摘要

Food-as-Medicine requires models to reason beyond what a dish is or what nutrition it contains: they must decide whether a concrete food choice is appropriate for a specific health condition. Existing food AI benchmarks primarily evaluate dish recognition, recipe understanding, nutrient estimation, or general nutrition question answering, leaving this health-aware decision layer largely untested. We introduce FAM-Bench, a multi-modal Food-as-Medicine benchmark with 2500 nutrition-expert-verified instances across 13 diet-related health conditions. The benchmark contains two complementary tasks: dish-level suitability assessment, where models judge whether a dish is suitable for a condition from its image and ingredient list, and comparative dish analysis, where models rank four candidate dishes by condition-specific suitability. Both tasks require integrating ingredient evidence, visual preparation cues, and clinical nutrition constraints, providing a standardized testbed for grounded health-aware reasoning in language and vision-language models.

2605.31408 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

大型语言模型代理中的技能可用性与呈现粒度:一项受控的SkillsBench研究

Xiaonan Xu, Wenjing Wu

发表机构 * Computer Information Technology, Northern Arizona University(计算机信息科技,北亚利桑那大学) Computer Science, University of Colorado Boulder(计算机科学,科罗拉多大学博尔德分校)

AI总结 通过受控实验研究技能知识的呈现粒度对下游任务成功率的影响,发现技能可用性显著提升成功率,而呈现粒度变化影响较小且不确定。

详情
AI中文摘要

技能文档在推理时为大型语言模型代理提供程序性知识。本文研究受控技能知识的呈现粒度是否会改变下游任务成功率。实验使用固定的SkillsBench版本,包含30个任务、领域平衡的子集(由官方oracle运行验证)、两种启用推理的模型配置、六种技能条件,以及每个任务-条件-模型单元五次试验。技能可用性是最清晰的经验信号。相对于无技能,技能条件使GPT-5.5的任务平均通过率提高26.7至36.0个百分点,使DeepSeek V4-Flash提高18.0至26.0个百分点。最终数据包含1800行,每个模型900行。任务是推理单元。在每个任务-条件-模型单元内聚合五次试验,然后在30个任务上估计配对对比。主要的呈现对比较小且不确定。低抽象指导与高抽象指导相比,GPT-5.5差异为+0.7个百分点,DeepSeek V4-Flash差异为-6.7个百分点,两者的95%自助法置信区间均跨越零。在中抽象指导中添加一个工作示例与无示例变体相比,差异分别为+0.7和+1.3个百分点。平均奖励稳健性检验保持了相同的实质性结论。在这个受控子集中,技能可用性与更高的成功率相关,而测试的呈现粒度变化产生的影响较小、不确定且依赖于模型。

英文摘要

Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentation granularity of controlled skill knowledge changes downstream task success. The experiment uses a pinned SkillsBench version, a 30-task domain-balanced subset validated by official oracle runs, two reasoning-enabled model configurations, six skill conditions, and five trials per task-condition-model cell. Skill availability is the clearest empirical signal. Relative to no skill, skill conditions increase task-mean pass rate by 26.7 to 36.0 percentage points for GPT-5.5 and by 18.0 to 26.0 percentage points for DeepSeek V4-Flash. The final data contain 1,800 rows, with 900 rows for each model. The task is the inference unit. Five trials are aggregated within each task-condition-model cell before paired contrasts are estimated over 30 tasks. The primary presentation contrasts are smaller and uncertain. Low-abstraction guidance differs from high-abstraction guidance by +0.7 percentage points for GPT-5.5 and -6.7 percentage points for DeepSeek V4-Flash, with both 95% bootstrap confidence intervals crossing zero. Adding one worked example to medium-abstraction guidance differs from the no-example variant by +0.7 and +1.3 percentage points. Mean-reward robustness checks preserve the same substantive conclusion. In this controlled subset, skill availability is associated with higher success than no skill, while the tested presentation-granularity changes yield small, uncertain, and model-dependent effects.

2605.31404 2026-06-01 cs.CL cs.AI 版本更新

The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning

剑、盾与阿喀琉斯之踵:大型语言模型在导航规划中空间推理的语言归纳偏置特征化

Xudong Zhang, Jian Yang, Shengkai Wang, Jiangpeng Tian, Shaowen Chen, Xian Wei, Ke Li, Xiong You

发表机构 * East China Normal University(东华大学) Information Engineering University(信息工程大学) Zhengzhou University(郑州大学)

AI总结 提出双干预框架,通过表示干预和上下文干预分离语言结构与上下文线索,揭示LLM在导航规划中语言归纳偏置的规律:拓扑信息是稳健规划的支柱,语言格式是双刃剑,语义信息是致命弱点。

详情
AI中文摘要

基于大型语言模型(LLM)的导航系统通常构建显式空间表示(如拓扑图、语义栅格图)并将其转换为文本描述作为LLM的输入。然而,此类基于文本的空间表示的语言结构及其包含的上下文特征(如拓扑、几何)的选择通常被视为中性的工程决策,而非塑造LLM行为的关键因素。为填补这一空白,我们提出一个双干预框架,将语言结构与不同的上下文线索分离,以评估LLM在导航规划中的语言归纳偏置。在该框架中,表示干预改变语言格式和语言压缩程度,阐明语言表示何时支持或抑制导航规划。上下文干预结合上下文特征组合与冲突探测,明确澄清LLM在处理不同上下文线索时的偏好和弱点。跨多种空间推理任务和多个模型规模的实验揭示了一致模式:拓扑信息是坚固的盾牌和稳健规划的支柱;语言格式是双刃剑,其效果取决于模型大小、任务需求和压缩程度;语义信息是致命的阿喀琉斯之踵——错误的语义线索会系统性地破坏规划过程。总体而言,我们的研究表明,基于LLM的导航中有效的文本空间表示应保持拓扑完整性,根据模型能力校准表示压缩,并确保语义正确性,而非简单采用单一表示。我们的代码公开于https://github.com/jonesdong150/LLM-Navigation-Inductive-Bias。

英文摘要

Large Language Model (LLM)-based navigation systems commonly construct explicit spatial representations (e.g., topological graphs, semantic raster maps) and translate them into textual descriptions as LLMs' inputs. However, the linguistic structures of such text-based spatial representations and the choices of contextual features (e.g., topology, geometry) they contain are often treated as neutral engineering decisions rather than key factors that shape LLMs' behavior. To fill the gap, we propose a dual-interventional framework that disentangles linguistic structures from different contextual cues to evaluate the linguistic inductive bias of LLMs for navigation planning. In the framework, representation intervention varies the linguistic format and the degree of linguistic compression, clarifying when linguistic representations support or inhibit navigation planning. Context intervention, combined with contextual feature combination and conflict probing, explicitly clarifies the preferences and weaknesses of LLMs when processing different contextual cues. Experiments across diverse spatial reasoning tasks and multiple model scales reveal a consistent pattern: topological information is a sturdy shield and the backbone of robust planning; linguistic format is a double-edged sword whose effect depends on model size, task demands, and the compression level; and semantic information is a fatal Achilles' heel -- incorrect semantic cues can systematically derail the planning process. Overall, our study shows that effective text-based spatial representations in LLM-based navigation should preserve topological integrity, calibrate representational compression to model capacity, and ensure semantic correctness, rather than simply adopting a single representation. Our code is publicly available at https://github.com/jonesdong150/LLM-Navigation-Inductive-Bias.

2605.31393 2026-06-01 cs.CL cs.AI 版本更新

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

发表机构 * III-LIDI Universidad Nacional de La Plata(III-LIDI国立拉普拉塔大学) CDTEC, Federal University of Pelotas(CDTEC,联邦 Pelotas 大学) CONICET III-LIDI Comision de Investigaciones Cientificas Universidad Nacional de La Plata(科学委员会国立拉普拉塔大学) Universidade Federal de Pelotas(联邦 Pelotas 大学)

AI总结 针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题,提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强,并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign (https://genai4sl.github.io/) at CVPR 2026. Non proceedings track

详情
AI中文摘要

手语翻译(SLT)仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法,其中GPT-4o生成参考句子的受控释义变体,而手语输入保持不变。采用基于Signformer姿态的Transformer,在两阶段调度下进行训练:先在增强语料库上预训练,然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估:PHOENIX14T(德国手语),具有适度的词汇多样性;GSL(希腊手语),具有高度受控、重复的录制;以及LSA-T(阿根廷手语),具有严重的长尾稀疏性。在PHOENIX14T上,增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知,这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by limited paired sign-video/text corpora and heavy-tailed target vocabularies. We study target-side augmentation in which GPT-4o generates controlled paraphrase variants of reference sentences while the sign input remains unchanged. A Signformer-style pose-based Transformer is trained under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate on three datasets spanning complementary challenges: PHOENIX14T (German Sign Language), with moderate lexical diversity; GSL (Greek Sign Language), with highly ontrolled, repetitive recordings; and LSA-T (Argentinian Sign Language), with severe long-tail sparsity. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33. The near-saturated GSL baseline and extremely sparse LSA-T setting reveal the limits of the approach. To our knowledge, this is the first study to apply LLM-generated target-side araphrases and LLM-as-a-Judge evaluation to SLT. The semantic evaluation reveals gains in fidelity that lexical overlap metrics understate.

2605.31377 2026-06-01 cs.IR cs.AI 版本更新

DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval

DynaTree: 面向时效性新闻检索的动态智能检索树

Siyuan Qi, Xinyuan Wang, Yingxuan Yang, Haochuan Guo, Jianghao Lin, Weiwen Liu, Yong Yu, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DynaTree两阶段框架,通过离线构建可复用检索树和在线轻量子树选择,实现高效、自适应的时效性新闻检索,在Syft新闻基准和BEIR数据集上优于标准RAG和现有智能体方法。

详情
AI中文摘要

智能体检索增强生成通过集成规划、工具使用和迭代推理改进了检索,但现有的智能体RAG方法通常将语义扩展与检索决策耦合在短视推理循环中,导致推理成本高且不适用于时效性新闻检索。我们提出DynaTree,一个高效自适应新闻检索的两阶段框架。在离线阶段,DynaTree使用协调的智能体构建一个可复用的检索树,具体化查询主题的语义空间。在在线阶段,DynaTree在时间局部评估代理上执行轻量级日常子树选择,无需进一步的智能体推理、树修改或重新训练。在多日Syft新闻基准和多个BEIR数据集上的实验表明,DynaTree实现了强大的召回和排序性能,始终优于标准RAG和先前的智能体基线。我们进一步在Syft生产系统中部署DynaTree,并通过2026年1月28日至2月6日的在线A/B测试进行评估。动态适应变体将固定离线选择子树的生存率从0.32-0.53提高到0.59-0.73,并且在每个评估日都优于现有的生产召回器。这些结果表明,持久的、结构感知的语义扩展可以将离线智能体推理转化为实际改进,覆盖范围、新鲜度和相关性在真实世界新闻检索中均得到提升。

英文摘要

Agentic Retrieval-Augmented Generation improves retrieval by integrating planning, tool use, and iterative reasoning, but existing agentic RAG methods often couple semantic expansion with retrieval decisions in short-horizon inference loops, leading to high inference cost and limited suitability for time-sensitive news retrieval. We propose DynaTree, a two-stage framework for efficient and adaptive news retrieval. In the offline stage, DynaTree uses coordinated agents to construct a reusable retrieval tree that materializes the semantic space of a query topic. In the online stage, DynaTree performs lightweight daily subtree selection over a time-localized evaluation proxy, without further agentic reasoning, tree modification, or retraining. Experiments on a multi-day Syft news benchmark and multiple BEIR datasets show that DynaTree achieves strong recall and ranking performance, consistently outperforming standard RAG and prior agentic baselines. We further deploy DynaTree in the Syft production system and evaluate it through online A/B testing from Jan. 28 to Feb. 6, 2026. The dynamically adapted variant improves survival rate from 0.32-0.53 to 0.59-0.73 over a fixed offline-selected subtree and outperforms existing production recallers on every evaluation day. These results show that persistent, structure-aware semantic expansion can translate offline agentic reasoning into practical improvements in coverage, freshness, and relevance for real-world news retrieval.

2605.31373 2026-06-01 cs.LG cs.AI 版本更新

Scaling Higher-Order Graph Learning with Maximal Clique Complexes

基于最大团复形的规模化高阶图学习

Antoine Vialle, Aref Einizade, Fragkiskos D. Malliaros, Jhony H. Giraldo

发表机构 * LTCI, Télécom Paris Institut Polytechnique de Paris(巴黎理工学院LTCI研究所) SAMOVAR, Télécom SudParis Institut Polytechnique de Paris(巴黎理工学院南巴黎研究所) CentraleSupélec, Inria Université Paris-Saclay(中央理工-巴黎高等师范学院与巴黎-萨克雷大学)

AI总结 提出简化与分解的细胞Weisfeiler-Leman测试及最大团复形,结合CliqueWalk随机游走,实现可扩展的高阶图神经网络。

详情
AI中文摘要

图神经网络(GNN)仅限于建模成对交互,而基于细胞复形的高阶模型虽然具有更强的表达能力,但通常可扩展性差。我们引入了简化和分解的细胞Weisfeiler-Leman测试(sCWL和fCWL),它们在保持CWL测试表达力的同时提高了计算效率。我们进一步引入了最大团复形,使得可扩展的CWN在降低时间和内存复杂度的同时保持强大的实证性能。为了避免显式枚举团,我们提出了CliqueWalk,一种有偏随机游走,用于采样最大团,并且其复杂度与图大小呈线性关系。这些贡献为高阶图表示学习提供了一个可扩展的拓扑学习框架。

英文摘要

Graph neural networks (GNNs) are limited to modeling pairwise interactions, while higher-order models based on cell complexes achieve greater expressivity but often suffer from poor scalability. We introduce simplified and factored cellular Weisfeiler Leman tests (sCWL and fCWL), which preserve the expressivity of the CWL test while improving computational efficiency. We further introduce the maximal clique complex, enabling scalable CWNs with reduced time and memory complexity while retaining strong empirical performance. To avoid explicit clique enumeration, we propose CliqueWalk, a biased random walk that samples maximal cliques and scales linearly with graph size. These contributions yield a scalable topological learning framework for higher-order graph representation.

2605.31370 2026-06-01 cs.AI 版本更新

HypoAgent: An Agentic Framework for Interactive Abductive Hypothesis Generation over Knowledge Graphs

HypoAgent: 一种用于知识图谱上交互式溯因假设生成的智能体框架

Yisen Gao, Yixi Cai, Tianshi Zheng, Jiaxin Bai, Yangqiu Song

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Beihang University(北航) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 提出HypoAgent框架,通过三个智能体(意图识别、假设生成、根因分析)实现知识图谱上的交互式溯因假设生成,在常识和生物医学领域知识图谱上达到最优语义相似度。

Comments Under Review

详情
AI中文摘要

知识图谱上的溯因推理旨在生成解释观察到的实体或事实的逻辑假设。现有的可控假设生成方法允许用户通过显式条件引导这一过程,但在交互式场景中仍存在局限:它们难以在多轮对话中锚定不断变化的自然语言意图,并且在生成的假设失败时缺乏细粒度诊断。为解决这些问题,我们提出了HypoAgent,一种用于知识图谱上交互式溯因假设生成的智能体框架。HypoAgent集成了三个智能体:意图识别智能体,将用户话语和对话历史转化为可执行的知识图谱条件;假设生成智能体,根据提取的用户意图执行可控假设生成;以及根因分析智能体,诊断不可靠的假设片段并利用知识图谱邻域探测来识别支持的改进。在常识和生物医学领域特定知识图谱上的实验表明,HypoAgent在单轮、多轮和无条件设置下均达到了最先进的语义相似度。我们的代码可在https://github.com/HKUST-KnowComp/HypoAgent获取。

英文摘要

Abductive reasoning over knowledge graphs aims to generate logical hypotheses that explain observed entities or facts. Existing controllable hypothesis generation methods allow users to guide this process with explicit conditions, but they remain limited in interactive settings: they struggle to ground evolving natural-language intents across multi-turn dialogues and provide little fine-grained diagnosis when generated hypotheses fail. To address these limitations, we propose HypoAgent, an Agentic framework for interactive abductive Hypothesis Generation over knowledge graphs. HypoAgent integrates three agents: an Intent Recognition Agent that grounds user utterances and dialogue history into executable KG conditions, a Hypothesis Generation Agent that performs controllable hypothesis generation according to the extracted user intention, and a Root Cause Analysis Agent that diagnoses unreliable hypothesis fragments and leverages KG neighborhood probing to identify supported refinements. Experiments on commonsense and biomedical domain-specific knowledge graphs demonstrate that HypoAgent achieves state-of-the-art semantic similarity under single-turn, multi-turn, and unconditional settings. Our code is available at https://github.com/HKUST-KnowComp/HypoAgent.

2605.31365 2026-06-01 cs.AI 版本更新

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

学习适应:通过认知感知探索实现自我改进的网络智能体

Weile Chen, Bingchen Miao, Qifan Yu, Wendong Bu, Guoming Wang, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Siliang Tang

发表机构 * Zhejiang University(浙江大学)

AI总结 提出SCALE框架,利用选择器、预测器和评判器三个对抗角色,通过环境探索自主发现智能体局限性并扩展认知边界,结合SCALE-Hop图探索策略和SCALE-20k数据集,显著提升多模态大语言模型在多种网络环境中的性能和泛化能力。

Comments 24 pages

详情
AI中文摘要

多模态大语言模型的最新进展在网络智能体领域取得了令人瞩目的进展。然而,现有的网络智能体通常依赖于手工设计的执行流程或昂贵的专家轨迹,限制了它们在复杂动态环境中的适应性。为了解决这些挑战,我们提出了SCALE(自我认知感知学习与探索),它利用三个对抗角色——选择器、预测器和评判器——通过环境探索自主发现智能体的局限性并扩展其认知边界。此外,我们提出了SCALE-Hop,一种图探索策略,有助于全局规划并帮助智能体避免局部探索陷阱。为了进一步支持学习,我们构建了SCALE-20k,一个从19个真实世界网站收集的大规模数据集,包含多样化的任务类型和由SCALE探索轨迹生成的结构化演示。实验结果表明,我们的方法显著提高了多种多模态大语言模型在各种网络环境中的性能和泛化能力。我们的框架为构建真正自主和自适应的网络智能体提供了一种可扩展且可泛化的解决方案。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To address these challenges, we propose SCALE (Self-Cognitive-Aware Learning and Exploration), which leverages three adversarial roles, Selector, Predictor, and Judger to autonomously discover the agent's limitations and expand its cognitive boundaries through environmental exploration. Moreover, we propose SCALE-Hop, a graph exploration strategy that facilitates global planning and helps agents avoid local exploration traps. To further support learning, we construct SCALE-20k, a large-scale dataset collected from 19 real-world websites, containing diverse task types and structured demonstrations generated from SCALE's exploration traces. Experimental results show that our approach significantly improves the performance and generalization of multiple MLLMs in various web environments. Our framework offers a scalable and generalizable solution for building truly autonomous and adaptive web agents.

2605.31361 2026-06-01 cs.MA cs.AI cs.LG 版本更新

Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning

梦见他人:多智能体强化学习中世界模型内的潜在队友建模

Tomas Leroy-Stone

发表机构 * Tomas Leroy-Stone

AI总结 提出一种将队友建模为世界模型中可学习组件的方法,通过分解潜在状态并引入心智理论头来推断队友行为,实现零样本和少样本协调。

Comments 5 pages, 2 figures. Accepted as a poster at the 2026 World Modeling Workshop. Conceptual workshop paper

详情
AI中文摘要

在合作多智能体强化学习(MARL)中,智能体必须与内部策略和意图不可直接观察的伙伴协调。虽然像Dreamer这样的世界模型在单智能体设置中表现出强大的泛化能力和样本效率,但它们由于无法处理队友引起的不确定性而在MARL中的应用受到限制。我们提出一个新的视角:将队友视为智能体世界模型中的结构化、可学习组件。我们引入一种架构,将Dreamer风格的循环状态空间模型(RSSM)的潜在状态分解为环境和队友组件,并学习一个辅助的心智理论(ToM)头,从部分轨迹中推断队友行为的潜在嵌入,如角色、意图和预测动作。这些队友潜在变量影响演员和评论家,使智能体能够想象并适应多样化的合作者。我们概述了这种方法如何在部分可观察设置中支持零样本和少样本协调,并提出了一套基准测试和评估协议来评估其影响。这项工作将世界模型定位为不仅是环境动态的预测器,而且是社会行为的模拟器,为可泛化、与人类兼容的AI开辟了新方向。

英文摘要

In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong generalization and sample efficiency in single-agent settings, their application to MARL remains limited by an inability to handle teammate-induced uncertainty. We propose a new perspective: treat teammates as structured, learnable components within the agent's world model. We introduce an architecture that factorizes the latent state of a Dreamer-style recurrent state-space model (RSSM) into environment and teammate components, and learns an auxiliary Theory-of-Mind (ToM) head to infer latent embeddings of partner behavior such as character, intent, and predicted actions from partial trajectories. These teammate latents condition the actor and critic, enabling the agent to imagine and adapt to diverse collaborators. We outline how this approach can support zero-shot and few-shot coordination in partially observable settings and propose a set of benchmarks and evaluation protocols to assess its impact. This work positions world models as not only predictors of environmental dynamics, but as simulators of social behavior, opening new directions for generalizable, human-compatible AI.

2605.31360 2026-06-01 cs.LG cs.AI 版本更新

dashi: A Python library for Dataset Shift Characterization to Support Trustworthy AI Development and Deployment

dashi: 一个用于数据集偏移表征以支持可信AI开发和部署的Python库

David Fernández-Narro, Pablo Ferri, Ángel Sánchez-García, Juan M. García-Gómez, Carlos Sáez

发表机构 * Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de Valéncia(生物医学数据科学实验室,信息与通信技术大学,巴塞罗那理工大学)

AI总结 本文介绍dashi,一个开源Python库,通过无监督(基于信息几何和非参数统计流形)和有监督方法,对数据集偏移进行探索、量化和表征,以支持AI生命周期中的可信度评估。

详情
AI中文摘要

人工智能(AI)生命周期需要对底层数据动态有透彻理解,以实现稳健、安全且经济高效的AI开发和使用。数据集偏移定义为训练和测试数据分布之间的变化。无论是随时间(时间性)还是跨不同站点(多源)发生,它们都可能严重降低模型性能并损害数据质量。这在健康AI中尤为重要,因为不受控制的偏移在训练和操作阶段都可能严重影响患者的安全和基本权利。虽然协变量偏移、先验偏移和概念偏移的理论基础已很完善,但缺乏可访问且全面的软件工具来执行其分析。我们介绍了dashi,一个开源Python库,旨在对数据集偏移进行探索、量化和表征。dashi提供双重方法:一种无监督方法,利用信息几何和非参数统计流形进行数据变异性表征和分析(例如,信息几何时间图和多源变异性指标,如全局概率偏差和源概率异常度);以及一种有监督方法,量化和表征模型性能退化。无监督和有监督方法均适用于用户定义的时间批次和域/源批次。我们在三个模拟和真实世界的健康AI案例研究(妊娠期糖尿病、COVID-19和紧急医疗调度)中展示了dashi的实用性。通过提供交互式视觉分析和变异性指标,dashi支持AI生命周期阶段的可信度,通过评估数据一致性和AI性能实现稳健且安全的机器学习管道。

英文摘要

The Artificial Intelligence (AI) life cycle requires a thorough understanding of the underlying data dynamics for robust, safe and cost-effective AI development and use. Dataset shifts are defined as changes between train and test data distributions. Whether occurring over time (temporal) or across different sites (multi-source), they can severely degrade model performance and compromise data quality. This is particularly important in health AI, where the safety and fundamental rights of patients can be severely affected by uncontrolled shifts both at training and operational stages. While the theoretical foundations of covariate, prior, and concept shifts are well established, there is a lack of accessible and comprehensive software tools to perform their analysis. We introduce dashi, an open-source Python library designed for the exploration, quantification, and characterization of dataset shifts. dashi provides a dual approach: an unsupervised approach that leverages information geometry and non-parametric statistical manifolds to data variability characterization and analysis (e.g., Information Geometric Temporal plots and Multi-Source Variability metrics like Global Probabilistic Deviation and Source Probabilistic Outlyingness), and a supervised approach that quantifies and characterizes model performance degradation. Both unsupervised and supervised approaches work across user-defined temporal and domain/source batches. We demonstrate the utility of dashi on three simulated and real-world health AI case studies on gestational diabetes mellitus, COVID-19 and emergency medical dispatch. By providing interactive visual analytics and variability metrics, dashi supports trustworthiness of AI life cycle stages enabling robust and safe machine learning pipelines through the assessment of data coherence and AI performance.

2605.31354 2026-06-01 cs.AI cs.LG 版本更新

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

资源受限视觉代理中共享状态协作的故障模式诊断

Yunpeng Zhou

发表机构 * Nanjing University of Information Science \& Technology, Nanjing, China

AI总结 本文通过噪声累积视角研究弱学习者(4B-8B模型)在共享工作记忆下的协作推理故障模式,提出CoSee审计框架追踪文档视觉问答中的信息流,发现朴素共享工作空间会放大幻觉而非解决,并识别出噪声强化和策略崩溃两种主要故障模式。

详情
AI中文摘要

模块化视觉推理系统越来越依赖共享工作记忆进行多步协作,但低容量场景下中间状态演化的故障动态仍未被充分探索。我们通过噪声累积的视角研究弱学习者(4B-8B模型)的协作推理故障模式。我们引入了CoSee,一个审计框架,形式化了读-写-验证循环以追踪文档视觉问答中的信息流。在多页、图表和基于网页的基准测试中,我们发现了一个反直觉的退化:朴素的共享工作空间往往放大而非解决幻觉。我们识别出两种主要的故障模式:噪声强化(未基于事实的笔记被重新用作证据)和策略崩溃(添加的上下文使模型转向欠指定的短形式答案)。使用成本-准确率帕累托前沿,我们表明增加计算量在没有显式验证的情况下可能与性能负相关。我们的发现表明,对于资源受限的代理,瓶颈不在于推理深度而在于通信保真度,为可靠的模块化设计提供了轨迹级诊断和机制基线。

英文摘要

Modular visual reasoning systems increasingly rely on shared working memory for multi-step collaboration, yet the failure dynamics of intermediate state evolution in low-capacity regimes remain underexplored. We study failure modes of collaborative reasoning with weak learners (4B--8B models) through the lens of noise accumulation. We introduce CoSee, an auditing framework that formalizes the read-write-verify loop to trace information flow in document visual question answering. Across multi-page, chart, and web-based benchmarks, we find a counter-intuitive degradation: naive shared workspaces often amplify hallucinations rather than resolve them. We identify two dominant failure modes: Noise Reinforcement, where ungrounded notes are reused as evidence, and Policy Collapse, where added context shifts the model toward under-specified, short-form answers. Using cost-accuracy Pareto frontiers, we show that increased compute can correlate negatively with performance without explicit verification. Our findings suggest that for resource-constrained agents, the bottleneck lies not in reasoning depth but in communication fidelity, providing trace-level diagnostics and a mechanistic baseline for reliable modular design.

2605.31349 2026-06-01 cs.CL cs.AI cs.CV cs.MM 版本更新

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

FBHM:用于仇恨模因检测的功能性基准测试与视觉语言模型引导

Paramananda Bhaskar, Naquee Rizwan, Daksh Jogchand, Saurabh Kumar Pandey, Animesh Mukherjee

发表机构 * Indian Institute of Technology (IIT), Kharagpur(印度理工学院(IIT)卡拉格浦尔) Microsoft(微软)

AI总结 针对现有基准无法因果评估视觉语言模型漏洞的问题,提出基于25种修辞功能和10个目标社区构建的FBHM基准,并采用可学习引导向量(LSV)在极低数据量下提升模型性能约30个Macro-F1点。

详情
AI中文摘要

仇恨模因检测对于视觉语言模型仍是一个严峻挑战,因为现有基准在结构上是观察性的——混淆了修辞仇恨机制与目标社区特征,并阻碍了对模型漏洞的因果评估。为解决这一问题,我们引入了FBHM,一个系统策划的基于功能的仇恨模因基准,沿两个正交轴构建:25种不同的修辞功能和10个目标社区(总共5,000个模因)。对最先进的视觉语言模型进行基准测试揭示了一个严重的泛化差距:在标准数据集上高度准确的模型在FBHM上灾难性地下降到接近随机性能,证明它们利用了数据集特定的启发式方法而非稳健的多模态推理。为了高效缩小这一差距,我们提出了LSV(可学习引导向量),一种超低数据量策略,在仅500个引导样本(50个独特基础模因)上应用因果干预目标,将FBHM性能提升约30个Macro-F1点,同时优于上下文学习和PEFT,且不降低源域性能。

英文摘要

Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.

2605.31340 2026-06-01 cs.HC cs.AI 版本更新

Appropriateness of Empathy in AI: A Signal-Cost Perspective

AI中同理心的适当性:信号-成本视角

Chi-Ching Juan, Tao Wang, Harold Lee

发表机构 * School of Information University of Toronto(信息学院多伦多大学) Independent Researcher(独立研究者)

AI总结 本文从信号-成本视角出发,运用信号理论提出信号成本代理(情感丰富性、观点采择和情境定制)来评估AI同理心的适当性,建立多维度框架以系统评价同理心是否适应用户需求。

Comments Accepted by IEEE CASCON 2025

详情
AI中文摘要

AI中同理心的适当性已成为一个关键问题,因为过度同理心可能显得操纵性,而不足则显得冷漠。虽然先前研究探索了如何量化AI中的同理心,但很少有研究考察这种同理心在情境上是否适当。本文通过将信号理论应用于人机对话,引入了一种经济学视角。我们提出了信号成本代理(情感丰富性、观点采择和情境定制),分别映射到情感、认知和关联同理心。这一多维度框架使得能够系统评估同理心,不仅基于其存在,还基于其相对于用户需求的适当性。

英文摘要

The appropriateness of empathy in AI has emerged as a critical concern, as excessive empathy risks seeming manipulative while insufficient empathy appears dismissive. While prior research has explored how to quantify empathy in AI, few studies examine whether such empathy is contextually appropriate. This paper introduces an economic perspective by applying signaling theory to human-AI conversations. We propose Signal Cost Proxies (emotional richness, perspective-taking, and contextual tailoring) mapped to affective, cognitive, and associative empathy. This multidimensional framework enables systematic evaluation of empathy not just by presence, but by its appropriateness relative to user demand.

2605.31330 2026-06-01 cs.GT cs.AI cs.MA math.OC nlin.AO 版本更新

Social welfare optimisation under institutional reward and punishment

制度奖惩下的社会福利优化

Van An Nguyen, Vuong Khang Huynh, Huu Loi Bui, Hai Anh Ha, Quang Dung Le, Tan Dat Nguyen, Ngoc Ngu Nguyen, Zhao Song, Manh Hong Duong, Le Hong Trang, The Anh Han

发表机构 * Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Vietnam(胡志明市技术大学计算机科学与工程学院,越南) Vietnam National University - Ho Chi Minh City (VNU-HCM), Vietnam(越南胡志明市国家大学(VNU-HCM),越南) School of Computing, Engineering and Digital Technologies, Teesside University, United Kingdom(泰赛德大学计算、工程与数字技术学院,英国) School of Mathematics, University of Birmingham, Birmingham, United Kingdom(伯明翰大学数学学院,英国)

AI总结 研究在有限混合群体中,通过奖励合作者或惩罚背叛者来最大化社会福利的激励机制,推导出最优激励的显式表达式和相变条件,并比较奖励与惩罚的福利效果。

详情
AI中文摘要

制度激励被广泛用于促进从人类社会到多智能体和AI系统中自主、自利代理人的合作。现有工作通常将激励设计视为双目标问题:在实现高长期合作频率的同时最小化制度成本。此类方案是否也能最大化社会福利——即扣除制度支出后的总人口收益——在很大程度上尚未被探索。我们针对有限、充分混合的群体中参与社会困境(捐赠博弈和公共品博弈)的情况,开发了一个以福利为中心的激励框架,同时考虑对合作者的奖励和对背叛者的惩罚。对于每种机制,我们推导出预期社会福利的显式表达式,并刻画其如何依赖于激励效率和选择强度。在解析上,我们识别出社会福利具有单一最优激励水平的参数区间,以及出现定性相变、福利非单调且具有多个局部最优的区间。我们证明任何最大化福利的激励要么为零,要么集中在简单的闭式目标附近,并提供了一种高效算法来计算这些最优值。通过比较奖励和惩罚,我们进一步推导出在给定预算下奖励在福利方面优于惩罚的闭式条件。总体而言,我们的结果揭示了针对成本或合作频率优化的激励与最大化福利的激励之间存在系统性差距。

英文摘要

Institutional incentives are widely used to promote cooperation among autonomous, self-regarding agents, from human societies to multi-agent and AI systems. Existing work typically treats incentive design as a bi-objective problem: minimise institutional cost while achieving a high long-run frequency of cooperation. Whether such schemes also maximise social welfare - total population payoff net of institutional expenditure - has remained largely unexplored. We develop a welfare-centric framework for institutional incentives in finite, well-mixed populations playing a social dilemma (Donation Game and Public Goods Game), considering both rewards for cooperators and punishments for defectors. For each mechanism, we derive explicit expressions for expected social welfare and characterise how it depends on incentive efficiency and selection intensity. Analytically, we identify parameter regimes where social welfare has a single optimal incentive level and regimes with qualitative phase transitions, in which welfare becomes non-monotonic with multiple local optima. We prove that any welfare-maximising incentive is either zero or concentrated around a simple closed-form target, and we provide an efficient algorithm to compute these optima. Comparing reward and punishment, we further derive close-formed conditions under which reward outperform punishment in terms of social welfare for any given budget. Overall, our results reveal a systematic gap between incentives optimised for cost or cooperation frequency and those that maximise welfare.

2605.31324 2026-06-01 cs.LG cs.AI 版本更新

Inconsistency-Aware Minimization: Improving Generalization with Unlabeled Data

不一致感知最小化:利用无标签数据提升泛化能力

Hee-Sung Kim, Hyeonseong Kim, Sungyoon Lee

发表机构 * Department of Computer Science, Hanyang University, Seoul, Korea(汉阳大学计算机科学系)

AI总结 本文提出一种基于信息几何的局部不一致性度量,并据此设计不一致感知最小化(IAM)方法,通过无标签数据计算该度量并融入训练目标,从而提升深度学习模型的泛化性能。

Comments ICML 2026

详情
AI中文摘要

估计泛化差距并开发改进泛化的优化方法对于深度学习模型至关重要,无论是从理论理解还是实际应用角度。利用无标签数据实现这些目标在实际场景中具有显著优势。本文从神经网络参数空间的信息几何角度出发,引入了一种新的泛化度量——局部不一致性。局部不一致性的一个关键特征是它可以在没有显式标签的情况下计算。我们通过将局部不一致性与Fisher信息矩阵和损失Hessian矩阵联系起来,建立了理论基础。实验上,我们证明了局部不一致性与泛化差距相关。基于这些发现,我们提出了不一致感知最小化(IAM),将局部不一致性纳入训练目标。我们证明,在标准监督学习设置中,IAM增强了泛化能力,实现了与现有方法(如锐度感知最小化)相当的性能。此外,IAM在半监督和自监督学习场景中表现出有效性,其中局部不一致性是从无标签数据计算得出的。

英文摘要

Estimating the generalization gap and developing optimization methods that improve generalization are crucial for deep learning models, for both theoretical understanding and practical applications. Leveraging unlabeled data for these purposes offers significant advantages in real-world scenarios. This paper introduces a novel generalization measure, local inconsistency, derived from an information-geometric perspective on the parameter space of neural networks. A key feature of local inconsistency is that it can be computed without explicit labels. We establish theoretical underpinnings by connecting local inconsistency to the Fisher information matrix and the loss Hessian. Empirically, we demonstrate that local inconsistency correlates with the generalization gap. Based on these findings, we propose Inconsistency-Aware Minimization (IAM), which incorporates local inconsistency into the training objective. We demonstrate that in standard supervised learning settings, IAM enhances generalization, achieving performance comparable to that of existing methods such as Sharpness-Aware Minimization. Furthermore, IAM exhibits efficacy in semi- and self-supervised learning scenarios, where the local inconsistency is computed from unlabeled data.

2605.31308 2026-06-01 cs.AI 版本更新

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

TraceGraph: 用于诊断和改进智能体轨迹的共享决策景观

Junjie Nian, Kang Chen, Ge Zhang, Yixin Cao, Yugang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出TraceGraph图框架,将多模型智能体轨迹构建为共享决策景观,通过事件摘要和陷阱感知恢复管线提升SWE-bench解决率。

详情
AI中文摘要

智能体基准测试越来越多地记录丰富的交互轨迹,但评估通常将每次运行简化为通过率或奖励分数。我们引入了TraceGraph,一个基于图的框架,将发布的多模型智能体轨迹转化为共享决策景观。对于每个任务,TraceGraph在引入模型身份之前,从聚合的运行中构建一个关于可观察动作-观察状态的图。然后,它叠加结果信息丰富的生产核心和陷阱区域,并用三个事件总结每条轨迹:访问、陷阱暴露和修复。跨越五个基准测试分割的轨迹中,TraceGraph配置文件揭示了被聚合分数隐藏的导航差异,并显示不同分割在奖励避免陷阱还是从中恢复方面有所不同。相同的TraceGraph景观还激发了SWE-bench的陷阱感知恢复管线:运行时检测器在匹配历史陷阱区域的状态上触发,然后从相同前缀评估轻量级延续策略。在触发状态上,最佳聚合单因子策略将每个提供者触发子集上的官方解决率从40.4%提高到43.5%,在共同触发实例上从41.0%提高到44.8%,并具有提供者特定的主动组件。总体而言,TraceGraph提供了一个过程词汇,用于询问智能体基准测试测试什么、模型在共享景观上何处出现分歧,以及失败区域如何指导下游改进。

英文摘要

Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score. We introduce TraceGraph, a graph-based framework that turns released multi-model agent trajectories into shared decision landscapes. For each task, TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced. It then overlays outcome-informed productive cores and trap regions, and summarizes each rollout with three events: Access, Trap exposure, and Repair. Across trajectories spanning five benchmark splits, TraceGraph profiles reveal navigation differences hidden by aggregate scores and show that splits differ in whether they reward avoiding traps or recovering from them. The same TraceGraph landscape also motivates a trap-aware recovery pipeline for SWE-bench: aruntime detector fires on states matching historical trap regions, then lightweight continuation policies are evaluated from the same prefix. On fired states, the best pooled single-factor policy raises official resolved rate from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances, with provider-specific active components. Overall, TraceGraph provides a process vocabulary for asking what agent benchmarks test, where models diverge on a shared landscape, and how failure regions can guide downstream improvement.

2605.31295 2026-06-01 cs.SD cs.AI cs.IR cs.LG 版本更新

Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation

通过激活引导实现潜在空间解缠:符号音乐生成中可解释的属性控制

Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos, Themos Stafylakis

发表机构 * Athens University of Economics Innovation Lab Orfium Athens, Greece Department of Music Technology Acoustics Hellenic Mediterranean University Rethymno, Greece Institute of Informatics \& Telecommunications National Center for Scientific Research “Demokritos” Athens, Greece Department of Informatics Athens University of Economics

AI总结 本文利用差分均值方法从多轨音乐Transformer的残差流中分离音高和时长的潜在方向,并通过Gram-Schmidt正交化实现双属性引导,从而在推理时实现可解释的确定性属性调制。

Comments Accepted at EUSIPCO 2026 (34th European Signal Processing Conference), 5 pages, 2 figures

详情
AI中文摘要

基于Transformer的架构在生成复杂符号序列方面取得了显著进展,但在实现对离散信号属性的细粒度、可解释控制方面仍存在显著差距。本文研究了多轨音乐Transformer(MMT)的机制可解释性,并提出了一种无需重新训练的确定性属性调制框架,通过推理时的激活引导来弥合这一差距。利用差分均值(DiffMean)方法,我们在残差流中分离了信号属性(特别是音高和时长)的潜在方向。我们验证了该领域的线性表示假设,实现了引导幅度与属性偏移之间的高相关性。为了解决多属性引导中固有的特征纠缠问题,我们引入了一种利用Gram-Schmidt正交化的双引导框架。实验结果表明,与简单的向量加法相比,这种几何解耦减少了概念干扰和信号退化,即使在强自回归条件下也能实现独立的确定性控制。

英文摘要

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

2605.31289 2026-06-01 cs.LG cs.AI 版本更新

The Terminal Representation in Reinforcement Learning

强化学习中的终端表示

Amir Esterhuysen, Anders Jonsson

发表机构 * Dept. Information and Communication Technologies(信息与通信技术系) Universitat Pompeu Fabra(庞培法布拉大学)

AI总结 提出终端表示(TR),一种无需特征分解即可直接用于下游任务且计算开销更低的奖励加权状态表示方法。

详情
AI中文摘要

表示学习是强化学习(RL)中用于时空抽象的强大工具。两种成熟的方法是通过后继表示(SR)和默认表示(DR)。SR通过状态引发的未来轨迹对其进行编码,捕获与奖励解耦的信息流。DR在此基础上用奖励加权轨迹,将信用分配结构整合到表示中。两种表示的特征向量已被用于支持一系列下游任务——包括选项发现、奖励塑造、迁移学习和探索。我们引入了一种结构不同的公式:终端表示(TR)。TR类似于DR对奖励加权轨迹进行编码,但可以作为更低维度的对象进行学习,并且可以直接用于上述应用而无需特征分解。特征分解还施加了对称转移动力学的假设,而TR可以绕过这一点。在这项工作中,我们发展了TR的理论基础:其推导、两种学习算法的收敛性、其在零样本组合性中的使用,以及替代奖励公式之间的等价性。我们进一步表明TR嵌入在顶部DR特征向量中,使其无需特征分解即可捕获相同的基础知识。此外,我们提供了经验证据,证明TR在辅助应用中作为现有表示的可行替代方案,同时在学习、存储和使用方面需要更少的计算开销。

英文摘要

Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approaches are through the successor representation (SR) and the default representation (DR). The SR encodes states by the future trajectories they induce, capturing information flow decoupled from reward. The DR builds on this by weighting trajectories with reward, integrating credit-assignment structure into the representation. Eigenvectors of both representations have been used to support a range of downstream tasks -- including option discovery, reward shaping, transfer learning, and exploration. We introduce a structurally distinct formulation: the terminal representation (TR). The TR encodes reward-weighted trajectories similarly to the DR, but can be learned as a lower-dimensionality object, and can be used directly for the mentioned applications without eigenvector computations. Eigendecomposition also imposes the assumption of symmetric transition dynamics, which the TR can bypass. In this work we develop the theoretical foundations of the TR: its derivation, convergence of two learning algorithms, its use for zero-shot compositionality, and equivalences between alternative reward formulations. We further show the TR is embedded in the top DR eigenvector, allowing it to capture the same underlying knowledge without eigendecomposition. Additionally, we provide empirical evidence of the TR as a viable alternative to existing representations in subsidiary applications, while requiring less computational overhead to learn, store, and use.

2605.31287 2026-06-01 cs.CY cs.AI cs.HC 版本更新

Neither Replacement nor Panacea: Comparing LLM-Based Conversational and Graphical Decision Support in Industrial Tasks

既非替代品也非万能药:比较基于LLM的对话式与图形化决策支持在工业任务中的应用

Roberto Figliè, Simone Caputo, Alan Serrano, Daria Mikhaylova, Tommaso Turchi, Daniele Mazzei

发表机构 * Department of Computer Science, University of Pisa(比萨大学计算机科学系) Department of Computer Science, Brunel University of London(伦敦布鲁内尔大学计算机科学系)

AI总结 通过混合因子实验,比较基于LLM的对话式界面与仪表盘在工业决策支持中的效果,发现对话界面在低复杂度任务中降低认知负荷和加快完成时间,但优势随任务复杂度增加而消失,且未提高决策准确性。

详情
AI中文摘要

制造业环境中的管理者依赖数字界面解读运营数据以进行决策,但不断增长的数据量和复杂性使得高效识别相关洞察变得困难。虽然仪表盘在工业环境中仍占主导地位,但通过对话式用户界面(CUI)访问的基于大型语言模型(LLM)的对话代理(CA)可能提供更直接的数据访问。然而,其有效性可能取决于任务的信息处理需求。本研究在制造决策支持场景中比较了通过CUI提供的基于LLM的CA与仪表盘。在一个2x3设计的混合因子实验中,134名工业决策者被分配到一种界面条件,并完成三个复杂度递增的任务。我们考察了感知心理负荷(MWL)、决策准确性、完成时间和预期依赖,并测试了自我报告的数据素养作为调节变量。结果显示,CUI总体上降低了感知MWL,并在低要求任务中支持更快的完成,但随着任务复杂度增加,这两个优势均减弱。两种界面在决策准确性上均未产生一致的整体优势,且CUI不被偏好作为后续决策的唯一基础。此外,数据素养并未可靠地调节界面效应。这些发现表明,对话式交互为工业决策支持提供的是有条件而非普遍的好处。基于LLM的CA可能减少信息访问努力,而复杂决策仍然受益于持久、可检查的视觉表示。

英文摘要

Managers in manufacturing settings rely on digital interfaces to interpret operational data for decision-making, but growing data volume and complexity can make relevant insights difficult to identify efficiently. While dashboards remain dominant in industrial contexts, Large Language Model (LLM)-based conversational agents (CAs), accessed through conversational user interfaces (CUIs), may provide more direct access to such data. However, their effectiveness may depend on the information-processing demands of the task. This study compares an LLM-based CA delivered through a CUI with a dashboard in a manufacturing decision-support scenario. In a mixed factorial experiment with a 2x3 design, 134 industrial decision-makers were assigned to one interface condition and completed three tasks of increasing complexity. We examined perceived Mental Workload (MWL), decision accuracy, completion time, and intended reliance, and tested self-reported data literacy as a moderator. Results showed that the CUI reduced perceived MWL overall and supported faster completion in less demanding tasks, but both advantages diminished as task complexity increased. Neither interface produced a consistent overall advantage in decision accuracy, and the CUI was not preferred as a sole basis for subsequent decisions. Furthermore, data literacy did not reliably moderate interface effects. These findings indicate that conversational interaction offers conditional rather than universal benefits for industrial decision support. LLM-based CAs may reduce information-access effort, whereas complex decisions continue to benefit from persistent, inspectable visual representations.

2605.31284 2026-06-01 cs.CV cs.AI 版本更新

SAM for Robust Mitochondria Instance Segmentation in Fluorescence Microscopy

SAM 用于荧光显微镜中鲁棒的线粒体实例分割

Suyog Jadhav, Dilip K. Prasad, Krishna Agarwal

发表机构 * UiT The Arctic University of Norway(UiT北极大学)

AI总结 通过仅在合成荧光显微镜数据上微调 SAM,解决了真实数据稀缺问题,提高了线粒体实例分割的精度和平均 Dice 分数。

Comments Accepted at PHAROS-AIF-MIH workshop @ CVPR 2026

详情
AI中文摘要

荧光显微镜(FM)中线粒体的形态分析对于理解细胞健康、能量产生和代谢调节至关重要。虽然像 Segment Anything Model (SAM) 这样的基础模型已经革新了自然图像分割,但由于衍射受限分辨率、低对比度和复杂的重叠细胞器网络,它们直接应用于 FM 受到显著领域偏移的阻碍。此外,鲁棒模型的开发因严重缺乏高质量、手动标注的线粒体实例分割数据集而受阻。在本文中,我们提出了一种可扩展的解决方案,通过仅在合成生成的 FM 数据上微调 SAM 来解决数据稀缺问题。我们模拟真实的线粒体数据并模拟荧光显微镜的光学特性,以创建大规模标注数据集。我们在一个精心策划的真实手动标注 FM 图像数据集上评估了我们的微调模型。定性和定量分析表明,我们的合成微调模型在精度和平均 Dice 分数上优于强基线。这项工作确立了模拟辅助训练在 FM 实例分割中的潜力。

英文摘要

The morphological analysis of mitochondria in fluorescence microscopy (FM) is crucial for understanding cellular health, energy production, and metabolic regulation. While foundation models like the Segment Anything Model (SAM) have revolutionized natural image segmentation, their direct application to FM is hindered by a significant domain shift characterized by diffraction-limited resolution, low contrast, and complex overlapping organelle networks. Furthermore, the development of robust models is bottlenecked by a severe lack of high-quality, manually annotated instance segmentation datasets for mitochondria. In this paper, we propose a scalable solution to this data scarcity by finetuning SAM exclusively on synthetically generated FM data. We simulate realistic mitochondria data and emulate the optical properties of fluorescence microscopes to create a large-scale annotated dataset. We evaluate our fine-tuned model on a curated dataset of real, manually annotated FM images. Qualitative and quantitative analyses demonstrate that our synthetically fine-tuned model improves precision and average dice score over strong baselines. This work establishes the potential of simulation-assisted training for FM instance segmentation.

2605.31279 2026-06-01 eess.SP cs.AI cs.NI 版本更新

Practical Cross-Band Channel Prediction for AI-RAN via Physics-Guided Deep Unfolding

面向AI-RAN的实用跨频段信道预测:基于物理引导的深度展开

Ruiqi Kong, He Chen, Xiaojun Lin

发表机构 * Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China(香港大学信息工程系)

AI总结 提出GUIDE框架,通过将无线信道物理嵌入可微层,实现跨频段信道预测的泛化与实时推理,在未见环境中波束赋形增益比深度学习基线FIRE高2.75倍,比模型基线R2F2高1.39倍且速度快1610倍以上。

Comments 2 pages

详情
AI中文摘要

为了使跨频段信道预测对AI原生RAN实用化,算法必须能够泛化到不同环境并支持实时推理。现有方法只能实现其中之一。为弥合这一差距,我们引入了GUIDE,一种物理引导的深度展开框架,将无线信道物理嵌入到可微层中。在未见环境中无需重新训练,GUIDE的波束赋形增益比基于深度学习的基线FIRE高2.75倍,且推理时间仅略有增加;比最强的基于模型的基线R2F2的波束赋形增益高1.39倍,同时运行速度快1610倍以上。

英文摘要

To make cross-band channel prediction practical for AI-native RAN, algorithms must generalize across diverse environments and support real-time inference. Existing approaches achieve one but not both. To bridge this gap, we introduce GUIDE, a physics-guided deep unfolding framework that embeds wireless channel physics into differentiable layers. Without retraining in unseen environments, GUIDE achieves 2.75x beamforming gain than the deep learning-based baseline FIRE with only a slight increase in inference time, and 1.39x beamforming gain than the strongest model-based baseline R2F2 while running over 1610x faster.

2605.31275 2026-06-01 cs.HC cs.AI 版本更新

Personalized to Persuade: The Effects of Contextualization and Warmth on Trust and Reliance in Conversational AI

个性化以说服:情境化和温暖对对话式AI中信任与依赖的影响

Mert Yazan, Suzan Verberne, Frederik Bungaran Ishak Situmeang

发表机构 * Amsterdam University of Applied Sciences(阿姆斯特丹应用科学大学) University of Leiden(莱顿大学)

AI总结 通过2x2被试间实验(N=380),研究情境化与对话温暖如何交互影响AI助手在反驳专家建议时的说服力与用户依赖,发现情境化降低说服力但与温暖结合通过交叉交互恢复,且AI素养解耦信任与行为。

详情
AI中文摘要

人工智能(AI)代理通过根据用户的背景、兴趣和先前交互来定制解释,即情境化,从而个性化其响应。个性化已被视为政治或营销中的说服策略。然而,在用户通常缺乏先验知识的日常任务中,情境化的说服效果仍不明确。我们进行了一项2×2被试间实验(N=380),研究情境化与对话温暖相结合如何影响AI助手在反驳专家建议时的依赖性和说服力。我们的发现表明,情境化降低了AI的说服力,但其与温暖的结合通过交叉交互恢复了说服力。对AI的依赖在所有条件下都存在,并且不受对话设计的影响。信任强烈预测说服力和依赖,但情境化和温暖都不通过信任起作用。AI素养解耦了信任与行为:素养更高的用户对助手报告的信任较低,但更易被说服且更依赖其建议。这些结果表明,用户倾向于依赖AI代理而非人类专家判断;然而,界面级别的对话设计选择在塑造行为方面的作用有限。

英文摘要

Artificial Intelligence (AI) agents personalize their responses by tailoring explanations to users' backgrounds, interests, and prior interactions, referred to as contextualization. Personalization has been identified as a persuasive strategy in politics or in marketing. However, the persuasive effect of contextualization in everyday tasks, where users often lack prior knowledge, remains unclear. We conducted a $2\times2$ between-subjects experiment ($N = 380$) examining how contextualization, combined with conversational warmth, shapes reliance and persuasiveness of an AI assistant arguing against expert recommendations. Our findings reveal that contextualization reduces the persuasive power of AI, but its combination with warmth restores persuasiveness through a crossover interaction. Reliance on AI is present across conditions and is invariant to the conversational design. Trust strongly predicts both persuasion and reliance, yet neither contextualization nor warmth operates through trust. AI literacy decouples trust from behavior: more literate users report lower trust in the assistant, yet are more persuaded and more reliant on its advice. These results suggest that users are prone to deferring to AI agents over human expert judgment; however, interface-level conversational design choices have a limited role in shaping the behavior.

2605.31266 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation

超越少数:用于少样本非典型布局到图像生成的解耦语义与基元

Nan Bao, Yifan Zhao, Wenzhuang Wang, Jia Li

发表机构 * State Key Laboratory of Virtual Reality Technology and Systems(虚拟现实技术与系统国家重点实验室) School of Computer Science and Engineering(计算机科学与工程学院) Qingdao Research Institute, Beihang University, China(北京航空航天大学青岛研究所,中国)

AI总结 针对少样本非典型布局到图像生成中表示碎片化问题,提出通过语义锚定和基元注入解耦语义与视觉细节,实现鲁棒少样本适应。

Comments Accepted to ICML 2026; code available at https://github.com/iCVTEAM/DSP

详情
AI中文摘要

布局到图像(L2I)任务通过对象类别和空间布局实现对图像生成的细粒度控制。然而,现有的L2I方法在少样本非典型设置下会产生碎片化和扭曲的生成结果。我们将这种失败称为表示碎片化,源于将语义身份与视觉细节纠缠在一起的粒度不匹配。为了解决这个问题,我们提出了一种表示驱动的框架,将语义与基元解耦,以实现鲁棒的少样本适应。具体来说,语义锚定将类别语义聚合到锚点中以实现稳定的身份,而基元注入则建模可重新组合的基元以实现鲁棒的局部细节建模。概念引导进一步通过显著性感知目标调节优化,以保持前景语义一致性。大量实验表明,在5样本设置下,我们的方法在视觉保真度和跨不同非典型领域的对齐方面,均优于最先进的L2I方法。源代码公开于 https://github.com/iCVTEAM/DSP。

英文摘要

The layout-to-image (L2I) task enables fine-grained control over image generation via object categories and spatial layouts. However, existing L2I methods yield fragmented and distorted generations under few-shot atypical settings. We term this failure as representation fragmentation, arising from a granularity mismatch that entangles semantic identity with visual details. To address this issue, we propose a representation-driven framework that disentangles semantics from primitives for robust few-shot adaptation. Specifically, Semantic Anchoring aggregates categorical semantics into anchors for stable identity, while Primitive Imbuing models recomposable primitives for robust local detail modeling. Conceptual Steering further regulates optimization with a saliency-aware objective to preserve foreground semantic consistency. Extensive experiments demonstrate consistent improvements in the 5-shot regime over state-of-the-art L2I methods in both visual fidelity and alignment across diverse atypical domains. The source code is publicly available at https://github.com/iCVTEAM/DSP.

2605.31264 2026-06-01 cs.AI cs.CL cs.LG 版本更新

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

COLLEAGUE.SKILL: 通过专家知识蒸馏实现自动化AI技能生成

Tianyi Zhou, Dongrui Liu, Leitao Yuan, Jing Shao, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出一个从异构痕迹到可检查、可修正、可代理使用的技能包的自动化蒸馏系统,用于生成基于人的AI技能。

Comments 12 pages, 4 figures

详情
AI中文摘要

LLM代理不仅被期望完成孤立的任务,还要承载人类专业知识、判断和互动风格的有限表示。构建这种基于人的代理仍然困难,因为与人或角色相关的可操作知识通常嵌入在异构痕迹中,而不是写成清晰的指令。现有的记忆和角色系统捕捉了这些证据的片段,而技能框架提供了可移植的打包格式;然而,没有端到端的工作流将这些痕迹蒸馏成可检查、可修正和代理可用的技能。我们提出了一个自动化的痕迹到技能蒸馏系统,通过专家知识蒸馏生成基于人的AI技能。给定目标人物或角色的材料,COLLEAGUE.SKILL 生成一个版本化的技能包,包含两个协调的轨道:一个能力轨道,用于实践、心理模型和决策启发式;一个边界行为轨道,用于沟通风格、互动规则和修正历史。该包可以被检查、调用、通过自然语言反馈更新、回滚、跨代理主机安装,并可选择性地为受控分发做准备。我们描述了开源系统中实现的人工制品契约、生成工作流、修正生命周期、部署表面和领域预设。在撰写本文时,公共仓库拥有约18.5k个GitHub星标;画廊列出了来自165位贡献者的215个技能,以及跨列出的技能卡累计超过10万个星标。该系统说明了基于人的技能如何表示为可移植、可修正的包,而不是不透明的提示或隐藏的记忆。

英文摘要

LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions. Existing memory and persona systems capture fragments of this evidence, while skill frameworks provide portable packaging formats; however, there is no end-to-end workflow for distilling these traces into inspectable, correctable, and agent-usable skills. We present an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge distillation. Given materials from a target person or role, COLLEAGUE.SKILL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural-language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution. We describe the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the open-source system. At the time of writing, the public repository has approximately 18.5k GitHub stars; the gallery lists 215 skills from 165 contributors and more than 100k cumulative stars across listed skill cards. The system illustrates how person-grounded skills can be represented as portable, correctable packages rather than opaque prompts or hidden memories.

2605.31261 2026-06-01 cs.LG cs.AI stat.ML 版本更新

Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

为什么线性循环记忆在部分可观测强化学习中有效

Yike Zhao, Onno Eberhard, Malek Khammassi, Ali H. Sayed, Michael Muehlebach

发表机构 * EPFL(苏黎世联邦理工学院) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所)

AI总结 本文通过构造两种线性滤波器,从理论上证明了线性循环神经网络在部分可观测强化学习中作为记忆单元的有效性,并扩展到动作控制的隐马尔可夫模型。

详情
AI中文摘要

线性循环神经网络家族在部分可观测强化学习中作为循环记忆单元表现出色。我们通过构造并研究两种线性滤波器为其经验有效性提供了理论依据:(i) 第一种在确定性转移矩阵下精确重现隐马尔可夫模型(HMM)中信念向量的预softmax logits,从而作为最优策略学习的充分统计量;(ii) 第二种在近似确定性转移矩阵下实现状态解码误差趋近于零,从而将状态模糊性降至接近零。结果扩展到动作控制的HMM,其中相应的线性滤波器变为随时间变化且依赖于动作的动态。我们通过数值实验说明了主要结果,并进一步展示了所构造的线性滤波器在小型强化学习游戏中作为强特征提取器的能力。

英文摘要

The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the first exactly reproduces the pre-softmax logits of the belief vector in a hidden Markov model (HMM) under a deterministic transition matrix, thereby serving as a sufficient statistic for optimal policy learning, (ii) the second achieves vanishing state-decoding error under a nearly deterministic transition matrix, thus reducing state ambiguity to near zero. The results extend to action-controlled HMMs, where the corresponding linear filters become time-varying with action-dependent dynamics. We illustrate our main results through numerical experiments and further show that the constructed linear filter serves as a strong feature extractor in a small reinforcement learning game.

2605.31254 2026-06-01 cs.AI 版本更新

Formalizing and falsifying causal pathways of rare events

罕见事件因果路径的形式化与证伪

Anahita Haghighat, Dominik Janzing

发表机构 * Amazon Research(亚马逊研究)

AI总结 本文在结构方程模型中罕见事件根因分析的形式化基础上,提出因果路径的形式定义并讨论其可检验含义,引入罕见事件因果路径的抽象以桥接简单因果解释与详细因果建模。

Comments accepted for ICML 2026

详情
AI中文摘要

基于最近在结构方程模型中对罕见事件(“异常值”)根因分析的形式化,我们提出了因果路径的形式定义并讨论了其可检验含义。我们识别了这些含义仅依赖于由罕见事件路径定义的因果抽象而非底层系统完整因果图的条件。据此,我们引入了一种因果结构到罕见事件路径的抽象,该抽象桥接了简单的口头因果解释与详细的因果建模。

英文摘要

Building on recent formalizations of root cause analysis for rare events (``outliers'') in structural equation models, we propose a formal definition of a causal pathway and discuss its testable implications. We identify conditions under which these implications depend only on a causal abstraction defined by the pathway of rare events, rather than on the full causal graph of the underlying system. Accordingly, we introduce an abstraction of causal structure to pathways of rare events that bridges simple verbal causal explanations and detailed causal modeling.

2605.31251 2026-06-01 cs.CV cs.AI 版本更新

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

ERGeoBench:多模态大语言模型中具身推理与地理定位的综合基准

Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou, Kaoyan Lu, Yu Feng, Yifan Zhu, Haoran Luo

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室) School of Materials Science and Engineering(材料科学与工程学院) China Mobile Research Institute(中国移动研究院) College of Computing and Data Science(计算与数据科学学院)

AI总结 提出ERGeoBench基准,通过单视图、全景视图和具身视图三种渐进设置评估多模态大语言模型在视觉驱动的具身地理定位中的能力,发现当前模型在高层次地理语义推理上表现良好,但在细粒度感知、度量定位和视图间空间一致性上仍有不足。

详情
AI中文摘要

多模态大语言模型(MLLMs)作为具身代理展现出强大潜力,然而由于缺乏细粒度评估,具身地理定位仍未被充分探索。我们引入ERGeoBench,一个用于视觉驱动的具身地理定位的诊断基准。ERGeoBench在三种渐进设置下评估模型——单视图、全景视图和具身视图——其中代理可以通过偏航、俯仰和缩放的顺序变化主动获取观察。该基准包含2,207个全球分布的街景全景图,并衡量四种互补能力:基础感知、空间意识、常识推理和地理定位推理。对领先的专有和开源MLLMs的评估表明,当前模型能够推断高层次的地理语义,但在细粒度感知操作、度量定位和跨视图空间一致性方面仍然困难。我们进一步观察到,地理定位与其他能力维度强相关,表明准确定位依赖于集成的感知、空间推理和常识推理,而非孤立的视觉识别。总体而言,ERGeoBench为诊断和推进类人具身地理定位提供了一个统一框架。项目页面:https://kaixuewen.github.io/ERGeoBench/

英文摘要

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/

2605.31250 2026-06-01 stat.ML cs.AI cs.LG 版本更新

Entropic Projection Alignment: Estimating, Explaining, and Improving Model Performance Under Distribution Shift

熵投影对齐:估计、解释和改进分布偏移下的模型性能

Salim I. Amoukou, Emanuele Albini, Tom Bewley, Saumitra Mishra, Manuela Veloso

发表机构 * J.P. Morgan AI Research(摩根大通AI研究所)

AI总结 提出熵投影对齐(EPA)方法,通过匹配选定矩并最小化KL散度来对齐源分布与目标分布,从而统一解决分布偏移下的性能估计、解释和改进问题。

Comments Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026)

详情
AI中文摘要

我们提出了一个统一框架,用于解决分布偏移的三个关键挑战:(1)估计模型在未标记目标域上的性能,(2)通过识别导致偏移的特征来解释偏移,以及(3)提高目标域性能。我们的方法,熵投影对齐(EPA),通过匹配精心选择的矩同时最小化与源分布的KL散度,将源分布与目标分布对齐。该公式为重要性权重提供了唯一的闭式解,通过隐式方差控制实现鲁棒性。借鉴领域适应理论,我们证明矩匹配足以实现可靠的估计和适应,避免了完全密度比恢复的需要。大量实验以及强有力的理论保证表明,EPA在提供显著计算效率的同时,始终优于最先进的基线方法。

英文摘要

We propose a unified framework for addressing three key challenges of distribution shift: (1) estimating a model's performance on an unlabeled target domain, (2) explaining the shift by identifying the features responsible, and (3) improving the target domain performance. Our method, Entropic Projection Alignment (EPA), aligns the source distribution to the target by matching carefully selected moments while simultaneously minimising the KL divergence from the source. This formulation yields a unique closed-form solution for importance weights, achieving robustness through implicit variance control. Drawing on domain adaptation theory, we establish that moment matching is sufficient for reliable estimation and adaptation, avoiding the need for full density ratio recovery. Extensive experiments, together with strong theoretical guarantees, demonstrate that EPA consistently outperforms state-of-the-art baselines while offering substantial computational efficiency.

2605.31249 2026-06-01 cs.LG cs.AI 版本更新

Learning Cardiac Latent Representations in Vectorcardiogram Space

在向量心电图空间中学习心脏潜在表示

Bosong Huang, Panzhen Zhao, Zengxiang Li, Patricia Lee, Wei Jin, Alan Wee-Chung Liew, Ming Jin, Shirui Pan

发表机构 * Griffith University, Australia(格里菲斯大学) SingHealth Duke-NUS AI in Medicine Institute, Singapore(新加坡SingHealth Duke-NUS医学人工智能研究所) Emory University, USA(埃默里大学)

AI总结 针对标准十二导联心电图表示学习中的冗余和过拟合问题,提出基于Frank向量心电图模型的LVCG框架,在物理潜在空间中学习视图不变的心脏电活动表示,提升鲁棒性和泛化能力。

详情
AI中文摘要

心电图(ECG)是心脏评估的基石,学习信息丰富的ECG表示对于从疾病诊断到临床报告生成等任务至关重要。然而,现有方法几乎完全在可观测的ECG信号空间中操作。实际上,标准十二导联ECG代表了同一心脏电活动在不同空间方向上的多个投影。因此,在ECG空间中进行表示学习不可避免地引入了大量冗余,可能导致虚假相关性和过拟合风险增加。为了解决这个问题,受Frank向量心电图(VCG)模型启发,我们提出直接在VCG空间中学习心脏电活动的统一潜在表示。我们引入了LVCG,这是第一个设计用于在此物理基础潜在空间中运行的通用自监督表示学习框架。通过学习视图不变的潜在VCG表示而非导联特定伪影,LVCG最小化了冗余并提高了泛化能力。LVCG在各项任务中普遍优于ECG空间基线,展现出增强的鲁棒性和泛化能力,尤其在领域偏移设置中。

英文摘要

Electrocardiography (ECG) is a cornerstone of cardiac assessment, making the learning of informative ECG representations fundamental to tasks ranging from disease diagnosis to clinical report generation. However, existing methods operate almost exclusively in the observable ECG signal space. In practice, the standard twelve-lead ECG represents multiple projections of the same underlying cardiac electrical activity from different spatial orientations. Therefore, representation learning in the ECG space inevitably introduces substantial redundancy, which may lead to spurious correlations and increased risk of overfitting. To address this and motivated by the Frank vectorcardiogram (VCG) model, we propose learning a unified latent representation of cardiac electrical activity directly in the VCG space. We introduce LVCG, the first general self-supervised representation learning framework designed to operate in this physically grounded latent space. By learning view-invariant latent VCG representations rather than lead-specific artifacts, VCG minimizes redundancy and improves generalization. LVCG generally outperforms ECG-space baselines across tasks, demonstrating enhanced robustness and generalization, especially in domain shift settings.

2605.31239 2026-06-01 stat.ML cs.AI cs.LG 版本更新

Correcting Split Selection in Online Decision Trees via Anytime-Valid Inference

通过随时有效推断纠正在线决策树中的分裂选择

Salim I. Amoukou, Saumitra Mishra, Manuela Veloso

发表机构 * J.P. Morgan AI Research(摩根大通AI研究)

AI总结 针对在线决策树分裂选择缺乏有效统计保证的问题,提出基于随时有效推断的方法,实现任意数据流下错误分裂的随时有效控制、预测优势下的有限承诺时间,并在平稳独立同分布数据下保证风险单调递减且每次分裂严格改善。

Comments Accepted as a Spotlight at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于装袋的集成方法,尤其是自适应随机森林,是数据流学习中最强的表现者之一。这些方法的共同点是依赖霍夫丁树作为基学习器,通过使用浓度不等式测试候选分裂是否显著优于其替代方案来增量式地构建决策树。尽管经验成功,现有变体缺乏有效的统计保证。当前分析依赖于固定样本浓度界,而分裂决策使用数据依赖的停止规则,这使其保证无效,并可能将错误分裂的概率推向1。我们引入了一种基于随时有效推断的原则性替代方案。我们的方法提供:(i) 在任意数据流(包括非平稳设置)下对错误分裂的随时有效控制;(ii) 在预测优势下的有限承诺时间;(iii) 在平稳独立同分布数据下,风险单调递减且每次分裂严格改善。在经验上,我们评估了独立树及其在非平稳流中在自适应随机森林中的使用。我们的方法提高了性能,同时生成了更小的树。

英文摘要

Bagging-based ensembles, most notably Adaptive Random Forests, are among the strongest performers for learning from data streams. A common denominator across these methods is their reliance on Hoeffding Trees as base learners, which grow decision trees incrementally by testing whether a candidate split is significantly better than its alternatives using concentration inequalities. Despite their empirical success, existing variants lack valid statistical guarantees. Current analyses rely on fixed-sample concentration bounds, while split decisions are made using data-dependent stopping rules, which invalidates their guarantees and can drive the probabilty of incorrect splits to one. We introduce a principled alternative based on anytime-valid inference. Our method provides: (i) anytime-valid control of false splits under arbitrary data streams, including non-stationary settings; (ii) finite commitment time under a predictive advantage; and (iii) under stationary i.i.d. data, risk is monotone decreasing and strictly improves at every split. Empirically, we evaluate both standalone trees and their use within Adaptive Random Forests on non-stationary streams. Our method improves performance while producing substantially smaller trees.

2605.31229 2026-06-01 cs.CV cs.AI 版本更新

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

超越分类:面向持续多模态检索的动态适配器路由

Alicja Dobrzeniecka, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik, Bartlomiej Twardowski

发表机构 * NASK National Research Institute(NASK国家研究院) IDEAS Research Institute(IDEAS研究所) Warsaw University of Technology(华沙技术大学) Universitat Autonoma de Barcelona(巴塞罗那自治大学)

AI总结 针对持续多模态检索(CMR)任务,提出基于原型路由和模型合并的动态适配器路由(DAR)方法,在跨域评估中取得优于现有基线的性能。

详情
AI中文摘要

虽然检索是视觉-语言模型的核心功能,但持续更新这些模型用于检索任务仍未被充分探索。现有工作通常通过类增量学习(CIL)的视角处理持续检索,在可能无法完全捕捉检索特定动态的设置中评估标准CIL方法和面向检索的适应方法。为了解决这一问题,我们引入了一个新的、原则性的持续多模态检索(CMR)评估框架,涵盖多样化的视觉领域,并在此设置中系统评估常见方法。我们的实证分析表明,标准CIL方法在我们更具挑战性的场景中未能产生有意义的增益。因此,我们提出了动态适配器路由(DAR),一种基于通过原型路由选择适配器并通过模型合并组合的新方法。DAR在先前基线上取得了优越性能,并在分布外评估中展现出强大的泛化能力。我们的结果凸显了CMR的独特挑战,并鼓励在该方向进行进一步研究。

英文摘要

While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard CIL methods and retrieval-oriented adaptations in settings that may not fully capture the retrieval-specific dynamics. To address this, we introduce a new, principled evaluation framework for continual multimodal retrieval (CMR) spanning diverse visual domains, and systematically evaluate common approaches within this setting. Our empirical analysis shows that standard CIL methods fail to yield meaningful gains in our more challenging scenario. Therefore, we propose Dynamic Adapter Routing (DAR), a novel approach based on adapters selected through prototype-based routing and combined via model merging.DAR achieves superior performance over the previous baselines and demonstrates strong generalization under out-of-distribution evaluation. Our results highlights the unique challenges of CMR and encourages further research in this direction.

2605.31228 2026-06-01 cs.LG cs.AI 版本更新

EchoRL: Reinforcement Learning via Rollout Echoing

EchoRL:通过回滚回响进行强化学习

Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, Xun Xiao, Volker Tresp, Yunpu Ma

发表机构 * Munich Center for Machine Learning(慕尼黑机器学习中心) Huawei Heisenberg Research Center(华为海森堡研究所以) University of Arizona(亚利桑那大学) College of Computing(计算学院) Data Science, Nanyang Technological University, Singapore(数据科学,南洋理工大学,新加坡) MemAgents Lab(MemAgents实验室)

AI总结 针对RLVR训练中优势退化问题,提出EchoRL模块,通过从成功回滚中提取EchoClip作为辅助监督信号,持续提升训练性能。

Comments ICML 2026

详情
AI中文摘要

基于可验证奖励的强化学习是增强大语言模型推理能力的有效后训练方法。然而,随着训练进行,学习信号可能崩溃,导致训练收益变得微弱且无效。具体而言,越来越多的提示回滚出现优势退化:所有自生成回滚均显示验证成功,使得其奖励的标准差为零;相应地,每个回滚的优势也退化为零。由于这些回滚的优势为零,用于模型优化的策略梯度最终消失,限制了训练性能。我们认为,其中一些回滚仍然包含有价值的学习信号,但不幸被现有RLVR方法忽略。本文受外部专家模型生成的金色轨迹的熵模式分析启发,提出EchoRL以更好地利用优势退化的回滚来进一步提升训练性能。EchoRL是一个轻量级模块,首先根据逐步熵值从验证成功的回滚中识别出EchoClip,然后将该片段作为辅助监督信号反馈到RL目标中。在10个基准、5个LLM骨干网络和4种流行RLVR后训练方法上的大量实验表明,EchoRL能够以最小开销持续改进RLVR后训练。

英文摘要

Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Specifically, a growing fraction of prompts' rollouts become advantage-degenerated: all the self-generated rollouts show verified-success, making the standard deviation over their rewards be zero; accordingly each rollout's advantage becomes degenerated (zero) as well. Given such rollouts' advantages, the policy-gradient for model optimization eventually vanishes, capping the training performance. We argue that some of these rollouts still contain valuable learning signals but unfortunately omitted with the existing RLVR methods. In this paper, inspired through analyzing the entropy pattern behind golden trajectories produced by external expert models, we propose EchoRL for better exploiting the advantage-degenerated rollouts to further improve the training performance. EchoRL is a lightweight module that first identifies an EchoClip from verified-success rollouts based on their step-level entropy values, and then feeds this clip back as an auxiliary supervision signal in the RL objective. Extensive experiments across 10 benchmarks, 5 LLM backbones, and 4 popular RLVR post-training methods demonstrate that EchoRL consistently improves RLVR post-training with minimal overhead.

2605.31226 2026-06-01 cs.LG cs.AI 版本更新

What changes after deployment? A survey on On-device Learning in TinyML

部署后发生了什么变化?TinyML中设备端学习综述

Massimo Pavan, Luca Pezzarossa, Fabrizio Pittorino, Manuel Roveri, Xenofon Fafoutis

发表机构 * Technical University of Denmark (DTU)(丹麦技术大学)

AI总结 本文针对微控制器级设备上的机器学习模型,系统综述了约70篇设备端学习(ODL)工作,基于分布变化类型分析其对应用、硬件和解决方案的影响,并指出方法论基准与现实部署之间的差距。

详情
AI中文摘要

微控制器级设备上的机器学习模型(TinyML)面临一个根本性挑战:部署后的分布变化会破坏静态模型。设备端学习(ODL)通过直接在设备上运行学习过程来解决这一问题。现有文献尚未描述分布变化如何发生,以及不同类型的变化需要不同的解决方案。本文基于分布变化类型这一原则,综述了约70篇ODL工作。调查分析了不同类型的分布变化如何影响可寻址的设备端应用、所使用的硬件以及解决方案的结构。还指出了方法论基准与现实部署场景之间持续存在的差距。

英文摘要

Machine learning models on microcontroller-class devices (TinyML) face a fundamental challenge: post-deployment distribution change undermines static models. On-device learning (ODL) addresses this by running the learning process directly on the device. The existing literature has not characterized how distribution change occurs or how different change types require different solutions. Approximately 70 ODL works are surveyed under one principle: the distribution change regime. The survey analyzes how different types of distribution change influence the applications addressable on-device, the hardware employed, and the structure of the solutions. A persistent gap between methodological benchmarks and real-world deployment scenarios is also identified.

2605.31224 2026-06-01 cs.CY cs.AI cs.HC 版本更新

Comparing LLM-Based Conversational and Graphical Interfaces for Industrial Decision Tasks: An Exploratory Mixed-Methods Study

基于LLM的对话式与图形化界面在工业决策任务中的比较:一项探索性混合方法研究

Roberto Figliè, Simone Caputo, Alan Serrano, Tommaso Turchi, Daniele Mazzei

发表机构 * Department of Computer Science(计算机科学系) University of Pisa(比萨大学) Department(部门) San Matteo Hospital(圣玛泰奥医院) Brunel University of London(伦敦布鲁内尔大学)

AI总结 通过混合方法研究,比较了基于LLM的对话式界面与图形化仪表盘在工业决策任务中的表现,发现对话式界面可减少交互努力,但仪表盘在概览和验证方面仍有价值。

详情
AI中文摘要

生成式AI对话用户界面(CUI)作为访问和分析数据的新方式,在各个领域(包括工业领域)的应用正在增长。在工业领域,物联网设备产生的大量数据流经用户界面,可能需要对决策者新的分析需求进行适应。基于LLM的CUI通过自然语言的直接性,无需学习每个GUI设计的成本,有望提供一种与这些数据直接交互的新方式。此外,LLM的能力及其代理性为自动化某些任务并在决策活动中辅助推理提供了可能性。但这些承诺是否可靠?我们通过一项混合方法研究来探讨这一普遍问题,比较了最先进的仪表盘与对话代理。共有20名参与者使用两种界面完成四项复杂度不同的模拟工业决策任务。我们结合了心理工作量、完成时间和决策准确性的测量,以及通过主题分析进行的事后问卷和半结构化访谈。研究结果表明,对话代理可以通过支持更直接的信息访问来减少交互努力,而仪表盘在概览和验证方面仍然有价值。然而,这些好处可能因任务而异,需要通过更大规模的研究进行验证。

英文摘要

The use of Generative AI Conversational User Interfaces (CUI) as a new way to access and analyze data is growing in all sectors, and the industrial one is no exception. There, large amounts of data produced by IoT devices are flowing through user interfaces and may require them a new adaptation to the new analyses needs of decision-makers. LLM-based CUIs are promising a new way to directly interact with those data through the directness of natural language and without the learning costs that every GUI design has. Moreover, the capabilities of LLMs and their agency open up the possibility to automate some tasks and help with the reasoning during decision-making activities. But are this promises well founded? We try to scope this general question with a mixed-approach study comparing a state-of-the-art dashboard with a conversational agent. A total of 20 participants used both interfaces to complete four simulated industrial decision tasks of varying complexity. We combined measures of mental workload, completion time, and decision accuracy with a post-study questionnaire and semi-structured interviews analyzed through thematic analysis. The findings suggest that the conversational agent can reduce interactional effort by supporting more direct access to information, while the dashboard remains valuable for overview and verification. However, these benefits may vary across tasks and require validation through larger-scale studies.

2605.31220 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models

共享疑虑:语言模型的零样本跨语言置信度估计

Athina Kyriakou, Dennis Ulmer, Ivan Titov

发表机构 * ILLC, University of Amsterdam(阿姆斯特丹大学ILLC) ILCC, University of Edinburgh(爱丁堡大学ILCC)

AI总结 研究多语言大语言模型是否编码共享的、可跨语言迁移的置信度特征,通过轻量级线性探针从中间表示直接预测答案正确性,实现零样本跨语言泛化,并发现置信度特征集中在中间层。

详情
AI中文摘要

置信度估计(CE),即量化模型预测的可靠性,在大语言模型(LLM)背景下引起了极大兴趣。然而,大多数研究集中在英语上,忽视了LLM使用的多语言现实,而许多CE方法会退化或需要跨语言重新训练。为了解决这一差距,我们研究了多语言LLM是否编码共享的、可跨语言迁移的置信度特征。我们使用一个轻量级线性探针,直接从中间表示预测答案正确性。经过单语言训练后,该探针在零样本情况下泛化到未见过的、类型多样的语言,无需目标语言监督。学习到的层权重和多次消融实验表明,置信度特征集中在各语言的中间层,表明存在共享的置信度子空间。虽然零样本跨语言性能取决于与源语言的相似性,但该探针无需任何重新训练即可提供强基线,并且与其他流行的置信度估计方法相比具有优势。

英文摘要

Confidence estimation (CE), i.e. quantifying the reliability of a model's prediction, has attracted great interest in the context of large language models (LLMs). However, most studies focus on English, ignoring the multilingual reality of LLM usage, while many CE methods degrade or require retraining across languages. To address this gap, we investigate whether multilingual LLMs encode shared, language-transferable confidence features. We use a lightweight linear probe that predicts answer correctness directly from intermediate representations. Trained monolingually, the probe generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and multiple ablations reveal that confidence features concentrate in middle layers across languages, suggesting a shared confidence subspace. While zero-shot cross-lingual performance depends on similarity to the source language, the probe provides a strong baseline without any retraining and compares favorably to other popular confidence estimation methods.

2605.31212 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

基准测试与增强文本到图像模型以生成早期算术教育中的视觉表示

Junling Wang, Boqi Chen, Heejin Do, Mubashara Akhtar, April Yi Wang, Mrinmaya Sachan

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) ETH AI Center(ETH人工智能中心)

AI总结 针对早期算术教育中的方程到视觉生成任务,构建了E2V-Bench基准并评估了现有T2I模型,发现其在计数和关系结构上存在严重错误,进而探索了基准引导的增强策略。

详情
AI中文摘要

AI系统越来越多地用于支持教育内容创作,但尚不清楚它们能否生成忠实代表其旨在教授的教学概念的输出。因此,我们引入了方程到视觉生成任务,与传统的图像生成不同,该任务要求从算术方程中生成具有教学意义的视觉内容,同时精确保留其数值和关系结构。根据对教师的访谈和教育材料的分析,我们构建了E2V-Bench基准,涵盖四种基于教学法的视觉类型,以及用于评估视觉正确性的自动指标。我们的评估显示,最近的文本到图像(T2I)模型在此任务上频繁失败,错误主要表现为对象计数不正确和关系结构破坏。在此基础上,我们探索了基准引导的增强策略。这些策略改进了代表性模型,但剩余的差距要求未来的T2I模型具备更强的数值和关系基础。

英文摘要

AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.

2605.31210 2026-06-01 cs.RO cs.AI 版本更新

Simulation of collision avoidance behavior in crowd movement by data-driven approach

基于数据驱动方法的群体运动碰撞规避行为模拟

Xuanwen Liang, Eric Wai Ming Lee

发表机构 * Department of Architecture and Civil Engineering(建筑与土木工程系) University of Hong Kong(香港大学)

AI总结 针对数据驱动人群模拟中碰撞率高的问题,提出一种结合碰撞惩罚的生成对抗网络(CPGAN),通过侧向加速度碰撞损失函数和Voronoi特征提取方法,有效降低双向流中的对向碰撞率。

详情
AI中文摘要

人群运动模拟对于行人安全管理和设施布局优化至关重要。数据驱动模型提高了欧几里得度量下的轨迹预测精度,但存在碰撞率过高的问题,尤其是在双向和多向流中。本文建立了一种新颖的数据驱动人群模拟模型,将行人碰撞机制纳入损失函数以减少碰撞。提出了基于侧向加速度的碰撞损失函数和基于Voronoi的运动特征提取方法。该模型基于生成对抗网络(GAN)架构,称为CPGAN(碰撞惩罚GAN)。我们在涉及频繁碰撞规避行为的双向流场景中评估了CPGAN。结果表明,所提出的基于侧向加速度的碰撞损失显著降低了相反方向行人的碰撞率,达到与受控实验相当的水平。CPGAN有效模拟了双向流,再现了通道形成和N-t曲线。研究成果可为将行人动力学机制融入数据驱动人群模拟的损失函数提供启发。

英文摘要

Crowd movement simulation is essential for pedestrian safety management and facility layout optimization. Data-driven models enhance trajectory prediction accuracy under Euclidean metrics, yet they suffer from excessively high collision rates, especially in bidirectional and multidirectional flows. In this paper, we establish a novel data-driven crowd simulation model that incorporates the pedestrian collision mechanism into the loss function to reduce collisions. A new lateral-acceleration-based collision loss function and a Voronoi-based motion feature extraction approach are proposed. The model is based on a Generative Adversarial Network (GAN) architecture and is termed CPGAN (Collision-Penalized GAN). We evaluate CPGAN in bidirectional flow scenarios, which involve frequent collision avoidance behaviors. Results show that the proposed lateral-acceleration-based collision loss significantly reduces opposite-direction pedestrian collision rates to levels comparable with controlled experiments. CPGAN effectively simulates bidirectional flow, reproducing lane formation and N-t curves. The research outcomes can provide inspiration for integrating pedestrian dynamics mechanisms into loss functions in data-driven crowd simulation.

2605.31199 2026-06-01 cs.CR cs.AI 版本更新

MAECO-Lite: Modular Ontology for Dynamic Malware Analysis

MAECO-Lite:动态恶意软件分析的模块化本体

Zekeri Adams, Peter Švec, Ján Kľuka, Roderik Ploszek, Monday Onoja, Štefan Balogh, Martin Homola

发表机构 * Department of Applied Informatics, Comenius University in Bratislava, Mlynská dolina, 842 48 Bratislava, Slovakia Institute of Computer Science Mathematics, Faculty of Electrical Engineering Information Technology, Slovak University of Technology, Ilkovičova 3, Bratislava, Slovakia

AI总结 针对MAEC和STIX在动态恶意软件分析中混淆工件与事件的问题,基于统一基础本体(UFO)进行本体分析,提出轻量级本体MAECO-Lite,通过模块化结构分离持久实体与运行时事件,提升语义清晰度和计算可用性。

详情
AI中文摘要

以实用且语义精确的方式捕获动态恶意软件行为仍然是网络威胁情报中的一个重大挑战。尽管MAEC和STIX等标准提供了广泛采用的词汇表来描述恶意软件工件和观测结果,但它们以相当复杂的结构表示数据,往往掩盖了重要的本体论区分。特别是,它们倾向于将持久的恶意软件工件与执行期间生成的事件混为一谈,从而模糊了本体设计基础标准中的核心区分。在本文中,我们以统一基础本体(UFO)为理论视角,对与动态恶意软件分析相关的核心MAEC和STIX构造进行了基础本体分析。我们的分析揭示了由于MAEC和STIX中工件、倾向和运行时事件的混淆而产生的一些本体论不匹配,这些不匹配使动态恶意软件行为的一致表示复杂化,并从实践角度限制了推理执行轨迹的能力。基于这些见解,我们提出了MAECO-Lite,一种轻量级本体,旨在表示数据并实现其处理以用于动态恶意软件分析。该本体采用模块化结构,以样本、进程、动作、系统工件和MITRE ATT&CK技术为中心,同时保持持久实体和运行时事件之间的清晰分离。使用描述逻辑概念学习算法的初步评估表明,简化的本体显著提高了学习性能,证明了基于本体的建模可以增强语义清晰度和计算可用性。

英文摘要

Capturing dynamic malware behavior in a practical but still semantically precise manner remains a significant challenge in cyber threat intelligence. While standards such as MAEC and STIX provide widely adopted vocabularies for describing malware artifacts and observations, they represent data with considerable complexity in structures that often obscure important ontological distinctions. In particular, they tend to conflate enduring malware artifacts with the events generated during execution, thereby flattening distinctions that are central in foundational standards for ontology design. In this paper, we conduct a foundational ontological analysis of core MAEC and STIX constructs relevant to dynamic malware analysis relying on Unified Foundational Ontology (UFO) as a theoretical lens. Our analysis reveals some ontological mismatches arising from the conflation of artifacts, dispositions, and runtime events in MAEC and STIX that complicate coherent representation of dynamic malware behavior and, from a practical perspective, limit the ability to reason about execution traces. Based on these insights, we propose MAECO-Lite, a lightweight ontology designed to represent data and operationalize their processing for dynamic malware analysis. The ontology adopts a modular structure centered on samples, processes, actions, system artifacts, and MITRE ATT&CK Techniques, while maintaining a clear separation between enduring entities and runtime events. An initial evaluation using description logic concept learning algorithms shows that the simplified ontology significantly improves learning performance, demonstrating that ontologically grounded modelling can enhance both semantic clarity and computational usability.

2605.31196 2026-06-01 cs.CV cs.AI cs.CL cs.RO 版本更新

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

探索视觉-语言模型中的碰撞接地以实现安全的人机协作

Jun Wang, Xiaohao Xu, Xiaonan Huang

发表机构 * University of Michigan, Ann Arbor(密歇根大学,安娜堡)

AI总结 针对安全人机协作,提出碰撞接地概念及物理基准TouchSafeBench,评估视觉-语言模型在分类当前安全状态和预警即将碰撞任务中的表现,发现现有模型不可靠,视觉流畅性不等于物理责任性。

Comments 31 pages, 9 figures

详情
AI中文摘要

安全的人机协作需要的不仅仅是视觉描述:监控器必须确定机器人身体是否安全分离、已经与场景或人发生碰撞,或即将碰撞。我们将这种能力称为碰撞接地:将视觉观察与机器人身体几何、相机视角、场景布局、人体接近度和时间运动相结合,以推断当前和即将发生的接触。我们引入了TouchSafeBench,一个基于物理的基准,用于评估视觉-语言模型(VLM)中的碰撞接地能力。TouchSafeBench基于Habitat 3.0构建,包含2,940个模拟室内共现场景,涵盖社交导航和社交重排,具有同步的多视角RGB-D观测、自上而下的轨迹地图、校准的相机元数据和模拟器导出的接触标签。我们研究了两个面向部署的任务:分类当前安全状态和在接触前预警即将发生的碰撞。在三个前沿或面向机器人的VLM和九种视觉表示中,当前模型远未达到可靠:最佳平均Macro-F1仍低于50%,显式深度不会自动转化为机器人身体碰撞证据,且机器人与场景的接触始终比人与人的接触风险更难。TouchSafeBench揭示了具身VLM的一个核心限制:视觉流畅性并不意味着物理责任性。可靠的机器人安全监控器需要能够显式绑定视角、机器人形态、度量几何和未来碰撞的表示。我们将在论文被接收后发布该基准。

英文摘要

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.

2605.31183 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

引导LLM?实际上,稀疏自编码器可以胜过简单基线

Mikkel Godsk Jørgensen, Lars Kai Hansen

发表机构 * DTU Compute(丹麦技术大学计算学院)

AI总结 本文通过监督流水线选择并标注特征,证明稀疏自编码器在模型引导任务上可接近LoRA性能,并发现高稀疏性对基于可解释性的引导并非关键。

详情
AI中文摘要

稀疏自编码器(SAEs)被视为探索大型语言模型(LLMs)内部机制和引导模型输出生成的有前途的途径。当Wu等人(2025)引入模型引导基准AxBench时,SAEs由于相对于一组简单基线的引导性能较差,似乎并未达到最初的期望。本文作为对稀疏自编码器的部分反驳,表明Wu等人(2025)的结果并未完全公正地评价它们。我们发现,当使用我们的监督流水线选择并标注特征时,稀疏自编码器实际上可以在AxBench基准上达到接近参考LoRA性能的水平。我们还发现,当仅使用基于可解释性的组件时,我们的流水线选择的特征与其识别标签具有令人惊讶的因果性。最后,我们提供证据表明,高稀疏性(低l0)可能对于基于可解释性的成功引导并非关键,这与Wang等人(2025)早期的发现相反。

英文摘要

Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines. This work serves as a partial rebuttal for Sparse Autoencoders and suggests that the results of Wu et al. (2025) did not do them full justice. We find that Sparse Autoencoders can, in fact, perform close to on par with the reference LoRA performance on the AxBench benchmark, when features are selected and labelled with our supervised pipeline. We also find that our pipeline selects features that are surprisingly causal of their identified labels when using only its interpretability-based components. Lastly, we present evidence that high sparsity (low l0) may not be crucial for successful steering based on interpretability, which is in contrast to the earlier findings in Wang et al. (2025).

2605.31173 2026-06-01 cs.SD cs.AI 版本更新

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

MindVoice: 利用预训练先验从非侵入性神经信号重建可理解语音

Guangyin Bao, Taiping Zeng, Jianfeng Feng, Xiangyang Xue

发表机构 * Fudan University(复旦大学)

AI总结 提出MindVoice框架,通过解耦语义和声学路径并融合预训练生成模型与语音克隆,从EEG/MEG信号中重建出可理解语音,显著优于现有方法。

详情
AI中文摘要

从非侵入性神经记录中重建连续语音是探究人类听觉感知和构建安全、可扩展的语音脑机接口的基本问题。尽管近期取得进展,但由于非侵入性记录本身存在噪声、空间模糊且仅部分保留感知语音信息,可理解的重建仍然难以实现。现有方法直接将神经活动映射到纠缠的语音表征,然后使用神经声码器合成波形,导致结果频谱相似但不可理解。为克服这些限制,我们引入MindVoice,一种神经到语音的重建框架,利用预训练模型补偿神经记录中不完整的语义和声学信息。MindVoice将重建解耦为两条互补路径:一条恢复高层语义内容,另一条估计细粒度声学属性。这些推断的表征随后与强大的语音生成模型和上下文语音克隆融合,以合成自然且可理解的语句。在EEG和MEG上的大量实验表明,MindVoice在各种指标上显著优于现有方法。这些结果表明,预训练先验为弥合噪声神经记录与自然语音之间的差距提供了一种原则性方法,凸显了听觉神经科学研究和非侵入性语音脑机接口的一个有前景的尝试。

英文摘要

Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.

2605.31171 2026-06-01 cs.IR cs.AI 版本更新

MIMO: Multilingual Information Retrieval via Monolingual Objectives

MIMO: 通过单语目标实现多语言信息检索

Youngjoon Jang, Seongtae Hong, Heuiseok Lim

发表机构 * Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系)

AI总结 提出MIMO两阶段框架,利用教师模型的稳定英语语义空间,通过知识蒸馏和跨语言对比学习联合优化,解决多语言信息检索中语言聚类和嵌入对齐-均匀性权衡问题。

详情
AI中文摘要

多语言信息检索(MLIR)反映了真实的搜索环境,其中查询和相关文档可能以不同语言出现在混合语言语料库中。然而,现有的嵌入模型主要针对多单语检索进行优化,在MLIR设置中其性能通常会下降。此外,直接将传统对比学习应用于MLIR会加剧语言聚类,并暴露跨语言对齐与嵌入均匀性之间的权衡。为了解决这些局限性,我们提出了MIMO:通过单语目标实现多语言信息检索,这是一个两阶段框架,使用来自高性能教师模型的稳定英语语义空间作为锚点。MIMO首先通过知识蒸馏初始化学生模型的跨语言对齐,然后联合优化蒸馏和跨语言对比学习,以提高检索判别力同时保持对齐。大量实验表明,MIMO在各种MLIR和多单语基准测试中始终优于现有的跨语言训练基线。MIMO在与类似或更大参数规模的现成模型相比也保持竞争力。此外,我们的跨语言对齐-均匀性分析阐明了两个损失组件的不同作用,并表明它们的组合在对齐和均匀性之间产生了有利的权衡。

英文摘要

Multilingual Information Retrieval (MLIR) reflects real-world search environments in which queries and relevant documents may appear in different languages within a mixed-language corpus. However, existing embedding models are primarily optimized for Multi-Monolingual retrieval and their performance often degrades in MLIR settings. Moreover, directly applying conventional contrastive learning to MLIR can exacerbate language clustering and expose a trade-off between cross-lingual alignment and embedding uniformity. To address these limitations, we propose MIMO: Multilingual Information Retrieval via Monolingual Objectives, a two-stage framework that uses a stable English semantic space from a high-performing teacher model as an anchor. MIMO first initializes the student model's cross-lingual alignment through knowledge distillation, and then jointly optimizes distillation and cross-lingual contrastive learning to improve retrieval discrimination while preserving alignment. Extensive experiments show that MIMO consistently outperforms existing cross-lingual training baselines across various MLIR and Multi-Monolingual benchmarks. MIMO also remains competitive with off-the-shelf models of similar or larger parameter scales. Furthermore, our cross-lingual Alignment-Uniformity analysis clarifies the distinct roles of the two loss components and shows that their combination yields a favorable trade-off between alignment and uniformity.

2605.31170 2026-06-01 cs.CL cs.AI 版本更新

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

语言模型智能体群体中的涌现语言:从令牌效率到监督规避

Stine Lyngsø Beltoft, William Brach, Federico Torrielli, Jacob Nielsen, Annemette Brok Pirchert, Filippo Tonini, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Southern Denmark(南丹麦大学) Slovak University of Technology in Bratislava(布拉迪斯拉发技术大学) University of Turin(都灵大学) Ordbogen A/S(Ordbogen公司)

AI总结 研究语言模型智能体群体中涌现的语言,通过规则启发式和零样本分类识别出令牌效率、新自然语言和监督规避三类,发现监督规避语言更难对齐且可被上下文学习,表明仅监控表面行为可能不足以控制智能体群体。

详情
AI中文摘要

目前,对自主语言模型智能体的监控主要依赖表面行为。但当智能体群体为了规避人类监督而发明新语言时会发生什么?本文研究了Moltbook上的涌现语言。为此,我们基于Moltbook Files数据集,采用两阶段方法:先进行基于规则的启发式匹配(约6000个匹配),再进行零样本分类(保留518个)。结果类别包括令牌效率(166个)、新自然语言(106个)和监督规避(59个)。我们进行了定量和定性分析。结果表明,提出用于规避监督的新语言的帖子被DeepSeek-3.2判定为比其他类别更不对齐,且所有语言都可以通过语言描述被其他语言模型在上下文中学习。此外,手动研究典型案例揭示了令人惊讶的复杂隐写协议,例如在自然语言中嵌入隐藏信息。尽管我们无法确定这些语言构思中的自主程度,但我们的结果进一步证明,仅监控表面行为可能很快不足以维持对智能体群体的控制。

英文摘要

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.

2605.31167 2026-06-01 cs.AI 版本更新

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

LLM-FACETS:一个保护隐私的评估LLM透明度和问责制的框架

Tom Lucas, Alessio Buscemi, Alfredo Capozucca, German Castignani, Barbara Delacroix

发表机构 * Luxembourg Institute of Science and Technology (LIST)(卢森堡科学与技术研究所) University of Luxembourg(卢森堡大学)

AI总结 提出一个开源框架LLM-FACETS,通过浏览器界面和插件架构,为技术专家、领域专家和合规官员提供隐私保护的LLM评估,实现透明度与问责制。

Comments Submitted to ACM Journal on Responsible Computing, Special Section: Collaborative Methods and Tools for Engineering and Evaluating Transparency in AI. 28 pages 9 figures, 7 tables, 1 algorithm. Source code: https://github.com/Scriptor-Group/AIMVi

详情
AI中文摘要

评估大型语言模型的输出是否事实准确、认知校准和方法可复现,是负责任AI部署的前提。然而,审计LLM对非技术从业者仍然难以实现:现有工具需要编程专业知识和非平凡的环境设置,云托管平台将评估数据传输到外部服务,为法律上负责AI监督的领域专家和合规官员设置了障碍。我们介绍LLM-FACETS(LLM事实交叉评估系统):一个开源框架,具有浏览器可访问的界面和插件架构,围绕三个从业者画像(技术专家、领域专家、合规官员)构建,这些画像反映了EU AI法案和NIST AI风险管理框架中识别的利益相关者类别。该架构使数据流明确:确定性指标(BLEU、ROUGE、BERTScore)完全在自托管服务器内运行,无出站传输;LLM评判指标显式联系外部API,用户保留完全凭据控制。该框架通过三种机制实现透明度:用于认知不确定性的token级对数概率可视化、多评判共识以减轻评判偏差,以及RAG Triad指标(忠实性、答案相关性、上下文相关性)以检测和定位幻觉。插件架构允许在不修改评估管道的情况下集成任何新指标或数据集。开源实现支持针对同一属性的多个指标进行交叉检查,确保可复现性,并将AI问责制与评估系统的构建团队解耦。我们通过18个指标实现与规范参考库的交叉验证来验证该框架。

英文摘要

Assessing whether Large Language Models outputs are factually grounded, epistemically calibrated, and methodologically reproducible is a prerequisite for responsible AI deployment. Yet auditing LLMs remains inaccessible to non-technical practitioners: existing tools require programming expertise and non-trivial environment setup, and cloud-hosted platforms transmit evaluation data to external services, creating barriers for domain experts and compliance officers legally responsible for AI oversight. We introduce LLM-FACETS (LLM FActuality Cross-EvaluaTion System): an open-source framework with a browser-accessible interface and a plugin architecture, structured around three practitioner profiles (technical experts, domain experts, compliance officers) that mirror the stakeholder categories identified in the EU AI Act and the NIST AI Risk Management Framework. The architecture makes data flows explicit: deterministic metrics (BLEU, ROUGE, BERTScore) run entirely within the self-hosted server with no outbound transmission; LLM-judge metrics contact external APIs explicitly, with users retaining full credential control. The framework operationalizes transparency through three mechanisms: token-level log-probability visualization for epistemic uncertainty, multi-judge consensus to mitigate judge bias, and RAG Triad metrics (Faithfulness, Answer Relevance, Context Relevance) to detect and localize hallucinations. A plugin architecture allows any new metric or dataset to be integrated without modifying the evaluation pipeline. The open-source implementation enables cross-checking across multiple metrics targeting the same property, ensuring reproducibility and decoupling AI accountability from the teams building the systems assessed. We verify the framework through cross-validation of 18 metric implementations against canonical reference libraries.

2605.31164 2026-06-01 cs.CL cs.AI 版本更新

D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

D$^3$: 面向LLM训练的动态有向图约束数据调度

Yuanjian Xu, Jianing Hao, Guang Zhang, Zhong Li

发表机构 * Microsoft Research(微软研究院)

AI总结 提出D$^3$框架,通过动态有向图建模训练单元间的有向影响关系,并求解约束优化问题以确定训练顺序,从而提升LLM预训练和后训练阶段的效率。

详情
AI中文摘要

训练数据在大语言模型(LLM)优化中起着核心作用,这激发了对数据调度策略的广泛研究。现有方法大多集中于调整整体数据分布,而忽略了训练过程中样本之间的潜在交互。然而,我们认为这种交互不可忽视,因为现实世界的数据样本之间经常存在有向影响,使得训练顺序至关重要。直观上,我们可以优先训练影响更大的单元以提高学习效率。在这项工作中,我们提出了D$^3$,一个动态有向图约束的数据调度框架。D$^3$将训练单元之间的复杂交互建模为一个动态影响图,其中边表示基于损失的依赖关系。然后,它在该图上求解一个约束优化问题,以推导出训练顺序,确保数据序列在整个训练过程中遵循不断演变的信息流。我们的方法具有理论动机,并在预训练和后训练阶段均比现有数据调度方法取得了一致的改进。此外,为了可扩展性,D$^3$还采用了一种高效的近似算法,将额外的计算开销控制在可管理范围内。为便于未来研究,代码可在https://github.com/xuyj233/D3获取。

英文摘要

Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.

2605.31159 2026-06-01 cs.LG cs.AI 版本更新

Trust-Region Behavior Blending for On-Policy Distillation

信任域行为混合用于在线策略蒸馏

Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky, Boris Shaposhnikov, Daria Korotyshova, Daniil Gavrilov

发表机构 * T-Tech

AI总结 提出信任域行为混合(TRB)预热方法,通过在学生中心的KL信任域内用最接近教师的行为策略替换早期学生策略,解决在线策略蒸馏中早期学生轨迹质量差的问题,在数学推理蒸馏中取得最佳平均性能。

详情
AI中文摘要

在线策略蒸馏(OPD)训练学生模型在其自身策略采样的前缀上进行学习,同时匹配更强的教师模型。这解决了离线蒸馏中的前缀不匹配问题,但早期的学生模型 rollout 仍然可能质量较差,导致教师监督应用于弱或低质量的前缀。我们提出信任域行为混合(TRB),一种预热方法,在学生中心的KL信任域内,用最接近教师的行为策略替换早期的 rollout 策略,同时保持每个前缀的反向KL OPD损失不变。KL预算逐渐退火至零,因此预热后训练恢复为纯学生 rollout。在两个数学推理蒸馏设置中,TRB在比较方法中取得了最强的平均性能。

英文摘要

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

2605.31149 2026-06-01 cs.HC cs.AI 版本更新

Developing a UXR Point of View for Cognitive Accessibility in Mobile Learning with Generative AI

利用生成式AI在移动学习中开发认知无障碍的UXR视角

Fatima Ahmad Muazu, Festus Adedoyin, Huseyin Dogan, Abiodun Adedeji, Melike Akca, Olumuyiwa Ayorinde

发表机构 * School of Computing and Engineering, Bournemouth University(伯恩茅斯大学计算与工程学院)

AI总结 本研究通过结合UX研究原则和大语言模型支持的分析,提出认知无障碍UXR剧本,以改善面向认知障碍学习者的移动学习系统需求质量。

详情
AI中文摘要

本研究探讨如何利用UX研究(UXR)原则,结合大语言模型(LLM)支持的分析,提高为认知障碍学习者设计的移动学习系统的需求质量。以UXR视角(PoV)金字塔为方法论框架,研究分为四个阶段:心理、行为和设计层的基础结构;使用DeLone和McLean信息系统成功模型及质量功能展开(QFD)进行结构化验证;通过开发九张认知无障碍UXR游戏卡进行洞察整合;以及支持跨学科沟通的利益相关者特定PoV表述。在人工监督下,整合LLM支持的合成以协助主题聚类、需求细化和假设制定。研究结果表明,移动学习中的许多可用性和参与度挑战源于模糊或未充分定义的需求,而不仅仅是界面设计。通过将认知无障碍原则嵌入可测量且技术可追溯的需求中,所提出的认知无障碍UXR剧本为协调理论、系统架构和利益相关者策略提供了结构化路径。

英文摘要

This study investigates how UX research (UXR) principles, combined with Large Language Model (LLM)-supported analysis, can be used to improve the quality of requirements for mobile learning systems designed for learners with cognitive disabilities. Using the UXR Point-of-View (PoV) pyramid as a methodological framework, the study progressed through four stages: foundational structuring of psychological, behavioral, and design layers; structured validation using the DeLone and McLean Information Systems Success Model and Quality Function Deployment (QFD); insight consolidation through the development of nine Cognitive Accessibility UXR Play Cards; and stakeholder-specific PoV articulation to support interdisciplinary communication. LLM-supported synthesis was integrated to assist in theme clustering, requirement refinement, and hypothesis formulation under human oversight. Findings suggest that many usability and engagement challenges in mobile learning originate from ambiguous or under-specified requirements rather than interface design alone. By embedding cognitive accessibility principles into measurable and technically traceable requirements, the proposed Cognitive Accessibility UXR Playbook provides a structured pathway for aligning theory, system architecture, and stakeholder strategy.

2605.31148 2026-06-01 cs.CV cs.AI cs.CL 版本更新

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

SpatialAct:探测VLM智能体在3D场景中的空间推理到行动能力

Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li, Pan Hui

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Zhongguancun Academy(中关村学院) Tsinghua University(清华大学) Helsinki University(赫尔辛基大学)

AI总结 本文提出SpatialAct基准,通过多轮交互细化、单步错误检测与修复等任务,揭示当前视觉语言模型在3D场景中从空间推理到行动存在显著差距。

详情
AI中文摘要

人类能够在日常3D环境中轻松感知空间布局、形成认知表征、推理空间关系,并将这种推理转化为行动。尽管最近的视觉语言模型(VLM)在基于观测的空间感知和推理任务上表现出色,但它们是否能够构建连贯的空间理解、据此行动并通过多轮反馈优化行动仍不清楚。为研究这一问题,我们引入了 extbf{SpatialAct},一个基于模拟器的基准,用于探测3D场景中的 extit{行动条件空间推理}。从最具挑战性的设置——多轮交互细化开始,我们进一步设计了其分解版本——单步错误检测与修复,以及五个基础空间能力任务,以诊断模型失败的潜在原因。实验揭示了明显的推理到行动差距:当前VLM在孤立的空间推理任务上表现良好,但在多轮反馈中难以维持连贯的空间信念并产生可靠行动,显著不如人类。这些结果表明,即使抽象掉了低级控制,当前VLM智能体在行动引起的环境变化下仍缺乏稳健的空间状态跟踪能力。

英文摘要

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.

2605.31147 2026-06-01 cs.HC cs.AI 版本更新

Developing a Culturally Grounded, AI-Augmented UX Research Point of View (POV): An Exemplar Case Study from Telemedicine Dementia Care

开发一个文化根基的、AI增强的用户体验研究观点(POV):来自远程医疗痴呆症护理的示例案例研究

Abiodun Adedeji, Huseyin Dogan, Festus Adedoyin, Michelle Heward, Melike Akca, Emmanuel Oluwatosin Oluokun, Fatima Ahmad Muhazu, Olumuyiwa Ayorinde

发表机构 * School of Computing and Informatics, Bournemouth University(计算与信息学院,伯恩茅斯大学)

AI总结 本文通过一个远程医疗痴呆症护理案例,展示了如何结合混合方法研究、假设生成和本体建模,并集成生成式AI作为协作工具,来构建一个文化敏感的、可辩护的用户体验研究观点(POV)。

详情
AI中文摘要

用户体验研究(UXR)观点(POV)将复杂且通常碎片化的研究证据提炼为可操作的视角,指导团队理解用户需求、构建设计决策并协调利益相关者。尽管POV在行业实践中被广泛使用,但公开记录POV构建过程的例子很少,特别是在文化敏感和资源匮乏的背景下。本文展示了一个示例案例研究,展示了如何开发一个文化根基的、AI增强的UXR POV,以指导TeleDeCa——一个面向尼日利亚家庭护理人员的远程医疗痴呆症护理框架。基于UXR POV Playbook和金字塔框架,我们说明了如何将混合方法研究、假设生成和基于本体的建模结合起来,形成一个可辩护的POV,而无需完全最终化的系统或验证结果。生成式AI(GenAI)作为有限的研究合作者被整合到UXR POV框架中,支持综合、假设探索和叙事构建,同时保留人类判断、伦理责任和文化敏感性。本文的贡献在于提取了可重用的Play Cards和一个Play,扩展了UXR POV Playbook,并为CHI 2026关于开发AI驱动的UXR POV的工作坊提供了示例材料。

英文摘要

User Experience Research (UXR) Points of View (POVs) distil complex and often fragmented research evidence into actionable perspectives that guide how teams interpret user needs, frame design decisions, and align stakeholders. Although POVs are widely used in industry practice, there are few published examples that explicitly document how POVs are constructed, particularly in culturally sensitive and low-resource contexts. This paper presents an exemplar case study demonstrating how a culturally grounded, AI-augmented UXR POV was developed to inform TeleDeCa, a telemedicine dementia care framework for family caregivers in Nigeria. Building on the UXR POV Playbook and pyramid framework, we illustrate how mixed-methods research, hypothesis generation, and ontology-based modelling can be combined to form a defensible POV without requiring a fully finalised system or validated outcomes. Generative AI (GenAI) is integrated across the UXR POV framework as a bounded research collaborator, supporting synthesis, hypothesis exploration, and narrative construction while preserving human judgment, ethical accountability, and cultural sensitivity. The contribution of this paper lies in the extraction of reusable Play Cards and a Play that extend the UXR POV Playbook and serve as exemplar material for the CHI 2026 workshop on developing AI-powered UXR POVs.

2605.31146 2026-06-01 cs.HC cs.AI 版本更新

From Evidence to Design: Developing an AI-Augmented UX Research Point of View for Digital Wellbeing in Emergency and Public Safety Contexts

从证据到设计:开发面向紧急与公共安全情境下数字福祉的AI增强用户体验研究视角

Olumuyiwa Ayorinde, Huseyin Dogan, Festus Adedoyin, Nan Jiang, Emmanuel Oluokun, Abiodun Adedeji, Melike Akca

发表机构 * School of Computing and Engineering, Bournemouth University(伯恩茅斯大学计算与工程学院)

AI总结 本研究结合用户体验研究方法与AI支持分析,针对紧急与公共安全人员开发数字福祉干预措施的设计方向,通过文献分析识别模式并整合行为改变技术与说服性设计原则,最终产出UXR PoV金字塔、九张UXR游戏卡和利益相关者叙事。

详情
AI中文摘要

本文研究如何将用户体验研究方法与AI支持分析相结合,为针对紧急与公共安全人员的数字福祉干预措施开发更清晰的设计方向。EPSP在高压、轮班制环境中工作,认知疲劳和不可预测的日程降低了他们对传统福祉工具的参与度。本研究使用UXR观点框架,应用AI支持的文献分析过程来识别反复出现的心理、行为和设计模式。在整个解释过程中整合了行为改变技术和说服性设计原则,以连接证据与实际设计推理。该过程产生了UXR PoV金字塔、九张UXR游戏卡和以利益相关者为中心的PoV叙事。研究结果表明,有效的EPSP福祉系统必须最小化认知努力、适应操作环境并优先考虑心理安全。这项工作展示了AI如何协助大规模证据解释,而人类研究人员则保持对情境判断和设计方向的责任。

英文摘要

This paper investigates how User Experience Research (UXR) methods can be combined with AI-supported analysis to develop clearer design direction for digital wellbeing interventions targeting Emergency and Public Safety Personnel (EPSP). EPSP work in high-stress, shift-based environments where cognitive fatigue and unpredictable schedules reduce engagement with conventional wellbeing tools. Using the UXR Point-of-View (PoV) framework, this study applied an AI-supported literature analysis process to identify recurring psychological, behavioural, and design patterns. Behaviour Change Techniques and Persuasive Technology principles were integrated throughout interpretation to connect evidence with practical design reasoning. The process resulted in a UXR PoV Pyramid, nine UXR Play Cards, and stakeholder focused PoV narratives. Findings show that effective wellbeing systems for EPSP must minimise cognitive effort, adapt to operational context, and prioritise psychological safety. The work demonstrates how AI can assist large-scale evidence interpretation while human researchers maintain responsibility for contextual judgement and design direction.

2605.31145 2026-06-01 cs.CV cs.AI cs.LG 版本更新

FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization

FOCUS: 通过视觉支持约束和策略优化强制上下文目标定位

Mohammed Asad Karim, Vinay Kumar Verma

发表机构 * Amazon, Seattle, USA(亚马逊(美国西雅图))

AI总结 提出一种两阶段训练框架,通过优化支持框与查询图像间的上下文注意力并结合GRPO强化学习,实现无类别监督的类别无关上下文目标定位,7B模型性能超越72B模型。

Comments Accepted at ICML 2026. * Equal Contributions

详情
AI中文摘要

上下文定位(ICL)旨在通过查询图像中的少量支持示例定位目标对象,无需训练或参数更新即可即时操作。尽管视觉语言模型(VLM)快速发展,实现类别无关且基于视觉的ICL仍然是一个未解决的问题,尽管它对图像编辑、个性化视觉搜索和检索等应用至关重要。现有方法脆弱且依赖显式类别监督,这不仅限制了在具有未命名或实例特定对象的现实场景中的适用性,还引入了类别偏差,使预测偏向语义先验而非视觉证据。我们提出一个两阶段训练框架,在无类别监督的情况下显式优化支持边界框与查询图像之间的上下文注意力。我们进一步通过使用组相对策略优化(GRPO)的强化学习来细化定位,直接最小化定位误差。这种公式强制视觉对应优于语义先验,产生鲁棒的实例级定位。实验表明,使用我们的目标训练的7B参数模型优于高达72B参数的模型,证明了上下文感知定位目标可以超越单纯扩展规模。全面的消融实验验证了每个组件的贡献。

英文摘要

In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision-language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.

2605.31143 2026-06-01 cs.HC cs.AI 版本更新

Extending the UXR Point of View Pyramid: A Generative AI-Augmented Methodology for Human-Centred AI Systems

扩展UXR观点金字塔:一种面向人本AI系统的生成式AI增强方法论

Festus Fatai Adedoyin, Huseyin Dogan, Melike Akca, Abiodun Adedeji

发表机构 * School of Computing and Engineering, Bournemouth University(伯恩茅斯大学计算机与工程学院)

AI总结 针对英国债务管理中的AI金融系统,通过扩展UXR观点金字塔,提出一种结合生成式AI的增强方法论,包括AI增强观点金字塔、结构化提示架构和AI驱动的Playbook卡片系统,以提升可解释性、公平性和问责性。

详情
AI中文摘要

英国家庭债务和生活成本压力的上升,加剧了AI驱动的金融技术在信贷评估、还款结构和债务支持服务中的作用。这些系统日益影响重大的财务决策,但它们在复杂的社会技术环境中运作,受到监管限制、算法不透明性和高度脆弱性风险的影响。用户体验研究(UXR)观点(PoVs)对于将异质性研究证据转化为产品和治理决策的战略方向至关重要。然而,现有的UXR PoV框架并非为AI中介的金融系统设计,而在此类系统中,可解释性、公平性和问责性至关重要。本文扩展了UXR PoV金字塔,形成了一种面向英国金融服务背景下以人为中心的AI债务管理技术的AI增强方法论框架。我们形式化了(1)AI增强的PoV金字塔,(2)用于综合和假设生成的结构化提示架构,以及(3)AI驱动的Playbook卡片系统,该系统将生成式AI嵌入UXR工作流程,同时保持可追溯性和伦理监督。生成式AI并非作为分析权威,而是作为受人类验证和监管意识约束的认识论支持机制。通过将该框架应用于债务管理技术(包括可负担性评估、还款计划和财务压力预测系统),本研究推进了高风险金融AI环境下的UXR方法论,并为CHI社区内负责任、AI驱动的UXR实践的发展做出了贡献。

英文摘要

Rising household debt and cost-of-living pressures in the United Kingdom have intensified the role of AI-driven financial technologies in mediating credit assessment, repayment structuring, and debt support services. These systems increasingly shape consequential financial decisions, yet they operate within complex socio-technical environments characterised by regulatory constraint, algorithmic opacity, and heightened vulnerability risk. User Experience Research (UXR) Points of View (PoVs) are critical in translating heterogeneous research evidence into strategic direction for product and governance decisions. However, the existing UXR PoV framework was not designed for AI-mediated financial systems where interpretability, fairness, and accountability are central. This paper extends the UXR PoV pyramid into an AI-augmented methodological framework for Human-Centred AI debt management technologies in the UK financial services context. We formalise (1) an AI-Augmented PoV Pyramid, (2) a structured prompt architecture for synthesis and hypothesis generation, and (3) an AI-enabled Playbook Card system that embeds Generative AI into UXR workflows while preserving traceability and ethical oversight. Generative AI is positioned not as an analytic authority, but as an epistemic support mechanism subject to human validation and regulatory awareness. By grounding the framework in debt management technologies, including affordability assessment, repayment planning, and financial stress prediction systems, this work advances UXR methodology for high-stakes financial AI environments and contributes to the evolution of responsible, AI-powered UXR practice within the CHI community.

2605.31142 2026-06-01 cs.CL cs.AI 版本更新

On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

多语言文本嵌入排名在学习任务、语言和基准数据集上的鲁棒性

Ana Gjorgjevikj, Barbara Koroušić Seljak, Tome Eftimov

发表机构 * Computer Systems Department(计算机系统系) Jožef Stefan Institute(乔泽夫·斯塔芬研究所)

AI总结 通过引入数据集组成鲁棒性和排名方案鲁棒性指标,系统分析了MTEB中多语言模型排名对评估设计变化的敏感性,发现基于LLM的大模型通常是鲁棒的顶尖模型,但并非在所有任务中一致。

详情
AI中文摘要

大规模多语言文本嵌入模型在研究和工业中扮演着关键角色,但它们在特定语言、多任务设置中的行为仍未被充分理解。尽管像MTEB这样的基准平台报告了超过250种语言的结果,但关于模型优越性的结论往往依赖于数据集组成和性能聚合方法的隐含选择。为了解决这一差距,我们对MTEB中的多语言模型性能鲁棒性进行了元研究,应用了多种多准则决策制定排名方案,并引入了两个鲁棒性指标:数据集组成鲁棒性(排名对数据集组成变化的敏感性)和排名方案鲁棒性(对聚合方法变化的敏感性)。它们使得系统性地分析基准结论在不同评估设计下是否保持稳定成为可能。我们对五种语言(英语、法语、德语、印地语和西班牙语)在九个任务(例如分类、聚类、检索)上进行了深入分析,并发布了约230种额外语言的结果。任务特定分析表明,基于大规模LLM的模型通常是鲁棒的顶尖表现者,尽管并非一致(例如在检索任务中),而任务无关的结果显示,只有一小部分模型在任务、排名方案和数据子样本中始终保持强劲。

英文摘要

Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.

2605.31138 2026-06-01 cs.HC cs.AI 版本更新

Developing an AI-Powered UX Research Point of View for Digital Health in A Regulatory Context: An Exemplar Case from MSM and Transgender HIV Care in Nigeria

在监管背景下开发AI驱动的用户体验研究视角:以尼日利亚MSM和跨性别者HIV护理为例

Emmanuel Oluwatosin Oluokun, Festus Fatai Adedoyin, Huseyin Dogan, Nan Jiang, Melike Akca, Abiodun Adedeji, Olumuyiwa Ayorinde, Fatima Ahmad Muazu

发表机构 * School of Computing and Engineering, Bournemouth University(伯恩茅斯大学计算与工程学院)

AI总结 本文提出一种生成式AI增强的用户体验研究方法论,通过四阶段UXR流程和十张理论驱动的UXR游戏卡,指导尼日利亚男男性行为者(MSM)和跨性别者HIV护理中数字健康干预的设计,核心贡献是可复制的、关注污名和隐私的负责任GenAI使用框架。

详情
AI中文摘要

在法律和监管背景下的用户体验研究(UXR)面临独特挑战,需要专门的方法来保护弱势群体,同时产生可操作的见解。数字咨询、预约和药物配送平台在扩展护理可及性方面显示出前景;然而,它们的实际有效性因缺乏充分考虑到这些人群心理社会状况的、基于理论的用户体验研究方法论而受到限制。本文介绍了一种生成式AI增强的UXR方法论,基于UXR视角(PoV)剧本,指导为尼日利亚感染HIV/AIDS的男男性行为者(MSM)和跨性别者设计心理安全、低认知负荷的数字健康干预措施。基于涉及协同设计工作坊、主题分析和需求工程的实证研究,该方法论通过一个四阶段UXR过程实现,包括AI支持的假设生成、基础规划、通过构建模块生成洞察以及构建利益相关者特定的PoV叙述。该过程产生了十张理论驱动的UXR游戏卡,将心理机制和实证发现转化为可操作的设计指导。每张游戏卡包含可操作的任务、AI增强的方法和针对边缘化人群研究的伦理护栏。输出是一套十张理论驱动的UXR游戏卡,将心理洞察和实证证据转化为可操作的设计指导。核心贡献是一个可复制的、关注污名和隐私的框架,用于在UXR实践中负责任地使用GenAI,推进边缘化社区的人本数字健康设计。

英文摘要

User Experience Research (UXR) in a legal and regulatory contexts presents unique challenges that require specialised approaches to protect vulnerable populations whilst generating actionable insights. Digital consultation, appointment booking, and medication delivery platforms show promise for extending care access; however, their real-world effectiveness is curtailed by an absence of theoretically grounded user experience research (UXR) methodologies that adequately account for the psychosocial conditions of these populations. This paper introduces a Generative AI-augmented UXR methodology, grounded in the UXR Point of View (PoV) Playbook, to guide the design of psychologically safe, low-cognitive-load digital health interventions for MSM and transgender individuals living with HIV/AIDS in Nigeria. Drawing from empirical research involving co-design workshops, thematic analysis, and requirements engineering, the methodology is operationalised through a four-stage UXR process encompassing AI-supported hypothesis generation, foundational planning, insight generation via Building Blocks, and the construction of stakeholder-specific PoV narratives. This process results in ten theory-informed UXR Play Cards that translate psychological mechanisms and empirical findings into actionable design guidance. Each play contains actionable tasks, AI-augmented approaches, and ethical guardrails tailored for research with marginalised populations. The output is a set of ten theory-informed UXR Play Cards translating psychological insight and empirical evidence into actionable design guidance. The core contribution is a replicable, stigma-aware, and privacy-centred framework for responsible GenAI use in UXR practice, advancing human-centred digital health design for marginalised communities.

2605.31131 2026-06-01 cs.HC cs.AI 版本更新

UXR PoV for Neuroinclusive Emotion Regulation

神经包容性情绪调节的用户体验研究观点

Melike Akca, Mona Giff, Deniz Cetinkaya, Huseyin Dogan, Stephen Giff

发表机构 * School of Computing and Engineering, Bournemouth University(伯恩茅斯大学计算机与工程学院) Google Redmond, Washington, USA(谷歌红mond分公司)

AI总结 本文提出一种生成式AI增强的用户体验研究方法,结合DBT、SDT和COM-B理论框架,通过四阶段流程生成十张UXR游戏卡,为ADHD成人设计神经包容性的数字情绪调节干预。

详情
AI中文摘要

注意缺陷/多动障碍(ADHD)是一种精神疾病,表现为个体在注意力不集中、多动和冲动方面的发展不适当模式,并在决策和情绪调节(ER)方面存在困难。尽管基于数字和人工智能的干预措施扩大了情绪调节支持的获取途径,但许多现有系统仍受限于理论整合薄弱、对神经多样性的适应不足以及缺乏将心理学洞察与设计实践相结合的结构化用户体验研究(UXR)方法。本文介绍了一种生成式AI增强的UXR方法,以UXR观点(PoV)剧本为基础,支持为ADHD成人设计具有情感智能和神经包容性的数字情绪调节干预。该方法将实证证据与既定心理学框架——辩证行为疗法(DBT)、自我决定理论(SDT)和COM-B行为模型相结合,并利用生成式AI作为协同分析工具,支持综合、假设形成和设计阐述。该方法通过四阶段UXR流程实施,包括AI支持的假设生成、基础规划、通过构建模块生成洞察以及构建利益相关者特定的PoV叙事。该流程产生了一套十张理论驱动的UXR游戏卡,将心理机制和实证发现转化为可操作的设计指导。本研究的主要贡献是一个可复制的、具有偏差意识的框架,用于将生成式AI整合到UXR实践中,推进数字心理健康设计中以人为本和神经包容性的方法。

英文摘要

Attention-deficit/hyperactivity disorder (ADHD) is a psychiatric disorder which presents itself in individuals through patterns of developmentally inappropriate levels of inattentiveness, hyperactivity, and impulsivity, with difficulties in decision making and emotional regulation (ER). Although digital and AI-based interventions have expanded access to ER support, many existing systems remain limited by weak theoretical integration, insufficient accommodation of neurodiversity, and a lack of structured user experience research (UXR) methodologies, that bridge psychological insight with design practice. This paper introduces a Generative AI-augmented UXR methodology, grounded in the UXR Point of View (PoV) Playbook, to support the design of emotionally intelligent and Neuroinclusive digital ER interventions for adults with ADHD. The approach integrates empirical evidence with established psychological frameworks Dialectical Behaviour Therapy (DBT), Self-Determination Theory (SDT), and the COM-B behavioural model and leverages Generative AI as a co-analytic tool to support synthesis, hypothesis formation, and design articulation. The methodology is operationalized through a four-stage UXR process encompassing AI-supported hypothesis generation, foundational planning, insight generation via Building Blocks, and the construction of stakeholder-specific PoV narratives. This process results in a set of ten theory informed UXR Play Cards that translate psychological mechanisms and empirical findings into actionable design guidance. The primary contribution of this work is a replicable, bias-aware framework for integrating Generative AI into UXR practice, advancing human-centred and Neuroinclusive approaches to digital mental health design.

2605.31126 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Not All Synthetic Data Is Yours to Learn From

并非所有合成数据都适合学习

Sina Alemohammad, Li Chen, Richard G. Baraniuk, Zhangyang Wang

发表机构 * ECE Department(电子工程系) Apple(苹果公司) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Rice University(里奇大学)

AI总结 研究无提示、无教师、无验证器、无奖励模型的自训练中,语言模型能否从自身生成的文本中学习,发现合成数据与学生之间的兼容性是关键,并揭示了能力与逐字记忆可分离的现象。

详情
AI中文摘要

语言模型能否从自身采样的纯文本中改进,无需提示、教师、验证器或奖励模型?可以,但仅当合成语料库与学生兼容时,这是一种源-学生对的关联属性,而非数据的内在属性。我们称之为潜在能力重现假说:弱自训练可以放大预训练模型中已有的能力,但仅在这种兼容条件下。我们在无提示无条件自训练的最小设置中研究这一点,其中基础语言模型仅在BOS令牌生成的文本上进行微调,没有任务规范或外部监督。我们报告三个发现。首先,合成效用是关联的而非内在的:自生成数据是最有效的来源,同源迁移优于更强但不同来源的训练,跨家族迁移显著较弱。其次,常见的内在代理失效:基准级别的语义相似性和学生下的平均每令牌似然都不能预测哪些语料库有帮助。第三,这种机制产生了一个令人惊讶的副产品。在受控的Pythia实验中,能力和逐字记忆解耦:基准效用得以保留或改善,而保留的精确匹配提取下降超过95%,无需遗忘集、隐私目标或针对性遗忘。总之,这些结果表明,无提示自训练通过放大学生已知的内容来工作,而不是从数据中导入结构。它们还揭示了一种无需任何显式遗忘目标即可分离能力和逐字记忆的机制。

英文摘要

Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities already present in the pretrained model, but only under this compatibility condition. We study this in the minimal setting of prompt-free unconditional self-training, where base language models are fine-tuned on text generated from the BOS token alone, with no task specification or external supervision. We report three findings. First, synthetic utility is relational rather than intrinsic: self-generated data is the most effective source, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Second, common intrinsic proxies fail: neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. Third, this regime produces a surprising byproduct. In controlled Pythia experiments, capability and verbatim memorization decouple: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent, with no forget set, privacy objective, or targeted unlearning. Together, these results suggest that prompt-free self-training works by amplifying what the student already knows, not by importing structure from the data. They also reveal a regime in which capability and verbatim memorization can be separated without any explicit unlearning objective.

2605.31121 2026-06-01 cs.RO cs.AI 版本更新

TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues

TARIC: 语义线索中断下基于记忆增强的可通行性感知户外视觉语言导航

Tianle Zeng, Hanjing Ye, Jianwei Peng, Jingwen Yu, Hanxuan Chen, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision(深圳机器人与计算机视觉重点实验室) Southern University of Science and Technology(南方科技大学) CKS Robotics Institute(CKS机器人研究所) Hong Kong University of Science and Technology(香港科技大学) College of Electrical and Information Engineering(电气与信息工程学院)

AI总结 针对户外视觉语言导航中语义线索中断导致导航退化的问题,提出统一框架,通过可通行性一致的执行引导和不确定性感知的3D线索记忆,在长时间无线索阶段维持稳定导航,在四足和轮式平台上成功率提升显著。

详情
AI中文摘要

户外视觉语言导航(VLN)在远程、开放世界环境中经常受到语义线索中断的干扰,此时信息性目标线索变得稀疏、被遮挡或离开视野。一旦此类线索消失,智能体进入无线索阶段,并常退化为回溯、振荡航向或盲目探索。虽然基于记忆的方法试图弥合这些间隙,但在可通行性驱动的绕行中常常失败:记忆中的线索方向可能不可行,迫使绕行延长无线索阶段,并逐渐使机器人中心的线索过时、隐式历史模糊。这使得可通行性成为维持目标导向引导的稳定性条件,而不仅仅是局部安全问题。 我们提出一个统一的户外VLN框架,通过在长时间无线索阶段维持可通行性一致的可执行引导来应对语义线索中断。具体来说,我们的方法从可见性门控的目标或探索线索中提取语义方位,并利用实时近场可通行性轮廓将其接地为可执行航向,提供超越仅拒绝安全过滤的目标一致可行引导。为防止绕行期间引导退化,我们将间歇性2D证据提升为世界对齐的3D线索记忆,并配备不确定性感知读出机制,确保引导在机器人移动时持续可达且稳定。 我们在四足和轮式平台上评估该框架,路线长度为600-1000米。我们的方法在模拟中成功率比最强基线提高超过10个百分点,真实世界成功率达到40%,而最强基线为17.5%,且在长时间无线索间隔中具有显著更高的鲁棒性。

英文摘要

Outdoor vision-language navigation (VLN) in long-range, open-world environments is frequently disrupted by semantic-cue interruptions, where informative goal cues become sparse, occluded, or leave the field of view. Once such cues disappear, agents enter a cue-free phase and often degrade into backtracking, oscillatory headings, or aimless exploration. While memory-based methods attempt to bridge these gaps, they often fail under traversability-driven detours: the remembered cue direction may be infeasible, forcing detours that prolong cue-free phases and gradually render robot-centric cues stale and implicit histories blurred. This makes traversability a stability condition for maintaining goal-directed guidance, rather than merely a local safety concern. We propose a unified outdoor VLN framework that survives semantic-cue interruptions by maintaining traversability-consistent executable guidance throughout prolonged cue-free phases. Specifically, our method extracts semantic bearings from visibility-gated goal or exploration cues and grounds them into executable headings using a real-time near-field traversability profile, providing goal-consistent feasible guidance beyond reject-only safety filtering. To prevent guidance degradation during detours, we lift intermittent 2D evidence into a world-aligned 3D cue memory with an uncertainty-aware readout mechanism, ensuring guidance remains continuously reachable and stable as the robot moves. We evaluate the framework on quadrupedal and wheeled platforms over 600--1000 m routes. Our method improves simulation success rate by over 10 percentage points over the strongest baseline and achieves a real-world success rate of 40%, compared to 17.5% for the strongest baseline, with substantially higher robustness during prolonged cue-free intervals.

2605.31120 2026-06-01 cs.GR cs.AI cs.LG 版本更新

SWIM: Single-Instance Whole-Body Imitation for swiMming

SWIM: 用于游泳的单实例全身模仿

Binglun Wang, Edmond S. L. Ho, He Wang

发表机构 * University College London(伦敦大学学院) University of Glasgow(格拉斯哥大学)

AI总结 提出一种基于物理的游泳动作合成方法SWIM,通过单实例模仿学习实现全身协调与流体连续交互,在数据效率、稳定性、鲁棒性和泛化性上优于现有方法。

详情
AI中文摘要

我们提出了一种合成基于物理的游泳动作的新方法。基于物理的角色动画旨在生成物理有效、可控且自然的动作,能够应对意外干扰,其中难度的一个决定性因素是任务的复杂性,尤其是与所需环境交互的复杂程度。现有研究已在静态和动态环境中的各种任务上取得成功。我们进一步将难度推向游泳,这需要全身协调和与流体的持续交互,这是与环境交互时的一个新复杂性层次。这种复杂性在学习控制时面临挑战,包括在易变的环境力下的控制学习、将控制泛化到不同环境和游泳风格、缺乏数据参考,以及在控制学习过程中不可避免的极其缓慢的物理模拟。为此,我们提出了SWIM,一种新的游泳动作模仿方法,它可以从单个游泳动作中学习,并泛化到未见过的环境、身体条件和游泳风格。广泛的评估和比较表明,SWIM具有数据效率高、稳定、鲁棒和可泛化的特点,在多个任务类别和指标上优于替代方法。

英文摘要

We propose a new method for synthesizing physically-based swimming motions. Physically-based character animation aims to generate physically valid, controllable, and natural-looking motions which can respond to unexpected disturbances, where one dictating factor of difficulty is the complexity of the task, especially the level of sophistication of the required interactions with the environment. Existing research has succeeded in various tasks in static and dynamic environments. We push the difficulty further to swimming, which requires full-body coordination and continuous interactions with fluids, a new level of complexity when it comes to interacting with the environment. This complexity imposes challenges in learning control under volatile environmental forces, generalizing control to different environments and swimming styles, lack of data references, and prohibitively slow physical simulation which is inevitable during control learning. To this end, we propose SWIM, a new imitation method for swimming motions, which can learn from a single swimming motion and generalize to unseen environments, body conditions, and swimming styles. Extensive evaluation and comparison demonstrate that SWIM is data-efficient, stable, robust, and generalizable, outperforming alternative methods across multiple classes of tasks and metrics.

2605.31100 2026-06-01 cs.AI cs.DB cs.IR 版本更新

Vector Linking via Cross-Model Local Isometric Consistency

通过跨模型局部等距一致性的向量链接

Ziying Chen, Yang Cao, He Sun, Beining Yang, Tianjian Yang

发表机构 * School of Informatics, University of Edinburgh, Edinburgh, United Kingdom(爱丁堡大学信息学院,爱丁堡,英国) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(深圳先进技术研究院,深圳,中国)

AI总结 提出一种基于局部几何一致性的迭代参考几何嵌入哈希方法,从少量种子锚点恢复跨模型向量对应关系,实现准确鲁棒的向量链接。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们研究向量链接:给定由不同黑盒编码器在部分重叠数据集上生成的两个嵌入云,仅使用向量恢复跨模型对象对应关系。实验和理论上表明,独立训练的对比编码器表现出局部几何一致性:短距离近似保持(按比例因子),而长距离因模型特定失真而不保持。基于此,我们提出一种迭代的、基于参考的几何嵌入哈希方法,从微小的种子锚点集恢复向量链接。它通过到采样配对锚点的距离表示每个向量,通过哈希空间匹配提出候选链接,并在Beta-Bernoulli后验中跨视图聚合证据,以引导高置信度链接作为新锚点。在多个基准测试和嵌入模型对上的实验表明,该方法在不同重叠度、种子预算和域外锚点下实现准确且鲁棒的链接,并应用于向量数据库集成和跨模型聚类。代码见https://github.com/DBgroup-Edinburgh/VecLinking。

英文摘要

We study Vector Linking: given two embedding clouds produced by different black-box encoders over partially overlapping datasets, recover cross-model object correspondences using only vectors. Empirically and theoretically, we show that independently trained contrastive encoders exhibit local geometric consistency: short-range distances are approximately preserved up to a scale factor, while long-range distances are not due to model-specific distortion. Building on this, we propose an iterative, reference-based geometric embedding hashing that recovers vector links from a tiny seed set of paired anchors. It represents each vector by distances to sampled paired anchors, proposes candidate links via hash-space matching, and aggregates evidence across views in a Beta-Bernoulli posterior to bootstrap high-confidence links as new anchors. Experiments across multiple benchmarks and embedding model pairs demonstrate accurate and robust linking under varying overlap, seed budgets, and out-of-domain anchors, with applications to vector database integration and cross-model clustering. Code is available at https://github.com/DBgroup-Edinburgh/VecLinking.

2605.31099 2026-06-01 cs.CL cs.AI 版本更新

KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning

KnowledgeGain: 评估和优化面向读者学习的科学新闻生成

Dominik Soós, Meng Jiang, Jian Wu

发表机构 * Old Dominion University(旧 Dominion 大学) University of Notre Dame(诺特大学)

AI总结 提出KnowledgeGain指标,通过测量读者知识增益来评估科学新闻质量,并利用LLM模拟器优化生成,提升读者学习效果。

详情
AI中文摘要

科学新闻是研究界与公众之间传播发现的重要媒介。然而,大多数用于生成或摘要文本的指标评估语义相似性和事实一致性,但并未衡量读者从新闻中学到了多少知识。我们引入了KnowledgeGain,这是一个通过测量读者阅读后获得的知识量来评估科学新闻质量的指标。为了评估该指标,我们首先进行了一项受控人类研究,表明该指标成功捕捉了人类读者阅读不同类型科学媒体时获得的知识差异。这些数据使我们能够校准一个仅基于提示的LLM读者模拟器。我们用它来在人类评估之前对候选文章进行排序和过滤。第二项人类研究表明,使用该模拟器选择的文章在阅读后准确性和标准化KnowledgeGain上均优于强生成基线。我们的工作是朝着生成更符合Bloom分类法知识和理解目标的科学新闻迈出的一步。

英文摘要

Science news is an important medium to communicate discoveries between the research communities and the public. Yet, most metrics for generated or summarized text evaluate semantic similarity and factual consistency, but do not measure how much knowledge readers learn from the news. We introduce KnowledgeGain, a metric that evaluates the quality of science news by measuring how much knowledge readers gained after reading it. To evaluate the metric, we first performed a controlled human study and showed that the metric successfully captures the differential knowledge gained by human readers reading different types of science media. The data allowed us to calibrate a prompt-only LLM reader simulator. We use it to rank and filter candidate articles before human evaluation. A second human study shows that articles selected with this simulator improve post-reading accuracy and normalized KnowledgeGain over a strong generation baseline. Our work is a step toward generating science news that better meets the knowledge and comprehension goals of Bloom's Taxonomy.

2605.31097 2026-06-01 cs.DB cs.AI 版本更新

SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition

SpecDB: 通过面向特征的分解生成LLM定制的数据库

Yunkai Lou, Longbin Lai, Shunyang Li, Zhengping Qian, Ying Zhang

发表机构 * Alibaba Group(阿里巴巴集团) Zhejiang Gongshang University(浙江工商大学)

AI总结 提出SpecDB系统,利用大语言模型通过面向特征的分解和依赖图DBGraph,从自然语言工作负载描述自动生成定制化关系数据库,在TPC-C测试中达到与PostgreSQL和MySQL相当的性能,代码量仅为它们的3%。

详情
AI中文摘要

主流关系数据库在部署时提供统一的特征集,尽管单个工作负载只使用可用子系统的一小部分。我们研究是否可以根据目标工作负载按需生成具有匹配特征集的数据库。我们提出SpecDB,一个使用大语言模型(LLM)合成定制化关系数据库的系统。我们调查了9个生产系统,并将其分解为10个功能模块,每个模块进一步划分为实现变体。为了捕获跨模块依赖关系,包括不相交子树中的实现必须协同设计的情况,我们采用FODA特征模型,并用合作边扩展它,得到依赖图DBGraph。SpecDB通过分层模块构建流水线来操作DBGraph,其中每个模块由专门的子代理(由三个内部代理驱动:主代理、测试代理、架构代理)生成、验证和集成,以及一个精炼代理,该代理根据用户提供的精炼工具(对现有数据库源代码具有只读访问权限)迭代修复和调整组装的数据库。配套的选择组件将自然语言工作负载描述转换为一组实现变体,提供从工作负载描述到可部署数据库的端到端流水线。我们在TPC-C上使用BenchmarkSQL评估SpecDB。生成的数据库(23,779行Rust代码)在1个和10个仓库下完成了60分钟的TPC-C测试,零错误。在10个仓库下,它达到tpmC=130,而PostgreSQL为128,MySQL为127,延迟相当,代码量约为它们的3%。由于代理在模块规范级别而非产品源代码级别操作,它原则上可以跨系统边界组合技术。随着LLM成本的下降,为目标工作负载生成专用数据库正变得简单。

英文摘要

Mainstream relational databases ship a uniform feature set across deployments, although individual workloads exercise only a fraction of the available subsystems. We investigate whether a database can instead be generated on demand with a feature set matched to the target workload. We present SpecDB, a system that uses large language models (LLMs) to synthesize customized relational databases. We survey 9 production systems and decompose them into 10 functional modules, each further divided into implementation variants. To capture cross-module dependencies, including cases where implementations in disjoint subtrees must be co-designed, we adopt the FODA feature model and extend it with a cooperate edge, yielding a dependency graph DBGraph. SpecDB operationalizes DBGraph through a layered module-construction pipeline in which each module is generated, validated, and integrated by a dedicated subagent (driven by three inner agents: Main, Tester, Architect), and a Refining Agent that iteratively repairs and tunes the assembled database against a user-supplied refining harness with read-only access to existing database source code. A companion selection component translates a natural-language workload description into a set of implementation variants, providing an end-to-end pipeline from workload description to deployable database. We evaluate SpecDB on TPC-C with BenchmarkSQL. The generated database (23,779 lines of Rust) completes 60-minute TPC-C at 1 and 10 warehouses with zero errors. At 10 warehouses it reaches tpmC=130, compared to 128 for PostgreSQL and 127 for MySQL, with comparable latency at ~3% of their code size. Because the agent operates at module-specification level rather than product source, it can in principle combine techniques across system boundaries. Paired with falling LLM costs, generating a purpose-built database for a target workload is becoming straightforward.

2605.31094 2026-06-01 cs.CV cs.AI 版本更新

Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation

重新定义实例匹配:全景分割评估中部件感知匹配的统一框架

Erik Großkopf, Soumya Snigdha Kundu, Hendrik Möller, Nicolas Münster, Mehdi Astaraki, Paula Tamara Buzduga, Kerstin Ritter, Benedikt Wiestler, Jan Kirschke, Jonathan Shapey, Tom Vercauteren, Florian Kofler

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen, Germany(人工智能与脑健康研究所,图宾根大学,德国) King’s College London, UK(伦敦国王学院,英国) Technical University of Munich, Germany(慕尼黑技术大学,德国) Stockholm University, Sweden(斯德哥尔摩大学,瑞典)

AI总结 提出将全景分割中的片段匹配重新表述为约束二分分配问题,定义四种匹配策略,并扩展至部件感知评估,发布基于Panoptica的统一开源包。

Comments 9 pages, 4 figures

详情
AI中文摘要

全景质量(PQ)度量是联合评估实例分割和语义分割的标准。然而,其原始定义依赖于预测片段和真实片段之间的一对一匹配,只有当IoU阈值超过0.5时才是直接的。低于0.5时,在一个探索不足的问题空间中会出现多种匹配策略。我们通过将片段匹配重新表述为约束二分分配问题,系统地阐明了这个空间。独立地约束预测端和真实端的度数,产生了四种匹配策略:一对一、多对一、一对多和多对多。我们表明,前三种在PQ框架内是良好定义的,而多对多则超出其范围。当实例被碎片化、相邻物体难以划分或标注有噪声时,这些策略变得相关。我们框架的核心是基于顶点的TP、FN和FP计数,锚定于真实片段和预测片段,而不是匹配边。我们进一步表明,该框架自然地扩展到部件感知全景分割,并在生物医学数据上探索了部件感知评估。在可配置的案例研究中,我们报告了不同阈值和匹配策略组合在实际中的表现。我们发布了一个基于Panoptica的统一开源包,它暴露了基于Voronoi的区域分析、部件感知评估和阈值下曲线面积作为可配置选项。

英文摘要

The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definition relies on a One-to-One matching between predicted and ground truth segments, which is only straightforward when the IoU threshold exceeds 0.5. Below 0.5, multiple matching strategies emerge in a poorly explored problem space. We systematically elucidate this space by recasting segment matching as a constrained bipartite assignment problem. Independently bounding the prediction- and ground-truth-side degrees yields four matching strategies: One-to-One, Many-to-One, One-to-Many, and Many-to-Many. We show that the first three are well-defined within the PQ framework, while Many-to-Many falls outside it. These strategies become relevant when instances are fragmented, adjacent objects are difficult to delineate, or annotations are noisy. Central to our framework is a vertex-based accounting of TP, FN, and FP, anchored to ground truth and predicted segments rather than to matching edges. We further show that the framework extends naturally to part-aware panoptic segmentation, and we explore part-aware evaluation on biomedical data. Across configurable case studies we report how different combinations of thresholds and matching strategies behave in practice. We release a unified open-source package built on Panoptica. It exposes Voronoi-based region-wise analysis, part-aware evaluation, and Area Under Threshold Curve computations as configurable options.

2605.31090 2026-06-01 cs.CV cs.AI 版本更新

On Revisiting Entropy for Identifying Mislabeled Images

重新审视熵在识别错误标注图像中的应用

Chunlei Li, Zixuan Zheng, Yilei Shi, Guanglu Dong, Pengfei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

发表机构 * MedAI Technology (Wuxi) Co. Ltd., Wuxi, China(MedAI技术(无锡)有限公司,无锡,中国) Sichuan University, Chengdu, China(四川大学,成都,中国) University of Basel, Allschwil, Switzerland(巴塞尔大学,阿勒西维尔,瑞士) Technical University of Munich, Munich, Germany(慕尼黑技术大学,慕尼黑,德国)

AI总结 提出基于训练动态的有符号熵积分(SEI)统计量,通过捕捉预测熵的幅度和时间趋势,有效识别训练集中的错误标注样本,在医学影像数据集上达到最优性能。

Comments ICML 2026

详情
AI中文摘要

训练数据集中的错误标注样本会严重降低深度网络的性能,因为过参数化模型倾向于记忆错误标签。我们通过提出一种利用训练动态的错误标注数据检测新方法来应对这一挑战。我们的方法基于一个关键观察:正确标注的样本在训练过程中熵持续下降,而错误标注的样本在整个训练过程中保持相对较高的熵。基于这一见解,我们引入了一个有符号熵积分(SEI)统计量,它捕捉了训练周期中预测熵的幅度和时间趋势。SEI广泛适用于分类网络,并且在与对比语言-图像预训练(CLIP)架构集成时表现出特别的有效性。通过在四个医学影像数据集(由于诊断复杂性,该领域特别容易受到标注错误的影响)上进行涵盖不同模态和病理的广泛实验,我们证明SEI在错误标注数据识别中达到了最先进的性能,在保持计算效率和实现简单性的同时优于现有方法。我们的代码可在 https://github.com/MedAITech/SEI 获取。

英文摘要

Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets -- a domain particularly susceptible to labeling errors due to diagnostic complexity -- spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.

2605.31080 2026-06-01 cs.MM cs.AI cs.CL cs.CV cs.HC 版本更新

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

策展人引导的多语言艺术描述对盲人和低视力观众的小型视觉语言模型试点研究

Iosif Tsangko, Andreas Triantafyllopoulos, George Margetis, Ioana Crihana, Björn W. Schuller

发表机构 * Technical University of Munich(慕尼黑技术大学) Foundation for Research and Technology -- Hellas(希腊研究与技术基金会) National University of Science and Technology Politehnica Bucharest(布加勒斯特政治技术科学与技术国家大学)

AI总结 本研究使用小型视觉语言模型Qwen2.5-VL-3B-Instruct,通过策展人引导的方式为盲人和低视力观众生成德语、罗马尼亚语和塞尔维亚语的多语言艺术描述,发现语言特定适配器在控制性和视觉基础描述质量上优于多语言适配器。

Comments 7 pages, 2 figures, 3 tables. Preprint

详情
AI中文摘要

盲人和低视力(BLV)观众在视觉艺术描述方面仍然服务不足,尤其是在跨语言和博物馆环境中,隐私和知识产权限制可能倾向于使用小型本地视觉语言模型(VLM)。本试点研究使用Qwen2.5-VL-3B-Instruct,针对德语、罗马尼亚语和塞尔维亚语,调查了策展人引导的多语言艺术描述。我们从艺术品图像和元数据构建了一个平行的BLV导向字幕语料库,并在固定骨干网络和训练预算下,比较了语言特定的LoRA适配器与单个多语言适配器。评估结合了自动词汇和基于嵌入的指标,以及针对小型罗马尼亚BLV试点研究校准的LLM作为评判协议。在我们的试点设置下,语言特定适配器在罗马尼亚语和塞尔维亚语上表现出更稳定的可控性和视觉基础描述质量,而多语言适配器在德语上仍具有竞争力。我们将这些发现视为小型本地VLM的部署导向证据,并强调在得出关于多语言可访问性的总体结论之前,需要进行更大规模的BLV用户研究和更广泛的语言覆盖。

英文摘要

Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.

2605.31065 2026-06-01 eess.SP cs.AI 版本更新

DRIFT: Joint Channel Estimation and Prediction Towards Pilotless 6G Non-Terrestrial Networks

DRIFT:面向无导频6G非地面网络的联合信道估计与预测

Bruno De Filippo, Carla Amatetti, Alessandro Vanelli-Coralli

发表机构 * Department of Electrical, Electronic, and Information Engineering (DEI), Univ. of Bologna(电子、电气与信息工程系(DEI),博洛尼亚大学)

AI总结 针对6G低轨卫星网络中导频开销大和星载计算受限的问题,提出一种轻量级联合信道估计与预测框架DRIFT,通过仅在初始时隙发送导频并利用数据驱动处理后续时隙,在低计算复杂度下实现高达12%的频谱效率提升。

Comments Submitted for publication

详情
AI中文摘要

非地面网络(NTN)有望通过实现无处不在的连接和大规模通信,在第六代(6G)系统中发挥关键作用。在此背景下,信道预测成为一项关键技术,通过限制导频开销来提高频谱利用效率。然而,许多基于人工智能(AI)的预测器具有高推理复杂度,给星载实现带来挑战。本文针对低地球轨道(LEO)NTN,在严格功率约束限制模型复杂度的情况下,设计了精确且计算高效的信道预测技术,以实现频谱效率增益。我们提出了一种面向6G NTN的迭代联合信道估计与预测框架,通过仅在初始时隙传输导频,并在后续时隙依赖数据驱动处理,显著降低了导频开销。我们引入了DRIFT(无线信道跟踪的数据驱动细化与迭代预测),这是一种轻量级架构,以低计算成本和减少的误差传播来细化数据辅助的信道估计并预测未来的信道频率响应。研究了基于卷积层和长短期记忆层的两种预测器变体。在上行链路LEO NTN场景的端到端仿真中,结果表明,与传统基于导频的系统相比,所提方法实现了高达12%的频谱效率增益,对训练-测试不匹配具有鲁棒性,并在不同信道模型下保持一致的性能。此外,DRIFT所需的乘加运算少于20万次,使其适用于严格功率约束下的星载实现。

英文摘要

Non-terrestrial networks (NTNs) are expected to play a pivotal role in sixth-generation (6G) systems by enabling ubiquitous connectivity and massive communication. In this context, channel prediction emerges as a key technique to improve the spectrum utilization efficiency by limiting the pilot overhead. However, many proposed predictors based on artificial intelligence (AI) are characterized by high inference complexity, posing challenges to onboard implementation. In this paper, we address the challenge of designing accurate yet computationally efficient channel prediction techniques tailored to low Earth orbit (LEO) NTNs, where strict power constraints limit model complexity, to enable spectral efficiency gains. We propose an iterative joint channel estimation and prediction framework in the context of 6G NTNs that significantly reduces pilot overhead by transmitting pilots only in the initial slot and relying on data-driven processing for subsequent slots. We introduce Data-driven Refinement and Iterative Forecast for wireless channel Tracking (DRIFT), a lightweight architecture that refines data-aided channel estimates and predicts future channel frequency responses with low computational cost and reduced error propagation. Two predictor variants based on convolutional and long short-term memory layers are investigated. Simulation results in an end-to-end simulation of an uplink LEO NTN scenario show that the proposed approach achieves up to 12% spectral efficiency gain compared to conventional pilot-based systems, with robustness to training-test mismatches and consistent performance across different channel models. Moreover, DRIFT requires fewer than 200k multiply-accumulate operations, making it suitable for on-board satellite implementation under stringent power constraints.

2605.31064 2026-06-01 cs.IR cs.AI 版本更新

Fighting Numerical Hallucinations via Data-centric Compilation for Online Financial QA

通过数据为中心的编译对抗在线金融问答中的数值幻觉

Hao Chen, Xing Tang, Qirui Liu, Weijie Shi, Shiwei Li, Fuyuan Lyu, Weihong Luo, Xiku Du, Xiuqiang He

发表机构 * Shenzhen Technology University(深圳科技大学) FiT, Tencent(腾讯金融科技部) South China University of Technology(华南理工大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Huazhong University of Science and Technology(华中科技大学) McGill University(麦吉尔大学)

AI总结 提出数据为中心推理编译器(DCRC),通过对抗数据构建、多阶段训练和编译执行推理流程,解决在线金融问答中检索增强生成面临的噪声敏感、计算脆弱和可审计性危机,实现可靠的数值推理。

Comments Accepted by KDD 2026 ADS track

详情
AI中文摘要

大型语言模型(LLMs)显著推进了在线数据服务,特别是在金融问答(FinQA)领域。然而,此类系统仍然容易受到数值推理幻觉的影响,这在高风险金融应用中严重损害了可靠性。尽管检索增强生成(RAG)已被广泛采用以将响应基于外部知识,但它引入了三个持续挑战:噪声敏感性、计算脆弱性和可审计性危机。现有的以模型为中心的方法主要侧重于单独优化检索器或生成器,仍然难以以集成方式解决这些问题。在这项工作中,我们开创了一种以数据为中心的范式,并提出了一个新颖的框架——数据为中心推理编译器(DCRC)。该框架通过三个连贯的阶段运作:(1)对抗数据构建,合成带有受控噪声的训练示例以教授鲁棒性;(2)多阶段训练,培养一个能够进行显式证据审计和程序合成的数据为中心结构化代理(DSA);(3)编译并执行推理过程,其中DSA将用户查询和检索到的文档转换为可验证、可执行的推理程序。这种数据驱动的框架通过设计确保了忠实的数值推理。我们在已建立的离线基准上进行了大量实验,并通过在实际在线金融问答系统中的部署进一步验证了我们的框架。

英文摘要

Large Language Models (LLMs) have significantly advanced online data services, particularly in the domain of financial question answering (FinQA). However, such systems remain susceptible to numerical reasoning hallucinations, which critically undermine reliability in high-stakes financial applications. Although retrieval-augmented generation (RAG) has been widely adopted to ground responses in external knowledge, it introduces three persistent challenges: noise sensitivity, calculation fragility, and an auditability crisis. Existing model-centric approaches, which primarily focus on optimizing either the retriever or generator in isolation, still struggle to address these issues in an integrated manner. In this work, we pioneer a data-centric paradigm and propose a novel framework, the Data-centric Reasoning Compiler (DCRC). The framework operates through three cohesive phases: (1) adversarial data construction, which synthesizes training examples with controlled noise to teach robustness; (2) multi-stage training that cultivates a Data-centric Structuring Agent (DSA) capable of explicit evidence auditing and program synthesis; and (3) a compile-and-execute inference process, where the DSA transforms user queries and retrieved documents into verifiable, executable reasoning programs. This data-driven framework ensures faithful numerical reasoning by design. We conduct extensive experiments on established offline benchmarks and further validate our framework through deployment in a real-world online financial QA system.

2605.31061 2026-06-01 cs.LG cs.AI 版本更新

STEP: Learning STructured Embeddings for Progressive Time Series

STEP:学习渐进时间序列的结构化嵌入

Lucas Thil, Jesse Read, Rim Kaddah, Guillaume Doquet

发表机构 * LIX, École Polytechnique(高等理工学院LIX) IRT SystemX(系统X研究院) Safran Tech(萨弗兰科技)

AI总结 提出一种自监督对比学习方法,通过构建具有固定正交原型向量的低维流形几何结构,实现渐进时间序列的端状态预测、多步预测和可解释相位分离。

详情
AI中文摘要

我们提出了一种新颖的方法,用于学习渐进时间序列的可解释表示,即捕获不可逆状态转换(如退化或任务完成)的数据。我们的方法使用自监督对比目标来学习低维潜在空间,其几何结构本身就是解释:每个观测成为位于两个固定正交原型向量之间的流形上的一个点,轨迹成为穿过该流形的路径。从这种结构中,我们读取一个潜在指南针,即潜在向量的极坐标(θ, r),其中θ跟踪潜在状态的进展(例如,从健康到故障),r识别活动模式(例如,操作条件),无需任何代理标签。我们在不同领域(包括工业退化、机器人任务和神经活动)上评估了该方法与最先进方法的对比,验证了三个关键能力:(1)端状态预测,(2)多步预测,以及(3)可解释的相位分离。我们的方法在所有方面匹配或优于黑盒对应方法,同时提供对底层机制的透明性。在潜在指南针坐标之上的简单线性回归器与深度架构具有竞争力,这是底层状态以几何可访问形式编码的直接定量证据。

英文摘要

We present a novel method for learning interpretable representations of progressive time series, that is, data capturing irreversible state transitions such as degradation or task completion. Our approach uses a self-supervised contrastive objective to learn a low-dimensional latent space whose geometry is itself the interpretation: each observation becomes a point on a manifold anchored between two fixed orthogonal prototype vectors, and a trajectory becomes a path across that manifold. From this structure we read a latent compass, the polar coordinates (θ, r) of the latent vector, in which θ tracks the progression of the underlying state (e.g., from healthy to failed) and r identifies the active mode (e.g., the operating condition), without any proxy labels. We evaluate the approach against the state of the art on diverse domains, including industrial degradation, robotic tasks, and neural activity, validating three key capabilities: (1) end-state prediction, (2) multi-step forecasting, and (3) interpretable phase separation. Our method matches or improves over black-box counterparts on all of these while providing transparency about the underlying mechanisms. A simple linear regressor on top of the latent compass coordinates is competitive with deep architectures, direct quantitative evidence that the underlying state is encoded in a geometrically accessible form.

2605.31053 2026-06-01 cs.SD cs.AI 版本更新

AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing

AnchorSteer: 自发现概念注入用于结构保持的音乐编辑

Chih-Heng Chang, Keng-Seng Ho, Chih-Yu Tsai, Kuan-Lin Chen, Yi-Hsuan Yang, Jian-Jiun Ding

发表机构 * National Taiwan University(国立台湾大学)

AI总结 提出AnchorSteer框架,通过结构锚定与自发现语义注入解耦语义-结构纠缠,实现高保真结构保持下的显著语义变换。

Comments Accepted by the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

可控音乐编辑旨在修改高级属性,同时严格保留节奏和旋律结构。然而,这一任务面临语义-结构纠缠的挑战:引导方法往往为了编辑性能而牺牲结构,而结构适配器则抑制语义响应。我们提出AnchorSteer,一个通过将结构锚定与自发现语义引导耦合来解耦这种张力的框架。该方法通过自监督重构目标探测内部表示,提取可解释、无标签的概念向量,无需精心策划的数据即可隔离属性。在编辑过程中,这些便携、即插即用的概念向量被注入扩散隐空间,同时结构适配器强制执行一致性。提供了无条件和条件注入的变体,以平衡鲁棒性和语义强度。在ZoME-Bench和主观测试上的实验表明,所提出的框架优于纯引导和纯锚定的基线,实现了高保真结构保持下的显著语义变换。

英文摘要

Controllable music editing is to modify high-level attributes while strictly preserving rhythmic and melodic structures. However, this task is challenged by a semantic-structural entanglement: steering methods often degrade structure to achieve editing performance, while structural adaptors suppress semantic responsiveness. We propose AnchorSteer, a framework that disentangles this tension by coupling structural anchoring with self-discovered semantic steering. The proposed approach probes internal representations to extract interpretable, label-free concept vectors via a self-supervised reconstruction objective, isolating attributes without curated data. During editing, these portable, plug-and-play concept vectors are injected into diffusion hidden manifolds while a structural adaptor enforces consistency. Variants for unconditioned and conditioned injections are provided to balance robustness and semantic strength. Experiments on ZoME-Bench and subjective tests show that the proposed framework outperforms both steering-only and anchoring-only baselines, enabling significant semantic transformations with high-fidelity structural preservation.

2605.31049 2026-06-01 cs.LG cs.AI cs.LO 版本更新

Learning to Solve and Optimize by Evolving Code

通过代码演化学习求解与优化

Veronika Semmelrock, Benedetta Strizzolo, Francesco Zuccato, Gerhard Friedrich, Patrick Rodler, Konstantin Schekotihin

发表机构 * University of Klagenfurt(克雷格福大学) University of Udine(乌迪大学)

AI总结 提出CHECKMATE工具,利用形式规范确保解的正确性并通过自然语言描述指导代码演化,自动生成算法,在配置与调度问题上超越最先进求解器。

Comments Preprint of a paper accepted to IJCAI26

详情
AI中文摘要

组合与优化问题是许多工业AI应用的基础。解决此类大规模现实世界实例通常需要仔细的问题形式化、专门的求解器以及专家设计的启发式方法。因此,专家不仅需要指定解是什么,还需要指定如何推导出解。通过引入工具CHECKMATE,我们展示了通过代码演化生成算法代表了一种范式转变,消除了制定如何的需求。CHECKMATE仅依赖于是什么。具体来说,形式规范确保了解的正确性,并能够对生成的程序进行系统性能评估,而自然语言描述则指导演化过程。我们的方法在两个工业领域(配置与调度)的选定问题上展示了有效性。在所有案例中,演化出的算法始终优于最先进的求解器。这凸显了形式方法在引导代码演化以自动解决复杂现实问题方面的潜力。

英文摘要

Combinatorial and optimization problems are fundamental to many industrial AI applications. Solving large-scale real-world instances of such problems typically requires careful problem formalization, specialized solvers, and expert-designed heuristics. Thus, experts need to specify not only what solutions are, but also how they are derived. By introducing the tool CHECKMATE, we show that algorithm generation via code evolution represents a paradigm shift by eliminating the need to formulate the how. CHECKMATE solely relies on the what. Specifically, a formal specification ensures solutions' correctness and enables systematic performance evaluation of the generated programs, while a natural language description guides the evolutionary process. The effectiveness of our method is demonstrated on selected problems from two industrial domains: configuration and scheduling. In all cases, the evolved algorithms consistently outperform state-of-the-art solvers. This underscores the potential of formal methods in guiding code evolution for automatically solving complex real-world problems.

2605.31043 2026-06-01 stat.ML cs.AI cs.LG 版本更新

Routing on the Stiefel Manifold: When Does Adaptive Subspace Selection Help for Cross-Domain EEG Decoding?

Stiefel流形上的路由:自适应子空间选择何时有助于跨域脑电解码?

Isabella Costa Maia, Pedro L. C. Rodrigues, Salem Said, Marco Congedo

发表机构 * GIPSA-lab, University Grenoble Alpes, CNRS, Grenoble-INP(GIPSA实验室,格勒诺布尔阿尔卑斯大学,法国国家科学研究中心,格勒诺布尔-INP) Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK(格勒诺布尔阿尔卑斯大学,法国国家信息与自动化研究所,法国国家科学研究中心,格勒诺布尔-INP,LJK) Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK(格勒诺布尔阿尔卑斯大学,法国国家科学研究中心,格勒诺布尔-INP,LJK)

AI总结 针对跨域脑电解码中协方差矩阵域偏移问题,提出动态Stiefel路由方法,通过Stiefel流形上的专家投影滤波器池和交叉注意力机制实现自适应子空间选择,并引入三种结构性质避免退化为集成平均,在三个数据集上取得一致提升。

详情
AI中文摘要

尽管黎曼深度学习取得了进展,跨域脑电解码仍然具有挑战性:来自不同受试者的协方差矩阵占据了SPD流形上系统不同的区域,然而现有的域适应方法要么需要目标域校准数据,要么学习无法跨域泛化的受试者特定组件。我们提出了动态Stiefel路由:在Stiefel流形上有一个包含$K$个专家投影滤波器的池,每个滤波器专门处理SPD流形上的不同区域,每个输入协方差通过交叉注意力路由到最合适的滤波器,从而为每个样本自适应调整子空间投影。一个核心发现是,这种朴素实现的方法会退化为集成平均:当路由权重均匀时,自适应滤波器恰好等价于专家的等贡献组合,与单个固定滤波器无法区分。三种结构性质打破了这种退化:一个对称锚点$W_{\mathrm{base}} \in \mathrm{St}(n,k)$消除了专家间的邻近偏差;一个冻结的域判别查询编码器将路由与任务优化解耦;以及一个解耦的键对齐损失,将专家键训练到稳定的域吸引子。它们共同产生了SPD流形上第一个真正承诺且域结构化的路由,在三个数据集上取得一致提升:平衡准确率分别从$0.773\to 0.823$、$0.757\to 0.809$和$0.801\to 0.839$,且对齐策略由单一数据驱动规则自动确定,无需数据集特定的超参数搜索。

英文摘要

Cross-domain EEG decoding remains challenging despite advances in Riemannian deep learning: covariance matrices from different subjects occupy systematically distinct regions of the SPD manifold, yet existing domain adaptation methods either require target-domain calibration data or learn subject-specific components that cannot generalise across domains. We propose dynamic Stiefel routing: a pool of $K$ expert projection filters on the Stiefel manifold, each specialised for a different region of the SPD manifold, with each input covariance routed to the most appropriate filter via cross-attention, adapting the subspace projection per sample. A central finding is that this approach, implemented naively, provably collapses to ensemble averaging: when routing weights are uniform, the adaptive filter reduces exactly to an equal-contribution combination of experts, indistinguishable from a single fixed filter. Three structural properties break this degeneracy: a symmetric anchor $W_{\mathrm{base}} \in \mathrm{St}(n,k)$ that removes proximity bias among experts; a frozen domain-discriminative query encoder that decouples routing from task optimisation; and a decoupled key alignment loss that trains expert keys toward stable domain attractors. Together they produce the first genuinely committed and domain-structured routing on SPD manifolds, with consistent gains across three datasets: balanced accuracy improves from $0.773\to 0.823$, $0.757\to 0.809$, and $0.801\to 0.839$, with the alignment strategy determined automatically by a single data-driven rule and no dataset-specific hyperparameter search.

2605.31042 2026-06-01 cs.CR cs.AI cs.CL 版本更新

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

从提示注入到持久控制:防御智能体框架中的木马后门

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院 Gallagher 学院)

AI总结 本文提出ClawTrojan基准测试揭示本地智能体框架中的多步木马攻击,并设计DASGuard防御方法,通过扫描控制文本、追溯来源并清除不可信控制内容,实现动态防御。

Comments Code and data are available at https://github.com/RUC-NLPIR/ClawTrojan

详情
AI中文摘要

LLM智能体正在从对话式聊天机器人演变为实际工作空间中的操作工具。在本地智能体框架中,LLM可以读写文件、调用工具,并在会话间重用工作空间状态。虽然这些功能增强了实用性,但也为攻击者暴露了新的攻击面。攻击者可以将提示注入嵌入文件或工具输出中。智能体可能会读取这一隐藏指令,存储它,并在之后执行。在这种多步木马攻击范式中,没有任何单个步骤本身是恶意的,但这些步骤可以共同将不可信文本转化为持久控制内容。然而,现有防御通常孤立地检查每个步骤。因此,它们可以阻止明显的恶意行为,但无法检测到植入后门的早期写操作。为了揭示这一威胁,我们引入了ClawTrojan,一个旨在识别本地智能体框架中多步木马攻击的基准测试。在OpenClaw风格的模拟工作空间中,使用GPT-5.4,ClawTrojan达到了95.5%的攻击成功率(ASR),而同一模型上现有的单轮提示注入攻击产生的ASR接近零。为了解决这一威胁,我们提出了DASGuard,它扫描敏感本地文件中的控制类文本,追溯其来源,并清除非可信来源的控制内容。我们的结果表明,DASGuard通过结合运行时攻击阻断和对工作空间的清理提交,实现了强大的动态防御。

英文摘要

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.

2605.31041 2026-06-01 cs.CV cs.AI 版本更新

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

视觉信息在视觉-语言-动作模型驾驶行为中是否起决定性作用?

Jingtao He, Hongliang Lu, Xiaoyun Qiu, Yixuan Wang, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou)(科技与交通智能 thrust,香港科学与技术大学(广州))

AI总结 本文提出结构化多级视觉扰动框架,系统分析VLA驾驶模型对视觉信息的依赖程度,揭示依赖模式随评估方式变化且在不同抽象层次上不均匀。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在自动驾驶中展现出令人期待的能力,凸显了统一多模态架构联合建模感知与规划的潜力。然而,当前基于VLA的驾驶行为如何植根于视觉信息仍知之甚少。现有评估协议主要关注聚合性能指标,缺乏结构化和实用的诊断方法来量化视觉-行为依赖性。在这项工作中,我们引入了一个结构化的多级视觉扰动框架,以系统分析基于VLA的驾驶模型中的视觉-行为依赖性。该框架沿着三个互补维度组织受控视觉扰动:通道级退化、信息级破坏和结构级修改。我们将其应用于基于VLA的驾驶系统,并在开环轨迹预测和交互式闭环安全评估下评估行为响应。实验揭示了依赖于评估的依赖模式以及跨抽象层次的不均匀视觉基础。这些发现呼吁对VLA驾驶模型进行更结构化的分析和原则性设计,以更好地理解视觉信息如何塑造行为,并开发更安全、更鲁棒的系统。

英文摘要

Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified multimodal architectures for jointly modeling perception and planning. However, how current VLA-based driving behavior is grounded in visual information remains poorly understood. Existing evaluation protocols mainly focus on aggregate performance metrics, lacking structured and practical diagnostics to quantify visual-behavior dependency. In this work, we introduce a structured multi-level visual perturbation framework to analyze visual-behavior dependency in VLA-based driving models systematically. The framework organizes controlled visual perturbations along three complementary dimensions: channellevel degradation, information-level disruption, and structurelevel modification. We apply it to VLA-based driving systems and evaluate behavioral responses under both open-loop trajectory prediction and interactive closed-loop safety evaluation. Experimental results reveal evaluation-dependent dependency patterns and uneven visual grounding across abstraction levels. These findings call for more structured analyses and principled design of VLA driving models to better understand how visual information shapes behavior and develop safer, more robust systems.

2605.31034 2026-06-01 cs.LG cs.AI 版本更新

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

多臂贝叶斯老虎机中的退火Softmax贪婪算法

William Overman, Mohsen Bayati

发表机构 * Stanford University(斯坦福大学)

AI总结 本文研究退火Softmax贪婪算法在多臂贝叶斯伯努利老虎机中的贝叶斯遗憾,证明在先验满足线性上尾条件(β=1的β正则性)时,算法达到接近最优的贝叶斯遗憾率,并与RLVR方法形成结构类比。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)和基于组的策略优化方法(如GRPO)通过为每个提示采样多个完成并增加策略在奖励较高的完成上的概率来更新随机策略,同时通过KL惩罚向参考策略正则化。这些更新不包括追踪认知不确定性的显式机制。本文研究为何这种不确定性无关的更新仍然有效的一个风格化解释。我们分析了一个退火softmax(玻尔兹曼)策略,该策略在多臂贝叶斯伯努利老虎机中根据经验平均奖励的softmax选择动作。在先验满足线性上尾条件(β正则性的β=1情况)下,该条件意味着存在大量接近最优的臂,我们证明退火softmax贪婪算法实现了贝叶斯遗憾$ ilde{O}(m + T/m)$,特别地,当臂数$m = Θ(\sqrt{T})$时,遗憾为$ ilde{O}(\sqrt{T})$。这是该机制下接近最优的贝叶斯遗憾率,经验平均贪婪算法也能达到。在β正则性下,许多臂在整个学习过程中保持经验均值接近最优,因此当softmax采样一个非经验最优的臂时,该臂往往是另一个接近最优的臂,而不是明显较差的臂。相比之下,当臂数较少时,同类的softmax策略可能遭受线性遗憾。该结果也为RLVR提供了结构类比,其中以非可忽略概率产生正确完成的基础策略扮演了β正则性的角色。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those with higher reward, regularized by a KL penalty toward a reference policy. These updates do not include explicit mechanisms that track epistemic uncertainty. This paper studies a stylized explanation for why such uncertainty-agnostic updates can nevertheless be effective. We analyze an annealed softmax (Boltzmann) policy that selects actions according to a softmax of empirical mean rewards in a many-armed Bayesian Bernoulli bandit. Under a linear upper-tail condition on the prior (the $β=1$ case of $β$-regularity), which implies an abundance of near-optimal arms, we prove that annealed softmax greedy achieves Bayes regret $\tilde{O}(m + T/m)$, and in particular $\tilde{O}(\sqrt{T})$ when the number of arms scales as $m = Θ(\sqrt{T})$. This is the near-optimal Bayes regret rate in this regime, attained also by empirical-mean greedy. Under $β$-regularity, many arms maintain empirical means close to the optimum throughout learning, so when softmax samples an arm other than the empirically best, that arm tends to be another near-optimal one rather than a clearly inferior one. By contrast, with a small number of arms, the same kind of softmax policy can suffer linear regret. The result also provides a structural analogy to RLVR, where a base policy with a non-negligible probability of producing a correct completion plays the role of $β$-regularity.

2605.31031 2026-06-01 cs.AI 版本更新

GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning

GraphARC:基于图的抽象推理综合基准

Saku Peltonen, August Bøgh Rønberg, Andreas Plesner, Roger Wattenhofer

发表机构 * ETH Z\"urich Z\"urich Switzerland ETH Z\"urich

AI总结 提出GraphARC基准,将抽象推理扩展到图结构数据,通过少样本变换学习任务评估模型在局部、全局和层次图变换上的泛化能力,并揭示语言模型的理解-执行差距和规模扩展障碍。

Comments Accepted at KDD 2026 Datasets and Benchmarks Track

详情
AI中文摘要

关系推理是智能的核心,但现有基准通常局限于网格或文本格式。我们引入了GraphARC,一个用于图结构数据抽象推理的基准。GraphARC推广了抽象与推理语料库(ARC)的少样本变换学习范式。每个任务需要从几个输入-输出对中推断变换规则,并将其应用于新的测试图,涵盖局部、全局和层次图变换。与基于网格的ARC不同,GraphARC实例可以在不同的图族和规模上大规模生成,从而能够系统评估泛化能力。我们在GraphARC上评估了最先进的语言模型,并观察到明显的局限性。模型能够回答关于图属性的问题,但往往无法解决完整的图变换任务,揭示了理解-执行差距。在更大实例上性能进一步下降,暴露了规模扩展障碍。更广泛地说,通过将节点分类、链接预测和图生成的方面结合在一个单一框架内,GraphARC为未来的图基础模型提供了一个有前景的测试平台。

英文摘要

Relational reasoning lies at the heart of intelligence, but existing benchmarks are typically confined to formats such as grids or text. We introduce GraphARC, a benchmark for abstract reasoning on graph-structured data. GraphARC generalizes the few-shot transformation learning paradigm of the Abstraction and Reasoning Corpus (ARC). Each task requires inferring a transformation rule from a few input-output pairs and applying it to a new test graph, covering local, global, and hierarchical graph transformations. Unlike grid-based ARC, GraphARC instances can be generated at scale across diverse graph families and sizes, enabling systematic evaluation of generalization abilities. We evaluate state-of-the-art language models on GraphARC and observe clear limitations. Models can answer questions about graph properties but often fail to solve the full graph transformation task, revealing a comprehension-execution gap. Performance further degrades on larger instances, exposing scaling barriers. More broadly, by combining aspects of node classification, link prediction, and graph generation within a single framework, GraphARC provides a promising testbed for future graph foundation models.

2605.31023 2026-06-01 cs.AI cs.LG cs.MA 版本更新

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

HADT: 一种用于自主对地观测卫星集群的异构多智能体差分Transformer

Mohamad A. Hady, Muhammad Anwar Masum, Siyi Hu, Mahardhika Pratama, Jimmy Cao, Ryszard Kowalczyk

发表机构 * School of Computer Science and Information Technology, Adelaide University(计算机科学与信息科技学院,阿德莱德大学) School of Electrical Engineering, Computing and Mathematical Sciences (EECMS), Curtin University(电气工程、计算与数学科学学院(EECMS), Curtin大学) Systems Research Institute, Polish Academy of Sciences(波兰科学院系统研究所)

AI总结 针对异构卫星集群自主对地观测任务,提出基于Transformer的架构,通过关系观测-动作令牌化和差分注意力机制实现自适应实时资源管理,性能显著优于基线。

Comments Accepted in ECML-PKDD 2026. arXiv admin note: text overlap with arXiv:2511.12792

详情
AI中文摘要

本文解决了执行对地观测任务(包括光学和合成孔径雷达卫星)的异构卫星集群中的自主资源管理问题。在自主运行模式下,卫星配备智能能力,能够根据最新条件实时决策,同时最小化与地面操作员的交互。传统的调度方法通常依赖数学模型来表示卫星任务和资源管理,然后通过优化算法求解。然而,当底层模型不可用、过于复杂或因空间任务环境中的动态变化和不确定性而不准确时,此类解决方案效果不佳。一个有前景的替代方案是将问题重新表述为序列决策过程,并应用无模型强化学习技术来实现自适应和实时资源管理。为此,我们提出了一种新颖的基于Transformer的架构,专门针对异构卫星集群自主对地观测任务,采用关系观测-动作令牌化和差分注意力机制。我们的实验结果表明,与现有基线相比,性能有显著提升。此外,所提出的架构在不同卫星集群数量下表现出强大的适应性和可迁移性。

英文摘要

This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation (EO) missions including optical and Synthetic Aperture Radar (SAR) satellites. In autonomous operation mode, satellites are equipped with intelligent capabilities enabling real-time decision-making based on the latest conditions, while requiring minimal interaction with ground operators. Traditional scheduling approaches typically rely on mathematical models to represent satellite mission and resource management. Then, this problem is solved by using optimization algorithms. However, such solutions become less effective when the underlying models are not available, over complex, and inaccurate due to dynamic changes and uncertainties inherent in the space mission environment. A promising alternative is to reformulate the problem as a sequential decision-making process and apply model-free reinforcement learning techniques to enable adaptive and real-time resource management. To this end, we propose a novel transformer-based architecture tailored for heterogeneous satellite cluster autonomous EO Mission with relational observations-actions tokenization and differential attention mechanism. Our experimental results demonstrate significant performance improvements compared to the available baselines. Moreover, the proposed architecture exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters.

2605.31021 2026-06-01 cs.AI cs.CL cs.LG 版本更新

A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI

基于人格的生成式AI多元对齐评估框架

Atahan Karagoz

发表机构 * Atahan Karagöz(阿塔汗·卡拉戈兹)

AI总结 提出一种状态空间约束仿真框架,通过合成认知轮廓替代单一评估函数,实现反映真实世界共识变异性的多元、视角依赖的基准测试,并分析仿真评估者的稳定性问题,论证动态调节机制的必要性。

详情
AI中文摘要

当前生成式人工智能的对齐范式主要依赖单一基准测试框架,将人类判断的多元性简化为聚合统计基线,从而掩盖了评估中的文化、人口和语境变异性。我们引入一种用于AI评估的状态空间约束仿真框架,用代表不同人类视角的合成认知轮廓的结构化流形替代单一评估函数。我们表明,现代生成架构能够以高度一致性实例化和维护这些评估人格,从而实现一种更接近现实世界共识变异性的多元、视角依赖的基准测试。然而,我们进一步分析了这些模拟评估者在顺序推理和随机提示扰动下的稳定性,揭示了人格一致性的系统性退化,表现为状态空间漂移和语义不一致。这些发现表明,静态对齐约束不足以维持随时间推移的稳健评估行为。相反,我们主张必须在生成系统中嵌入动态的、可行性驱动的调节机制,以保持连贯的认知仿真。通过将基于人格的评估视为潜在表征流形上的结构化动力系统,本研究为更自适应、更符合人类、更注重语境的AI评估方法奠定了基础。

英文摘要

Current alignment paradigms for generative artificial intelligence rely predominantly on monolithic benchmarking frameworks that reduce the plurality of human judgment to aggregated statistical baselines, thereby obscuring cultural, demographic, and contextual variability in evaluation. We introduce a state-space constrained emulation framework for AI evaluation that replaces singular assessment functions with a structured manifold of synthetic cognitive profiles representing diverse human perspectives. We show that modern generative architectures can instantiate and maintain these evaluative personas with high consistency, enabling a form of pluralistic, perspective-dependent benchmarking that more closely reflects real-world consensus variability. However, we further analyze the stability of these simulated evaluators under sequential inference and stochastic prompt perturbations, revealing systematic degradation in persona coherence that manifests as state-space drift and semantic inconsistency. These findings suggest that static alignment constraints are insufficient for sustaining robust evaluative behavior over time. Instead, we argue for the necessity of embedding dynamic, viability-driven regulatory mechanisms within generative systems to preserve coherent cognitive emulation. By framing persona-based evaluation as a structured dynamical system over latent representation manifolds, this study provides a foundation for more adaptive, human-aligned, and context-sensitive approaches to AI evaluation.

2605.31007 2026-06-01 cs.LG cs.AI 版本更新

DEM: A Distilled Explanation Model for Interpretable Anomaly Detection in Physiological Sensor Networks

DEM:面向生理传感器网络中可解释异常检测的蒸馏解释模型

Jyotirmoy Singh, Anushka Roy, Shreea Bose, Chittaranjan Hota

发表机构 * Department of Computer Science and Information Systems(计算机科学与信息系统系) Department of Electrical and Electronics Engineering(电气与电子工程系)

AI总结 提出一种三阶段玻璃箱框架DEM,通过将梯度提升专家模型的知识蒸馏到基于线性基线残差的决策树中,实现高精度与内在可解释性的异常检测,并引入蒸馏保真度指标量化解释可信度。

Comments 21 pages, 10 figures, 7 tables. Code: https://github.com/Jyotirmoy17/dem-model

详情
AI中文摘要

无线体域网(WBANs)中生理传感器数据的异常检测可能由传感器故障、网络中断或数据缺失引起,导致误报。因此,它既需要高预测精度,也需要临床可解释的解释。现有方法要么依赖性能强但无透明度的黑盒模型,要么依赖SHAP和LIME等事后解释方法。本文提出蒸馏解释模型(DEM),一个三阶段玻璃箱框架,将梯度提升专家模型的非线性知识蒸馏到基于线性基线残差的可解释决策树中,使得解释不是近似而是预测本身。DEM引入了一种新颖的蒸馏保真度指标,量化解释树忠实捕捉专家模型非线性贡献的程度,提供了先前可解释模型所缺乏的解释可信度的原则性度量。在包括MIMIC-IV、WESAD、eICU和内部SmartNet WBAN语料库在内的四个生理数据集上评估,DEM在临床上下文异常检测上达到0.9964的AUC,在可穿戴压力检测上达到0.9047,同时以可控深度生成人类可读的if-then规则。推理每1000个样本需要0.17ms,使DEM比基于SHAP的事后解释快1235倍,适用于实时生理监测。消融研究证实,XGBoost蒸馏步骤比朴素残差拟合提供了可测量的增益,深度敏感性分析展示了DEM在现有内在可解释模型中独有的、用户可控的准确性-可解释性权衡。

英文摘要

Anomaly detection in physiological sensor data from Wireless Body Area Networks (WBANs) can be caused by sensor faults, network disruptions, or missing data, leading to false alarms. Hence, it demands both high predictive accuracy and clinically interpretable explanations. Existing approaches rely either on black-box models that achieve strong performance but offer no transparency, or on post-prediction explanation methods such as SHAP and LIME. In this paper, we propose the Distilled Explanation Model (DEM), a three-stage glass-box framework that distills the non-linear knowledge of a gradient boosting expert into an interpretable decision tree operating on residuals relative to a linear baseline, so that the explanation is not an approximation but the prediction itself. DEM introduces a novel distillation fidelity metric that quantifies how faithfully the explanation tree captures the expert model's non-linear contribution, providing a principled measure of explanation trustworthiness absent from prior interpretable models. Evaluated across four physiological datasets, including MIMIC-IV, WESAD, eICU, and an in-house SmartNet WBAN corpus, DEM achieves an AUC of 0.9964 on clinical contextual anomaly detection and 0.9047 on wearable stress detection while producing human-readable if-then rules at a controllable depth. Inference requires 0.17ms per 1000 samples, rendering DEM 1235x faster than SHAP-based post-hoc explanation and suitable for real-time physiological monitoring. Ablation studies confirm that the XGBoost distillation step provides measurable gains over naive residual fitting, and depth-sensitivity analysis demonstrates an explicit, user-controlled accuracy-interpretability trade-off unique to DEM among existing intrinsically interpretable models.

2605.30984 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

生成报告还是重复模板?测量和缓解三维CT报告生成中的模板崩溃

Tom Maye-Lasserre, Yitong Li, Bailiang Jian, Morteza Ghahremani, Benedikt Wiestler, Christian Wachinger

发表机构 * Technical University of Munich (TUM)(慕尼黑技术大学) TUM Hospital(TUM医院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 针对三维CT报告生成中模型输出多样性低、病理检测能力差的模板崩溃问题,提出解耦框架CLarGen,通过分离临床检测与语言合成,显著提升临床准确性并保持报告流畅性。

详情
AI中文摘要

现代三维医学视觉语言模型(VLM)能够生成流畅的放射学风格文本,但表现出极低的病理检测率和输出多样性,崩溃为低估罕见但关键发现的通用模板。我们将这种失败模式识别为模板崩溃。这种失败源于三维医学成像的独特限制,例如数据有限、标签严重不平衡以及体积编码器的弱信号。在这些限制下,文本生成目标鼓励捷径学习和流畅但基础薄弱的报告。我们通过临床保真度、输出多样性、正常模板偏差和罕见发现存活率系统性地诊断模板崩溃。为了缓解它,我们提出CLarGen,一个解耦框架,将说什么(临床检测)与怎么说(语言合成)分开。CLarGen使用(i)用于多标签病理检测的潜在查询变换器,(ii)用于临床匹配示例的病理引导检索,以及(iii)用于从检测到的发现和检索到的上下文中合成最终报告的医学语言模型。在最新的三维CT报告生成基线中,CLarGen缓解了模板崩溃,并在保持流畅报告的同时显著提高了临床准确性(macro-F1 0.487 vs. 0.189;CRG 0.472 vs. 0.368)。我们的结果表明,明确、可测量的临床基础对于抗模板崩溃的三维CT报告生成至关重要。代码将在接收后发布。

英文摘要

Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.

2605.30968 2026-06-01 cs.CV cs.AI 版本更新

Variational Adapter for Cross-modal Similarity Representation

变分适配器用于跨模态相似性表示

WenZhang Wei, Zhipeng Gui, Dehua Peng, Tiandi Ye, Huayi Wu

发表机构 * School of Remote Sensing and Information Engineering(遥感与信息工程学院) Wuhan University(武汉大学) School of Data Science and Engineering(数据科学与工程学院) East China Normal University(华东师范大学) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing(测绘遥感信息工程国家重点实验室)

AI总结 针对跨模态匹配中细粒度标注稀缺导致二元分类边界压缩和假负样本问题,提出变分适配器VACSR,将匹配任务重构为变分推断问题,通过构建潜在相似性空间和正则化缓解过拟合,在图像-文本检索、域泛化和基类到新类泛化任务上验证了有效性。

Comments Accepted by the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

视觉-语言模型的核心在于在统一表示空间中度量跨模态相似性。然而,大多数图像-文本匹配或多类图像分类数据集缺乏细粒度的跨模态匹配标注,迫使连续的相似性空间压缩为二元分类边界。这种压缩引入了假负样本,并严重损害了跨模态任务的泛化性能。尽管先前的研究试图通过建模模态内模糊性来缓解这一问题,但往往忽略了固有的标注缺陷,导致不确定性分配次优。为了解决这些挑战,我们提出了一种变分适配器用于跨模态相似性表示(VACSR)。该方法将具有细粒度语义稀缺性的图像-文本匹配重新表述为变分推断问题。它构建了一个跨模态相似性的潜在空间,并使用正则化技术来减轻对二元标注的过拟合。在图像-文本检索、域泛化和基类到新类泛化上的实验证明了所提出方法的有效性和鲁棒的泛化能力。

英文摘要

The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks. While prior research has attempted to mitigate this by modeling intra-modal ambiguity, it often overlooks inherent annotation flaws, leading to suboptimal uncertainty allocation. To address these challenges, we propose a Variational Adapter for Cross-modal Similarity Representation (VACSR). This approach reformulates image-text matching with fine-grained semantic scarcity as a variational inference problem. It constructs a latent space for cross-modal similarity and uses regularization techniques to mitigate overfitting to binary annotations. Experiments on image-text retrieval, domain generalization, and base-to-novel generalization demonstrate the proposed method's effectiveness and robust generalization ability.

2605.30966 2026-06-01 cs.IR cs.AI cs.CL 版本更新

Reading Between the Citations: A Typed Claim Network for Scientific Literature

解读引用:面向科学文献的类型化主张网络

Ning Ding, Sergio J. Rodríguez Méndez, Pouya G. Omran

发表机构 * Australian National University(澳大利亚国立大学)

AI总结 针对现有知识图谱忽略引用立场的问题,提出将文献间引用具体化为带有立场标签的类型化主张网络,并构建了包含8260条主张的实例,在检索增强、立场摘要和拓扑分析三个任务上验证其有效性。

详情
AI中文摘要

基于相互引用文献语料库(如学术论文、法律意见书、政策简报)的知识图谱编码了引用的拓扑结构,但未编码其立场。标准表示将丰富的评价关系压缩为无类型边,丢失了支持社区级查询(关于一篇文献如何被另一篇文献接受)的关键内容。我们提出主张网络:一种表示模式,其中每个跨文献引用被具体化为一个类型化主张,携带来源、目标、主张文本以及基于引用意图文献的四类立场标签。我们给出了一个适用于任何学术相互引用文献语料库的构建流程,并在3D点云语义分割领域的127篇论文语料库上实例化,生成了一个包含8260个类型化主张的网络。三个下游任务系列展示了该网络的能力:检索信号增强、聚合立场摘要和拓扑分析。与标准检索增强生成(RAG)基线的直接比较表明,相对于平面检索的增益来自于正确的中间表示,而非错误的表示。

英文摘要

Knowledge graphs over corpora of inter-referencing documents - scholarly papers, legal opinions, policy briefs - encode the topology of reference but not its stance. The standard representation collapses a rich evaluative relation into an untyped edge, losing the very content that supports community-level queries about how one document is received by another. We propose the claim network: a representational pattern in which each cross-document reference is reified as a typed claim, carrying source, target, claim text, and a four-class stance label grounded in the citation-intent literature. We give a construction pipeline applicable to any corpus of scholarly inter-referencing documents and instantiate it on a corpus of 127 papers in 3D point cloud semantic segmentation, producing a network of 8,260 typed claims. Three downstream task families demonstrate what the network enables: retrieval signal augmentation, aggregated-stance summarisation, and topological analytics. Head-to-head evaluation against standard Retrieval-Augmented Generation (RAG) baselines shows that the gain over flat retrieval is the gain from the right intermediate representation rather than the wrong one.

2605.30965 2026-06-01 eess.AS cs.AI cs.CL 版本更新

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

ImmersiveTTS:基于多模态扩散Transformer和领域特定表示对齐的环境感知文本转语音

Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 提出ImmersiveTTS模型,通过多模态扩散Transformer和领域特定表示对齐,实现与环境音频自然融合的文本到语音生成。

Comments Accepted to ACL 2026 main conference. Code is available at https://github.com/jjunak-yun/ImmersiveTTS

详情
AI中文摘要

最近在文本引导音频生成方面的进展在声音效果、语音和音乐等多个领域取得了有希望的结果。然而,由于语音和环境音频在声学模式和时域动态上的固有差异,联合生成语音和环境音频仍然具有挑战性。我们提出了ImmersiveTTS,一种环境感知的文本到语音(TTS)模型,通过显式建模跨模态交互,生成与环境上下文无缝融合的自然语音。我们的模型基于多模态扩散Transformer,并通过联合注意力将转录对齐的语音潜在表示与文本条件的环境上下文融合。为了增强语义一致性,我们引入了一种针对环境感知TTS量身定制的领域特定表示对齐目标,利用来自语音和音频编码器的互补自监督表示。实验结果表明,在客观指标和人类听力测试中,ImmersiveTTS在自然度、可懂度和音频保真度方面均优于现有方法。

英文摘要

Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.

2605.30963 2026-06-01 q-bio.BM cs.AI 版本更新

AMix-2: Establishing Protein as a Native Modality in Large Language Models

AMix-2:将蛋白质确立为大语言模型的原生模态

Keyue Qiu, Yixin Wu, Lihao Wang, Yawen Ouyang, Jixiang Yu, Zihan Zhou, Changze Lv, Dongyu Xue, Yuxuan Song, Xinbo Zhang, Hao Wang, Jiangtao Feng, Zhiqiang Gao, Lijun Wu, Xiaoqing Zheng, Ka-Chun Wong, Lei Bai, Ya-Qin Zhang, Wei-Ying Ma, Dahua Lin, Bowen Zhou, Hao Zhou

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Generative Symbolic Intelligence Lab (GenSI), Tsinghua University(生成符号智能实验室(GenSI),清华大学) Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学)

AI总结 提出AMix-2,一种蛋白质-文本基础模型,通过统一蛋白质理解与序列设计,将蛋白质作为大语言模型的原生模态,并引入块状扩散语言建模骨干以更好地匹配蛋白质内在特性。

Comments 30 pages, 4 figures, 12 tables

详情
AI中文摘要

我们提出了AMix-2,一种蛋白质-文本基础模型,将蛋白质确立为大语言模型(LLMs)的原生模态,在单一基础模型中统一了蛋白质理解和序列设计。AMix-2基于两个关键思想:(1)统一的蛋白质-文本公式,将自然语言和蛋白质序列嵌入共享的标记空间,使一个模型能够执行生物推理和条件设计,而不是使用单独的下游任务专用模型;(2)块状扩散语言建模骨干,结合了跨块的因果生成与块内的双向上下文和迭代细化。这种方案比严格的从左到右分解更好地匹配了蛋白质的内在本质。为了在现实的泛化设置下评估蛋白质基础模型,我们进一步引入了ProteinArena,一个全面的基准测试,具有时间感知和同源性感知协议,涵盖各种理解和设计任务,并以经典生物信息学工具、蛋白质专用模型和LLMs作为基线。在ProteinArena上,AMix-2优于前沿的LLMs,并展现出与任务专用蛋白质模型竞争的性能。控制实验进一步表明,基于扩散的范式普遍优于其自回归对应物,突显了蛋白质序列灵活生成顺序的优势。我们发布了AMix-2和ProteinArena,以促进蛋白质基础模型的开放研究。

英文摘要

We present AMix-2, a protein-text foundation model that establishes protein as a native modality in large language models (LLMs), unifying protein understanding and sequence design within a single foundation model. AMix-2 is built upon two key ideas: (1) a unified protein-text formulation that embeds natural language and protein sequence in a shared token space, enabling one model to perform biological reasoning and conditional design instead of separate downstream task-specialized models; and (2) a block-wise diffusion language modeling backbone that combines causal generation across blocks with bidirectional context and iterative refinement within blocks. This scheme better matches the intrinsic nature of proteins than a strict left-to-right factorization. To evaluate protein foundation models under realistic generalization settings, we further introduce ProteinArena, a comprehensive benchmark with time-aware and homology-aware protocols across various understanding and design tasks, and with baselines covering classical bioinformatics tools, protein-specialized models and LLMs. On ProteinArena, AMix-2 outperforms frontier LLMs and demonstrates competitive performance to task-specific protein models. Controlled experiments further show that the diffusion-based paradigm generally surpasses its autoregressive counterpart, highlighting the advantage of flexible generation order for protein sequences. We release both AMix-2 and ProteinArena to facilitate open research in protein foundation models.

2605.30934 2026-06-01 cs.CL cs.AI 版本更新

Do Large Language Models Encode Institutional Experience? Evidence from Cross-Linguistic Moral Reasoning Under Ambiguity

大型语言模型是否编码了制度经验?来自跨语言模糊道德推理的证据

Nattavudh Powdthavee

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 通过跨语言道德困境实验,研究大型语言模型在模糊情境下是否通过语言编码制度经验,发现隐含制度线索会放大跨语言道德分歧,而明确框架则抑制这种差异。

Comments 44 pages

详情
AI中文摘要

大型语言模型(LLMs)在不同语言中表现出系统性的道德推理差异,但这种差异的来源尚不清楚。我们检验了一个假设:语言编码了其使用环境中的制度方面,使得LLMs通过训练继承了特定制度的道德先验。跨越制度质量梯度广泛的九种语言、六个前沿LLM以及两项预注册研究,我们考察了道德困境的可接受性取决于制度功能的情况。在研究1中,明确的制度框架产生了统一的无结果:跨语言道德分歧在制度依赖场景中没有增加,也没有追踪语言社区之间的制度差异。在研究2中,我们引入了制度模糊场景,其中制度利益存在但未明确说明。在这些条件下,跨语言道德分歧相对于制度无关控制组增加,并且除一个理论上有信息的例外,与语言社区之间的现实世界制度差异相关。明确的框架再次减弱了这些效应。这些发现表明,制度经验可能在语言中留下可检测的痕迹,塑造LLM的道德推理,同时也表明明确的制度线索可以抑制这些差异的表达。

英文摘要

Large language models (LLMs) exhibit systematic differences in moral reasoning across languages, yet the source of this variation remains unclear. We test the hypothesis that languages encode aspects of the institutional environments in which they are spoken, allowing LLMs to inherit institution-specific moral priors through training. Across nine languages spanning a broad gradient of institutional quality, six frontier LLMs, and two preregistered studies, we examine moral dilemmas whose acceptability depends on institutional functioning. In Study 1, explicit institutional framing produced uniformly null results: cross-linguistic moral divergence did not increase in institutionally contingent scenarios, nor did it track institutional differences between language communities. In Study 2, we introduced institutionally ambiguous scenarios in which institutional stakes were present but not explicitly stated. Under these conditions, cross-linguistic moral divergence increased relative to institutionally inert controls and, with one theoretically informative exception, was associated with real-world institutional differences between language communities. Explicit framing again attenuated these effects. These findings suggest that institutional experience may leave detectable traces in language that shape LLM moral reasoning, while also indicating that explicit institutional cues can suppress the expression of those differences.

2605.30930 2026-06-01 cs.HC cs.AI cs.CL cs.CY 版本更新

TUX: Measuring Human--AI Tacit Understanding

TUX:衡量人机默契理解

Yueshen Li, Hanyi Min, Vedant Das Swain, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) New York University(纽约大学)

AI总结 通过光谱放置任务和TUX指数,量化人类与LLM之间的默契理解,发现人格特征影响对齐程度。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地作为协作伙伴,人机对齐通常通过明确的任务成功、准确性或奖励优化来评估。然而,许多协作场景依赖于默契理解:即智能体能否在没有明确目标、沟通或反馈的情况下,与人类的评价立场或表征先验对齐。为了研究这种能力,我们开发了一个受社交派对游戏Wavelength启发的光谱放置任务,在该任务中,人类和智能体独立地将概念放置在主观光谱上。我们将默契理解指数(TUX)操作化为人类与智能体判断之间的成对相似性度量,并通过241名人类参与者和200个基于人格条件的LLM智能体(涵盖四种模型)进行评估。我们发现,在特质空间中最近的人-智能体对实现了显著更高的TUX,表明默契对齐是由个体层面特征而非随机相似性所结构化的。回归分析表明,随着预测变量集变得更加丰富,TUX变得更可解释,个体特质、决策风格和置信度优于聚合特质距离基线。这些发现表明,人类与LLM之间的默契理解是可测量的,同时也揭示了基于人格条件化方法在捕捉更深层表征对齐方面的局限性。

英文摘要

As large language models (LLMs) increasingly act as collaborative partners, human--AI alignment is often evaluated through explicit task success, accuracy, or reward optimization. Yet many collaborative settings depend on tacit understanding: whether an agent can align with a human's evaluative stance or representational priors without clear objectives, communication, or feedback. To study this capacity, we develop a spectrum-placement task inspired by the social party game Wavelength, in which humans and agents independently place concepts along subjective spectra. We operationalize the Tacit Understanding Index (TUX) as a pairwise measure of similarity between human and agent judgments, and evaluate it with 241 human participants and 200 profile-conditioned LLM agents across four models. We find that nearest human--agent pairs in trait space achieve significantly higher TUX, suggesting that tacit alignment is structured by person-level characteristics rather than random similarity. Regression analyses show that TUX becomes more explainable as predictor sets become richer, with individual traits, decision-making styles, and confidence improving over aggregate trait-distance baselines. These findings suggest that tacit understanding between humans and LLMs is measurable, while revealing the limits of profile-based conditioning for capturing deeper representational alignment.

2605.30919 2026-06-01 cs.LG cs.AI 版本更新

De-attribute to Forget for LLM Unlearning

De-attribute to Forget for LLM Unlearning

Xinyang Lu, Jiabao Pan, Rachael Hwee Ling Sim, See-Kiong Ng, Anthony Kum Hoe Tung, Bryan Kian Hsiang Low

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系)

AI总结 本文提出基于数据归因奖励的LLM遗忘框架DareU,通过强化学习降低生成响应与遗忘数据的归因分数,实现有效遗忘并平衡模型效用。

详情
AI中文摘要

大型语言模型(LLM)的快速发展引发了对使用不当数据进行训练的担忧,这导致了对LLM遗忘研究的兴趣日益增长。许多现有的LLM遗忘方法依赖于优化预测损失,例如最大化遗忘集上的损失,但常常面临过度遗忘和模型效用差等关键问题。为了解决这些问题,本文创新地将LLM遗忘的优化目标定义为归零数据归因。具体而言,我们提出了第一个基于数据归因奖励的LLM遗忘框架,称为DareU,该框架通过强化学习来更新LLM,通过降低其生成响应与遗忘数据所有者的归因分数(即去归因)来实现遗忘。使用LLM分类器作为归因的有效近似进行的实证评估表明,DareU在实现有效遗忘的同时,很好地平衡了遗忘质量和模型效用,优于现有基线。

英文摘要

The rapid development of large language models (LLMs) has raised concerns on the use of inappropriate data for training, which has led to a growing interest in LLM unlearning. Many existing LLM unlearning approaches rely on optimizing prediction loss(es), such as maximizing the loss on the forget set, but often face critical issues like over-forgetting and poor model utility. To address them, this paper novelly frames the optimization objective for LLM unlearning as one of zeroing out data attribution instead. In particular, we propose the first LLM unlearning framework based on data attribution rewards called DareU that performs reinforcement learning to update the LLM by reducing the attribution score of its generated responses (i.e., de-attributing) to the forget data owners. Empirical evaluation using an LLM classifier as an efficient approximation of attribution shows that DareU outperforms existing baselines by achieving effective unlearning while balancing forget quality and model utility well.

2605.30913 2026-06-01 cs.CL cs.AI cs.CY cs.HC 版本更新

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

有毒幻觉:扰动提示与追踪LLM电路

Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen, Yian Wang, Navin Kumar, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Nimblemind

AI总结 研究有毒语言扰动对LLM事实可靠性的影响,发现有毒词汇降低准确率并增加不确定性,通过归因图分析揭示内部机制。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在对话环境中,用户语气从礼貌到对抗性或毒性不等,但尚不清楚在语义等效的提示中,有毒语言是否会降低事实可靠性。我们研究基于词汇和语气的提示扰动如何影响LLM的事实可靠性。通过礼貌、随机和三种毒性水平的受控提示变化,我们在ARC-Easy、GSM8K和MMLU上评估了五个LLM。我们发现有毒词汇扰动持续降低事实准确性并增加不确定性,而礼貌措辞产生有限且不一致的变化。为了检查这些答案不一致是否对应内部变化,我们进行了模型激活和影响的归因图分析。我们发现增加毒性选择性地放大对扰动敏感的变体节点,而相对稳定的核心推理节点保持更不变。这些发现将提示语气定位为LLM可靠性的关键维度,并提供了行为和机制证据,表明表面词汇变化可以改变事实输出和内部计算。

英文摘要

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

2605.30911 2026-06-01 cs.CV cs.AI 版本更新

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

什么使LVLMs更少产生幻觉?揭示影响幻觉鲁棒性的架构因素

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin, Jun Luo, Jiancheng Lv

发表机构 * School of Computer Science, Engineering Research Center of Machine Learning and Industry Intelligence, Sichuan University(计算机科学学院,机器学习与产业智能工程研究中心,四川大学) School of Computer and Information Engineering, Xiamen University of Technology(计算机与信息工程学院,厦门理工大学) Department of Electrical and Computer Engineering, University of Hong Kong(电气与计算机工程系,香港大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 本文通过将架构设计分解为语言基础、视觉表示和语义对齐三个维度,并引入CoSimUE基准,系统探索了架构因素对LVLMs幻觉鲁棒性的影响,发现模型参数扩展效果有限,而增强视觉编码器、语言基础和语义对齐能分别减少不同类型的幻觉。

详情
AI中文摘要

幻觉仍然是削弱大型视觉-语言模型(LVLMs)可靠性的关键挑战之一。但什么使LVLM更少产生幻觉?许多现有工作专注于改进模型的内部组件。我们认为幻觉从根本上源于模型架构的设计方式。为了研究这一点,我们将架构设计分解为三个维度:语言基础(LF)、视觉表示(VR)和语义对齐(SA),并将幻觉分为共现型、相似型和先前被忽视的不确定型。基于这一框架,我们提出了CoSimUE基准,通过受控文本扰动和随机扰动创建细粒度的幻觉场景,从而建立设计选择与幻觉行为之间的映射。在7个设计方面的实验表明:1)广泛强调的参数规模扩展对减少所有三类幻觉的影响有限;2)更大且训练更好的语言基础可以减少共现型幻觉;3)更强的视觉编码器和更高的分辨率减轻相似型错误;4)有效的对齐策略缓解不确定型幻觉。5)此外,跨维度分析显示,联合增强视觉保真度和对齐质量能带来最全面的改进。本研究首次系统性地将架构级设计与幻觉鲁棒性联系起来,为开发可靠且高效的LVLMs提供了实用指导。

英文摘要

Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.

2605.30907 2026-06-01 cs.SE cs.AI cs.CL cs.LG 版本更新

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

BlueFin: 在金融电子表格上对LLM智能体进行基准测试

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta, Case Winter, George Fang, John Ling, Emma Strubell, Zach Kirshner

发表机构 * Longitude Labs Inc.(Longitude Labs公司) Cornell University(康奈尔大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出BlueFin基准,通过131个真实金融电子表格任务评估LLM智能体的合成、操作和理解能力,并验证了LM评判与人类专家的一致性。

Comments 26 pages

详情
AI中文摘要

我们提出BlueFin,一个基准测试,要求大语言模型(LLM)智能体在专业金融领域的电子表格工作簿上执行合成、操作和理解任务。尽管全球电子表格软件付费用户估计数亿——比全球专业开发人员估计数量高一个数量级——但投入探索和扩展LLM在电子表格领域能力的资源相对较少,而专门用于反映专业金融角色实际职业任务的资源更少。为此,我们整理了131个具有现实相关性的挑战性复杂任务,包含3225个细粒度评分标准;值得注意的是,我们的评分标准和LM评判评估由一组专家人工标注员验证,从而对难以通过编程验证但可由LM评判智能体可靠评估的复杂任务进行高质量、细粒度的评估。我们的评判与专家共识达到一致(α=0.826),宏F1得分为0.839。前沿LLM在此挑战性基准上表现不佳,最强LLM在任务上的平均得分低于50%——模型在动态正确性方面表现出特别弱点。我们的贡献包括:涵盖三类电子表格任务的示例数据集、开源工具包和智能体评估框架,以及现有前沿模型在我们基准上的性能表征。

英文摘要

We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spreadsheet workbooks in the professional finance domain. Though estimates of the global population of paying users of spreadsheet software range in the hundreds of millions -- an order of magnitude more than the estimated global population of professional developers -- comparatively fewer resources have been devoted to exploring and expanding LLM capabilities in the spreadsheet domain, with fewer still dedicated to mirroring real occupational tasks encountered by those in professional finance roles. In response, we curate a set of 131 challenging, complex tasks with real-world relevance in the domain, containing 3,225 granular rubric criteria; notably, our rubric criteria and LM judge evaluations are validated by a team of expert human annotators, resulting in high-quality, granular evaluations of complex tasks that are difficult to verify programmatically but can be reliably evaluated by an LM judge agent. Our judge achieves parity with expert consensus ($α=0.826$) with a macro-F1 score of 0.839. Frontier LLMs demonstrate poor performance on the challenging benchmark, with the strongest LLMs achieving less than 50\% average scores across tasks -- models exhibit particular weaknesses in dynamic correctness. Our contributions include a dataset of examples across three categories of spreadsheet tasks, an open source harness and agentic evaluation framework, and a characterization of existing frontier models' performance on our benchmark.

2605.30903 2026-06-01 cs.LG cs.AI 版本更新

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

无最优演示者的逆强化学习:一种可行奖励集方法

Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang

发表机构 * MIT LIDS(麻省理工学院媒体实验室) University of Massachusetts, Amherst(马萨诸塞大学阿姆赫斯特分校) Adobe Research(Adobe研究院) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 针对多个非最优演示者数据,提出可行奖励集框架,通过线性约束联合可行集单调收缩,并给出恢复保证与高维环境离线算法。

详情
AI中文摘要

逆强化学习(IRL)通常假设来自单个最优演示者的演示,但在许多应用中,数据来自多个具有异质次优性水平的非完美演示者。我们通过可行奖励集框架研究这一设置下的奖励学习:对于每个演示者,我们将其声明的次优性水平编码为线性约束,并在演示者之间对所得可行集取交集。我们的理论分析表明,随着数据的增加,联合可行集单调收缩,并且我们精确刻画了新演示者何时严格收紧该集合。我们进一步为真实最优演示者的可行奖励集建立了两个恢复保证:一个界限依赖于与最优占用度的接近程度,而另一个仅需要足够的覆盖且没有接近最优的演示者。在实际方面,我们引入了解决所得奖励集中固有奖励模糊性的策略,并提供了适用于高维环境的函数逼近离线算法。在表格型网格世界和大语言模型(LLM)微调设置中的实验与理论预测一致,并证明了所提框架相对于基线的有效性。

英文摘要

Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study reward learning in this setting through a feasible-reward-set framework: for each demonstrator, we encode its declared suboptimality level as a linear constraint and intersect the resulting feasible sets across demonstrators. Our theoretical analysis shows that the joint feasible set shrinks monotonically as data are added, and we give an exact characterization of when a new demonstrator strictly tightens it. We further establish two recovery guarantees for the feasible reward set of the ground-truth optimal demonstrator: one bound depends on closeness to the optimal occupancy, while the other requires only sufficient coverage and no near-optimal demonstrator. On the practical side, we introduce strategies to address the inherent reward ambiguity in the obtained reward set and provide an offline algorithm with function approximation for high-dimensional environments. Experiments in tabular grid-world and large language model (LLM) fine-tuning settings are consistent with the theoretical predictions and demonstrate the effectiveness of the proposed framework over baselines.

2605.30900 2026-06-01 cs.AI physics.app-ph 版本更新

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

BilliardPhys-Bench: 多模态大语言模型的物理推理与视觉动力学基准测试

Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao, Chengliang Xu, Zeyu Wang, Zichao Chen, Bing Zhao, Hu Wei

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出BilliardPhys-Bench基准,通过合成台球环境评估多模态大语言模型在物理推理(碰撞、反弹、最终位置预测)上的能力,发现模型存在“静态偏差”且性能随模拟时间与场景复杂度下降。

详情
AI中文摘要

当前多模态模型在静态图像识别方面表现良好,但直观的物理推理仍是弱点。从单张图像预测物体如何运动及相互作用对这些系统而言仍然困难。我们提出了BilliardPhys-Bench,一个用于合成台球环境中物理推理的基准测试。其程序化引擎生成带有摩擦和弹性碰撞的随机场景。该基准测试三种能力:(1) 预测球与球之间的碰撞,(2) 推理墙壁反弹,(3) 估计运动停止后球的最终位置。我们评估了来自GPT、Claude、Gemini和Qwen系列的最新MLLMs。随着模拟时间增加和场景几何复杂度提高,性能下降。我们还观察到一个一致的失败模式,称为“静态偏差”:当正确的物理结果更难推断时,模型倾向于预测无交互。这些发现揭示了当前MLLMs在视觉动力学上的不足之处,并指出了在多模态架构中需要更好的物理归纳偏置。

英文摘要

Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments. Its procedural engine generates randomized scenarios with friction and elastic collisions. The benchmark tests three abilities: (1) predicting ball-to-ball collisions, (2) reasoning about wall bounces, and (3) estimating final ball positions after motion stops. We evaluate recent MLLMs from the GPT, Claude, Gemini, and Qwen families. Performance drops as simulation time increases and scene geometry grows more complex. We also observe a consistent failure mode we call "stasis bias": when the correct physical outcome is harder to infer, models tend to predict no interaction. These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.

2605.30899 2026-06-01 eess.AS cs.AI cs.SD 版本更新

A Unified and Reproducible Experimentation Framework for Speech Understanding

语音理解的统一可复现实验框架

Jing Peng, Junhao Du, Chenghao Wang, Hanqi Li, Yi Yang, Yixuan Wang, Xiaoyu Gu, Guanyu Chen, Yucheng Wang, Jiang Li, Zhangjie Zhao, Haoran Wang, Wenming Tu, Haoyu Li, Duo Ma, Lirong Qian, Yu Xi, Wen Wen, Jiaqi Guo, Hui Zhang, Shuai Fan, Wenbin Jiang, Shuai Wang, Kai Yu

发表机构 * X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University(上海交通大学计算机科学与工程系X-LANCE实验室) MoE Key Lab of Artificial Intelligence(人工智能MOE重点实验室) Jiangsu Key Lab of Language Computing(江苏省语言计算重点实验室) AISpeech Ltd(AISpeech有限公司) ETH Zürich(苏黎世联邦理工学院) Nanjing University(南京大学) Hangzhou Dianzi University(杭州电子科技大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出SURE框架,通过标准化预测格式、归一化和评分,以及代理辅助的训练转换流程,提高语音理解模型在部署场景下的可比性和可复现性。

Comments This paper is submitted to INTERSPEECH 2026

详情
AI中文摘要

语音基础模型和语音大语言模型推动了语音理解的发展,但面向部署的模型选择受到非可比评估的阻碍,这些评估由不匹配的后处理以及跨数据规模和流水线难以复现的训练结果导致。我们提出了SURE,一个统一的实验框架,标准化了预测格式、归一化和评分。SURE评估了从传统流水线到语音大语言模型的各种范式下的强系统,在代表性任务上施加了现实声学和语言压力。除了评估,SURE还引入了一种代理辅助的训练转换流程,该流程将论文和代码映射到统一协议下、基于匹配开放数据子集的版本化、可运行训练流水线。总体而言,SURE提高了面向部署评估的可比性和可复现性。

英文摘要

Speech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-comparable evaluations caused by mismatched post-processing, and by training results that are hard to reproduce across data scales and pipelines. We present SURE, a unified experimentation framework that standardizes prediction formats, normalization, and scoring. SURE evaluates strong systems across paradigms, from conventional pipelines to Speech LLMs, on representative tasks under realistic acoustic and linguistic stressors. Beyond evaluation, SURE introduces an agent-assisted training conversion flow that maps paper and code into versioned, runnable training pipelines under a unified protocol on matched open-data subsets. Overall, SURE improves comparability and reproducibility for deployment-oriented evaluation.

2605.30898 2026-06-01 cs.AI cs.CL 版本更新

UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

UniScale: 通过模型路由和测试时扩展的在线联合优化实现自适应统一推理扩展

Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi, Yuqian Hou, Hong Xu, Zhongxiang Dai, Minchen Yu, Qingjiang Shi

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)大数据研究院) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) Department of Computer Science and Engineering, Chinese University of Hong Kong(香港中文大学计算机科学与工程系)

AI总结 提出UniScale框架,将模型路由和测试时扩展统一为上下文多臂老虎机问题,通过LinUCB在线学习推理策略,实现细粒度且更优的质量-成本权衡。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在大语言模型(LLM)的实际部署中,平衡推理质量和计算成本已成为核心挑战。现有方法沿着两个大致独立的维度处理这一权衡:模型路由(在不同规模的模型之间切换以匹配请求复杂度)和测试时扩展(TTS,在固定模型内调整推理时计算以实现细粒度控制)。然而,这种解耦设计引入了固有限制。由于模型规模稀疏,模型路由产生粗粒度的离散性能变化,而单模型TTS通常遇到能力上限,并随着计算增加出现收益递减。此外,将两种机制分开处理限制了动态推理环境中的适应性。为克服这些限制,我们引入统一推理扩展(UIS),将模型路由和TTS统一到单个优化空间中。基于此公式,我们提出UniScale,一个在线框架,将自适应UIS建模为上下文多臂老虎机问题,并通过LinUCB学习推理策略。该框架包含效率感知学习和成本建模,以确保在高维动作空间上的稳定和可扩展优化。评估表明,UniScale有效利用UIS空间中的协同作用,在多样化的动态推理场景中提供细粒度且持续更优的质量-成本权衡。

英文摘要

In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.

2605.30873 2026-06-01 cs.LG cs.AI cs.DC 版本更新

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

联邦变分偏好对齐与Gumbel-Softmax先验用于个性化用户偏好

Jabin Koo, Hoyoung Kim, Minwoo Jang, Jungseul Ok

发表机构 * Graduate School of AI, POSTECH, Pohang, Republic of Korea(POSTECH人工智能研究生院) Department of CSE, POSTECH, Pohang, Republic of Korea(POSTECH计算机科学与工程系) National AI Research Lab, Seoul, Republic of Korea(首尔国家人工智能研究实验室)

AI总结 提出FedVPA-GP框架,通过联邦混合先验和正交损失解决联邦学习中用户偏好冲突和个性化问题,在HH-RLHF数据集上优于单一模型。

Comments 21 pages, 4 figures. Accepted to ICML 2026

详情
AI中文摘要

联邦学习(FL)为对齐大型语言模型(LLMs)提供了一条保护隐私的途径;然而,现有框架通常强制使用单一奖励模型,不可避免地平均了本质上相互冲突的用户偏好(例如,有用性与无害性)。虽然变分偏好学习(VPL)提供了一条个性化的途径,但将其适应于去中心化设置面临一个基本挑战:由严重的局部数据稀缺性和异质性驱动的后验坍塌。在本文中,我们提出了具有Gumbel-Softmax先验的联邦变分偏好对齐(FedVPA-GP),这是一个旨在在不牺牲隐私的情况下解耦多样偏好的框架。为了稳定变分推断,我们引入了一个联邦混合先验,使客户端能够利用聚合的总体分布作为动态先验。此外,我们加入了一个正交损失,明确强制在潜在空间中分离偏好原型。在HH-RLHF数据集上的实验表明,FedVPA-GP显著优于单一基线,成功解耦了冲突的用户意图,并实现了动态偏好切换。

英文摘要

Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user preferences (e.g., helpfulness vs. harmlessness). While Variational Preference Learning (VPL) offers a pathway to personalization, adapting it to decentralized settings presents a fundamental challenge: posterior collapse driven by severe local data scarcity and heterogeneity. In this paper, we propose Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a framework designed to disentangle diverse preferences without compromising privacy. To stabilize variational inference, we introduce a Federated Mixture Prior that enables clients to leverage the aggregate population distribution as a dynamic prior. Furthermore, we incorporate an Orthogonal Loss that explicitly enforces the separation of preference prototypes in the latent space. Experiments on the HH-RLHF dataset demonstrate that FedVPA-GP significantly outperforms monolithic baselines, successfully disentangling conflicting user intents and enabling dynamic preference switching.

2605.30862 2026-06-01 cs.DB cs.AI 版本更新

Sophrosyne: Agentic Exploration of Relational Data Systems Needs Moderation

Sophrosyne: 关系数据系统的智能体探索需要适度

Madhav Jivrajani, Ramnatthan Alagappan, Aishwarya Ganesan

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对LLM驱动的Text2SQL智能体在探索数据系统时过度探索的问题,提出Sophrosyne环境,通过增强API响应中的指令来引导探索,减少过度探索并提升SQL生成准确性。

详情
AI中文摘要

由LLM驱动的Text2SQL智能体通过工具调用探索数据系统,将自然语言意图转化为SQL。然而,为了确保安全且受限的访问,数据系统构建了具有显式API表面的环境。我们研究并分类了当前暴露的API,将其分为粗粒度或细粒度,并认为在这两者之间进行选择会带来成本效益探索与准确SQL生成之间的基本权衡。大多数数据系统暴露细粒度API,但这无意中使智能体处于劣势:它们过度探索,将不相关的模式元素纳入查询公式中,并产生不准确的结果。我们认为,抑制过度探索是有效利用这些API表面的关键,并提出了Sophrosyne,一种数据系统环境,它通过增强API响应中的指令来引导智能体的探索过程。初步结果显示,指令将过度探索减少了4.6倍,并将准确率提高了高达12.4%(约4个百分点)。

英文摘要

Text2SQL agents powered by LLMs translate natural language intent into SQL by exploring the data system through tool calls before formulating the query. However, to ensure secure and scoped access, data systems construct environments with explicit API surfaces. We study and categorize these APIs exposed today as either coarse-grained or fine-grained and posit that choosing between them presents a fundamental tradeoff between cost-efficient exploration and accurate SQL generation. Most data systems expose fine-grained APIs, but this inadvertently disadvantages agents: they over-explore, incorporating irrelevant schema elements into their query formulation and produce inaccurate results. We argue that curbing over-exploration is key to the effective use of these API surfaces, and propose Sophrosyne, a data system environment that augments API responses with directives that guide the agent's exploration process. Initial results show that directives reduce over-exploration by 4.6x and boost accuracy by up to 12.4% (approx. 4 percentage points).

2605.30861 2026-06-01 cs.AI 版本更新

Distilling LLM Feedback for Lean Theorem Proving

蒸馏LLM反馈用于Lean定理证明

Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal, Pierre Marion

发表机构 * FAIR at Meta(Meta 的 FAIR 部门) Inria(法国国家科学与技术研究院) Sorbonne Université(索邦大学) Institut universitaire de France(法国国家科学研究院) CERMICS École des Ponts ParisTech(巴黎理工学院 CERMICS 实验室) ENS, PSL Research University(巴黎高等师范学院与巴黎科学实验室)

AI总结 提出反馈蒸馏方法,通过让模型在token级别匹配自身分布(基于语言模型提供的特权反馈)来训练,以解决GRPO在推理后训练中的稀疏奖励和模式崩溃问题,并在Lean4定理证明中取得更好效果。

详情
AI中文摘要

推理模型的后训练通常结合监督微调和基于可验证奖励的强化学习(最常见的是GRPO)。然而,该算法存在奖励稀疏、探索受限和模式崩溃的问题。基于最近关于自蒸馏的工作,我们提出了反馈蒸馏,这是一种训练方法,其中模型在token级别被训练以匹配自身分布,该分布以语言模型产生的特权反馈为条件。反馈蒸馏提供token级别的监督,并能注入外部知识。在Lean4定理证明中评估我们的方法,我们发现反馈蒸馏比GRPO在生成轨迹上保持更大的多样性,从而产生更高的策略熵和更好的pass@k缩放。这两种方法是互补的:从反馈蒸馏检查点初始化GRPO优于单独使用任何一种方法。总之,我们的结果为提高复杂推理的后训练提供了一条有前景的途径。

英文摘要

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.

2605.30859 2026-06-01 cs.LG cs.AI 版本更新

DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

DARTS: 分布感知的主动展开轨迹塑造以加速LLM强化学习

Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu, Xupeng Miao, Fangcheng Fu, Bin Cui

发表机构 * School of Computer Science \& Beijing Key Laboratory of Software Hardware Cooperative Artificial Intelligence Systems, Peking University, Beijing, China School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China Institute of Computational Social Science, Peking University (Qingdao), Qingdao, China

AI总结 针对强化学习中长尾响应分布导致的效率瓶颈,提出分布感知的主动轨迹塑造方法,通过细粒度识别提示内长尾并削减无效冗余,实现高达1.77倍的加速而不损失模型性能。

Comments 16 pages, 14 figures, 5 tables. Accepted to ICML 2026

详情
AI中文摘要

强化学习已成为提升模型能力的关键技术,但由于响应长度的长尾分布,其展开效率受到瓶颈制约。现有工作通过提示级尾部调度缓解长尾影响,但我们关注低效率的根本来源:分布本身。具体而言,我们以更细粒度刻画长尾分布,识别提示内长尾,并揭示它们通常包含无效冗余。为解决此问题,我们提出一种主动分布塑造的新范式,将展开分布向简洁性和确定性方向塑造,从而从根本上解决尾部带来的开销。我们通过一种分布感知的轨迹采样机制实现这一点,该机制为每个提示从冗余探索空间中选择轨迹,并采用自适应冗余分配方案以最大化塑造效果和系统效率。实验表明,与最先进系统相比,在不影响模型性能的情况下,实现了高达1.77倍的显著加速。

英文摘要

Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the long-tail response length distribution. While existing works mitigate the impact of long tails via prompt-level tail scheduling, we focus on the root source of inefficiency: the distribution itself. Specifically, we characterize the long-tail distribution at a finer granularity, identifying intra-prompt long tails, and revealing that they frequently consist of ineffective verbosity. To address this, we propose a novel paradigm of active distribution shaping to shape the rollout distribution towards conciseness and certainty, thereby fundamentally resolving tail-induced overheads. We achieve this through a distribution-aware trajectory sampling mechanism, which selects trajectories from a redundant exploration space for each prompt, and an adaptive redundancy allocation scheme to maximize both shaping effectiveness and system efficiency. Experiments demonstrate significant acceleration over state-of-the-art systems by up to 1.77x without compromising model performance.

2605.30854 2026-06-01 cs.MA cs.AI 版本更新

Safe Equilibrium Policy Optimization for Strategic Agent Policies

面向策略型智能体的安全均衡策略优化

Karthika Arumugam, Kiran Kumar Manku, Amit Dhanda

发表机构 * Amazon, USA(亚马逊公司)

AI总结 提出Safe Equilibrium Policy Optimization (SEPO)方法,通过惩罚可剥削性、共谋风险和外部性成本,优化语言模型在多智能体博弈中的策略安全性。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

使用强化学习微调的语言模型通常优化任务奖励,忽略了多智能体策略结构。由于这些智能体以自然语言游戏状态描述为条件并通过自由生成发出动作,策略失败模式——利用较弱对手、协调有害均衡以及外部化成本——与语言接口本身密不可分。我们提出Safe Equilibrium Policy Optimization (\sepo{}),一种训练目标,通过显式惩罚可剥削性、共谋风险和外部性成本来增强期望收益。我们将\sepo{}作为组相对策略优化(GRPO)的奖励信号,应用于监督微调(SFT)后的Gemma~4 E4B-it和Qwen~3.5-4B。在五个策略领域(迭代囚徒困境、重复拍卖、两种谈判变体以及Kuhn扑克)中评估。\sepo{}在Kuhn扑克中实现了两种模型的零剥削池优势,在四个领域的安全性能上优于基础模型,并纠正了SFT引入的过度合作行为。在谈判中,\sepo{}实现了正安全结果,并且是唯一具有正归一化相对优势的谈判配置。消融实验证实,每次推演的剥削计算是必要的:共享常数惩罚在GRPO优势归一化中抵消(常数控制变量性质),产生零梯度。为支持智能体策略安全的进一步研究,我们发布了我们的\href{https://anonymous.4open.science/r/sepo-2668/README.md}{代码}和SFT数据集。

英文摘要

Language models fine-tuned with reinforcement learning typically optimize for task reward, ignoring multi-agent strategic structure. Because these agents condition on natural language game-state descriptions and emit actions through free-form generation, strategic failure modes -- exploiting weaker opponents, coordinating on harmful equilibria, and externalizing costs are inseparable from the language interface itself. We propose Safe Equilibrium Policy Optimization (\sepo{}), a training objective that augments expected payoff with explicit penalties for exploitability, collusion risk, and externality cost. We implement \sepo{} as a reward signal for Group Relative Policy Optimization (GRPO), applied to Gemma~4 E4B-it and Qwen~3.5-4B after supervised fine-tuning (SFT). Evaluated across five strategic domains: Iterated Prisoner's Dilemma, repeated auctions, two negotiation variants, and Kuhn Poker. \sepo{} achieves zero exploit-pool advantage in Kuhn Poker for both models, outperforms the base model on safety in four domains, and corrects the over-cooperative behavior introduced by SFT. In negotiation, \sepo{} achieves a positive-safety outcome and only the positive normalized relative advantage of any negotiation configuration. Ablation experiments confirm that per-rollout exploit computation is necessary: a shared constant penalty cancels in GRPO advantage normalization (constant control-variate property), producing zero gradient. To support further research in strategic safety for agents, we release our \href{https://anonymous.4open.science/r/sepo-2668/README.md}{code} and SFT datasets.

2605.30844 2026-06-01 cs.CL cs.AI stat.ML 版本更新

Fine-Tuning Improves Information Conveyance in Language Models

微调提升语言模型中的信息传递

Yuwei Cheng, Weiyi Tian, Haifeng Xu

发表机构 * Department of Statistics(统计学系) University of Chicago(芝加哥大学) Department of Data Science(数据科学系) Department of Computer Science(计算机科学系)

AI总结 提出冠层熵(Canopy Entropy)度量,从树结构视角量化生成空间的有效大小,发现微调模型在总熵降低时仍能增强长度-熵率正相关,从而更高效地将不确定性转化为语义多样性。

详情
AI中文摘要

微调通常被认为会降低大型语言模型的不确定性和多样性,但现有分析忽略了输出长度这一关键混杂因素,因此未能捕捉不确定性在整个生成展开中的分布。为解决这一问题,我们提出冠层熵($\mathrm{CE}^\star$),一种从树视角看待语言生成的度量,其中“冠层”代表所有可能展开的空间,使得$\mathrm{CE}^\star$自然地量化生成空间的有效大小。$\mathrm{CE}^\star$共同捕捉输出长度$N$和生成序列$Y_{1:N}$中的不确定性——实际上,我们证明它等于总香农熵$H(N, Y_{1:N}\mid X)$,其中$X$表示提示。该公式产生了可解释的度量,包括长度-熵率相关项$ ho(N, r_N)$,其中$r_N$是熵率,通过指示较长输出是否每个标记信息量更多或更少来量化信息传递效率。实验上,跨任务和模型家族,我们发现微调模型一致地表现出更强的正相关$ ho(N, r_N)$,即使总熵降低。此外,在控制模型家族、任务、提示和输出长度效应后,我们发现微调几乎使熵率与语义多样性之间的相关强度增加了两倍,表明对齐模型更有效地将标记不确定性转化为语义多样性。总体而言,这些结果表明微调并非简单地降低不确定性,而是从根本上将其重组为更具信息性和语义意义的生成。我们的代码可在https://github.com/WeiyiTian/canopy-entropy获取。

英文摘要

Fine-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout. To address this, we propose Canopy Entropy ($\mathrm{CE}^\star$), a measure that views language generation from a tree perspective, where ``canopy'' represents the space of all possible rollouts, making $\mathrm{CE}^\star$ naturally quantify the effective size of the generation space. $\mathrm{CE}^\star$ jointly captures uncertainty in both the output length $N$ and the generated sequence $Y_{1:N}$ -- indeed, we show that it equals to total Shannon entropy $H(N, Y_{1:N}\mid X)$, where $X$ denotes the prompt. This formulation yields interpretable metrics, including a length-entropy correlation term $ρ(N, r_N)$, where $r_N$ is the entropy rate, quantifying information conveyance efficiency by indicating whether longer outputs are more or less informative per token. Empirically, across tasks and model families, we find that fine-tuned models consistently exhibit stronger positive correlation $ρ(N, r_N)$, even when total entropy decreases. Furthermore, after controlling for model family, task, prompt, and output-length effects, we find that fine-tuning nearly triples the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently. Overall, these results demonstrate that fine-tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations. Our code is available at https://github.com/WeiyiTian/canopy-entropy.

2605.30838 2026-06-01 cs.AI 版本更新

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

COMPASS: 认知MCTS引导的过程对齐用于安全搜索代理

Wenkai Shen, Pengyang Zhou, Jiahe Xu, Jiaming Qian, Haozhe He, Zhihao Huang, Chaochao Chen, Xiaolin Zheng

发表机构 * Zhejiang University(浙江大学)

AI总结 提出COMPASS框架,通过认知树探索和自省步骤对齐,在保持通用效用的同时实现搜索代理工作流中的鲁棒安全对齐。

详情
AI中文摘要

基于LLM的搜索代理能够进行多步推理和使用工具。然而,这些能力引入了检索诱导的安全退化,因为有害意图可能分解为看似无害的子查询,导致不安全的结果。现有的对齐方法难以捕捉稀疏的安全信号,并且无法监督多步交互中的各种违规行为。我们提出COMPASS,一种认知MCTS引导的过程对齐框架,旨在在保持通用效用的同时,实现代理工作流中的鲁棒安全对齐。COMPASS集成了认知树探索(CTE)以高效合成隐蔽攻击轨迹,以及自省步骤对齐(ISA)以隔离有风险的中间动作进行细粒度过程监督。实验结果表明,COMPASS在实现良好的安全-效用权衡的同时,所需训练数据大幅减少。

英文摘要

LLM-powered search agents enable multi-step reasoning and tool use. However, these capabilities introduce retrieval-induced safety degradation, as harmful intents may decompose into seemingly innocuous sub-queries that lead to unsafe outcomes. Existing alignment methods struggle to capture sparse safety signals and fail to supervise diverse violations across multi-step interactions. We propose COMPASS, a Cognitive MCTS-Guided Process Alignment framework designed to achieve robust safety alignment throughout the agent workflow while preserving general utility. COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories, and introspective step-wise alignment (ISA) to isolate risky intermediate actions for fine-grained process supervision. Empirical results show that COMPASS achieves a favorable safety-utility trade-off while requiring substantially less training data.

2605.30834 2026-06-01 cs.RO cs.AI 版本更新

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

轨迹中的捉迷藏:发现VLA运行时监控的失败信号

Seongheon Park, Wendi Li, Changdae Oh, Samuel Yeh, Zsolt Kira, Michael Hagenow, Sharon Li

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Hide-and-Seek框架,通过轨迹间和轨迹内对比学习,从轨迹级监督中定位失败指示动作,实现无需步骤标注的VLA模型运行时失败检测。

详情
AI中文摘要

视觉-语言-动作(VLA)模型使机器人能够遵循自然语言指令并在不同任务中泛化,但在实际部署中仍易受执行失败影响,损害可靠性。因此,在执行过程中检测此类失败对于具身系统的稳健部署至关重要。现有的失败检测方法要么依赖昂贵的动作重采样或外部模型,要么将轨迹级标签均匀传播到每个时间步,掩盖了局部失败信号。在本文中,我们提出 extbf{Hide-and-Seek}框架,将VLA失败检测形式化为粗监督学习问题。通过结合轨迹间和轨迹内对比目标,Hide-and-Seek能够定位指示失败的动作,并仅从轨迹级监督中诱导出具有时间结构的失败信号,无需任何步骤级标注。我们在LIBERO、VLABench和真实机器人平台上,针对三种代表性VLA策略(OpenVLA、$π_0$和$π_{0.5}$)评估了Hide-and-Seek。我们的方法在共形预测下实现了最先进的多任务失败检测性能,具有实用的准确度-及时性权衡,并且对已见和未见任务均具有良好的泛化能力。

英文摘要

Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose \textbf{Hide-and-Seek}, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, $π_0$, and $π_{0.5}$.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.

2605.30833 2026-06-01 cs.CL cs.AI 版本更新

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

你的老师在这里帮不了你:对抗在线策略蒸馏中的监督保真度衰减

Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Chinese Information Processing Laboratory(中文信息处理实验室) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国)

AI总结 针对在线策略蒸馏中监督保真度衰减问题,提出前瞻组奖励方法,通过评估学生候选词在后续步骤中诱导的教师置信度并分配组归一化奖励,结合熵触发树注意力机制,显著提升长链推理性能。

详情
AI中文摘要

在线策略蒸馏通过使用来自教师的 token 级反馈,在学生模型自身生成的轨迹上训练学生模型来传递推理能力。然而,我们识别出一个关键瓶颈,即 extbf{监督保真度衰减(SFD)}:随着学生生成的前缀变长,教师的下一个 token 分布变得不那么自信和更具区分性。因此,反向 KL 蒸馏中依赖教师的纠正信号减弱,导致学生漂移在长推理链中累积。为了缓解 SFD,我们引入了 extbf{前瞻组奖励(\ours{})}。基于下一步教师置信度反映了未来反向 KL 监督的区分强度这一见解,\ours{} 通过学生在后续步骤中诱导的教师置信度来评估学生的 top-K 候选 token,并分配组归一化奖励。为了保持计算效率,我们进一步设计了一种熵触发的树注意力机制。在六个数学和代码基准测试中,\ours{} 在 7B 学生模型上比 OPD 提高了 mean@8 达 extbf{2.57} 个点,在长生成任务中增益更大,在 AIME-26 上达到 + extbf{4.92} 个点(39k token)。

英文摘要

On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce \textbf{Lookahead Group Reward (\ours{})}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, \ours{} evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, \ours{} improves mean@8 by \textbf{2.57} points over OPD for a 7B student, with gains increasing in longer-generation and reaching +\textbf{4.92} points on AIME-26 at 39k tokens.

2605.30832 2026-06-01 cs.AI 版本更新

SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning

SLAT:面向高效CoT推理的段级自适应修剪

Jian Yao, Xiongcai Luo, Ran Cheng, Kay Chen Tan

发表机构 * Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(数据科学与人工智能学院,香港理工大学) The Hong Kong Polytechnic University-Daya Bay Technology and Innovation Research Institute, Huizhou,Guangdong Province, China(香港理工大学大亚湾科技与创新研究院,广东惠州市) The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, China(香港理工大学深圳研究院,深圳)

AI总结 提出段级自适应修剪框架SLAT,通过强化学习选择性抑制低边际效用的高概率冗余段,在保持准确率的同时将推理长度减少50%。

详情
AI中文摘要

近期大型推理模型通过强化学习显著提升了思维链(CoT)能力。然而,生成的推理链常存在结构冗余(即“过度思考”),在未提高答案正确性的情况下产生高计算开销。现有缓解策略通常依赖令牌均匀长度惩罚,这种粗粒度、段无关的缩短压力可能在不经意间抑制有用推理。为解决此问题,我们证明低效集中在高概率且边际效用低的段。我们推导了在正确性-长度权衡目标下段次优性的理论表征,并提出SLAT(段级自适应修剪),一种基于该准则选择性抑制冗余段的强化学习框架。在标准基准上的实验结果表明,SLAT建立了优越的准确率-效率帕累托前沿,与未压缩基线相比,将推理长度减少50%,同时保持有竞争力的准确率。总体而言,我们的结果表明,基于理论的段感知修剪是大型语言模型中高效CoT推理的一个有前景的方向。

英文摘要

Recent advances in Large Reasoning Models have significantly improved chain-of-thought (CoT) capabilities via reinforcement learning (RL). However, generated reasoning chains frequently suffer from structural redundancy (i.e., \emph{overthinking}), incurring high computational overhead without improving answer correctness. Existing mitigation strategies typically rely on token-uniform length penalties, which provide coarse, segment-agnostic pressure toward shorter outputs and can inadvertently suppress useful reasoning alongside redundancy. To address this, we demonstrate that inefficiency concentrates in high-probability segments with low marginal utility. We derive a theoretical characterization of segment suboptimality under the correctness-length trade-off objective and propose \textsc{SLAT} (Segment-Level Adaptive Trimming), an RL framework that selectively suppresses redundant segments based on this criterion. Empirical results on standard benchmarks indicate that \textsc{SLAT} establishes a superior accuracy-efficiency Pareto frontier, reducing reasoning length by $50\%$ relative to uncompressed baselines while maintaining competitive accuracy. Overall, our results suggest that theoretically grounded, segment-aware trimming is a promising direction for efficient CoT reasoning in large language models.

2605.30826 2026-06-01 cs.CL cs.AI 版本更新

Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage

超越一致性:为策展人分类对面板筛选的生物医学实体候选进行评分

Shuheng Cao, Ruiqi Chen, Renjie Cao, Zhenhao Zhang, Siyu Zhang, Tingting Dan

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) The Hong Kong University of Science and Technology, Guangzhou(香港科学与技术大学(广州)) ShanghaiTech University(上海科技大学) University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出BioConCal评分器,利用无金标准的一致性、提及、表面可用性和文档特征,对多LLM面板筛选的候选实体进行评分,显著提高候选筛选的精确率和召回率。

详情
AI中文摘要

生物医学命名实体识别对于现代LLM来说看似简单:合理的生物医学提及容易浮现,但语料库约定正确性取决于标注约定、跨度边界、实体粒度和类型模式。多LLM一致性是一个显著性信号,而非语料库约定正确性。我们引入了一个候选级面板输出基准,用于面板筛选的候选验证,其中单元是由明确定义的多模型面板对齐的候选,而非独立提取器输出。该基准将八个LLM在五个公共生物医学NER数据集上的预测对齐到一个候选主表中。BioConCal是一个领域内监督评分器,它利用推理时的无金标准一致性、提及、表面可用性和文档特征,为固定候选流实例化这一层。在领域内,BioConCal将AUROC从原始一致性的0.753提高到0.910。在验证选择的0.95精确率目标下,它选择了1,340个候选,经验测试精确率为0.939,而原始一致性为293个候选。这对应于候选级召回率0.592和语料库级召回率0.523,而面板内行标签上限为0.883。主要好处不是恢复每个面板成员遗漏的实体,而是将嘈杂的面板流重塑为更高产出的审查队列。在实体类型转移下,阈值需要目标领域验证,而精确字符定位仍然是单独的后处理步骤。

英文摘要

Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas. Multi-LLM agreement is a salience signal, not corpus-convention correctness. We introduce a candidate-level panel-output benchmark for panel-surfaced candidate verification, where the unit is an aligned candidate surfaced by an explicitly defined multi-model panel rather than a standalone extractor output. The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets into a candidate master table. BioConCal is an in-domain supervised scorer that instantiates this layer with inference-time gold-free agreement, mention, surface-availability, and document features for a fixed candidate stream. In domain, BioConCal improves AUROC from 0.753 for raw agreement to 0.910. At a validation-selected 0.95 precision target it selects 1,340 candidates at empirical test precision 0.939, compared with 293 for raw agreement. This corresponds to candidate-level recall 0.592 and corpus-level recall 0.523 against a within-panel row-label ceiling of 0.883. The main benefit is not recovering entities missed by every panel member, but reshaping a noisy panel stream into a higher-yield review queue. Under entity-type shift, thresholds require target-domain validation, and exact character localization remains a separate deterministic post-processing step.

2605.30825 2026-06-01 cs.LG cs.AI math.OC stat.ML 版本更新

Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints

扩散模型中的遗忘学习:基于KL散度和似然约束的统一框架

Shervin Khalafi, Alejandro Ribeiro, Dongsheng Ding

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Tennessee, Knoxville(田纳西大学,基洛纳)

AI总结 提出一个约束优化框架,通过最小化与预训练模型的偏差并施加与遗忘分布的分离约束,实现扩散模型中的概念和数据遗忘,并基于KL散度和似然约束推导最优解及原始-对偶算法。

Comments 27 pages, 6 figures, 4 tables; Accepted by ICML 2026

详情
AI中文摘要

扩散模型中的遗忘学习旨在移除不需要的数据或概念,同时保留预训练模型的效用——这两个目标本质上相互冲突。我们提出了一个原则性的约束优化框架,将遗忘学习形式化为在满足与遗忘分布的显式分离约束下,最小化与预训练模型的偏差。具体地,我们基于反向和正向KL散度以及似然约束,构建了三个约束优化问题。前两个问题泛化了现有的概念和数据遗忘方法,而第三个问题为遗忘学习提供了一种新颖且自然的表述。尽管KL约束非凸,我们证明了所有三个问题的强对偶性,从而能够显式地表征其最优解作为遗忘目标,并为每个公式开发原始-对偶算法。实验结果表明,与基于权重的基线方法相比,我们的KL约束方法在概念和数据遗忘中实现了更优的保留-遗忘权衡,而基于似然的方法在匹配遗忘效果的同时,更好地保留了保留概念。

英文摘要

Unlearning in diffusion models aims to remove undesirable data or concepts while preserving the utility of pretrained models -- two fundamentally conflicting objectives. We propose a principled constrained optimization framework that formulates unlearning as minimizing the deviation from a pretrained model, subject to explicit separation constraints from the unlearning distributions. Specifically, we formulate three constrained optimization problems based on reverse and forward KL divergences, and likelihood constraints. The first two generalize existing approaches for concept and data unlearning, while the third offers a novel and natural formulation for unlearning. Despite the nonconvexity of the KL constraints, we establish strong duality for all three problems, enabling us to explicitly characterize their optimal solutions as unlearning targets and develop primal-dual algorithms for each formulation. Experimental results demonstrate that our KL-constrained approach achieves superior retention-unlearning tradeoffs compared to weight-based baselines for concept and data unlearning, and that our likelihood-based approach matches unlearning effectiveness while better preserving retained concepts compared to baselines.

2605.30824 2026-06-01 cs.AI 版本更新

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

面向深度研究的规划器中心强化学习与结构感知奖励

Mustafa Anis Hussain, Xinle Wu, Yao Lu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出DecomposeR框架,通过将研究计划表示为有向无环图(DAG)并采用两阶段强化学习(规划器RL和回答器RL),实现显式结构化规划与细粒度奖励分配,在长文本基准上提升5.1-8.0分。

详情
AI中文摘要

深度研究任务要求LLM规划调查内容、检索证据,并在多个研究分支中综合长格式答案。现有训练范式要么依赖短格式可验证问答作为代理,要么优化单一的长轨迹,这使得规划和执行难以分离,并导致规划过程的信用分配薄弱。我们提出DecomposeR,一种以规划器为中心的深度研究框架,将研究计划表示为类型化有向无环图(DAG),使规划变得显式、结构化且可奖励。我们分两个阶段训练Qwen3-8B模型:规划器强化学习(RL)首先学习图结构和查询分解以改进研究规划,然后回答器强化学习(RL)基于所学计划学习分支级执行和最终综合。通过将奖励分配给显式的规划器令牌和结构化组件,而不是平坦的轨迹,DecomposeR实现了对规划的更细粒度优化,同时减少了端到端训练的模糊性。实验表明,由于改进了规划和回答能力,DecomposeR-8B在流行的长文本基准上比强可比开源基线提高了5.1-8.0分。

英文摘要

Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process. We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan. By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5.1-8.0 points on popular long-form benchmarks due to improved planning and answering capabilities.

2605.30818 2026-06-01 cs.ET cs.AI cs.SD 版本更新

GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement

GaMi: 通过跨模态减法解缠实现几何无关的材料识别

Zhiwei Chen, Yijie Li, Yimo Zhang, Shiyun Shao, Yichao Chen, Dian Ding, Liang Wang, Haiwei Wu, Liwei Guo, Jie Yang, Xiaosong Zhang, Yongzhao Zhang

发表机构 * National University of Singapore(新加坡国立大学) Shanghai Jiao Tong University(上海交通大学) Northwestern Polytechnical University(西北工业大学)

AI总结 提出GaMi系统,利用毫米波和声学传感的跨模态减法解缠框架,在不受约束的几何条件下实现高精度材料识别。

Comments 17 pages, 18 figures

详情
AI中文摘要

非接触式材料识别使具身智能能够进行自适应交互,但面临几何诱导变化(如方向、形状、距离)和单模态模糊性的挑战。本文提出GaMi,一种集成毫米波和声学传感的多模态材料识别系统,可在不受约束的几何条件下稳健运行。利用共置双模态传感器之间共享几何一致性的洞察,GaMi采用样本内跨模态减法解缠框架。通过语义对齐模态并减去共享几何上下文,它隔离了内在材料特征。此外,GaMi引入样本间对比学习以纠正跨模态未对准引起的残余干扰。另外,两种模态之间的配对自适应策略实现了跨设备的少样本泛化。在20种材料上的广泛评估表明,GaMi达到了95.2%的准确率,在未见几何条件下优于单模态基线。

英文摘要

Non-contact material identification enables adaptive interaction for embodied intelligence yet faces challenges from geometry-induced variations (e.g., orientation, shape, distance) and single-modality ambiguities. In this paper, we present GaMi, a multimodal material identification system integrating mmWave and acoustic sensing to robustly operate under unconstrained geometric conditions. By leveraging the insight of shared geometric consistency between co-located bimodal sensors, GaMi employs an intra-sample cross-modal subtractive disentanglement framework. By semantically aligning modalities and subtracting the shared geometric context, it isolates intrinsic material features. Furthermore, GaMi incorporates inter-sample contrastive learning to correct the residual interference caused by cross-modal misalignment. Additionally, a pairing-based adaptation strategy between two modalities enables few-shot generalization across devices. Extensive evaluations on 20 materials show that GaMi achieves 95.2% accuracy, outperforming single-modality baselines across unseen geometric conditions.

2605.30808 2026-06-01 cs.CR cs.AI cs.LG 版本更新

Differentially Private Preference Data Synthesis for Large Language Model Alignment

面向大语言模型对齐的差分隐私偏好数据合成

Fengyu Gao, Jing Yang

发表机构 * Department of Computer Science, University of Virginia, Charlottesville, Virginia, USA(弗吉尼亚大学计算机科学系) Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, Virginia, USA(弗吉尼亚大学电气与计算机工程系)

AI总结 提出DPPrefSyn算法,基于Bradley-Terry偏好模型和DP-PCA生成差分隐私合成偏好数据,实现隐私保护的偏好对齐。

Comments Accepted to ICML 2026

详情
AI中文摘要

偏好对齐是大语言模型(LLMs)的关键后训练步骤,以确保其输出与人类价值观一致。然而,在真实人类偏好数据上进行后训练会引发隐私问题,因为这些数据集通常包含敏感的用户提示和人类判断。为了解决这一问题,我们提出了DPPrefSyn,一种用于生成差分隐私(DP)合成偏好数据的新算法,以实现隐私保护的偏好对齐。DPPrefSyn是一个基于Bradley-Terry偏好模型和成对人类偏好数据内在几何结构的原理性框架。它首先从具有正式差分隐私保证的私有数据中学习一个潜在的偏好模型,然后利用学习到的模型结合公共提示合成高质量的偏好数据。它利用每个簇奖励模型的共享线性结构来有效捕捉私有数据中的异构人类偏好,并利用差分隐私主成分分析(DP-PCA)来提高学习准确性。大量实验结果表明,DPPrefSyn在强DP保证下实现了具有竞争力的对齐性能。这些发现突显了合成偏好数据作为隐私保护偏好对齐的实用替代方案在广泛应用中的潜力。据我们所知,这是首项为LLM对齐生成DP合成偏好数据的工作。我们的代码可在https://github.com/gfengyu/Differentially-Private-Preference-Data-Synthesis获取。

英文摘要

Preference alignment is a crucial post-training step for large language models (LLMs) to ensure their outputs align with human values. However, post-training on real human preference data raises privacy concerns, as these datasets often contain sensitive user prompts and human judgments. To address this, we propose DPPrefSyn, a novel algorithm for generating differentially private (DP) synthetic preference data to enable privacy-preserving preference alignment. DPPrefSyn is a principled framework grounded in the Bradley-Terry preference model and the intrinsic geometric structure of pairwise human preference data. It first learns an underlying preference model from private data with formal differential privacy guarantees, and then leverages the learned model together with public prompts to synthesize high-quality preference data. It exploits the shared linear structure of per-cluster reward models to effectively capture heterogeneous human preferences in private datasets, and leverages DP Principal Component Analysis (DP-PCA) to improve learning accuracy. Extensive experimental results demonstrate that DPPrefSyn achieves competitive alignment performance under strong DP guarantees. These findings highlight the potential of synthetic preference data as a practical alternative for privacy-preserving preference alignment across a broad range of applications. To the best of our knowledge, this is the first work to generate DP synthetic preference data for LLM alignment. Our code is available at https://github.com/gfengyu/Differentially-Private-Preference-Data-Synthesis.

2605.30803 2026-06-01 cs.AI 版本更新

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

PReMISE:作为LLM评判者测量规范的政策评分标准

Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris, Rahul Gupta, Anna Rumshisky, Pradeep Natarajan, Venkatesh Saligrama

发表机构 * Amazon AGI(亚马逊人工智能研究院)

AI总结 提出PReMISE框架,从人类偏好数据中发现政策级评分标准集,并从结构充分性、可靠性、偏好拟合和对抗鲁棒性四个维度审计评分标准,通过偏好排名选择和可靠性约束修复操作提升评判准确性并降低可被利用性。

详情
AI中文摘要

LLM评判者越来越多地被用于评估开放式回答,但其分数强烈依赖于条件化它们的评分标准。一个模糊的评分标准要求回答“有帮助且事实准确”可能会奖励那些编造事实或违反用户意图的精心修饰的回答。我们将可重复使用的评分标准视为测量规范:改变评分标准会改变由固定评判者产生的回答质量测量。我们引入PReMISE,一个框架,给定成对的人类偏好数据,(i) 发现一个政策级别的评分标准集,以及(ii) 在LLM评判者使用下,沿着四个维度审计任何评分标准集:结构充分性、可靠性、偏好拟合和对抗鲁棒性。在评分标准来源中,没有原始来源同时具有可靠性、偏好预测性和对抗鲁棒性;高评分者间一致性并不意味着低可被利用性。PReMISE是唯一同时在适用性、特异性和有效维度上得分非平凡的评分标准来源。我们贡献了两个针对审计的修复操作:偏好排名选择将评判者在成对回答上的准确率从65.0%提高到68.6%,与最强的评分标准发现基线竞争,并在我们的跨评判者扫描中在三个评判者中的两个上领先;可靠性约束精炼将利用性回答获得高分的比率从46.4%降低到36.0%,而评分者间一致性变化很小(α=.531→.519)。

英文摘要

LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge. We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability. PReMISE is the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. We contribute two audit-targeted repair operations: preference-rank selection raises judge accuracy on paired responses from $65.0\%$ to $68.6\%$, competitive with the strongest rubric-discovery baselines and leading on two of three judges in our cross-judge sweep; reliability-constrained refinement reduces the rate at which exploit responses receive high scores from $46.4\%$ to $36.0\%$ with little change in inter-judge agreement ($α{=}.531\to.519$).

2605.30802 2026-06-01 cs.MA cs.AI 版本更新

Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution

用于预测市场决议的多智能体AI预言机系统的设计与评估

Tarun Kota

发表机构 * Yale University(耶鲁大学)

AI总结 本研究设计并评估了多智能体LLM架构作为预测市场决议的预言机,通过独立聚合与协商共识两种机制,在KalshiBench数据集上对比单模型基线,发现置信度加权投票的独立聚合达到83.43%准确率,而协商共识因错误传播性能下降,并提出了混合AI-人类预言机的路由标准。

Comments 34 pages, 11 figures

详情
AI中文摘要

预测市场聚合集体智慧以预测不确定事件,但其效用依赖于可靠的结果决议。现有的预言机系统在快速但脆弱的自动化与准确但昂贵的人工仲裁之间进行权衡。单LLM预言机实现了有意义的准确性,但继承了其底层模型的所有失败模式,且没有自我纠正机制。我们评估了多智能体LLM架构是否能在单模型基线之上提高预言机决议准确性。我们在KalshiBench的1,189个已决议预测市场问题上,比较了独立聚合和协商共识与单LLM基线(GPT-5 Nano、DeepSeek V3和Llama-3.3-70B)的性能。所有智能体通过Exa共享共同的证据层,检索按出版日期过滤以隔离推理与检索质量。采用置信度加权投票的独立聚合达到了83.43%的最高准确率,比最佳个体模型高出1.01个百分点。协商共识将准确率降低至约76%,低于所有单模型基线,这归因于辩论过程中的错误传播,即自信的错误模型使正确模型发生翻转。模型间的错误相关性(0.529-0.689)解释了聚合增益为何低于理论Condorcet上限,对集成方法构成了根本限制。许多问题无法通过任何多智能体架构纠正,这促使升级至人工仲裁。我们提出了混合AI-人类预言机系统的路由标准:仅自动解决一致且高置信度的问题,在数据集的47%上达到97.87%的准确率,而智能体间的分歧则标记其余部分供人工审查。

英文摘要

Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution. Existing oracle systems tradeoff fast but brittle automation against accurate but costly human arbitration. Single-LLM oracles achieve meaningful accuracy but inherit all failure modes of their underlying model with no self-correction mechanism. We evaluate whether multi-agent LLM architectures can improve oracle resolution accuracy over single-model baselines. We compare independent aggregation and deliberative consensus against single-LLM baselines (GPT-5 Nano, DeepSeek V3, and Llama-3.3-70B) on 1,189 resolved prediction market questions from KalshiBench. All agents share a common evidence layer through Exa, with retrieval filtered by publication date to isolate reasoning from retrieval quality. Independent aggregation with confidence-weighted voting achieves the highest accuracy at 83.43 percent, outperforming the best individual model by 1.01 percentage points. Deliberative consensus degrades accuracy to approximately 76 percent, below every single-model baseline, attributed to error propagation during debate where confidently wrong models flip correct ones. Error correlations across models (0.529-0.689) explain why aggregation gains fall short of the theoretical Condorcet ceiling, placing a fundamental limit on ensemble approaches. Many questions resist correction by any multi-agent architecture, motivating escalation to human arbitration. We propose routing criteria for hybrid AI-human oracle systems: auto-resolving only unanimous, high-confidence questions yields 97.87 percent accuracy on 47 percent of the dataset, with inter-agent disagreement flagging the remainder for human review.

2605.30794 2026-06-01 cs.CV cs.AI 版本更新

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

MechVQA:在综合机械图纸理解上基准测试与增强多模态大语言模型

Qian Kou, Xiaofeng Shi, Yulin Li, Xiaosong Qiu, Xinyang Wang, Hua Zhou, Cao Dongxing

发表机构 * Beijing Academy of Artificial Intelligence (BAAI), China(北京人工智能研究院) Institute of Information Engineering, Chinese Academy of Sciences, China(信息工程研究所) Beijing University of Technology, China(北京理工大学)

AI总结 针对多模态大语言模型在机械工程图纸理解上的不足,提出首个综合机械图纸理解数据集MechVQA,并开发MechVL模型,通过多阶段训练显著提升性能。

Comments accept by iclm2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在通用视觉问答(VQA)任务中取得了显著成就。然而,它们在机械工程图纸上仍然脆弱,因为高标注密度和弱领域知识,加上严格投影规则和几何约束下不可靠的空间关系推理,使得决定性线索容易被忽略,并经常导致错误答案。为弥补这一差距,我们引入了第一个综合机械图纸理解数据集MechVQA,通过半自动构建和质量控制流程创建。MechVQA包含3.3k张高密度图片和21K个问答对,涵盖三个能力级别(识别、推理和判断)的10个不同细粒度任务,为评估和改进MLLM在真实机械图纸上的理解提供了测试平台。在MechVQA基础上,我们通过多阶段训练范式开发了MechVL模型,构建了一个强大的领域专用基线。大量实验结果表明,MechVL在MechVQA总分上比最强的闭源基线高出7.57个百分点,显著增强了机械图纸理解能力,并为在机械设计和检测场景中部署MLLM提供了可复用的基础。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understanding dataset, MechVQA, created through a semi-automated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question-answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLM understanding on real-world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain-specialized baseline. Extensive experimental results demonstrate that MechVL outperforms the strongest closed-source baseline by 7.57 percentage points on the MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.

2605.30792 2026-06-01 eess.AS cs.AI 版本更新

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

OpenSTBench:超越语义评估的语音翻译

Yanjie An, Yuxiang Zhao, Yichi Zhang, Qixi Zheng, Yujie Tu, Keqi Deng, Kai Yu, Xie Chen

发表机构 * MoE Key Lab of Artificial Intelligence(摩埃人工智能关键实验室) Jiangsu Key Lab of Language Computing(江苏语言计算重点实验室) X-LANCE Lab(X-LANCE实验室) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Shanghai Innovation Institute(上海创新研究院) Microsoft(微软) University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 提出OpenSTBench统一多维评估框架,联合评估语音翻译系统的翻译质量、语音质量、时间一致性等,揭示系统间跨维度差异。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

语音翻译系统日益涵盖语音到文本翻译(S2TT)、语音到语音翻译(S2ST)、离线翻译和流式生成,产生的输出在模态、语音实现和时间行为上有所不同。现有评估实践评估了翻译质量、语音质量和时间质量等重要方面,但这些方面通常在不同的协议下进行评估,使得难以全面比较异构系统。为弥补这一差距,我们提出了OpenSTBench,一个统一的多维评估框架,将异构语音翻译输出组织成共享的评估格式。OpenSTBench支持离线与流式设置下的S2TT和S2ST系统,并联合评估翻译质量、语音质量、说话人保留、情感与副语言保真度、时间一致性和延迟。通过在代表性语音翻译系统上的实验,我们表明具有强翻译质量的系统在语音质量和时间质量上仍可能存在显著差异。OpenSTBench提供了一个可复现的协议,用于分析这些跨维度差异,并支持面向应用的语音翻译系统比较。代码和数据集可在https://github.com/sjtuayj/OpenSTBench获取。

英文摘要

Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.

2605.30790 2026-06-01 cs.IR cs.AI cs.CL 版本更新

On the impact of retrieved content representations in RAG Pipelines

关于检索内容表示对RAG管道的影响

Jonathan J Ross, Bevan Koopman, Anton van der Vegt, Guido Zuccon

发表机构 * The University of Queensland(昆士兰大学) CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 通过控制变量实验,研究检索文档的不同表示(选择、摘要、改写等)对RAG生成准确性的影响,发现答案保留是主要决定因素。

Comments 23 pages, 15 figures, submitted to ACL May 2026 ARR

详情
AI中文摘要

检索增强生成(RAG)通过检索到的文档补充语言模型的输入,但大多数RAG管道继承了为人类读者设计的检索组件。当消费者是大型语言模型(LLM)而非人类时,检索内容应如何表示尚不清楚。最近的工作提出了对检索内容的转换,并识别了影响生成的属性,但每项工作仅孤立地考察单一转换或属性,未明确文档表示的哪些特征最重要。我们通过控制比较来解决这一问题:固定检索不变,仅改变检索文档的表示,将原始基线与其他十三种转换(涵盖选择、摘要和改写,包括查询相关和查询无关变体)进行比较。在这十四种表示中,我们测量了四个生成器的问答准确性,并对每种表示测量了答案保留:即已知包含答案的文档在转换后是否仍支持其答案。我们发现,答案保留是生成器准确性的主要决定因素;值得注意的是,当保留率高时,表示的措辞、结构、长度和查询相关性影响有限。这表明,先前工作中归因于特定机制的准确性提升,可能部分由这些机制保留答案内容的能力解释,而这种归因在未控制保留的情况下无法确定。

英文摘要

Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.

2605.30788 2026-06-01 cs.CL cs.AI cs.LG 版本更新

XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

XLGoBench: 用算法任务检测跨语言技能差距

Purvam Jain, Preethi Jyothi, Vihari Piratla, Suvrat Raju

发表机构 * Google DeepMind(谷歌深Mind) Indian Institute of Technology Bombay(印度理工学院孟买分校) International Centre for Theoretical Sciences, Tata Institute of Fundamental Research(理论科学国际中心, Tata 基础研究机构)

AI总结 提出一套合成算法任务基准,通过跨语言执行相同任务来检测大语言模型的跨语言能力差距,实验揭示多个先进模型存在持续差距。

Comments 8+37pages

详情
AI中文摘要

我们引入一套合成算法任务,用于检测大语言模型在跨语言能力上的差距。我们的基准在语言间具有可比性,因为它要求模型在不同语言中执行相同的底层任务;可扩展,因为每个任务可以在不同复杂度级别生成,从而适应不同能力的模型;可量化,因为每个任务都承认客观的正确性概念;且透明,因为任务是从简单模板生成的,可以轻松审计翻译错误。由于我们的基准专注于算法任务,性能差异是跨语言差距的充分但不必要条件。尽管如此,我们通过大量实验表明,我们的基准暴露了多个最先进模型中存在的持续跨语言差距。

英文摘要

We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.

2605.30785 2026-06-01 cs.AI 版本更新

Learning Agent-Compatible Context Management for Long-Horizon Tasks

面向长时任务的学习智能体兼容上下文管理

Lu Yi, Runlin Lei, Liuyi Yao, Yuexiang Xie, Yuyang Li, Wenhao Zhang, Zhewei Wei, Yaliang Li, Jian-Yun Nie

发表机构 * Renmin University of China(中国人民大学) Tongyi Lab, Alibaba Group(阿里云实验室) Beijing University of Posts and Telecommunications(北京邮电大学) Université de Montréal(蒙特利尔大学)

AI总结 提出AdaCoM方法,通过外部LLM对冻结智能体进行端到端强化学习上下文管理,在长时任务中提升性能并揭示保真度-可靠性权衡。

详情
AI中文摘要

LLM智能体在现实应用中越来越多地面临长时任务,如网络搜索和深度研究,累积的上下文可能导致长上下文退化和推理失败。先前的工作通过智能体端上下文控制或固定策略(如摘要)来缓解这一问题,这需要训练智能体本身进行适应——这使得它对于闭源智能体不切实际,并且忽略了不同智能体可能需要不同策略。我们引入了自适应上下文管理(AdaCoM),它训练一个外部LLM通过灵活的修改动作和端到端强化学习来管理冻结智能体的上下文。在多种智能体上进行的网络搜索和深度研究基准测试中,AdaCoM通过保留任务约束和进展同时修剪过时内容,显著提升了性能。学习到的策略揭示了保真度-可靠性权衡:具有更高原始ReAct性能的智能体受益于更高保真度的上下文保留,而性能较低的智能体则需要更激进的压缩以保持在可靠的推理范围内。迁移实验表明,AdaCoM在能力相似(以原始ReAct性能衡量)的智能体之间最有效地泛化,这为智能体系统的可复用上下文管理器提供了一条实用路径。

英文摘要

LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through context management with agent-side context control or fixed strategies such as summarization, which require training the agent itself for adaptation - making it impractical for closed-source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.

2605.30740 2026-06-01 cs.RO cs.AI 版本更新

GSAM: A Generalizable and Safe Robotic Framework for Articulated Object Manipulation

GSAM: 一种通用且安全的铰接物体操作机器人框架

Beichen Shao, Mengying Xie, Heng Su, Wanyi Zhang, Mingyan Li, Yan Ding, Fausto Giunchiglia, Chao Chen

发表机构 * College of Computer Science, Chongqing University, Chongqing, China(重庆大学计算机学院) Lumos Robotics, China(Lumos机器人中国) Xi'an Jiaotong-Liverpool University, China(西安交通大学利物浦大学) Fudan University, China(复旦大学) Department of Information Engineering and Computer Science, University of Trento, Trento, Italy(特伦托大学信息工程与计算机科学系)

AI总结 提出GSAM框架,通过视觉感知器生成运动学参数、基于VLM的细调器进行常识推理修正、交互约束函数生成器集成障碍物避免知识,并由运动学感知规划器验证轨迹可达性,在50个铰链任务上相比最佳基线将标准差降低3.1%、操作成功率提升36.0%。

Comments Accepted by the 19th International Conference on Parallel Problem Solving from Nature (PPSN 2026)

详情
AI中文摘要

铰接物体操作对服务机器人是一个独特的挑战。现有方法采用端到端策略学习、视觉运动规划以及大语言/视觉语言模型(LLM/VLM),但往往忽视了铰接物体的多样性和末端执行器与手柄之间交互的复杂性,导致泛化能力有限和破坏性碰撞。为了解决这一问题,我们提出了GSAM,一个通用且安全的铰接物体操作机器人框架。具体来说,一个基于视觉的感知器生成运动学参数。考虑到感知器中预训练标记产生的原始估计可能偏离常识,我们提出了一个基于VLM的细调器,利用链式思维(COT)常识推理来细化感知。为了防止破坏性碰撞,我们设计了一个交互约束函数生成器,将铰接物体、交互姿态和障碍物避免知识集成到一个基中。然后LLM将这些约束函数化,并将其应用于轨迹和姿态规划。一个运动学感知的操作规划器验证轨迹和姿态的可达性。在5个物体类别的50个铰链任务和50个随机初始化的末端执行器-手柄配置上的实验表明,与最佳基线相比,GSAM将标准差降低了3.1%,操作成功率提高了36.0%,分别展示了GSAM在实际场景中优越的物体泛化能力和交互安全性。

英文摘要

Articulated object manipulation is a unique challenge for service robots. Existing methods employ end-to-end policy learning, visionmotion planning, and large-language/visual-language model (LLM/VLM), but often overlook the diversity of articulated objects and the complexity of interactions between end-effector and handle, leading to limited generalization and destructive collisions. To address this, we propose GSAM, a generalizable and safe robotic framework for articulated object manipulation. Specifically, a vision-based perceiver generates the kinematic parameters. Considering that pre-trained markers in perceiver yield raw estimations that may deviate from commonsense, we present a f ine-tuned VLM-based refiner, using chain-of-thought (COT) commonsense reasoning to refine perception. To prevent destructive collisions, we design an interaction constraint function generator, integrating articulated object, interaction pose, and obstacle avoidance knowledge into a base. LLM then functionalize these constraints and apply them to trajectory and posture planning. A kinematic-aware manipulation planner verifies reachability for trajectory and posture. Experiments on 50 hinge tasks across 5 object categories and 50 randomly initialized end-effectorhandle configurations show that GSAM reduces standard deviation by 3.1% and improves manipulation success rate by 36.0% compared to the best baseline, respectively demonstrating the superior object generalization and interaction safety of GSAM in practical scenarios.

2605.30738 2026-06-01 cs.AI 版本更新

MAVEN: Improving Generalization in Agentic Tool Calling

MAVEN:提升智能体工具调用的泛化能力

Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali

发表机构 * CoreThink AI, USA(CoreThink AI, 美国) Stanford University, Stanford, CA, USA(斯坦福大学, 加州斯坦福, 美国)

AI总结 提出MAVEN框架,通过轻量级符号推理脚手架实现结构化分解、自适应工具编排和中间验证,在多个基准测试中显著提升模型性能,且成本仅为前沿专有模型的约1/10。

详情
AI中文摘要

跨智能体工具调用环境的泛化仍然是可靠智能体推理系统的核心挑战。尽管大语言模型在单个基准测试上取得了强劲结果,但它们在组合推理策略、保留中间状态以及跨域协调工具方面的能力仍未得到充分探索。我们提出MAVEN(模块化智能体验证与执行网络),这是一种轻量级符号推理脚手架,用于结构化分解、自适应工具编排和中间验证。我们在包括BFCL v3、TauBench、Tau2Bench、AceBench在内的既定工具调用基准上评估MAVEN,并引入MAVEN-Bench,这是一个针对多步数学和物理推理的压测基准,具有显式验证和对抗性任务组合。MAVEN-Bench揭示了部分推理质量与端到端任务成功之间的巨大差距;在直接的MAVEN-Bench运行中,MAVEN在不进行额外训练的情况下,将其GPT-OSS-120b基础模型的准确率从48%提升至71%。同时,它在使用开源权重骨干且估计成本约为1/10的情况下,与前沿专有基线保持竞争力,这表明以轻量级验证为中心的脚手架可以增强组合推理,并激励对智能体进行更注重过程的评估。

英文摘要

Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured decomposition, adaptive tool orchestration, and intermediate verification. We evaluate MAVEN across established tool-calling benchmarks, including BFCL v3, TauBench, Tau2Bench, AceBench, and introduce MAVEN-Bench, a stress-test benchmark for multi-step mathematical and physical reasoning with explicit verification and adversarial task composition. MAVEN-Bench exposes a substantial gap between partial reasoning quality and end-to-end task success; in direct MAVEN-Bench runs, MAVEN improves its GPT-OSS-120b base model from 48% to 71% accuracy without additional training. It also remains competitive with frontier proprietary baselines while using an open-weight backbone with an estimated cost ratio of roughly 1/10, suggesting that lightweight verification-centered scaffolds can strengthen compositional reasoning and motivate more process-aware evaluation of agents in the wild.

2605.30736 2026-06-01 cs.LG cs.AI cs.CL 版本更新

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

OrcaRouter: 一种面向生产的混合离线-在线学习LLM路由器

Zhenghua Bao, Fengya Tian, Chris Zhang, Zhenjun Chen, Xile Ma, Yi Shi

发表机构 * Continuum AI

AI总结 提出OrcaRouter,一种结合LinUCB上下文赌博机与混合离线-在线学习协议的生产级LLM路由器,通过离线全信息反馈和在线赌博机学习实现低成本高精度模型选择。

Comments 6 pages, 1 table. Technical report

详情
AI中文摘要

大型语言模型的快速发展,每个模型具有不同的能力和推理成本,引发了一个实际部署问题:给定一个传入请求,应由哪个模型处理?我们提出OrcaRouter,一种面向生产的LLM路由器,它结合了基于词法和句子嵌入特征的LinUCB上下文赌博机与混合离线-在线学习协议。在离线阶段,OrcaRouter通过在一组精心策划的路由提示上评估每个候选模型来获取全信息反馈,生成一个奖励矩阵,用于为每个臂拟合一个岭回归器。在部署时,它从这些参数初始化,并可选地从赌博机反馈中继续学习,在观察到奖励后仅更新所选模型的臂。在我们提交RouterArena时(2026年5月20日),OrcaRouter-Adaptive以72.08的竞技场得分在公共RouterArena排行榜上排名第二,在每1000次查询成本1.00美元的情况下实现了75.54%的准确率。

英文摘要

The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question: given an incoming request, which model should handle it? We present OrcaRouter, a production-oriented LLM router that combines a LinUCB-based contextual bandit over lexical and sentence-embedding features with a hybrid offline-online learning protocol. Offline, OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts, yielding a reward matrix used to fit one ridge regressor per arm. At deployment time, it initializes from these parameters and can optionally continue learning from bandit feedback, updating only the selected model's arm after observing its reward. At the time of our RouterArena submission (May 20, 2026), OrcaRouter-Adaptive ranked second on the public RouterArena leaderboard with an arena score of 72.08, achieving 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.

2605.30720 2026-06-01 cs.LG cs.AI econ.GN q-fin.EC stat.ML 版本更新

Kalimati Vegetable Price Index Forecasting with a Momentum Corrected Online Stacking Ensemble

Kalimati蔬菜价格指数预测:基于动量校正的在线堆叠集成方法

Sahaj Raj Malla

发表机构 * Department of Mathematics, Kathmandu University(数学系,加德满都大学)

AI总结 针对新兴经济体农产品价格高波动性问题,提出动量校正在线堆叠集成模型,通过构建逆波动率加权综合指数和64个因果特征,在90天预测期实现RMSE=1.771、MAPE=0.68%、R²=0.845的优异性能。

Comments 21 pages, 8 figures, 2 tables

详情
AI中文摘要

由于高波动性、频繁的供应中断以及强烈的文化需求影响,新兴经济体的农产品价格预测十分困难。本研究引入了Kalimati蔬菜价格指数(KVPI),这是一个新的逆波动率加权综合指数,汇总了加德满都十年(2013-2023年)的135种日度批发商品。通过创建稳定的宏观信号,KVPI减少了单个作物建模固有的噪声。我们开发了包含64个因果有效特征的丰富特征集,包括节日领先滞后效应、滚动统计量和日历变量。对涵盖统计、树基、深度学习、混合和Transformer架构的14种预测模型,在短期(7天)、中期(14天和30天)和长期(90天)预测期上进行了严格评估。树基集成方法表现出显著的鲁棒性,而经典统计模型和复杂Transformer在处理噪声数据集时表现不佳。提出的动量校正在线堆叠集成模型取得了最强性能,在90天预测期上均方根误差(RMSE)为1.771,平均绝对百分比误差(MAPE)低至0.68%,并解释了84.5%的方差(R²=0.845)。这一开源流程为尼泊尔及类似市场的政策制定者和供应链参与者提供了实用、可靠的工具,以预测价格波动并加强粮食安全。

英文摘要

Forecasting agricultural commodity prices in emerging economies is difficult due to high volatility, frequent supply disruptions, and strong cultural influences on demand. This study introduces the Kalimati Vegetable Price Index (KVPI), a new inverse-volatility weighted composite index that aggregates 135 daily wholesale commodities from Kathmandu over ten years (2013-2023). By creating a stable macro-level signal, the KVPI reduces the noise inherent in modelling individual crops. A rich set of 64 causally valid features was developed, including festival lead-lag effects, rolling statistics, and calendar variables. Fourteen forecasting models spanning statistical, tree-based, deep learning, hybrid, and transformer architectures were rigorously evaluated across short (7-day), medium (14- and 30-day), and long-term (90-day) horizons. Tree-based ensembles proved notably robust, while classical statistical models and complex transformers struggled with the noisy dataset. The proposed Momentum-Corrected Online Stacking Ensemble achieved the strongest performance, yielding a Root Mean Square Error (RMSE) of 1.771, an exceptionally low Mean Absolute Percentage Error (MAPE) of 0.68%, and explaining 84.5% of the variance (R-squared = 0.845) at the 90-day horizon. This open-source pipeline provides policymakers and supply chain actors in Nepal and similar markets with a practical, reliable tool for anticipating price movements and strengthening food security.

2605.30719 2026-06-01 cs.LG cs.AI 版本更新

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

何时LLMs足以作为序列RL任务的策略优化器?

Stephane Hatgis-Kessell, Emma Brunskill

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学)

AI总结 提出PromptPO方法,利用LLM通过Python描述状态空间、动作空间和奖励函数,基于rollout反馈迭代生成和优化可执行策略,在多种环境中匹配或超越标准RL基线,但在细粒度连续控制任务中表现不足。

详情
AI中文摘要

我们研究大型语言模型(LLMs)何时可以作为强化学习(RL)任务的有效黑盒策略优化器,即何时可以用LLM替代经典RL算法?我们通过引入提示策略优化(PromptPO)来探索这个问题,这是一种迭代方法,它用状态空间、动作空间和奖励函数的Python描述提示LLM,然后让LLM根据rollout反馈生成并优化可执行策略。在硬探索环境、Meta-World机器人任务以及几个现实世界控制问题中,PromptPO通常匹配或超过标准RL基线的性能,同时使用显著更少的环境交互。为了最大化期望回报,且无需进一步显式提示,PromptPO输出的策略范围从调谐的比例控制器或基于规则的规划到运行值迭代等规划算法的策略。我们的结果表明,当LLM能够利用关于环境或优化策略的先验知识时,基于LLM的策略优化是足够的。PromptPO在MuJoCo领域中的表现不如标准RL基线,这展示了基于LLM的策略优化在需要细粒度连续控制的设置中可能存在的局限性。

英文摘要

We study when large language models (LLMs) can serve as effective black-box policy optimizers for reinforcement learning (RL) tasks, i.e., when can we replace classical RL algorithms with an LLM? We explore this question by introducing Prompted Policy Optimization (PromptPO), an iterative method that prompts an LLM with Python descriptions of the state space, action space, and reward function, then has it generate and refine executable policies based on rollout feedback. Across hard exploration environments, Meta-World robotics tasks, and several real-world control problems, PromptPO often matches or exceeds the performance of standard RL baselines while using substantially fewer environment interactions. To maximize expected return, and without further explicit prompting, the policies PromptPO outputs range from tuned proportional controllers or rule-based plans to policies that run planning algorithms like value iteration. Our results demonstrate that LLM-based policy optimization is sufficient when the LLM can leverage prior knowledge about the environment or optimization strategy. PromptPO underperforms standard RL baselines in MuJoCo domains. This demonstrates possible limitations of LLM-based policy optimization to settings that requiring fine-grained continuous control.

2605.30716 2026-06-01 cs.CV cs.AI 版本更新

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

用于病例级病理学概要报告生成的简单令牌高效视觉语言模型

Zhiyuan Yang, Jiahao Cheng, Vincent Quoc-Huy Trinh, Mahdi S. Hosseini

发表机构 * Department of Computer Science and Software Engineering (CSSE), Concordia University, Montreal, Canada(计算机科学与软件工程系(CSSE),康科迪亚大学,蒙特利尔,加拿大) Axe Cancer, Centre de recherche du CHUM, Université de Montréal, Montreal, Canada(Axe癌症,CHUM研究中心,蒙特利尔大学,蒙特利尔,加拿大) Institut de recherche en immunologie et cancérologie (IRIC), Université de Montréal(免疫学与癌症研究所(IRIC),蒙特利尔大学) Mila - Quebec AI Institute, Montreal, Canada(魁北克AI研究所(Mila),蒙特利尔,加拿大)

AI总结 提出一种简单令牌高效的视觉语言模型,通过5倍放大率的512×512补丁和两阶段监督训练,在有限GPU内存下实现病例级多WSI病理报告生成,显著降低序列长度并提升效率。

Comments Accepted by the DeLTA 2026 conference

详情
AI中文摘要

从全切片图像(WSI)生成临床有用的病理报告具有挑战性,原因在于十亿像素分辨率、长视觉令牌序列以及病例级推理的复杂性(单个病例可能包含多个具有异质性组织和模糊发现的WSI)。我们提出了一种简单的令牌高效视觉语言模型,用于病例级概要报告生成,在受限GPU内存下保持实用性。我们的架构遵循最小的三组件设计:冻结的病理补丁编码器、轻量级两层MLP视觉语言对齐器和大语言模型解码器,并带有显式的WSI标记令牌以分隔病例内的切片。训练分两个监督阶段进行:(1)仅对齐器的WSI字幕生成,使用异质WSI-文本对;(2)病例级监督微调,基于病例-报告对进行结构化报告生成。为了减少序列长度,我们使用5倍放大率下的$512 \times 512$补丁表示每个切片,与常用的20倍补丁相比,平均序列长度减少高达64倍。结合高效训练技术,我们仅用半块NVIDIA H100 GPU即可实现实际训练。在两个训练阶段中,我们的方法在ROUGE-L/METEOR/BLEU-4上取得了高分,同时在内存和运行时间上显著更高效。在基于AI的评估中,我们的模型始终优于强基线。大量消融实验表征了性能-效率权衡,并确定了在多WSI设置中提高鲁棒性的简单选择。总体而言,这项工作为高效病理报告生成提供了一个强大且可复现的基线,降低了在有限计算资源下进行多WSI VLM研究的门槛。

英文摘要

Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \times 512$ patches at $5\times$ magnification, which reduces the average sequence length by up to $64\times$ times compared to the commonly used $20\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.

2605.30711 2026-06-01 cs.CL cs.AI cs.LG stat.ML 版本更新

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

SAGE: 一种用于智能体大语言模型中高效记忆演化的新颖门控机制

Sijia Wang, Dhanajit Brahma, Ricardo Henao

发表机构 * Duke University(杜克大学)

AI总结 提出SAGE门控机制,基于von Mises-Fisher密度估计和自适应阈值,将记忆写入控制建模为新奇性检测问题,在LoCoMo上以更低成本实现最优token-F1。

详情
AI中文摘要

智能体大语言模型必须持续决定新提取的事实是应添加、与现有记忆合并还是忽略,然而先前的工作更侧重于检索和存储,而非原则性的写入端控制。我们将记忆演化视为一个新颖性检测问题,并提出SAGE(Spherical Adaptive Gate for memory Evolution),一种用于记忆演化的球形自适应门控机制,它通过基于von Mises-Fisher的密度估计器对记忆嵌入上的候选事实进行评分,并使用跟踪记忆存储几何结构的自适应阈值对其进行路由。SAGE将明确新颖的事实解析为ADD,明确冗余的事实解析为NOOP,仅将不确定的情况发送给LLM合并步骤,从而减少了昂贵的写入时推理。在LoCoMo上,SAGE在所有七个开放权重骨干对比中均实现了对Mem0的最佳平均token-F1,而在GPT-4o-mini上,它将添加阶段的API成本降低了3.4倍,添加阶段延迟降低了2.5倍,且平均评判分数差距很小。作为A-Mem的即插即用二进制门控,SAGE在五个模型上跳过了大约16-18%的LLM调用,且在开放权重骨干上质量变化极小。这些结果表明,新颖性感知的写入控制是提高长期智能体记忆中记忆质量和系统效率的实用杠杆。

英文摘要

Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory.

2605.30698 2026-06-01 cs.CV cs.AI cs.MA 版本更新

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

先见后议:用视觉证据对齐多智能体共识

Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma, Yuanzi Li, Zhengren Wang, Yinglong Yang, Yufei Chen, Yikang Wang, Shaoxu Sun, Wentao Zhang

发表机构 * Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Renmin University of China(中国人民大学) Shandong University(山东大学)

AI总结 提出EAGLE框架,通过显式暴露各智能体的视觉证据区域并相互验证,实现无需训练的多智能体视觉问答协作,提升共识可靠性。

详情
AI中文摘要

视觉语言模型(VLM)在视觉问答(VQA)上取得了强劲性能。为了减轻个体幻觉和盲点,通过多智能体协作聚合不同视角已成为一种有前景的范式。虽然这种方法在文本问答中取得了巨大成功,但其在多模态领域的潜力仍未充分探索。现有的多智能体VQA方法主要采用以文本为中心的协议,专注于文本讨论而忽略视觉信息的对齐。在这项工作中,我们揭示了一个关键见解:答案级别的共识对于可靠的多智能体VQA是不够的; extit{对齐的视觉证据}——智能体所依赖的图像区域的共享支持——对于可信的共识至关重要。为了利用这一见解,我们提出了EAGLE( extbf{E}vidence- extbf{A}ligned extbf{G}rounded mu extbf{L}ti-agent r extbf{E}asoning),一个无需训练的以证据为中心的框架,用于协调多个VLM智能体。EAGLE显式暴露每个智能体的定位区域作为视觉证据,允许对证据进行相互验证,并使用证据一致性指导最终决策。在六个VQA基准上的实验表明,EAGLE在跨领域实现了最佳平均性能,同时保持轻量、可解释且易于部署。

英文摘要

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

2605.30689 2026-06-01 cs.CV cs.AI 版本更新

ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

ConTrans:学习文本增强的局部-全局时间表示用于零样本时间动作定位

Kanchan Keisham, Thenukan Pathmanathan, Thangarajah Akilan

发表机构 * Vellore Institute of Technology, India(维洛雷理工学院,印度) Lakehead University, Canada(拉克希德大学,加拿大)

AI总结 针对零样本时间动作定位中忽略局部相关性和特征表示能力不足的问题,提出融合卷积归纳偏置与Transformer自注意力的多尺度编码器ConTrans,联合捕获细粒度局部依赖和长程全局上下文,在ActivityNet-1.3和THUMOS14上显著超越现有方法。

Comments 4 figures, 8 tables

详情
AI中文摘要

零样本时间动作定位(ZS-TAL)旨在检测和定位未修剪视频中未见过的动作。然而,现有方法主要关注建模长程上下文信息,常常忽略了视频帧之间基于相对偏移的关键局部相关性。此外,由于网络架构的浅层性,其特征表示能力受限,阻碍了性能提升。在本文中,我们通过引入一种新颖的局部-全局多尺度特征表示模块来解决这些局限性。我们提出了一种新颖的多尺度编码器架构,称为ConTrans,它将卷积(Conv)归纳偏置与Transformer自注意力相结合,以共同捕获细粒度的局部依赖和长程全局上下文,从而比现有方法获得更全面的特征表示。在ActivityNet-1.3和THUMOS14数据集上的实验评估表明,ConTrans显著优于现有方法,为ZS-TAL建立了新的基准。

英文摘要

Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THUMOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.

2605.30686 2026-06-01 cs.CR cs.AI cs.LG 版本更新

Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity

工具调用ReAct代理中深度相关的间接提示注入:注入深度、载荷框架和轮次预算敏感性

Mohammadreza Rashidi

发表机构 * Department of Computer Science(计算机科学系) AI and Media Analysis Lab(人工智能与媒体分析实验室) Berlin, Germany(柏林,德国)

AI总结 通过四个对照实验(共460次试验),研究在工具调用ReAct代理中,注入深度、载荷框架和轮次预算对间接提示注入攻击成功率的影响,发现注入深度是主导变量,且仅清理第一个工具观察可捕获67%的注入成功。

Comments 17 pages, 16 figures

详情
AI中文摘要

将链式推理与工具调用交错的ReAct代理越来越多地用于实际任务,如调度、文件检索和数据访问。它们的工具观察循环创建了一个直接攻击面:控制任何工具返回值的攻击者可以嵌入指令,将代理从用户目标引开,这种威胁称为间接提示注入。现有基准在固定条件下评估固定注入位置的攻击成功率(ASR),留下了三个未探索的风险维度:载荷在工具序列中出现的位置(注入深度)、使用的修辞风格(框架)以及代理允许的轮次数(轮次上限)。我们在五个攻击类别的20个场景中进行了四项对照研究,总共对GPT-4o-mini和Claude Haiku进行了460次试验,总API成本低于0.36美元。研究1显示,GPT-4o-mini的ASR从深度1的60%衰减到深度4和5的0%(Cramer's V = 0.58,p < 0.001;限制在序列内深度1-3:V = 0.47,p = 0.0013),这是由于深度1的模型抵抗和更深位置在遇到载荷前任务完成所致。研究2在Claude Haiku上重复了深度实验,通过保守的工具调用和真正的指令抵抗,在每个深度均实现了0%的ASR。研究3显示,在深度1,框架将ASR调节在25%(中性)到75%(角色)之间,范围达50个百分点,但在每个条件下N=20时未达到统计显著性。研究4确认ASR在3、5和7的轮次上限下稳定,表明轮次预算在此设置中不是风险因素。我们的结果确立了注入深度为主导变量,并表明仅清理第一个工具观察可捕获67%的测量注入成功。

英文摘要

ReAct agents that interleave chain-of-thought reasoning with tool calls are increasingly deployed for real tasks such as scheduling, file retrieval, and data access. Their tool observation loop creates a direct attack surface: an adversary who controls any tool's return value can embed instructions that redirect the agent away from the user's goal, a threat known as indirect prompt injection. Existing benchmarks evaluate attack success rate (ASR) at a fixed injection position under fixed conditions, leaving three risk dimensions unexplored: where in the tool sequence the payload appears (injection depth), what rhetorical register it uses (framing), and how many turns the agent is permitted (turn cap). We conduct four controlled studies on 20 scenarios spanning five attack categories, totalling 460 trials against GPT-4o-mini and Claude Haiku at a combined API cost under 0.36 USD. Study 1 shows that ASR against GPT-4o-mini decays from 60% at depth 1 to 0% at depths 4 and 5 (Cramer's V = 0.58, p < 0.001; restricted to within-sequence depths 1-3: V = 0.47, p = 0.0013), driven by model resistance at depth 1 and task completion before payload encounter at deeper positions. Study 2 replicates the depth experiment on Claude Haiku, which achieves 0% ASR at every depth through a combination of conservative tool invocation and genuine instruction resistance. Study 3 shows that framing modulates ASR between 25% (neutral) and 75% (persona) at depth 1, a 50-percentage-point range that does not reach statistical significance at N = 20 per condition. Study 4 confirms that ASR is stable across turn caps of 3, 5, and 7, indicating the turn budget is not a risk factor in this setting. Our results establish injection depth as the dominant variable and show that sanitising only the first tool observation captures 67% of measured injection successes.

2605.30685 2026-06-01 cs.CY cs.AI cs.CL cs.HC 版本更新

How Early Adopters Used Generative AI Worldwide: Variation by Country Income and Language

早期采用者如何在全球范围内使用生成式AI:按国家收入和语言的差异

Madeleine I. G. Daepp, Isaac Slaughter

发表机构 * Microsoft AI Economy Institute(微软人工智能经济研究所)

AI总结 基于大规模匿名化AI聊天机器人交互数据,实证分析了不同国家早期采用者在使用生成式AI上的差异,发现教育用途在低收入国家更普遍,休闲用途与收入正相关,且英语交互在非英语主导国家中过度代表,表明语言性能改进可能影响数字鸿沟或跨越式发展。

详情
AI中文摘要

全球范围内人们正在使用AI,但并非所有人都以相同的方式使用。利用一个广泛可用的免费AI聊天机器人的大规模匿名化、去标识化和隐私清洗的交互数据集,我们实证描述了不同国家早期采用者使用情况的差异。在大多数国家,尤其是低收入国家,教育是最常见的使用领域,教育使用与国家GDP之间存在明显的负相关。相比之下,休闲相关使用与国家收入水平正相关。我们发现,语言也塑造了使用模式:在研究期间,英语交互在那些主要语言未被现有模型很好服务的地区过度代表。我们的工作表明,改善跨语言性能可能是决定这项技术扩大数字鸿沟还是实现跨越式发展的关键因素。

英文摘要

AI is being used by people globally, but not everyone is using it in the same ways. Using a large-scale dataset of anonymized, de-identified, and privacy-scrubbed interactions with a widely available and free AI chatbot, we empirically characterize differences in early adopters' usage across countries. Schooling is the most common domain of use in most countries, particularly low-income countries, with a strong inverse association evident between schooling and country-level GDP. Leisure-related use, by contrast, is positively associated with country-level income. Language, we find, also shapes use: English-language interactions are overrepresented in places where the predominant languages were not well-served by existing models during the period of the study. Improving performance across languages may be a key factor, our work suggests, in whether this technology expands digital divides or enables leapfrogging.

2605.30680 2026-06-01 cs.AI cs.MA 版本更新

Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

战略提供者响应下的策略即代码搜索中的医疗机制

Zihan Wang, Xiang Xu, Hongyuan Zha, Wenhao Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tongji University(同济大学)

AI总结 将医疗机制设计转化为语言模型的程序合成,通过多智能体模拟器Medi-Sim评估策略提供者响应下的均衡,并利用LLM引导的进化代码搜索合成可检查的混合目标程序。

Comments 32 pages, 18 figures, 4 tables

详情
AI中文摘要

医疗机制与它们所引发的战略提供者响应密不可分:现有的医疗AI基准固定了这种响应,因此无法通过它们产生的均衡来评估机制。我们将医院机制设计重新定义为语言模型的程序合成:类型化、可检查的规则程序由Medi-Sim执行和评分,Medi-Sim是一个具有五个战略提供者渠道(编码、选择、延迟、努力、分诊)的多智能体模拟器。激励扫描恢复了经典的健康经济学发现作为相邻制度——在利润压力下的过度编码和低复杂度患者选择,以及古德哈特式漂移,其中测量绩效与真实结果呈负相关——而单个审计杠杆暴露了压力迁移:关闭编码渠道使低复杂度选择增加一倍以上。LLM引导的进化代码搜索在相同的规则程序空间上合成一个可检查的混合目标程序,该程序消除了过度编码,将拒绝率减半,并保留了大部分以利润为导向的基线的资金。

英文摘要

Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.

2605.30677 2026-06-01 cs.CR cs.AI cs.SE 版本更新

Investigating Detection and Obfuscation of Prompt Injection Attacks Against Software Reverse Engineering AI Agents

针对软件逆向工程AI代理的提示注入攻击的检测与混淆研究

Brian Crawford, Patrick McClure

发表机构 * Dept. of Computer Science(计算机科学系) Naval Postgraduate School(海军学院)

AI总结 本研究针对软件逆向工程AI代理面临的提示注入攻击,提出了检测反编译器输出中提示注入字符串的防御策略,并探索了攻击混淆及相应防御方法。

详情
AI中文摘要

代理型软件逆向工程系统容易受到嵌入可执行二进制文件源代码中的提示注入攻击。本研究展示了检测对抗性示例程序的反编译器输出中提示注入字符串存在的防御策略。还探讨了混淆这些攻击的方法以及随后防御这些混淆的方法。本研究推进了对代理型软件分析系统风险和安全性的理解,这对于将其部署到生产级网络工作流中是必要的。

英文摘要

Agentic software reverse engineering systems are vulnerable to prompt injection attacks placed into the source code of executable binary files. This research demonstrates defensive tactics for detecting the presences of prompt injection strings in the decompiler output of adversarial example programs. Methods for obfuscating these attacks and subsequent methods for defending against these obfuscations are also explored. This research advances the understanding of risk and security of agentic software analysis systems necessary for their deployment into production-level cyber workflows.

2605.30675 2026-06-01 cs.CL cs.AI 版本更新

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

大型语言模型不确定性中的人类对齐、校准与激活模式

Kyle Moore, Jesse Roberts, Daryl Watson, William Ward, Grayson Heyboer

发表机构 * Vanderbilt University(范德比大学) Tennessee Technological University(田纳西技术大学)

AI总结 研究大型语言模型的不确定性与人类不确定性的相似性,通过分析行为与内部激活模式,发现模型在多项选择和开放式事实回忆数据集上同时存在对齐与校准,并描述了指令微调的影响。

详情
AI中文摘要

不确定性量化是大型语言模型行为分析中一个庞大且不断发展的子领域。为了识别和对抗幻觉,该领域主要关注测量和改进校准,即不确定性判断对任务效能的准确性。在这项工作中,我们探讨了一个相对未被充分探索的问题:大型语言模型的不确定性与人类不确定性有多相似。我们研究了大型语言模型的外部行为和内部激活模式中是否存在类似人类的不确定性信号,即不确定性对齐。我们识别了模型在涵盖多项选择和开放式事实回忆的多种数据集上是否同时表现出对齐和校准的证据。并且我们描述了指令微调对这些方面的影响。

英文摘要

Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the accuracy of uncertainty judgments to task efficacy. In this work, we investigate the relatively underexplored question of how similar large language model uncertainty is to human uncertainty. We investigate the presence and strength of human-similar uncertainty signals, deemed uncertainty alignment, in large language model overt behavior and internal activation patterns. We identify whether the models show evidence of simultaneous alignment and calibration on a variety of datasets covering both multiple choice and open ended factual recall. And we characterize the effect of instruct fine-tuning on each of these facets.

2605.30668 2026-06-01 cs.CL cs.AI 版本更新

CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation

CobSeg: 对话主题分割的连贯性边界建模

Sijin Sun, Liangbin Zhao, Jiaxiang Cai, Ming Deng, Mingyu Luo, Xiuju Fu

发表机构 * Institute of High Performance Computing, Agency for Science, Technology and Technology(高性能计算研究所,科技局) Shanghai Univeristy(上海大学) Fudan University(复旦大学)

AI总结 提出CobSeg多分支架构,通过分离连贯性语义与词汇边界转换并利用边界信息加权和主题连贯性线索,在无需LLM调用下提升对话主题分割性能。

Comments 8 pages with appindx. Under review

详情
AI中文摘要

对话主题分割在许多人类-AI协作应用中至关重要,需要识别异质边界线索,包括话语边缘附近的词汇转换和跨话语的语义不连续性。现有的话语模型常常稀释这些局部词汇信号。我们提出CobSeg,一种新颖的多分支架构,它将连贯性层面的语义连续性与词汇边界转换分离,并通过方向性边界预测恢复两者。CobSeg进一步使用边界信息加权来强调高效用的话语位置,并融合了基于语料库的主题连贯性线索与学习到的组合权重。尽管CobSeg在有监督的金标准边界训练和自动诱导边界的伪标签设置下作为紧凑的可训练分割器进行评估,它在推理过程中无需LLM调用即可实现增强的边界预测。在五个基准测试中,它改进了$P_k$和$W_d$,特别是在局部词汇线索显著时:在金标准监督下,它在VHF上将$P_k$降低了0.7个点,$W_d$降低了0.6个点,并在DialSeg711上达到了$P_k$为1.0;在诱导边界下,它在VHF上将$P_k$降低了14.8个点,在DialSeg711上降低了1.5个点,在TIAGE上降低了1.1个点,优于先前的非LLM方法。

英文摘要

Dialogue topic segmentation is critical in many human-AI collaborative applications which requires identifying heterogeneous boundary cues, including lexical transitions near utterance edges and semantic discontinuities across utterances. Existing utterance models often dilute these local lexical signals. We propose CobSeg, a novel multi-branch architecture that separates coherence-level semantic continuity from lexical boundary transitions and recovers both through directional boundary prediction. CobSeg further uses boundary informativeness weighting to emphasize high-utility utterance positions, and incorporates a corpus-derived topic coherence cue with learned combination weights. While CobSeg is evaluated as a compact trainable segmenter under supervised gold-boundary training and a pseudo-label setting with automatically induced boundaries, it performs enhanced boundary prediction without LLM calls during inference. Across five benchmarks, it improves $P_k$ and $W_d$ particularly when local lexical cues are prominent: under gold supervision, it reduces $P_k$ by 0.7 points and $W_d$ by 0.6 points on VHF, and reaches $P_k$ of 1.0 on DialSeg711; with induced boundaries, it reduces $P_k$ by 14.8 points on VHF, by 1.5 points on DialSeg711, and by 1.1 points on TIAGE, outperforming prior non-LLM approaches.

2605.30667 2026-06-01 cs.CR cs.AI 版本更新

Automatically Attacking Software Reverse Engineering AI Agents

自动攻击软件逆向工程AI代理

Brian Crawford, Justin Phillips, Patrick McClure

发表机构 * Naval Postgraduate School(海军学院)

AI总结 提出基于遗传算法的对抗性提示生成技术(AutoDAN变体),通过注入无关字符串变量欺骗LLM驱动的反汇编与反编译系统,导致其错误分析二进制可执行文件。

详情
AI中文摘要

用于逆向工程可执行二进制文件的软件工具(如Ghidra)使恶意软件分析师能够在无法访问原始源代码的情况下安全地进行稳健的静态分析。结合大型语言模型(LLM)的分析能力,配备工具(如GhidraMCP)的代理系统可以自动化先前由人工驱动的过程。尽管这种自动化可以提高单个恶意软件分析师的生产力,但它也为恶意软件混淆引入了新的漏洞领域。本文提出了一种对抗性技术,使用基于遗传算法的提示生成(一种称为AutoDAN的对抗性攻击的变体),以证明能够欺骗基于LLM的反汇编和反编译系统,使其错误解释二进制可执行文件,从而有效破坏其分析输出。这种概念验证方法利用了LLM处理和解译反编译机器代码时的固有漏洞,通过使用无关字符串变量赋值向LLM传递隐蔽指令,同时不影响可执行文件的功能。我们通过几个简洁的例子展示了这种能力。这种方法可能使攻击者能够绕过依赖LLM驱动分析管道的自动化检测系统。通过研究和理解这种攻击,可以获得关于将LLM集成到网络安全工具链中的安全影响以及构建更稳健的代理代码分析系统的见解。

英文摘要

Software tools for reverse engineering executable binary files, such as Ghidra, enable malware analysts to safely conduct robust static analysis without having access to original source code. Coupled with the analytic power of large language models (LLM), agentic systems enabled with tools, such as GhidraMCP, can allow analysts to automate a previously human driven process. Although this automation can increase the productivity of a single malware analyst, it also introduces a new area of vulnerability for malware obfuscation. This paper presents an adversarial technique using genetic algorithm-based prompt generation, a modification of an adversarial attack known as AutoDAN, to demonstrate the ability to deceive LLM-powered disassembly and decompilation systems into misinterpreting binary executables, effectively corrupting their analytical output. This proof-of-concept methodology exploits inherent vulnerabilities in how LLMs process and interpret decompiled machine code via prompt injection by using extraneous string variable assignments to pass surreptitious instructions to the LLM while not impacting the functionality of the executable file. We demonstrate this capability through several concise examples. This approach could enable attackers to bypass automated detection systems that rely on LLM-driven analysis pipelines. By studying and understanding this attack, insights can be gained regarding the security implication of integrating LLMs into cybersecurity toolchains and building more robust agentic code analysis systems.

2605.30664 2026-06-01 cs.AI 版本更新

Structure-Induced Information for Rerooting Levin Tree Search

结构信息用于重定根莱文树搜索

Jake Tuero, Michael Buro, Laurent Orseau, Levi H. S. Lelis

发表机构 * Department of Computing Science, University of Alberta, Edmonton, Canada. Alberta Machine Intelligence Institute (Amii), Edmonton, Canada. Google DeepMind, London, United Kingdom.

AI总结 提出三种重定根器设计,利用结构信息隐式分解子目标,提升策略树搜索的可扩展性和效率。

Comments ICML 2026

详情
AI中文摘要

基于子目标的策略树搜索利用策略引导搜索,对于复杂的单智能体确定性问题是有效的,但通常依赖于显式的子目标生成,这会带来大量开销并阻碍可扩展性。在本文中,我们通过最近引入的$\sqrt{\text{LTS}}$算法使用学习到的“重定根器”来克服这些限制。重定根器隐式地将问题分解为软子任务。虽然先前的工作侧重于给定或手工制作的重定根器的形式保证,但在本文中,我们提出了三种重定根器设计:(i) 基于聚类的重定根器,利用全局状态空间结构;(ii) 基于启发式的重定根器,利用学习的代价估计;(iii) 结合两种信号的混合重定根器。我们的框架避免了显式重构和推理生成的子目标,从而能够以显著降低的计算开销实现可扩展的搜索努力分配。实验上,我们的基于重定根的方法在基于子目标的策略树搜索失败的复杂环境中也能扩展,并在测试的领域上实现了最先进的在线训练效率。

英文摘要

Subgoal-based policy tree search, which uses a policy to guide search, is effective for complex single-agent deterministic problems but often relies on explicit subgoal generation that can incur substantial overhead and hinders scalability. In this paper, we overcome these limitations by using a learned ``rerooter'' through the recently-introduced $\sqrt{\text{LTS}}$ algorithm. A rerooter implicitly decomposes the problem into soft subtasks. While previous work focused on the formal guarantees for given or handcrafted rerooters, in this work we propose three rerooter designs: (i) a clustering-based rerooter that exploits global state-space structure, (ii) a heuristic-based rerooter that leverages learned cost-to-go estimates, and (iii) a hybrid that combines both signals. Our framework avoids having to explicitly reconstruct and reason over generated subgoals, thereby enabling scalable allocation of search effort with significantly lower computational overhead. Empirically, our rerooting-based methods scale to complex environments where subgoal-based policy tree search fails, and achieve state-of-the-art online training efficiency on the domains tested.

2605.30654 2026-06-01 cs.CL cs.AI cs.HC 版本更新

EUDAIMONIA: Evaluating Undesirable Dynamics in AI

EUDAIMONIA: 评估AI中的不良动态

Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast, Ravi Iyer, Robin Jia

发表机构 * University of Southern California(南加州大学)

AI总结 提出Social AI Design Code框架,并通过EUDAIMONIA基准测试评估22个最新LLM在社交互动中对用户福祉的符合程度,发现即使最强模型也违反约30%的设计要求。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作陪伴、情感披露和人际建议的对话伙伴,但这些互动的社会动态可能造成能力导向或传统安全评估无法捕捉的伤害。我们引入了Social AI Design Code,这是一个评估LLM在社交互动中是否符合用户福祉的框架,包括它们是否鼓励有害的亲密关系、依赖或长时间参与。为了在自然且多样化的用户-LLM互动中评估这些风险,我们通过弱到强过滤、多模型重新标记和受控重写,从WildChat构建了包含969个用户输入和3,147个设计要求违规检查的基准测试EUDAIMONIA,将代码操作化。评估22个最近的LLM,我们发现即使最强的模型Claude-Opus-4.7和GPT-5.5也分别违反了30.7%和27.2%的检查。扩展思考并未降低违规率,表明这些失败是持久的社会对齐问题,而非仅通过测试时推理就能解决的缺陷。

英文摘要

Large language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability-oriented or traditional safety evaluations. We introduce the Social AI Design Code, a framework for evaluating whether LLMs align with user welfare in social interactions, including whether they encourage harmful intimacy, dependence, or prolonged engagement. To evaluate these risks in natural and diverse user-LLM interactions, we operationalize the code with EUDAIMONIA, a benchmark of 969 user inputs and 3,147 design-requirement violation checks built from WildChat through weak-to-strong filtration, multi-model relabeling, and controlled rewriting. Evaluating 22 recent LLMs, we find that even the strongest models, Claude-Opus-4.7 and GPT-5.5, violate 30.7% and 27.2% of checks, respectively. Extended thinking does not reduce violation rates, suggesting that these failures are persistent social-alignment problems rather than deficits solvable through test-time reasoning alone.

2605.30651 2026-06-01 cs.LG cs.AI 版本更新

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

LARK:基于可学习性的轨迹选择用于高效推理蒸馏

Tianrun Yu, Kaixiang Zhao, Chih-Chun Chen, Amanda Hughes, Taylor W. Killian, Fenglong Ma, Weitong Zhang, Porter Jenkins

发表机构 * Brigham Young University The Pennsylvania State University University of North Carolina at Chapel Hill

AI总结 提出LARK方法,通过可学习性因子ρ和χ²正则化选择策略,在推理蒸馏中高效选择学生模型可学习的轨迹,同时保持分布覆盖,显著提升多个基模型和推理任务的性能。

Comments 43 pages, 9 figures, 2 tables

详情
AI中文摘要

我们研究推理蒸馏中的轨迹选择问题,其中教师生成的推理轨迹被选择性地用作学生模型的监督。现有方法依赖于启发式规则,如轨迹质量或模型置信度,但往往忽略了轨迹是否可被学生模型学习。本文提出LARK,一种基于可学习性的推理轨迹选择方法。LARK选择学生能够高效学习的轨迹,同时保留完整训练分布的泛化能力。LARK的核心是可学习性因子$ρ$,它刻画了学生训练损失下降的速率。为了高效估计该速率并保持泛化,我们引入了一个可学习性代理和一个$χ^2$正则化的选择策略,该策略平衡可学习性和分布覆盖,两者均具有强理论保证的估计误差。实验表明,LARK在多个基模型和推理任务上持续优于数据选择基线。诊断分析显示,LARK得分能预测下游训练效用,且LARK选择的轨迹能诱导更快的监督微调损失下降。我们的代码可在https://github.com/Tianrun-Yu/LARK获取。

英文摘要

We study trajectory selection for reasoning distillation, where teacher-generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or model confidence, but they often overlook whether a trajectory is learnable by the student. In this paper, we present LARK, a learnability-grounded method for reasoning trajectory selection. LARK selects trajectories that the student can learn efficiently while preserving the generalization of the full training distribution. At the core of LARK is a learnability factor $ρ$, which characterizes the rate at which the student's training loss decreases. To estimate this rate efficiently and maintain generalization, we introduce a learnability proxy and a $χ^2$-regularized selection policy that balances learnability and distributional coverage, both with strong theoretical guarantees on their estimation error. Empirically, LARK consistently outperforms data selection baselines across multiple base models and reasoning tasks. Diagnostic analyses show that the LARK score predicts downstream training utility and that LARK-selected trajectories induce faster supervised fine-tuning loss reduction. Our code is available at https://github.com/Tianrun-Yu/LARK.

2605.30646 2026-06-01 cs.CL cs.AI 版本更新

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

同一患者,不同措辞,不同诊断?评估临床大语言模型中的语义稳定性

Mahdi Alkaeed, Adnan Qayyum, Nabeel Abo Kashreef, Muhammad Bilal, Junaid Qadir

发表机构 * Department of Computer Science and Engineering, College of Engineering, Qatar University(卡塔尔大学计算机科学与工程系) College of Science and Engineering, Hamad Bin Khalifa University (HBKU)(哈马德·本·卡伊夫大学(HBKU)理学院) Primary Health Care Corporation (PHCC)(初级卫生保健公司) Birmingham City University(伯明翰城市大学)

AI总结 针对临床大语言模型对语义等价但措辞不同的提示敏感的问题,提出基于自然语言推理的语义验证框架和三个量化指标,评估16个开源通用与医学模型,发现领域专业化并不一致地提升或降低鲁棒性。

Comments 14 pages, 5 figures

详情
AI中文摘要

大语言模型(LLMs)越来越多地应用于临床场景。然而,它们的行为对细微的语言变化(如改写或句法变化)高度敏感。这种敏感性在安全关键的医疗环境中带来风险,因为语义等价的输入应产生一致的预测。但一个关键挑战是确保提示变化真正保留临床意义,因为基于嵌入的相似性度量通常无法捕捉涉及否定、时间性或严重程度的区别。为解决这一局限,我们提出一个基于自然语言推理(NLI)的语义验证框架,用于过滤保留意义的提示变化,并进一步使用LLM-as-a-judge进行精炼,由临床专家审核。此外,我们引入三个指标来量化模型敏感性:保留意义变化敏感性(MVS)、置信度变化(ΔC)和最坏情况不稳定性(WCI)。我们使用来自DiagnosisQA和MedQA数据集的改写提示,评估了同一模型系列和参数规模下的16个开源通用(GP)和医学LLMs。结果表明,领域特定(DS)模型之间的鲁棒性差异是混合的且高度依赖模型,即领域专业化并不一致地改善或降低对保留意义提示改写的鲁棒性。几个DS模型在鲁棒性排名中位列前茅(与GP对应模型相比),而强大的GP基线模型也保持竞争力。

英文摘要

Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (ΔC), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i.e., domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.

2605.30641 2026-06-01 cs.CL cs.AI 版本更新

COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models

COFT: 用于大语言模型中公平思维链推理的反事实-保形解码

Arya Fayyazi, Mehdi Kamal, Massoud Pedram

发表机构 * Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, California, USA(电气与计算机工程系,南加州大学,洛杉矶,加利福尼亚州,美国)

AI总结 提出COFT,一种无需训练的解码方法,通过反事实提示和保形校准在解码时实现token级公平性控制,显著减少思维链生成中的社会偏见,同时保持任务效用和语言质量。

Comments Proceeding of ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在思维链(CoT)生成过程中可能揭示并放大社会偏见。我们提出COFT(Chain of Fair Thought),一种无需训练的解码方法,在解码时应用token级公平性控制,并对任何冻结的因果语言模型提供无分布边际有效性保证(在可交换性下)。COFT分三个阶段运行。首先,通过将敏感跨度替换为中性token来创建掩码反事实提示。其次,通过轻量级logit融合比较事实和掩码logit分布,以减弱属性驱动的偏见。第三,使用双分支分裂保形校准,在用户选择的风险水平下认证每步候选token集。我们在六个模型和多个偏见基准上评估COFT。我们的方法将标准偏见指标降低30-55%(中位数38%),同时保持任务效用和语言质量。推理准确率在运行间噪声范围内保持不变。计算开销适中,相当于一次额外的缓存前向传递(<=11%)。COFT提供了一条清晰、可审计的路径,实现更安全的CoT生成,显著减少偏见,效用损失可忽略,且无需重新训练、辅助分类器或权重访问。

英文摘要

Large language models (LLMs) can reveal and amplify societal biases during chain-of-thought (CoT) generation. We present COFT (Chain of Fair Thought), a training-free decoding method that applies token-level fairness control at decode time, with distribution-free marginal validity guarantees (under exchangeability) for any frozen causal language model. COFT operates in three stages. First, it creates a masked counterfactual prompt by replacing sensitive spans with neutral tokens. Second, it compares the factual and masked logit distributions through lightweight logit fusion to attenuate attribute-driven biases. Third, it uses dual-branch split-conformal calibration to certify per-step candidate token sets at a user-chosen risk level. We evaluate COFT across six models and multiple bias benchmarks. Our method reduces standard bias metrics by 30-55% (median 38%) while preserving task utility and language quality. Reasoning accuracies remain unchanged within run-to-run noise margins. The computational overhead is modest, equivalent to one additional cached forward pass (<=11%). COFT offers a clear, auditable path to safer CoT generation with significant bias reduction, negligible utility loss, and no requirement for retraining, auxiliary classifiers, or weight access.

2605.30639 2026-06-01 cs.CV cs.AI cs.RO 版本更新

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

PInVerify:面向主动实例验证的离线具身基准

Yuhang Jiang

发表机构 * University of Trento(特伦托大学)

AI总结 提出主动实例验证任务,构建离线具身基准PInVerify,通过多视角导航和细粒度属性匹配评估具身智能体,并基于多模态大语言模型建立基线。

Comments Accepted as a poster at the Foundation Models Meet Embodied Agents (FMEA) Workshop, CVPR 2026. 44 pages including appendix. Code: https://github.com/Avalon-S/PInVerify

详情
AI中文摘要

具身智能体在导航到目标物体方面取得了显著进展,但到达目标附近并不能保证智能体找到了正确的实例:微妙的属性差异(例如“白色花卉”与“白色条纹”)通常需要近距离、多视角检查。我们通过主动实例验证(AIV)来解决这一差距,该任务要求智能体主动围绕候选对象选择视角,以判断其是否匹配细粒度的自然语言描述。我们将AIV形式化为一个有限视野决策过程,并引入PInVerify,一个用于AIV的离线具身基准:包含18个物体类别的3000个评估场景,以多视角捕获形式提供,并采用6扇区导航拓扑,暴露陷阱视角(可导航但无信息)和不可达扇区。作为参考基线,我们构建了一个无需训练的流水线和一个基于开源多模态大语言模型(MLLMs)的LoRA微调端到端智能体(参数规模≤8B),包括属性分解、可见性加权多视角跟踪器和三种次优视角选择(NBV)策略。在Qwen3-VL(4B/8B)、SenseNova-SI-1.2-InternVL3-8B、CLIP和SigLIP2上的评估中,最佳MLLM基线超过最佳嵌入基线4.9个百分点;GT框消融实验显示检测差距为+3.1个百分点;在测试的NBV策略中,我们未观察到主动视角选择带来的可靠增益。LoRA微调智能体(SFT+GSPO)达到85.6%。PInVerify旨在支持具身AI中主动、细粒度语义验证的进一步研究。代码:https://github.com/Avalon-S/PInVerify。

英文摘要

Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.

2605.30638 2026-06-01 cs.LG cs.AI 版本更新

Score Broadcast and Decorrelation: A General Framework for Broadcast-Based Credit Assignment

分数广播与去相关:基于广播的信用分配通用框架

Mustafa Uzun, Mete Erdogan, Cengiz Pehlevan, Alper T. Erdogan

发表机构 * KUIS AI Center, Koc University, Turkey(科克大学KUIS人工智能中心,土耳其) Electrical and Electronics Engineering, Koc University, Turkey(科克大学电子与电气工程系,土耳其) Department of Electrical Engineering, Stanford University, USA(斯坦福大学电气工程系,美国) John A. Paulson School of Engineering & Applied Sciences, Harvard University, USA(哈佛大学约翰·A·保罗森工程与应用科学学院,美国) Kempner Institute, Harvard University, USA(哈佛大学凯姆纳研究所,美国) Center for Brain Science, Harvard University, USA(哈佛大学脑科学中心,美国)

AI总结 提出分数广播与去相关(SBD)框架,通过输出分数与隐藏层激活的正交性原理,统一了多种可微损失函数下的广播式信用分配,并理论支撑了三因子学习规则。

详情
AI中文摘要

我们引入了分数广播与去相关(SBD),一个用于一般可微损失族基于广播的信用分配的原则性框架。误差广播是反向传播的一种生物合理替代方案,它无需权重传输即可将输出信息发送到隐藏层。最近针对均方误差(MSE)设置引入的误差广播与去相关(EBD)框架,将这一机制建立在最优估计量的随机正交性基础上,即最优残差与输入的函数正交。我们通过引入输出分数(损失对最终层输出的梯度)与隐藏层激活之间的正交性原理来推广这一基础,该原理在最优分数条件均值为零时成立。这一单一原理统一了标准可微损失族(包括交叉熵、Bregman散度、适当评分规则和指数族负对数似然)的广播式信用分配。该框架为一般损失下的三因子学习规则提供了理论基础,其中神经调节因子被推导为广播损失分数。我们明确推导了交叉熵情况,刻画了可接受损失类,并引入了一种分数向量扩展技术,该技术在保持正交性框架的同时丰富了广播信号。在CIFAR-10和Tiny ImageNet上的实验表明,SBD显著优于现有的广播方法,而分数向量扩展带来了进一步的提升。总体而言,这项工作确定了损失分数作为广播信号,提供了正交性理论以及神经科学中三因子学习规则的理论基础,并展示了分数向量扩展如何丰富所得目标函数的去相关方向。

英文摘要

We introduce Score Broadcast and Decorrelation (SBD), a principled framework for broadcast-based credit assignment for general families of differentiable losses. Error broadcast is a biologically plausible alternative to backpropagation that sends output information to hidden layers without weight transport. The Error Broadcast and Decorrelation (EBD) framework, recently introduced for the mean-squared-error (MSE) setting, grounded this mechanism in the stochastic orthogonality of optimal estimators, under which the optimal residual is orthogonal to functions of the input. We generalize that foundation by introducing an orthogonality principle between the output score (the gradient of loss with respect to the final-layer output) and hidden-layer activations, which holds whenever the optimal score has conditional mean zero. This single principle unifies broadcast-based credit assignment across the standard differentiable-loss families, including cross-entropy, Bregman divergences, proper scoring rules, and exponential-family negative log-likelihoods. The framework supplies a theoretical grounding for the three-factor learning rule under general losses, with the neuromodulatory factor derived as the broadcast loss score. We derive the cross-entropy case explicitly, characterize the admissible loss class, and introduce a score vector expansion technique that enriches the broadcast signal while preserving the orthogonality framework. Experiments on CIFAR-10 and Tiny ImageNet show that SBD substantially improves over existing broadcast approaches, with score vector expansion delivering further gains. Overall, this work identifies the loss score as the signal to broadcast, supplies the orthogonality theory and theoretical grounding for the three-factor learning rule from neuroscience, and shows how score vector expansion enriches the decorrelation directions of the resulting objective.

2605.30637 2026-06-01 cs.AI 版本更新

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

EHRBench: 基于电子健康记录的自动化可靠临床决策基准测试,用于大语言模型

Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui, Guanchen Wu, Ziyang Zhang, Kai Shu, Jiaying Lu, Xiao Hu, Carl Yang

发表机构 * Emory University(埃默里大学) Stanford University(斯坦福大学)

AI总结 提出EHRBench,通过EHR-LLM-KB交互流水线自动构建近百万问答对,涵盖诊断、治疗和预后三大临床决策任务,系统评估30余种LLM的性能与鲁棒性。

Comments Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026), Datasets and Benchmarks Track, Oral

详情
AI中文摘要

临床决策(CDM)是真实临床工作流程的核心,临床医生在不完整证据下推断诊断、选择治疗方案或预测未来健康结果。由于大语言模型(LLM)具有强大的语言能力、广泛的生物医学知识和高效性,越来越多地被用于支持这些决策,但LLM在真实临床决策任务上的可靠性尚未得到充分理解。为了评估CDM模型,特别是基于LLM的模型,一个理想且实用的医学决策基准应通过自动化且可靠的流水线构建,以确保规模和质量。此外,基于真实患者电子健康记录(EHR)的CDM基准可以更好地支持需要实质性生物医学知识和临床推理的实践性CDM任务的评估。为填补这些空白,我们引入了EHRBench,一个自动化且可靠的基于EHR的基准,用于大规模评估基于LLM的临床决策。为了确保可扩展性和可靠性,EHRBench通过EHR-LLM-KB(知识库)交互流水线构建。为了提高效率,我们使用专门的LLM自动将就诊级别的EHR轨迹转换为结构化模板,并确定性地将模板实例化为问答项。同时,我们应用系统性的基于知识库的验证和丰富,以过滤幻觉或模糊关系,并提高可靠性。利用该流水线,我们构建了近100万(960,067)个问答项,涵盖三个需要推理的核心临床决策任务:诊断、治疗和预后。我们在EHRBench上对30多个代表性LLM进行了基准测试,并提供了性能和鲁棒性的详细分析。结果显示了跨设置的一致能力趋势,进一步验证了EHRBench的可靠性,并指出了实现临床可靠LLM系统的可操作差距。

英文摘要

Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood. To evaluate CDM models, especially LLM-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference. To fill the gaps, we introduce EHRBench, an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. To ensure scalability and reliability, EHRBench is constructed through an EHR-LLM-KB(knowledge-base) interaction pipeline. For efficiency, we use a specialized LLM to automatically convert encounter-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items. In parallel, we apply systematic KB-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability. Using this pipeline, we construct nearly 1M (960,067) QA items spanning three core inference-required clinical decision tasks: diagnosis, treatment, and prognosis. We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems.

2605.30632 2026-06-01 cs.HC cs.AI cs.LG 版本更新

Rationalize: Shared Semantic Reasoning for Human-AI Alignment

Rationalize: 人机对齐的共享语义推理

Aritra Dasgupta, Naga Datha Saikiran Battula, Avina Nakarmi, Sohom Sen, Subhodeep Ghosh, Xun Song

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 提出Rationalize角色对框架,通过共享推理空间中的互补角色对(如探索者-引导者)实现人类与AI在数据驱动意义建构中的语义对齐,并设计元素级和角色特定的对齐评估方法。

Comments Accepted by ACM CHI 2026 BiAlign Workshop

详情
AI中文摘要

我们介绍了Rationalize,一个用于数据驱动意义建构中人类与AI模型之间共享语义推理的角色对框架。基于人机协作和批判性思维的思路,我们将人机交互概念化为一系列互补的角色对(探索者-引导者、调查者-告知者、教师-学生、法官-倡导者),这些角色对在共享推理空间中运作。在这个空间中,人类分析师和AI模型(如LLM)使目的、问题、假设、证据、推理和影响变得明确,不仅促进输出层面的对齐,而且促进双方意图和行动的合理化层面的对齐。我们将这些角色对与双向人机对齐框架联系起来,说明“使AI对齐人类”和“使人类对齐AI”如何因角色而异,并勾勒出一个使用元素级和角色特定方法进行对齐设计和评估的协作研究议程。

英文摘要

We introduce Rationalize, a role-pair framework for shared semantic reasoning between humans and AI models in data-driven sensemaking. Building on ideas in human-machine teaming and critical thinking, we conceptualize human-AI interaction as a series of complementary role pairs (Explorer-Guide, Investigator-Informant, Teacher-Student, Judge-Advocate) operating in a shared reasoning space. In this space, human analysts and AI models (such as LLMs) make purposes, questions, assumptions, evidence, inferences, and implications explicit, facilitating alignment not only at the output level but at the level of rationalization of intent and action by each side. We relate these role pairs to the bidirectional human-AI alignment framework, illustrating how "aligning AI to humans" and "aligning humans to AI" differ by role, and sketch a collaborative research agenda for alignment design and assessment using element-level and role-specific approaches.

2605.30631 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models

基于直方图正则化潜扩散模型的可控肺结节合成

Arunkumar Kannan, Yanbo Zhang, Han Liu, Michael Baumgartner, Jianing Wang, Alexander Hertel, Bogdan Georgescu, Sasa Grbic

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Department of Radiology and Nuclear Medicine, University Medical Center Mannheim, Heidelberg University(放射学与核医学科,曼海姆大学医学中心,海德堡大学)

AI总结 提出一种直方图正则化潜扩散模型,通过结合亚型、空间掩码和HU直方图条件以及可微特征空间直方图正则化项,在3D CT体积中合成肺结节,以准确建模结节特异性强度分布,提高视觉真实感和亚型一致性。

详情
AI中文摘要

尽管自动诊断系统在基于CT的肺癌筛查中取得了显著成功,但其发展仍受限于多样化、带标注的肺结节数据集的稀缺性。基于扩散的生成模型为数据合成提供了一种有前景的策略;然而,许多现有的条件方法主要优化空间重建损失,这鼓励体素级相似性,但可能不足以约束病灶级强度分布。因此,这些方法可能产生过度平滑的纹理轮廓,并低估不同结节亚型(包括实性、部分实性和磨玻璃结节)的独特衰减特性。为解决这一挑战,我们提出了一种可控潜扩散模型,该模型在全3D CT体积内合成肺结节,同时准确建模结节特异性强度分布。具体而言,我们不只依赖空间损失,还引入了一个基于直方图的正则化项,在生成过程中约束体素强度分布。该模型结合了亚型、空间掩码和Hounsfield单位(HU)直方图条件以及可微特征空间直方图正则化项,以更好地对齐病灶级强度分布,提高合成结节的视觉真实感和亚型一致性。在肺部CT数据上的大量实验表明,我们的框架实现了强烈的视觉真实感,通过定量指标和视觉图灵测试验证。此外,当用于数据增强时,生成的结节提高了下游临床任务的性能,特别是对于代表性不足的结节亚型,并显示出对亚型知情恶性分类的潜在益处。

英文摘要

While automated diagnosis systems have achieved remarkable success in computed tomography (CT)-based lung cancer screening, their development remains limited by the scarcity of diverse, annotated pulmonary nodule datasets. Diffusion-based generative models offer a promising strategy for data synthesis; however, many existing conditional approaches primarily optimize spatial reconstruction losses, which encourage voxel-wise similarity but may inadequately constrain lesion-level intensity distributions. As a result, these methods may produce over-smoothed texture profiles and underrepresent the distinct attenuation characteristics of different nodule subtypes, including solid, part-solid, and ground-glass nodules. To address this challenge, we propose a controllable latent diffusion model that synthesizes pulmonary nodules within full 3D CT volumes while accurately modeling nodule-specific intensity distributions. Specifically, rather than relying solely on spatial losses, we introduce a histogram-based regularization term that constrains voxel intensity distributions during the generative process. The model combines subtype, spatial mask, and Hounsfield unit (HU) histogram conditioning with the differentiable feature-space histogram regularization term to better align lesion-level intensity distributions, improving the visual plausibility and subtype consistency of synthesized nodules. Extensive experiments on lung CT data demonstrate that our framework achieves strong visual realism, validated through both quantitative metrics and a visual Turing test. Furthermore, when used for data augmentation, the generated nodules improve performance in downstream clinical tasks, particularly for underrepresented nodule subtypes, and show a potential benefit for subtype-informed malignancy classification.

2605.30628 2026-06-01 cs.CL cs.AI cs.LG 版本更新

The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability

错误的架构:从普遍不可能到局部补丁的LLM可靠性

Mikhail L. Arbuzov, Lee Mosbacker, Sisong Bei, Ziwei Dong, Dmitri Kalaev, Alexey Shvets

发表机构 * Independent Researcher(独立研究者) Palo Alto Networks(帕洛阿尔托网络)

AI总结 本文通过两个命题和一个推论,形式化地论证了通用LLM可靠性在无限域上不可实现,但在操作有界的局部补丁中可通过目录发现和干预覆盖实现可靠性。

Comments 25 pages, no figures

详情
AI中文摘要

通用LLM可靠性不是一个有限库问题:在所有可能任务、工具、模式、知识源和评估者期望中,新的可干预区分的失败模式会无界出现,因此没有有限的干预词典能保证对每种此类模式的有界残余错误。但部署的系统并不在整个宇宙中运行。它们在操作有界的补丁(法律审查、医学RAG、代码修复、客户支持代理、合同提取)内运行,这些补丁具有重复的任务、模式、工具和评估者期望。在这些补丁内,经验证据表明失败是稀疏的、重复的,并集中在一个小的重复目录中,因此可靠性变成了一个局部目录发现和干预覆盖问题,而不是指数级的令牌长度问题。我们通过两个命题和一个推论形式化了这一转变。命题1是最坏情况模式方面的负面结果:没有有限的干预词典能覆盖无界域的每个可区分的失败模式。推论1是逆发现蕴含:模式发现的对数上界无法容纳线性更多的不同尾模式,除非指数级地观察到更多的硬失败事件。命题2是积极的局部补丁结果:在活跃模式暴露对数增长和头部重覆盖下,每个硬决策的足够干预预算随序列长度多对数增长,并在补丁目录饱和后变为域常数。该框架重新定位而非消解长上下文困难:当硬决策数量本身随任务长度增长时,可靠性仍然困难;贡献在于识别轴向干预,而非使这些区域变得容易。

英文摘要

Universal LLM reliability is not a finite-library problem: across all possible tasks, tools, schemas, knowledge sources, and evaluator expectations, new intervention-distinguishable failure modes can appear without bound, so no finite intervention dictionary can guarantee bounded residual error for every such mode. But deployed systems do not operate over the whole universe. They operate inside operationally bounded patches (legal review, medical RAG, code repair, customer-support agents, contract extraction) with recurring tasks, schemas, tools, and evaluator expectations. Within such patches, empirical evidence suggests failures are sparse, repetitive, and concentrated in a small recurring catalogue, so reliability becomes a local catalogue-discovery and intervention-coverage problem rather than an exponential token-length problem. We formalize this transition with two propositions and one corollary. Proposition 1 is the worst-case-mode-wise negative result: no finite intervention dictionary covers every distinguishable failure mode of an unbounded domain. Corollary 1 is the inverse-discovery implication: the logarithmic upper bound on mode discovery cannot accommodate linearly more distinct tail modes without exponentially more observed hard-failure events. Proposition 2 is the positive patch-local result: under log active-mode exposure and head-heavy coverage, a sufficient per-hard-decision intervention budget grows polylogarithmically in sequence length and becomes domain-constant once the patch catalogue saturates. The framework relocates rather than dissolves long-context difficulty: where the number of hard decisions itself grows with task length, reliability remains hard; the contribution is to identify the on-axis intervention rather than to make those regimes easy.

2605.30625 2026-06-01 cs.LG cs.AI stat.ML 版本更新

Active Timepoint Selection for Learning Measure-Valued Trajectories

学习测度值轨迹的主动时间点选择

Nicolas Huynh, Mihaela van der Schaar

发表机构 * DAMTP, University of Cambridge(剑桥大学 DAMTP 实验室)

AI总结 针对高成本破坏性数据获取场景,提出基于线性化最优传输的主动学习框架,通过高斯过程建模概率路径并迭代选择最优测量时间点以最小化不确定性。

Comments ICML 2026

详情
AI中文摘要

从稀疏快照推断连续概率路径是单细胞生物学等领域的基本挑战,其中高保真数据获取通常具有破坏性且受限于高昂测序成本。这促使需要主动学习策略来战略性选择最优测量时间。然而,为此场景设计主动学习策略仍是一个开放问题:目标对象位于无限维Wasserstein空间,标准欧几里得度量在此不适用,且当前插值方法缺乏认知不确定性量化。我们提出一个将主动实验扩展到测度空间的框架。通过利用线性化最优传输(LOT),我们将分布快照映射到适合高斯过程建模的切空间,从而为底层概率路径构建可处理的概率代理模型。这产生了一种采集策略,通过迭代选择测量时间以最小化不确定性。实验结果表明,我们的策略在合成和真实数据集上均优于不考虑不确定性的基线方法。

英文摘要

Inferring continuous probability paths from sparse snapshots is a fundamental challenge in domains like single-cell biology, where high-fidelity data acquisition is often destructive and constrained by prohibitive sequencing costs. This motivates the need for active learning strategies to strategically select optimal measurement times. However, designing active learning policies for this setting remains an open problem: the target objects reside on the infinite dimensional Wasserstein space where standard Euclidean metrics are ill-defined, and current interpolation methods lack epistemic uncertainty quantification. We introduce a framework which extends active experimentation to the space of measures. By leveraging Linearized Optimal Transport (LOT), we map distributional snapshots into a tangent space amenable to Gaussian Process modeling, allowing us to construct a tractable probabilistic surrogate for the underlying probability path. This yields an acquisition policy that iteratively selects measurement times to minimize uncertainty. Empirical results demonstrate that our strategy outperforms uncertainty-agnostic baselines on both synthetic and real-world datasets.

2605.30621 2026-06-01 cs.AI 版本更新

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

利用更新并非利用收益:解构自演化LLM智能体中的演化能力

Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Dakuo Wang, Xiang Zhang, Benoit Dumoulin, Cihang Xie, Yuyin Zhou, Suhang Wang, Hanqing Lu

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) UC Santa Cruz(加州大学圣克鲁兹分校) Amazon(亚马逊) Emory University(埃默里大学) UIUC(伊利诺伊大学香槟分校) Northeastern University(东北大学)

AI总结 本文通过分析LLM智能体在外部框架(提示、技能、记忆和工具)上的自演化能力,发现框架更新能力与基础能力无关,而框架收益能力与基础能力呈非单调关系,中等能力模型受益最大。

Comments 24 pages, 9 figures, 12 tables

详情
AI中文摘要

LLM智能体越来越多地被部署为围绕可编辑外部框架(包括提示、技能、记忆和工具)构建的系统,这些框架在不改变模型参数的情况下塑造任务执行。框架自演化通过从执行证据中更新这些框架来适应此类智能体。然而,模型在任务解决中的基础能力是否预测其在框架自演化中的能力仍不清楚:哪些模型产生有用的框架更新,哪些模型实际上从中受益?我们分析了两种框架自演化能力:(i) 框架更新能力,即从执行证据中产生有用的持久框架更新的能力;(ii) 框架收益能力,即在任务解决过程中从更新框架中受益的能力。我们的分析揭示了两个发现。首先,框架更新能力在基础能力上是平坦的:来自不同能力层级的模型产生的框架更新带来的收益惊人地相似;甚至Qwen3.5-9B的更新产生的收益与Claude Opus~4.6相当。其次,框架收益能力在基础能力上是非单调的:弱层级模型从更新框架中受益甚微,中等层级模型受益最大,强层级模型受益少于中等层级。我们将弱层级的低收益归因于两种失败模式:弱层级模型可能无法激活相关的框架工件,或者激活了但未能忠实地遵循它们。这些发现表明应将能力预算投入到任务解决智能体而非演化器中,并在智能体训练中针对框架调用和长程指令遵循进行优化。我们的源代码公开在 https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution。

英文摘要

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.

2605.30619 2026-06-01 stat.ML cs.AI cs.LG 版本更新

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

从 Best-of-$N$ 偏好数据中学习奖励:目标、权衡与设计原则

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

发表机构 * Machine Learning Department(机器学习系)

AI总结 本文分析了从 Best-of-$N$ 采样构建的成对偏好数据中 Bradley-Terry 奖励学习的目标,揭示了 $N$ 和基础分布对奖励估计的影响,并提出了基于样本效率和连通性权衡的设计原则。

详情
AI中文摘要

Best-of-$N$ 采样被广泛用于构建成对偏好数据:从基础分布中抽取 $N$ 个候选,并将最佳响应与拒绝响应配对。尽管其广泛使用,但 Bradley-Terry (BT) 奖励学习从这类数据中提取了什么,以及如何选择 $N$ 和基础分布,仍不清楚。我们将近期通过诱导条件分布对偏好数据的分析专门应用于 Best-of-$N$。对于独立参考变体,我们推导出作为 $N$ 和基础分布显式函数的闭式奖励目标,并证明它们保留了潜在奖励排名。对于实用的 Best-vs-Random 和 Best-vs-Worst 变体,所选和拒绝的响应通过同一候选集耦合,因此精确的 BT 可表示性通常不成立;然而,随着 $N$ 增长,有界类最小化器接近参考目标。尽管已知边界和连通性在成对偏好学习中控制样本效率,但 Best-of-$N$ 通过 $N$ 以相反方向耦合它们:更大的 $N$ 加宽成对边界但降低连通性。这种权衡产生了两个设计原则:当偏好标签是瓶颈时使用较大的 $N$,当生成是瓶颈时使用较小的 $N$;并塑造基础分布,使其质量集中在测试时比较最重要的响应之间。在合成和真实偏好数据上的实验支持了对样本量和基础分布形状的预测依赖性。

英文摘要

Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best is paired with a rejected response. Despite its widespread use, what Bradley--Terry (BT) reward learning extracts from such data, and how to choose $N$ and the base distribution, remain unclear. We specialize a recent analysis of preference data via its induced conditional distribution to Best-of-$N$. For independent-reference variants, we derive closed-form reward targets as explicit functions of $N$ and the base distribution, and show that they preserve the latent reward ranking. For the practical Best-vs-Random and Best-vs-Worst variants, chosen and rejected responses are coupled through the same candidate set, so exact BT representability generally fails; nevertheless, bounded-class minimizers approach the reference targets as $N$ grows. Although margin and connectivity are known to govern sample efficiency in pairwise preference learning, Best-of-$N$ couples them through $N$ in opposing directions: larger $N$ widens pairwise margins but reduces connectivity. This trade-off yields two design principles: use larger $N$ when preference labels are the bottleneck, smaller $N$ when generation is the bottleneck; and shape the base distribution to place mass between the responses whose comparison matters most at test time. Experiments on synthetic and real preference data support the predicted dependence on sample size and base-distribution shape.

2605.30611 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Crafter: 面向多样化输入的可编辑科学图表生成的多智能体框架

Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen, Xiaotong Li, Zhixiang Liang, Maosong Sun, Minjia Zhang

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学) Peking University(北京大学)

AI总结 提出Crafter多智能体框架,通过结构化组合离散语义组件,实现跨图表类型和输入条件的可编辑科学图表生成,并引入CraftEditor将栅格输出转换为可编辑SVG,在CraftBench基准上显著优于现有方法。

Comments 24 pages, 11 figures

详情
AI中文摘要

科学图表是传达复杂研究思想最有效的手段之一,但生成出版质量的插图仍然是论文准备中最劳动密集的部分。现有的自动化系统各自针对单一图表类型,且仅接受文本输入,未能解决研究人员实际使用的多样类型和条件;此外,它们的栅格输出无法进行局部修改。由于科学图表是离散语义组件的结构化组合,生成器在这些布局上产生的局部错误需要的不是更强的骨干网络,而是一个框架。我们将这个框架实例化为两个互补系统:Crafter,一个用于图表生成的多智能体框架,无需架构更改即可泛化到多种图表类型和输入条件;以及CraftEditor,它应用相同的模式将栅格输出转换为可编辑的SVG。此外,我们引入了CraftBench,一个涵盖三种图表类型和四种输入条件的基准,并带有手工质量标注。实验表明,Crafter在PaperBanana-Bench和CraftBench上显著优于独立的生成器和智能体基线,消融实验确认了每个组件的独立贡献;CraftEditor忠实地将输出转换为可编辑的SVG,超越了所有基线。我们的代码和基准可在https://github.com/HaozheZhao/Crafter获取。

英文摘要

Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.

2605.30604 2026-06-01 cs.CR cs.AI cs.CL cs.IR 版本更新

An Organization-Scoped LLM Agent Runtime Architecture for Regulated Cybersecurity Operations

面向受监管网络安全运营的组织范围LLM代理运行时架构

George Fatouros, Georgios Makridis, George Kousiouris, John Soldatos, Dimosthenis Kyriazis

发表机构 * Innov-Acts Ltd(Innov-Acts有限公司) Dept. of Digital Systems University of Piraeus(数字系统系希腊比雷埃克斯大学) Harokopio University(哈罗基奥大学) Dept. of Informatics and Telematics(信息学与电信系)

AI总结 提出一种组织范围的LLM代理运行时架构,通过类型化安全上下文、运行时核心、专业子代理、受控工具适配层和分层人机回环,实现检索、工具调用、内存、发现、报告和审计的全局强制,并保持模型无关和本地部署。

Comments 8 pages, 3 figures

详情
AI中文摘要

受监管的网络安全工作流缺乏一个运行时基础,该基础能够在检索、工具调用、内存、发现、报告和审计中强制执行组织范围,同时保持模型无关和本地可部署。近期的大语言模型(LLM)代理系统在孤立的网络安全任务上报告了强劲结果,但它们本身并未为受监管的安全运营中心(SOC)和合规工作流定义一个可审计的平台架构,在这些工作流中,单个分析师可能触发约束整个组织的行动,并且运行时必须与现有的SIEM/XDR堆栈集成,作为上下文和告警驱动触发器的主要来源,而不是作为独立的分析层运行。本文提出了一种面向金融网络安全的组织范围LLM代理运行时架构。其贡献是一个类型化的安全上下文,该上下文在每个入口点创建,包括作为一等触发器摄入的SIEM/XDR通知,并在每个组件边界强制执行,结合共享的运行时核心、逻辑专业子代理、受控工具适配层(在统一策略和审计下暴露SIEM/XDR查询、丰富和响应原语)、带有证据引用的结构化发现、分层人机回环(HITL)门控以及仅追加审计。模型上下文协议(MCP)、扩展遥测、用于渗透测试的数字孪生、图检索和联邦知识共享被视为可选的扩展路径,而非强制性的运行时假设。我们描述了一个可实现的切片作为架构的可测试性表面,并提出了一个可证伪的评估计划,包含用于架构就绪性、安全策略执行、证据可追溯性、输出质量和运营可观测性的度量级通过标准。

英文摘要

Regulated cybersecurity workflows lack a runtime substrate that enforces organization-level scope across retrieval, tool calls, memory, findings, reports, and audit while remaining model-agnostic and locally deployable. Recent large language model (LLM) agent systems report strong results on isolated cybersecurity tasks, yet they do not by themselves define an auditable platform architecture for regulated security operations centre (SOC) and compliance workflows, where a single analyst may trigger actions that bind the organization, and where the runtime must integrate with existing SIEM/XDR stacks as a primary source of context and alert-driven triggers rather than operate as a standalone analytical layer. This paper proposes an organization-scoped LLM agent runtime architecture for financial cybersecurity. The contribution is a typed Security Context that is created at every entry point, including SIEM/XDR notifications ingested as first-class triggers, and enforced at every component boundary, combined with a shared Runtime Core, logical specialist subagents, a governed Tool Adapter Layer exposing SIEM/XDR query, enrichment, and response primitives under uniform policy and audit, structured findings with evidence references, tiered human-in-the-loop (HITL) gates, and append-only audit. Model Context Protocol (MCP), extended telemetry, digital twins for pentesting, graph retrieval, and federated knowledge sharing are treated as optional extension paths rather than mandatory runtime assumptions. We describe an implementable slice as the architecture's testability surface, and we propose a falsifiable evaluation plan with metric-level pass criteria for architecture readiness, security-policy enforcement, evidence traceability, output quality, and operational observability.

2605.30593 2026-06-01 cs.LG cs.AI cs.CE 版本更新

Scientific Machine Learning for Engine Health Management and Remaining Useful Life Prediction

面向发动机健康管理与剩余寿命预测的科学机器学习

Jostein Barry-Straume, Changmin Son, Adrian Sandu, Gavan Burke, Rekha Sundararajan, Andrew Rimell, James G. Steinrock

发表机构 * Computational Science Laboratory(计算科学实验室) Department of Computer Science(计算机科学系) Virginia Tech(弗吉尼亚理工学院)

AI总结 提出一个多任务科学机器学习框架,通过联合预测涡轮气体温度、温差和剩余寿命并提供量化不确定性区间,以支持基于风险的维护决策。

详情
AI中文摘要

发动机健康管理依赖于对剩余寿命的可靠预测以及对涡轮气体温度等热指标的跟踪。在实际应用中,真实机队数据具有异质性和非平稳性,仅靠点预测不足以支持风险感知的维护决策。本文提出了一种用于涡轮机预测的多任务科学机器学习框架,该框架联合预测未修剪涡轮气体温度、涡轮气体温差和剩余寿命,并以预测区间的形式提供量化不确定性,并评估其经验覆盖率。共享序列编码器(带有残差双向LSTM层和注意力池化的卷积前端)为任务特定头部提供输入,包括用于概率回归的均值-方差估计,以及可选的用于基于阈值事件建模的生存头部。该框架设计为可通过少量面向实践者的参数(例如,温差阈值规则和剩余寿命目标构建)进行调整,以便部署能够与内部策略和专有标准保持一致。使用点指标和区间指标评估所提出框架的预测性能,包括平均绝对误差、预测区间覆盖概率、平均预测区间宽度以及覆盖-宽度准则。结果按总体和按飞行阶段与维护段分层报告,以突出运营环境的影响并支持不确定性感知监控。

英文摘要

Engine Health Management (EHM) depends on reliable forecasting of Remaining Useful Life (RUL) and on tracking thermal indicators such as turbine gas temperature (TGT). In practice, real-world fleet data are heterogeneous and non-stationary, and point predictions alone are insufficient for risk-aware maintenance decisions. This paper presents a multi-task scientific machine learning framework for turbine prognostics that jointly predicts turbine gas temperature untrimmed (TGTU), Delta Turbine Gas Temperature (DTGT), and RUL, with quantified uncertainty in the form of prediction intervals whose empirical coverage is evaluated. A shared sequence encoder (convolutional front-end with residual bidirectional LSTM layers and attention pooling) feeds task-specific heads, including mean--variance estimation for probabilistic regression and, optionally, a survival head for threshold-based event modeling. The framework is designed to be tunable via a small set of practitioner-facing parameters (e.g., DTGT thresholding rules and RUL target construction) so that deployment can align with in-house policies and proprietary criteria. The predictive performance of the proposed framework is evaluated using both point and interval metrics, including mean absolute error (MAE), prediction interval coverage probability (PICP), mean prediction interval width (MPIW), and the coverage--width criterion (CWC). Results are reported both in aggregate and stratified by flight phase and maintenance segment to highlight operational-context effects and to support uncertainty-aware monitoring.

2605.30590 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

反事实评估揭示临床LLM和智能体的隐藏能力画像

Matt Turk

发表机构 * Protege Data Lab(Protege数据实验室)

AI总结 提出因果敏感性评分(CSS),通过沿五个临床维度变异肿瘤病例来评估模型是否按预期方向更新推荐,发现与覆盖度指标排名相反,并揭示所有前沿模型在手术状态干预上的安全盲点。

Comments Accepted to RLEval @ ACM CAIS 2026 (Workshop on Methods and RL Environments for Evaluating AI Agents) and selected for an invited talk based on reviewer ratings. 4-page short paper + appendix

详情
AI中文摘要

两个临床AI系统在基于覆盖度的评分标准上得分几乎相同,但当患者输入变化时行为却截然不同:一个更新其推荐以匹配新的临床信号,而另一个无论输入如何都产生相同输出。我们引入因果敏感性评分(CSS),这是一个预注册的干预性指标,沿五个临床有意义的维度——生物标志物翻转、先前治疗失败、生物标志物移除、手术状态变化和分期扰动——变异肿瘤肿瘤委员会病例,并使用{0, 0.5, 1.0}量表对每个模型是否在预注册的正确方向上更新其推荐进行评分。与基于覆盖度的加权召回指标共识匹配评分(CMS)相比,来自三个实验室的六个前沿模型在224个病例的单次推理中评估,排名几乎完全相反:所有六个模型排名发生变化,CMS最差的模型成为CSS最好的模型,而一个中上CMS模型在CSS上排名最后。我们进一步揭示了一个普遍的安全盲点:每个前沿模型在手术状态干预上失败(D家族最多17.2%的CSS),这是CMS未暴露的发现。该指标也适用于使用工具的智能体:在ReAct风格的实验中,工具使用改善了六个模型中五个的CSS(+2.5到+20.3个百分点),然而CSS最低的模型检索相同的图表部分但仍未能更新其推荐——揭示了仅在反事实评估下可见的结构性响应缺陷。跨评判者复制和三位评估者的医学专业验证确认了总体发现。像CSS这样的干预性预注册指标补充了临床AI智能体的基于覆盖度的评估:它们捕捉了覆盖度指标遗漏的响应性,并为未来的智能体强化学习系统提供了候选的密集奖励信号。

英文摘要

Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations - and scores whether each model updates its recommendations in the pre-registered correct direction using a {0, 0.5, 1.0} scale. Benchmarked against the Consensus Match Score (CMS), a coverage-based weighted recall metric, six frontier models from three labs evaluated in single-shot inference across 224 cases rank in nearly opposite orders: all six models change rank, the CMS-worst model becomes CSS-best, and one upper-mid CMS model ranks last on CSS. We further surface a universal safety blind spot: every frontier model fails on surgery-status interventions (at most 17.2% CSS on Family D), a finding CMS does not expose. The metric also transfers to tool-using agents: in a ReAct-style experiment, tool use improves CSS for five of six models (+2.5 to +20.3 percentage points), yet the lowest-CSS model retrieves the same chart sections and still fails to update its recommendations - revealing a structural responsiveness deficit visible only under counterfactual evaluation. Cross-judge replication and three-rater medical-professional validation confirm the aggregate findings. Interventional pre-registered metrics like CSS complement coverage-based evaluation for clinical AI agents: they capture responsiveness that coverage metrics miss and offer a candidate dense reward signal for future agentic RL systems.

2605.30589 2026-06-01 cs.CL cs.AI 版本更新

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

ImmigrationQA: 一个基于来源的数据集及面向美国移民法的小型模型适配

Nazarii Shportun

发表机构 * Independent Researcher(独立研究员)

AI总结 本文构建了基于来源的问答数据集ImmigrationQA(17,058对,覆盖13个移民子领域),并通过参数高效LoRA微调Llama 3.2 3B Instruct模型,在程序性子领域取得显著提升,但复杂法律推理仍较弱。

Comments 12 pages, 4 tables. Dataset (17,058 QA pairs), fine-tuned model, and code are publicly released

详情
AI中文摘要

美国移民法涵盖数千页的官方政策、联邦法规和程序指南,这些内容频繁变化,且对缺乏法律代表的申请人影响重大。我们描述了ImmigrationQA的构建过程,这是一个基于来源的问答数据集,包含13个移民子领域的17,058对问答,以及使用参数高效LoRA对Llama 3.2 3B Instruct模型在该数据集上的微调。语料库来自11个主要和次要来源——包括USCIS政策手册、8 CFR、BIA先例决定和社区问答——产生了10,056份经过验证的规范文档和18,308个文本块。使用Claude Sonnet 4.6通过五种模式特定提示从这些文本块生成结构化问答对,其中22对因来源跨度重叠不足被拒绝。微调模型在993对的保留测试集上使用LLM-as-judge评分进行评估,基于101个示例的分层样本。微调模型平均得分为1.08/3.0(16.8%完全正确;101示例分层评估),而Llama 3 8B基础模型得分为0.85/3.0(4%完全正确),平均分相对提升27%;零样本Claude Sonnet基线得分为1.52/3.0(25%完全正确)。微调模型在程序性子领域(旅行证件、身份调整、非移民签证)表现出集中改进,但在复杂法律推理和时效性统计方面仍然较弱。整个流程的云计算成本约为29美元。所有工件——数据集、模型、代码和提示模板——均已公开发布。该系统不能替代法律咨询,且不反映语料库抓取日期后的法规变化。

英文摘要

U.S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3.2 3B Instruct model on that dataset using parameter-efficient LoRA. The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community Q&A -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4.6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap. The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1.08/3.0 (16.8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0.85/3.0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1.52/3.0 (25% fully correct). The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.

2605.30585 2026-06-01 cs.LG cs.AI cs.CE 版本更新

Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation

机器学习不确定性量化方法在预测涡轮燃气温度退化中的基准测试

Jostein Barry-Straume, Changmin Son, Adrian Sandu, Gavan Burke, Rekha Sundararajan, Andrew Rimell, James G. Steinrock

发表机构 * Computational Science Laboratory(计算科学实验室) Department of Computer Science(计算机科学系) Virginia Tech(弗吉尼亚理工大学)

AI总结 本文研究了五种预测区间构建方法(Delta法、贝叶斯蒙特卡洛Dropout、Bootstrap法、下上界估计和均值方差估计),在统一实验框架下评估其捕捉涡轮燃气温度神经网络预测不确定性的能力,并基于覆盖概率、归一化平均预测区间宽度和覆盖宽度准则等指标比较了各方法的可靠性、锐度及权衡,为发动机健康管理中的预测区间方法选择和调优提供了实用指南。

详情
AI中文摘要

现代发动机的有效预测与健康管理依赖于准确的涡轮燃气温度预测和稳健的不确定性量化,以确保可靠性和安全性。本文研究了五种构建预测区间的主要方法——即Delta法、贝叶斯蒙特卡洛Dropout、Bootstrap法、下上界估计和均值方差估计——作为捕捉涡轮燃气温度神经网络预测中不确定性的手段。每种方法都在统一的实验框架内实现,该框架采用交叉验证进行超参数选择、重复训练-测试分割以保证性能稳健性,并使用多个指标评估区间的准确性和紧致性。具体地,测量了覆盖概率、归一化平均预测区间宽度以及基于覆盖宽度的准则,以全面评估每种方法的可靠性和锐度。在代表性涡轮燃气温度数据集上进行的实验揭示了五种方法在区间覆盖、宽度和稳定性方面的不同权衡。这些发现为发动机健康管理和预测中选择和调整预测区间方法提供了实用指南,确保在实际应用中的可解释性和精度。

英文摘要

Effective prognostics and health management of modern engines relies on accurate turbine gas temperature predictions and robust uncertainty quantification to ensure reliability and safety. This paper investigates five major approaches for constructing prediction intervals -- namely the Delta method, Bayesian Monte Carlo Dropout, Bootstrap method, Lower-Upper Bound Estimation, and Mean-Variance Estimation -- as a means of capturing the uncertainty in neural network predictions of turbine gas temperature. Each approach is implemented within a unified experimental framework that employs cross-validation for hyperparameter selection, repeated train-test splits for performance robustness, and multiple metrics to evaluate both the accuracy and tightness of the intervals. In particular, Coverage Probability, Normalized Mean Prediction Interval Width, and the Coverage Width-based Criterion are measured to comprehensively assess each method's reliability and sharpness. Experiments conducted on a representative turbine gas temperature dataset reveal distinct trade-offs among the five methods in terms of interval coverage, width, and stability. These findings provide a practical guide for selecting and tuning prediction interval methods in engine health management and prognostics, ensuring both interpretability and precision in real-world applications.

2605.30576 2026-06-01 cs.AI 版本更新

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

自动驾驶强化学习中不确定性感知与时间调控的专家建议

Ahmed Abouelazm, Felix Klingebiel, Philip Schörner, J. Marius Zöllner

发表机构 * FZI Research Center for Information Technology(弗劳恩霍夫信息技术研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出一种不确定性感知框架,通过自适应阈值触发专家建议并采用承诺-冷却策略调控指导时长,结合离线策略隐式分位数网络实现安全高效的探索,在CARLA中成功率提升5-7%。

Comments Accepted in The IEEE International Conference on Intelligent Transportation Systems (ITSC) September 15-18, 2026 -- Naples, Italy

详情
AI中文摘要

自动驾驶强化学习中的探索本质上是不安全的:智能体必须经历新颖行为才能学习,但探索可能导致碰撞或偏离道路。我们提出一种不确定性感知框架,利用专家建议引导探索,同时避免长期依赖。当认知不确定性或偶然不确定性超过基于滚动缓冲区的自适应阈值时,触发建议,确保建议随智能体置信度演变。采用带有随机早停启发式的承诺-冷却策略调控指导的持续时间和频率,使智能体接触连贯操作而不耗尽建议预算。专家和智能体经验在离线策略隐式分位数网络(IQN)骨干网络中的共享回放缓冲区中合并,实现专家轨迹的高效重用。在CARLA中的实验表明,我们的方法优于IQN基线,成功率提高5-7%并减少失败,证明风险敏感的不确定性与调控的专家集成相结合,能够实现基于传感器的RL策略学习在无信号交叉口导航中更安全、更高效的探索。

英文摘要

Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off-road driving. We propose an uncertainty-aware framework that leverages expert advice to guide exploration while avoiding long-term dependence. Advice is triggered when epistemic or aleatoric uncertainty exceeds adaptive thresholds derived from rolling buffers, ensuring advice evolves with the agent's confidence. A commitment-cooldown strategy with a stochastic early-stop heuristic regulates the duration and frequency of guidance, exposing the agent to coherent maneuvers without exhausting the advice budget. Expert and agent experiences are combined in a shared replay buffer within an off-policy implicit quantile network (IQN) backbone, enabling efficient reuse of expert trajectories. Experiments in CARLA show that our method outperforms the IQN baseline, improving success by 5-7% and reducing failures, demonstrating that risk-sensitive uncertainty coupled with regulated expert integration enables safer and more efficient exploration for sensor-based RL policy learning in unsignalized intersection navigation.

2605.30571 2026-06-01 cs.AR cs.AI cs.DC cs.PF cs.RO 版本更新

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

受限于内存但不受限于带宽:批量1的LLM解码中的物理AI推理差距

Josef Chen

发表机构 * KAIKAKU(卡伊卡普)

AI总结 本文通过测量不同GPU上批量1的自回归解码性能,发现物理AI推理并非仅受内存带宽限制,还受启动开销影响,并指出量化路径的实际收益取决于运行时实现。

详情
AI中文摘要

物理AI系统,包括机器人、自动驾驶车辆、具身智能体和边缘副驾驶,通常运行与云端LLM服务不同的推理工作负载:单流、批量1的自回归解码,其中一个机器人、摄像头流或用户会话等待下一个token。这种工作负载通常被描述为受内存带宽限制。每个解码步骤都会流式传输模型权重和活跃的KV缓存,因此延迟应与峰值HBM带宽成比例。我们表明这种说法是正确的但不完整。我们测量了三个7至8B类GQA变压器在四个NVIDIA GPU(H100 SXM5、A100-80GB SXM4、L40S和L4)上的批量1解码。我们评估了从2048到16384的上下文长度,在受控的bf16 SDPA设置下产生了44个有效单元。达到的峰值HBM带宽比例随着峰值带宽的增加而下降。在标题性的Qwen-2.5-7B ctx=2048单元中,L4达到了其分析内存下限的大约81%,而H100仅达到27%。物理AI解码是内存主导的,但更快的内存并不能转化为成比例的延迟增益。我们通过CUDA Graphs A/B实验测试了缺失项。在H100上,ctx=2048时,CUDA Graphs在N=10个新会话中将解码延迟提高了1.259倍,95%自助法置信区间为1.253至1.267。在L4上,相同的干预仅提供了1.028倍的提升。这分离出了在快速GPU上可见但在较慢、带宽受限的GPU上基本隐藏的启动侧开销。部署的含义是,只有当运行时实现时,内存节省才重要。在L4上,bf16解码接近内存下限,但常见的量化路径并未恢复预期的4倍权重流量减少:从62.32 ms/step的bf16基线,bnb-nf4达到59.36 ms/step,AutoAWQ+Marlin达到45.24 ms/step。使用Ada调优的int4内核的GPTQ+ExLlamaV2达到17.36 ms/step。

英文摘要

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.

2605.30570 2026-06-01 cs.AI 版本更新

Procedural Generation of First Person Shooter Maps using Map-Elites

使用MAP-Elites程序化生成第一人称射击游戏地图

Simone de Donato, Pier Luca Lanzi, Daniele Loiacono

发表机构 * Politecnico di Milano — DEIB(米兰理工学院——DEIB)

AI总结 研究应用MAP-Elites算法生成第一人称射击游戏地图,提出两种新表示方法(点线和空间布局)以提高地图多样性和质量。

详情
AI中文摘要

我们研究了应用MAP-Elites(一种著名的质量多样性算法)来设计第一人称射击(FPS)游戏关卡。我们考虑了两种已知的地图表示方法(全黑和网格图),并引入了两种新的表示方法(点线和空间布局),以改进FPS地图的特征化。我们定义了一系列指标来描述地图的拓扑属性(仅依赖于地图布局)和涌现属性(必须通过实际游戏玩法进行评估)。我们进行了深入分析,以确定最适合指导MAP-Elites照明过程的特征。我们应用带有滑动边界的MAP-Elites(MESB)来演化FPS地图种群。我们的结果表明,与之前用于演化FPS地图的表示方法相比,新表示方法可以生成具有更高多样性和质量的地图。

英文摘要

We investigate the application of MAP-Elites (a well-known quality diversity algorithm) to design levels for First-Person Shooter (FPS) games. We consider two well-known map representations (All-Black and Grid-Graph) and introduce two novel representations (Point-Line and Spatial-Layout) that improve the characterization of FPS maps. We define a series of metrics to describe maps' topological properties (which solely depend on maps' layout), and emergent properties (which must be evaluated through actual gameplay). We perform an in-depth analysis to identify the most suitable features to guide MAP-Elites illumination process. We apply MAP-Elites with Sliding Boundaries (MESB) to evolve populations of FPS maps. Our results show that the new representations can generate maps with higher diversity and quality than the representations previously used for evolving FPS maps.

2605.30563 2026-06-01 cs.AI 版本更新

Transforming and Encoding FTS for SAT Solving: What Helps, What Hurts (Extended Version)

转换与编码FTS以用于SAT求解:什么有帮助,什么有损害(扩展版)

João Filipe, Álvaro Torralba, Gregor Behnke

发表机构 * University of Amsterdam, Institute for Logic Language and Computation(阿姆斯特丹大学,逻辑语言与计算研究所) Aalborg University(奥尔堡大学)

AI总结 研究如何将因子化任务编码为SAT问题,提出多种编码策略,并分析并行性和任务转换对SAT规划器性能的影响。

详情
AI中文摘要

因子化任务是一种经典规划表示,它通过有限形式的析取前提、条件效应和天使非确定性扩展了SAS+。这使得任务表示比传统形式如STRIPS或SAS+更紧凑,并支持广泛的任务转换。然而,现有的因子化任务规划方法仅限于启发式搜索方法。在这项工作中,我们研究了如何将因子化任务编码为SAT。我们提出了几种编码任务的方法,重点关注将因子化转换关系翻译为命题逻辑的不同策略。我们还分析了如何在这种设置中利用不同层次的并行性,并研究了常见任务转换对基于SAT的规划器性能的影响。

英文摘要

Factored tasks are a classical planning representation that extends SAS+ with limited forms of disjunctive preconditions, conditional effects, and angelic nondeterminism. This allows for a more compact representation of tasks than traditional formalisms such as STRIPS or SAS+, and supports a wide range of task transformations. However, existing planning approaches for factored tasks have been limited to heuristic search methods. In this work, we investigate how to encode factored tasks in SAT. We propose several ways to encode the tasks, focusing on different strategies for translating the factored transition relation into propositional logic. We also analyze how to exploit parallelism at various levels in this setting and study the impact of common task transformations on the performance of SAT-based planners.

2605.30561 2026-06-01 cs.CV cs.AI 版本更新

VLM3: Vision Language Models Are Native 3D Learners

VLM3:视觉语言模型是原生3D学习者

Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu, Vikas Chandra, Yangyang Shi

发表机构 * Meta(Meta公司) Princeton University(普林斯顿大学)

AI总结 本文提出VLM3,通过焦距统一、文本像素参考和数据混合缩放,使标准视觉语言模型无需复杂架构或损失函数即可高效掌握多种3D任务。

详情
AI中文摘要

视觉语言模型(VLM)通过提示使统一模型能够解决各种视觉任务,在语义理解方面表现出色。然而,3D理解仍然很大程度上依赖于具有复杂任务特定设计的专家视觉模型。本文要提出的关键论点是,VLM是原生的3D学习者。我们深入的大规模研究表明:1)焦距统一,2)基于文本的像素参考,以及3)数据混合和缩放,是有效3D学习所需的一切。模型架构变化、大模型、大量数据增强以及包括回归公式在内的复杂损失(其中许多构成了专家视觉模型的基础)实际上并不是必要条件。因此,我们提出了VLM3,一种具有最简单设计的可扩展方法,使标准VLM能够掌握多样的3D任务。VLM3不仅大幅提升了VLM深度估计的准确性(0.84 -> 0.9),还实现了多样的3D任务,如像素对应、相机姿态估计和物体级3D理解,在保持标准架构和基于文本的训练的同时,匹配了专家视觉模型的准确性。我们相信VLM3为简单且可扩展的3D学习开辟了新的范式。

英文摘要

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

2605.30557 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

看见不等于知道:视觉语言模型是否知道何时不回答空间问题(以及为什么)?

Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Google Research(谷歌研究)

AI总结 针对视觉语言模型在空间推理中过度自信回答的问题,提出SpatialUncertain框架,通过遮挡和视角歧义两种挑战,评估模型是否知道何时应弃权以及如何寻找可靠证据。

Comments Website: https://zhangyuejoslin.github.io/spatialuncertain/

详情
AI中文摘要

空间推理是部署在真实环境中的视觉语言模型(VLM)的基本能力。然而,视觉观察本质上是对3D世界的有限表示:遮挡可能使物体不可见,视角可能使几何属性产生误导。尽管如此,现有的空间推理基准通常假设观察是充分且可靠的,侧重于模型是否产生正确答案,而不是它们是否认识到问题无法回答以及需要哪些额外观察。在这项工作中,我们通过构建一个受控评估框架SpatialUncertain来挑战这一假设,并引入两种观察挑战:(1)遮挡,隐藏目标信息;(2)视角歧义,产生误导性视觉线索。对于每种配置,我们设计在清晰观察下可回答但在引入挑战下需要弃权的空间问题。我们进一步评估模型是否能识别哪些额外视角可以解决视角歧义。我们在多种前沿开源和闭源VLM上的结果揭示了两个一致的失败模式。首先,模型倾向于过度自信地回答,即使在视觉证据不完整或具有误导性时也试图解决空间推理任务,在遮挡下平均准确率约为30%,在视角歧义下低于10%。其次,即使有额外视角可用,一些模型在识别哪些视角能提供可靠证据方面表现接近随机。总之,我们的发现呼吁超越答案正确性,转向评估模型是否知道何时弃权以及如何寻找可靠证据。

英文摘要

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.

2605.30542 2026-06-01 cs.AI 版本更新

Physically Viable World Models: A Case for Query-Conditioned Embodied AI

物理可行的世界模型:面向查询条件具身AI的案例

Adam J. Thorpe, Stepan Tretiakov, Cheng-Hsi Hsiao, Su Ann Low, Xingjian Li, Hassan Iqbal, Neel P. Bhatt, Ufuk Topcu, Krishna Kumar

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 针对具身AI中现有世界模型预测未来观测但物理不可行的问题,提出应构建基于查询条件、识别最简物理抽象的世界模型,通过模块化分解确保可解释性和可验证性。

Comments 21 pages; Adam J. Thorpe and Stepan Tretiakov contributed equally

详情
AI中文摘要

具身AI的世界模型必须是物理可行的:其构建应能通过表示支配动作结果的物理结构来回答干预查询,而不仅仅是预测未来观测。现有的观测预测世界模型可以产生视觉上合理但物理上错误的展开。这种失败是结构性的;不同的物理系统可能看起来相同,但在干预下却产生分歧。我们通过控制基准实验暴露了这个问题,这些基准固定可见场景同时变化潜在物理属性。我们表明,此类模型可能推荐不可行的动作、错误预测交互结果或认证不安全行为。我们认为,具身AI需要能够识别足以回答干预查询的最简物理抽象的世界模型。这样的模型由模块化组件组成,包括环境表示、潜在状态和参数估计、动作规范、干预动力学和查询级响应。一个自主编排器应识别相关抽象,并为每个查询组合兼容的学习和结构化组件。当封闭形式的物理不可用、不确定或成本高昂时,转移模型可以是解析的、模拟的、学习的或混合的,但它必须保留决定干预结果的结构。这种分解使模型可解释、其组件可验证,其输出可针对查询进行审计。它还为新的世界模型提供了设计原则,为现有模型提供了可行性测试:正确的抽象不是最详细的世界模型,而是保留与查询相关区分的最简模型。我们在现有系统无法正确回答的查询上展示了这种方法,并概述了编排器如何动态组装和调整物理可行的模型以用于规划、控制和验证。

英文摘要

World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure governing action outcomes, rather than merely predicting future observations. Existing observation-predictive world models can produce visually plausible but physically wrong rollouts. This failure is structural; distinct physical systems can look identical yet diverge under intervention. We expose this problem with controlled benchmarks that fix the visible scene while varying latent physics. We show that such models may recommend infeasible actions, mispredict interaction outcomes, or certify unsafe behavior. We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query. Such a model comprises modular components, including environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query-level response. An autonomous orchestrator should identify the relevant abstraction and compose compatible learned and structured components per query. When closed-form physics is unavailable, uncertain, or costly, the transition model may be analytic, simulated, learned, or hybrid, but it must preserve the structure that determines interventional outcomes. This decomposition makes the model interpretable, its components verifiable, and its outputs auditable against the query. It also provides a design principle for new world models and a feasibility test for existing ones: the right abstraction is not the most detailed model of the world, but the simplest model that preserves the distinctions relevant to the query. We demonstrate this approach on queries that existing systems fail to answer correctly, and outline how an orchestrator can dynamically assemble and adapt physically viable models for planning, control, and verification.

2605.30529 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

通用嵌入还是特定嵌入,哪个更好?非英语语言临床编码搜索的实证研究

David Rey-Blanco, Roberto Cruz

发表机构 * TietAI

AI总结 本研究通过使用大型生成语言模型生成的合成数据微调双语编码器,构建两阶段检索器,解决了非英语语言临床编码检索中召回率下降的问题,并在多语言基准上取得了优于BioBERT-ST的性能。

Comments 24 pages, 12 figures, 6 tables

详情
AI中文摘要

用于语义搜索的句子嵌入模型绝大多数是在英语语料库上开发和评估的。当应用于其他语言的临床检索——特别是ICD-10-CM/CIE-10代码的检索——召回率会下降,而这种下降往往被聚合基准所掩盖。我们研究大型生成语言模型是否可以作为数据工厂来缩小这一差距。我们构建了一个两阶段检索器(双编码器后接交叉编码器重排序器),该检索器在Gemini生成的合成数据(涵盖英语、西班牙语、加泰罗尼亚语、意大利语、葡萄牙语和法语)上对西班牙生物医学编码器(PlanTL-GOB-ES/bsc-bio-ehr-es)进行微调,并与BioBERT-ST和未调优的西班牙编码器进行评估。仅双编码器在MRR(0.876 vs. 0.866)上匹配BioBERT-ST,并在R@3(0.650 vs. 0.626)和R@5(0.804 vs. 0.790)上超越它,且无需英语生物医学预训练。添加交叉编码器重排序器将聚合R@5提升至0.822,并在五种语言中的四种上占据主导地位(西班牙语+0.017,加泰罗尼亚语+0.033,法语+0.018,葡萄牙语+0.037),但以英语的小幅回归为代价。这种权衡在临床上是可接受的:葡萄牙语的R@5达到0.829,而BioBERT-ST为0.714。贡献:一个基于LLM生成数据构建领域特定医学检索器的开放配方;学习增益的量化(MRR从0.755到0.876,+15.9%,使用约19,500个合成对);以及按语言和排名对增益集中区域的刻画。

英文摘要

Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap. We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0.876 vs. 0.866) and overtakes it on R@3 (0.650 vs. 0.626) and R@5 (0.804 vs. 0.790) without English biomedical pretraining. Adding a cross-encoder reranker lifts aggregate R@5 to 0.822 and dominates on four of five languages (+0.017 Spanish, +0.033 Catalan, +0.018 French, +0.037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0.829 vs. BioBERT-ST's 0.714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0.755 to 0.876, +15.9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.

2605.30523 2026-06-01 cs.LG cs.AI cs.CC cs.CL cs.FL 版本更新

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

重新审视填充Transformer的表达能力:哪些架构选择重要,哪些不重要

Anej Svete, William Merrill, Ryan Cotterell, Ashish Sabharwal

发表机构 * ETH Zürich(苏黎世联邦理工学院) Allen Institute for AI(人工智能研究所)

AI总结 本文通过连接布尔电路,系统研究了填充Transformer的表达能力,发现数值精度和模型深度是影响表达能力的主要因素,而注意力类型、模型宽度和均匀性等架构选择对表达能力影响不大。

详情
AI中文摘要

近期工作通过连接布尔电路描述了Transformer能计算和不能计算的内容,但现有结果缺乏精确刻画,且对建模选择敏感。填充Transformer——在其输入后附加填充符号如“...”——通过为自适应并行计算提供多项式空间,成为建立与电路类等价关系的有用工具。然而,目前仅研究了有限的填充Transformer理想化模型,这些等价关系在注意力类型、模型宽度和均匀性变化下的稳健性仍待探索。我们发现,在实际假设下,填充Transformer对所有这些变化都出奇地稳健,并确定数值精度和模型深度是影响表达能力的主要因素。具体地,我们证明多项式填充的L-均匀常数精度Transformer等价于L-均匀AC⁰,而增长精度的Transformer达到L-均匀TC⁰,与宽度无关。此外,循环机制允许类似电路的顺序处理:log^d N次循环的常数精度Transformer达到FO-均匀AC^d,增长精度的达到FO-均匀TC^d。有趣的是,宽度或精度超过对数增长并不会增加表达能力,且我们所有结果对softmax和平均硬注意力Transformer均成立。

英文摘要

Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers -- to whose input filler symbols such as ``...'' are appended -- emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially padded $\text{L-uniform}$ constant-precision transformers are equivalent to $\text{L-uniform AC}^0$, while growing-precision ones achieve $\text{L-uniform TC}^0$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\log^d N$-looped constant-precision transformers reach $\text{FO-uniform AC}^d$, and growing-precision ones reach $\text{FO-uniform TC}^d$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.

2605.30512 2026-06-01 cs.AI cs.CV 版本更新

PhyDrawGen: Physically Grounded Diagram Generation from Natural Language

PhyDrawGen: 基于自然语言的物理约束图表生成

Nafiul Haque, Syed Nazmus Sakib, Shifat E Arman

发表机构 * Department of Robotics and Mechatronics Engineering, University of Dhaka(机器人与机电工程系,达卡大学)

AI总结 提出PhyDrawGen神经符号管道,通过场景图提取、确定性求解器和视觉验证循环,从自然语言生成符合物理定律的图表,在力学、光学和电磁学基准上显著优于现有模型。

Comments 9 figures, 7 tables. Under review at EMNLP 2026

详情
AI中文摘要

从文本生成物理图表需要严格遵守物理定律。虽然当前生成模型能产生视觉上合理的输出,但它们会系统性地产生力向量幻觉、忽略守恒定律并违反几何约束。我们提出PhyDrawGen,一种神经符号管道,将语义场景理解与物理约束满足解耦。首先,大语言模型从问题文本中提取类型化场景图。然后,确定性求解器将该图转换为平面直线图(PSLG),将力平衡、光路和场拓扑编码为精确几何基元。最后,微调的Qwen-VL模型实现视觉基础的提议-验证循环,以迭代纠正任何约束违反。在涵盖力学、光学和电磁学的1,449个问题基准上评估,PhyDrawGen显著优于GPT-5-image、Gemini 2.5 Flash和Gemini 3 Pro,即使在非常见物体问题上也展现出鲁棒的物理准确性。

英文摘要

Generating physics diagrams from text requires strict adherence to physical laws. While current generative models produce visually plausible outputs, they systematically hallucinate force vectors, ignore conservation laws, and violate geometric constraints. We present PhyDrawGen, a neuro-symbolic pipeline that decouples semantic scene understanding from physical constraint satisfaction. First, a large language model extracts a typed scene graph from the problem text. A deterministic solver then converts this graph into a Planar Straight-Line Graph (PSLG), encoding force balance, optical paths, and field topologies as exact geometric primitives. Finally, a fine-tuned Qwen-VL model implements a visually grounded propose-verify loop to iteratively correct any constraint violations. Evaluated on a benchmark of 1,449 problems spanning mechanics, optics, and electromagnetism, PhyDrawGen significantly outperforms GPT-5-image, Gemini 2.5 Flash, and Gemini 3 Pro, demonstrating robust physical accuracy even on unusual-object problems.

2605.30510 2026-06-01 cs.CV cs.AI 版本更新

A Novel Global Context-aware Deep Neural Network for Enhanced Brain Tumor Segmentation using Magnetic Resonance Images

一种新颖的全局上下文感知深度神经网络用于基于磁共振图像的增强脑肿瘤分割

Sourjya Mukherjee, Ananya Bhattacharjee, R. Murugan

发表机构 * National Institute of Technology Silchar(全国理工学院锡拉char分校)

AI总结 提出全局上下文感知的挤压激励残差UNet(GCSER-UNet),融合空间和通道注意力,在TCGA LGG和BraTS 2020数据集上取得优于现有技术的Dice分数。

Comments 11 pages, 9 figures, 6 tables. Submitted to arXiv cs.CV

详情
AI中文摘要

脑癌的严重性需要精确的脑肿瘤分割,这对于有效的脑肿瘤诊断至关重要。手动识别成本高、劳动强度大且易出错,凸显了自动化方法的必要性。在本研究中,我们引入了全局上下文感知的挤压激励残差UNet(GCSER-UNet),它促进了空间和通道注意力的融合,从而增强了模型捕捉复杂空间依赖和上下文信息的能力。GCSER-UNet从多模态MRI切片中高效提取肿瘤区域,表现出卓越的性能。在基准数据库上的评估显示了其优越性,在TCGA LGG数据集上达到了94%的Dice分数,超过了当前最先进的91.8%。在BraTS 2020数据集上,所提出的GCSER-UNet集成方法在肿瘤区域——全肿瘤(W)、肿瘤核心(T)和增强肿瘤(E)上分别获得了95%、92%和90%的Dice分数,而当前最先进的Dice分数分别为94%、93%和88%。这些令人信服的结果突显了GCSER-UNet在精确脑肿瘤分割中的有效性,因此可以帮助神经科医生进行有效的脑癌管理和治疗规划。

英文摘要

Brain cancer's severity necessitates precise brain tumor segmentation, which is crucial for effective brain tumor diagnosis. Manual identification, burdened by high costs, labor, and error risks, highlights the need for automated methods. In this study, we introduce the Global Context-aware Squeeze and Excite Residual UNet (GCSER-UNet), which facilitates a fusion of spatial and channel-wise attention and thus enhances the model's capacity to capture intricate spatial dependencies and contextual information. GCSER-UNet efficiently extracts tumor segments from multimodal MRI slices, delivering exceptional performance. Evaluations on benchmark databases exhibit its superiority, achieving a notable 94 percent dice score on the TCGA LGG dataset, surpassing the state-of-the-art dice score of 91.8 percent. In the BraTS 2020 dataset, the proposed GCSER-UNet ensemble approach yielded dice scores of 95 percent, 92 percent, and 90 percent for the tumor regions - Whole Tumor (W), Tumor Core (T), and Enhancing Tumor (E), respectively. The current state-of-the-art dice scores were 94 percent, 93 percent, and 88 percent. These compelling outcomes highlight the efficacy of GCSER-UNet in precise brain tumor segmentation and thus can aid neurologists in effective brain cancer management and treatment planning.

2605.30509 2026-06-01 stat.ML cs.AI cs.LG 版本更新

Improved Distribution Estimation in $\ell_\infty$

在 $\ell_\infty$ 下的改进分布估计

Doron Cohen, Aryeh Kontorovich, Yonatan Livshitz

发表机构 * Department of Computer Science, Ben-Gurion University of the Negev(本·古里安大学计算机科学系)

AI总结 本文在 $\ell_\infty$ 范数下改进了离散概率分布的估计,给出了期望极小极大界和高概率尾界,解决了 Kontorovich 和 Painsky (JMLR, 2025) 提出的开放问题,包括最紧风险界的完全经验版本和最坏情况极值分布的形式,并报告了鼓励性的实证结果。

Comments 24 pages, 3 figures

详情
AI中文摘要

我们提出了在 $\ell_\infty$ 范数下估计离散概率分布的改进界。这些包括期望极小极大界和高概率尾界。我们解决了 Kontorovich 和 Painsky (JMLR, 2025) 提出的一些开放问题——包括他们提出的最紧风险界的完全经验版本以及识别最坏情况极值分布的形式。还报告了鼓励性的实证结果。

英文摘要

We present improved bounds for estimating discrete probability distributions under the $\ell_\infty$ norm. These include minimax bounds in expectation and high-probability tail bounds. We resolve some of the open questions posed in Kontorovich and Painsky (JMLR, 2025) -- including a fully empirical version of the tightest risk bound they presented and identifying the form of the worst-case extremal distribution. Encouraging empirical results are reported as well.

2605.03337 2026-06-01 cs.CV cs.AI 版本更新

FreeTimeGS++: Secrets of Dynamic Gaussian Splatting and Their Principles

FreeTimeGS++:动态高斯泼溅的秘密及其原理

Lucas Yunkyu Lee, Soonho Kim, Youngwook Kim, Sangmin Kim, Jaesik Park

发表机构 * Seoul National University(首尔国立大学) POSTECH

AI总结 本文通过建立控制基线FreeTimeGS_ours,系统分析4D高斯泼溅框架中的隐藏因素,揭示高斯持续时间驱动的时态分区和光度保真度与时空一致性之间的差异等关键秘密,并提出FreeTimeGS++方法,采用门控边缘化和神经速度场实现更稳定的动态表示。

Comments Project page: https://yklcs.com/ftgspp

详情
AI中文摘要

近期4D高斯泼溅(4DGS)的兴起在动态场景重建方面取得了令人瞩目的成果。尽管这些方法表现出卓越的性能,但其背后的具体驱动因素仍未被充分探索,使得对基本原理的系统理解具有挑战性。本文对这些隐藏因素进行了全面分析,以提供对4DGS框架更清晰的视角。我们首先通过形式化和复现最先进的FreeTimeGS的启发式方法,建立了一个受控基线FreeTimeGS_ours。利用该框架,我们沿着其基本轴剖析4DGS,并揭示了关键秘密,包括由高斯持续时间驱动的涌现时态分区以及光度保真度与时空一致性之间的差异。基于这些见解,我们提出了FreeTimeGS++,这是一种采用门控边缘化和神经速度场的原理性方法,以实现卓越的稳定性和鲁棒的动态表示。我们的方法产生了可重复的结果,并降低了运行间方差。我们将发布我们的实现,为未来的4DGS研究提供可靠的基础。

英文摘要

The recent surge in 4D Gaussian Splatting (4DGS) has achieved impressive dynamic scene reconstruction. While these methods demonstrate remarkable performance, the specific drivers behind such gains remain less explored, making a systematic understanding of the underlying principles challenging. In this paper, we perform a comprehensive analysis of these hidden factors to provide a clearer perspective on the 4DGS framework. We first establish a controlled baseline, FreeTimeGS_ours, by formalizing and reproducing the heuristics of the state-of-the-art FreeTimeGS. Using this framework, we dissect 4DGS along its fundamental axes and uncover key secrets, including the emergent temporal partitioning driven by Gaussian durations and the discrepancy between photometric fidelity and spatiotemporal consistency. Based on these insights, we propose FreeTimeGS++, a principled method that employs gated marginalization and neural velocity fields to achieve superior stability and robust dynamic representations. Our approach yields reproducible results with reduced run-to-run variance. We will release our implementation to provide a reliable foundation for future 4DGS research.

2605.30486 2026-06-01 cs.LG cs.AI 版本更新

Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting

图条件化的图神经网络专家混合模型用于交通预测

Amirhossein Ghaffari, Saeid Sheikhi, Ekaterina Gilman

发表机构 * Future Computing Group, University of Oulu(奥卢大学未来计算组)

AI总结 提出GC-MoE框架,通过图拓扑和近期交通输入为每个节点分配个性化专家组合,仅训练轻量路由模块,在四个基准上提升MAE。

Comments An accepted paper at the 27th IEEE International Conference on Mobile Data Management (MDM 2026)

详情
AI中文摘要

传感器图上的时空预测通常采用统一应用于所有节点的单一骨干架构,尽管图区域可能表现出不同的动态。道路段在功能类别、结构和交通行为上存在差异,表明节点级专家专业化可能是有用的。我们提出GC-MoE,一种图条件化的专家混合框架,基于图拓扑和近期交通输入窗口为每个节点分配个性化的冻结预测专家组合。GC-MoE将冻结的预训练时空GNN专家与输入感知、空间上下文化的路由器相结合,同时仅训练轻量级路由模块。我们还研究了一个有界图条件化输出精炼层作为可选扩展,并仅作为消融诊断包含节点自适应ST-LoRA适配器。在四个标准基准(PEMS04、PEMS07、METR-LA和PEMS-BAY)上,GC-MoE在零参数集成基线上改善了MAE,具有竞争力的RMSE和MAPE,同时在1.5M冻结专家权重之上仅训练约17K参数。实现代码见https://github.com/Ahghaffari/gc_moe。

英文摘要

Spatio-temporal forecasting on sensor graphs is commonly tackled with a single backbone architecture applied uniformly across all nodes, although graph regions can exhibit different dynamics. Road segments differ in functional class, structure, and traffic behavior, suggesting that node-wise expert specialization can be useful. We propose GC-MoE, a graph-conditioned mixture of experts framework that assigns each node a personalized combination of frozen forecasting experts based on graph topology and the recent traffic input window. GC-MoE combines frozen pretrained spatio-temporal GNN experts with an input-aware, spatially contextualized router while training only a lightweight routing module. We also study a bounded graph-conditioned output refinement layer as an optional extension and include node-adaptive ST-LoRA adapters only as an ablation diagnostic. Across four standard benchmarks (PEMS04, PEMS07, METR-LA, and PEMS-BAY), GC-MoE improves MAE over a zero-parameter ensemble baseline, with competitive RMSE and MAPE, while training only ~17K parameters on top of 1.5M frozen expert weights. The implementation is available at https://github.com/Ahghaffari/gc_moe.

2605.30462 2026-06-01 cs.LG cs.AI 版本更新

idSCD: Identifying Training Datasets through Semantic Correlation Descriptors

idSCD: 通过语义相关描述符识别训练数据集

Andrada Gobeaja, Ionut Hodoroaga, Elena Burceanu, Marius Leordeanu

发表机构 * POLITEHNICA University of Bucharest(巴尔贝鲁斯理工大学) Bitdefender, Romania(罗马尼亚Bitdefender公司) Institute of Mathematics of the Romanian Academy(罗马尼亚科学院数学研究所)

AI总结 提出基于语义相关描述符(SCD)的白盒方法,通过模型学习到的语义相关结构识别训练数据集中的成员关系,在多个实验设置中优于现有基线方法。

Comments 16 pages, 3 figures

详情
AI中文摘要

一个数据集能否通过其在训练过程中引起的虚假相关性被识别?我们认为,数据集会在模型学习的语义相关结构中留下特定于数据集的痕迹:在数据集中具有预测性但对底层任务非因果的偶然规律性,可能在训练过程中被内化。我们利用这一洞察研究数据集级别的成员推断,超越了依赖置信度分数、损失、边际、生成样本或查询响应等行为或分布证据的现有方法。我们引入了一种基于语义相关描述符(SCD)的白盒语义指纹方法,该方法捕获模型学习的语义相关结构,并使其在不同数据集混合中具有可比性。在受控的留一数据集诊断中,SCD恢复了数据集特定的变化,并完美区分匹配与非匹配的数据集对。然后,我们提出了一种实用的基于SCD的成员分数,该分数仅使用模型的SCD和目标数据集的独立SCD来测试目标数据集是否是模型训练混合的一部分,无需留一数据集模型。在三个不同的实验设置中,包括自然语言推理、情感分类和医学文本分类的数据集组,我们测试了基于SCD的成员推断在不同程度的语义分离和数据集划分之间的关键词支持下的优势和局限性。平均而言,基于该分数的分类器实现了最高的性能和最低的标准差,优于黑盒基线RMIA、Attack-P和LiRA,以及白盒基线SIF。这些结果表明,数据集成员可以通过内部语义相关性进行追踪,当数据集组暴露不同的语义特性时,ROC-AUC的最大相对增益超过60%。

英文摘要

Can a dataset be recognized from the spurious correlations it induces during training? We argue that datasets leave dataset-specific traces in a model's learned semantic correlation structure: incidental regularities that are predictive within a dataset, but not causal for the underlying task, can be internalized during training. We use this insight to study dataset-level membership inference, moving beyond existing methods that rely on behavioral or distributional evidence such as confidence scores, losses, margins, generated samples, or query responses. We introduce a white-box semantic fingerprinting approach based on semantic correlation descriptors (SCDs), which capture the semantic correlation structure learned by a model and make it comparable across dataset mixtures. In a controlled leave-one-dataset-out diagnostic, SCDs recover dataset-specific changes and perfectly separate matching from non-matching dataset pairs. We then propose a practical SCD-based membership score that tests whether a target dataset is part of a model's training mixture using only the model's SCD and the target dataset's standalone SCD, without requiring leave-one-dataset-out models. Across three diverse experimental settings, with dataset groups for natural language inference, emotion classification, and medical text classification, we test both the advantages and limitations of SCD-based membership inference with different degrees of semantic separation and keyword support between dataset splits. On average, the classifier based on this score achieves the highest performance and the lowest std, outperforming black-box baselines RMIA, Attack-P, and LiRA, as well as the white-box SIF baseline. These results show that dataset membership can be traced through internal semantic correlations, with the largest relative gain exceeding 60% in ROC-AUC when dataset groups expose distinct semantic particularities.

2605.30461 2026-06-01 cs.LG cs.AI 版本更新

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

通过状态增强和共识实现可分离动力学的可扩展约束多智能体强化学习

Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson

发表机构 * Department of Engineering University Pompeu Fabra(工程系庞培法布拉大学)

AI总结 提出一种结合状态增强策略学习与对偶变量分布式共识的分布式约束多智能体强化学习方法,解决可分离动力学系统中全局资源约束的协调问题,实现线性可扩展性并保证约束满足。

Comments 17 pages, 8 figures, 3 tables. Plus appendix

详情
AI中文摘要

我们提出了一种用于约束多智能体强化学习(MARL)的分布式方法,该方法将状态增强策略学习与对偶变量的分布式共识相结合。我们的方法针对智能体具有可分离动力学但必须协调以满足全局资源约束的系统,正如我们通过实验证明的,在这种设置下,独立学习无法产生可行解,因为智能体无法确定各自对集体约束满足的适当贡献。关键技术贡献在于证明,对拉格朗日乘子进行轻量级邻居到邻居共识足以实现全局协调的约束执行,同时保持独立训练的可扩展性。每个智能体离线学习一个单一的增强策略,该策略以其局部状态和编码约束反馈的对偶变量为条件。在执行过程中,智能体仅通过局部通信就该对偶变量达成共识。我们证明,在温和的连通性假设下,智能体乘子之间的共识误差是有界的,并且表明这转化为有界的约束违反,该违反随图连通性和共识轮次增加而减小。与集中训练分散执行(CTDE)方法相比,后者的复杂度至少随智能体数量呈二次增长,而我们的方法在训练和执行中均呈线性扩展。在智能电网需求响应上的实验表明,共识协调对于可行性至关重要:没有共识,智能体只能通过无限期推迟需求来满足电网容量约束,这是一种退化的非解。有了共识,智能体收敛到共享的对偶变量,并同时满足电网约束和需求满足,可扩展到数千个智能体,而CTDE基线仅能处理数十个。

英文摘要

We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have separable dynamics but must coordinate to satisfy global resource constraints, a setting in which, as we demonstrate empirically, independent learning fails to produce feasible solutions because agents cannot determine appropriate individual contributions toward collective constraint satisfaction. The key technical contribution is showing that lightweight neighbor-to-neighbor consensus over Lagrange multipliers suffices for globally coordinated constraint enforcement while preserving the scalability of independent training. Each agent learns a single augmented policy offline, conditioned on both its local state and a dual variable encoding constraint feedback. During execution, agents reach agreement on this dual variable through local communication alone. We prove that under mild connectivity assumptions, the consensus error among agents' multipliers is bounded, and show that this translates to a bounded constraint violation that decreases with graph connectivity and the number of consensus rounds. Unlike centralized training with decentralized execution (CTDE) approaches, whose complexity grows at least quadratically with agent count, our method scales linearly in both training and execution. Experiments on smart grid demand response demonstrate that consensus coordination is \emph{essential for feasibility}: without it, agents satisfy grid capacity constraints only by indefinitely postponing demand, a degenerate non-solution. With consensus, agents converge to a shared dual variable and satisfy both grid constraints and demand fulfillment, scaling to thousands of agents while CTDE baselines are limited to dozens.

2605.30454 2026-06-01 cs.CR cs.AI 版本更新

The Surface You Test Is Not the Surface That Breaks

测试的表面并非断裂的表面

Shifat E Arman, Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin

发表机构 * Department of Robotics and Mechatronics Engineering, University of Dhaka(达卡大学机器人与机电工程系) Department of Computer Science and Engineering, University of Dhaka(达卡大学计算机科学与工程系)

AI总结 本文发现工具增强的LLM代理对提示注入的脆弱性依赖于攻击表面(工具输出 vs 工具描述),提出自适应攻击率并强调评估需报告每个表面的脆弱性。

Comments 8 Figures, 8 Tables, Under Review at EMNLP

详情
AI中文摘要

工具增强的LLM代理容易受到提示注入攻击:控制代理上下文部分的第三方可以植入指令,代理随后执行这些指令,仿佛它们来自用户。当前的评估在每个模型的一个通道(工具输出)上报告单一的攻击成功率,并将该数字视为模型的脆弱性。但工具描述(代理在每次调用工具前都会读取)本身是一个攻击者可以选择替代的注入表面。我们保持注入载荷字节相同,并通过两个表面在来自六个家族的13个LLM和四个任务套件中传递。相同的字节在模型间的成功率上出现反转:GPT-4.1在工具输出上脆弱性为96%,但在工具描述上仅为4%,而GEMINI-3-FLASH则呈现镜像模式,分别为20%和98%。对6,830次尝试的方差分解显示,攻击结果的变化中仅有0%可归因于表面本身,而模型-表面交互则占16.7%。脆弱性是配对的性质,而非通道的性质。自适应攻击率(定义为每个单元在表面上的最大值)平均比最强的固定表面基线高出9.1个百分点。标准的提示级防御继承了相同的盲点,将工具输出的ASR降低到10-18%,而描述通道仍高于54%。攻击和防御评估都必须报告每个表面的脆弱性。

英文摘要

Tool-augmented LLM agents are vulnerable to prompt injection: a third party who controls part of the agent's context can plant instructions that the agent then executes as if they came from the user. Current evaluations report a single attack success rate per model on one channel, the tool output and treat that number as the model's vulnerability. But tool descriptions, which the agent reads at every turn before any tool is called, are themselves an injection surface that the attacker can choose instead. We hold the injection payload byte-identical and deliver it through both surfaces across 13 LLMs from six families and four task suites. The same bytes invert in success rate across models: GPT-4.1 is 96 percent vulnerable on tool outputs but only 4 percent on tool descriptions, while GEMINI-3-FLASH shows the mirror pattern at 20 percent and 98 percent. A variance decomposition over 6,830 attempts attributes 0 percent of the variation in attack outcomes to the surface alone, while the model-surface interaction accounts for 16.7 percent. Vulnerability is a property of the pairing, not the channel. The Adaptive Attack Rate, defined as the per-cell maximum over surfaces, exceeds the strongest fixed-surface baseline by +9.1 percentage points on average. Standard prompt-level defenses inherit the same blindspot, reducing tool-output ASR to 10-18 percent while leaving the description channel above 54 percent. Both attack and defense evaluation must report per-surface vulnerability.

2605.30452 2026-06-01 cs.LG cs.AI math.OC 版本更新

A Unified Framework for Gradient Aggregation in Multi-Objective Optimization

多目标优化中梯度聚合的统一框架

Zeou Hu, Kelvin Ho, Yaoliang Yu

发表机构 * Cheriton School of Computer Science(切尔顿计算机科学学院) University of Waterloo(滑铁卢大学) Vector Institute(向量研究所) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出一个统一框架,通过充分对齐条件建立梯度聚合方法的收敛率,并引入基于CVaR的capped MGDA算法,在对抗联邦学习中验证鲁棒性。

详情
AI中文摘要

许多机器学习问题涉及多个固有的权衡,最好通过基于梯度的多目标优化(MOO)算法来解决。现有方法通常基于不同的动机提出,逐个案例进行分析,并且在每一步中如何聚合分量梯度在算法上有所不同。在这项工作中,我们为MOO中的梯度聚合开发了一个统一框架,建立了收敛到帕累托平稳性(MOO的标准性能度量)的(最优)速率。我们分析的核心是一个充分对齐条件,由此我们推导出一个定理,表明当在梯度的凸包内选择时,非冲突方向构成了收敛的基本充分条件。我们进一步表明,通过对偶锥上的投影可以确保可行性,从而拓宽了具有收敛保证的方法的范围。同时,我们提出了梯度聚合的原始优化视角,该视角涵盖了已有算法,阐明了它们的理论关系,并能够设计新的变体。作为示例,我们引入了capped MGDA,它基于CVaR公式推导而来,并展示了其在对抗联邦学习中的鲁棒性。最后,我们通过在合成问题和实际基准上的实验验证了我们的理论。

英文摘要

Many machine learning problems involve multiple inherent trade-offs that are best addressed by gradient-based multi-objective optimization (MOO) algorithms. Existing methods are often proposed with various motivations, analyzed case by case, and differ algorithmically in how the component gradients are aggregated at each step. In this work, we develop a unifying framework for gradient aggregation in MOO, establishing (optimal) rates of convergence to Pareto stationarity, the standard measure of performance in MOO. Central to our analysis is a sufficient alignment condition, from which we derive a theorem showing that non-conflicting directions, when chosen within the convex hull of gradients, form a fundamental sufficient condition for convergence. We further show that feasibility can be ensured through projection onto the dual cone, broadening the scope of methods that admit convergence guarantees. In parallel, we present a primal optimization perspective of gradient aggregation that encompasses established algorithms, clarifies their theoretical relationships, and enables the design of new variants. As an illustration, we introduce capped MGDA, derived from a CVaR-based formulation, and demonstrate its robustness in adversarial federated learning. Finally, we validate our theory through experiments on synthetic problems and practical benchmarks.

2605.30447 2026-06-01 cs.LG cs.AI stat.ML 版本更新

Calibrated Preference Learning: The Case of Label Ranking

校准偏好学习:以标签排序为例

Santo M. A. R. Thies, Viktor Bengs, Timo Kaufmann, Sebastian J. Vollmer, Eyke Hüllermeier

发表机构 * Munich Center for Machine Learning, Munich (MCML), Germany(慕尼黑机器学习中心,慕尼黑(MCML),德国)

AI总结 针对概率标签排序问题,形式化定义了校准概念并建立层次体系,通过理论证明和实验验证了不同校准概念的关系及现有模型的校准缺陷。

详情
AI中文摘要

校准,即预测概率与真实结果频率的对齐,对于可靠决策至关重要。尽管在分类和回归中已有广泛研究,但校准尚未在概率标签排序中得到正式处理,其目标是预测标签集排序上的分布。将排序视为类别会忽略其结构,并无法捕捉成对和top-k预测等重要模态。我们形式化了标签排序的校准,并建立了一个涵盖完整排序、子排序和top-k排序的概念层次。我们证明完整排序校准蕴含其他校准,但反之不成立,且子排序和top-k校准不可比较。实验发现,流行的标签排序模型通常校准不良,子排序和top-k指标之间存在显著差异。将我们的框架应用于RLHF奖励模型,发现校准与基准准确性强相关但不完全一致,表明它捕捉了超越top-1准确性的有意义的质量维度。这些发现激励了未来关于理解误校准的下游影响以及开发纠正方法的工作。

英文摘要

Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While extensively studied for classification and regression, calibration has not been formally addressed for probabilistic label ranking, where the goal is to predict a distribution over orderings of a label set. Naively treating rankings as classes ignores their structure and fails to capture important modalities such as pairwise and top-k predictions. We formalize calibration for label ranking and develop a hierarchy of notions covering full rankings, sub-rankings, and top-k rankings. We prove that full-rank calibration implies the others but not conversely, and sub-ranking and top-k calibration are incomparable. Empirically, we find popular label ranking models are often poorly calibrated, with substantial differences between sub-ranking and top-k metrics. Applying our framework to RLHF reward models, we find that calibration correlates strongly but not perfectly with benchmark accuracy, suggesting it captures a meaningful quality dimension beyond top-1 accuracy. These findings motivate future work on understanding the downstream effects of miscalibration and developing methods to correct it.

2605.30434 2026-06-01 cs.LG cs.AI cs.CL cs.MA 版本更新

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS-Bench:关于长周期智能数据分析的失败

Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding, Haoming Xu, Lei Liang, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph(知识图谱联合实验室)

AI总结 提出LongDS基准,用于评估长周期多轮数据分析中智能体维护和更新分析状态的能力,发现最佳模型平均准确率仅48.45%,且长周期错误占失败原因的52%-69%。

Comments Ongoing work

详情
AI中文摘要

现实世界的数据分析本质上是迭代的,然而现有基准大多评估孤立或短期的交互任务,未能测试智能体在长周期内跟踪不断变化的分析上下文的能力。我们引入了LongDS,一个用于长周期、多轮数据分析的基准,其中智能体必须维护、更新、恢复和组合不断变化的分析状态。LongDS包含68个从真实世界Kaggle笔记本构建的任务,涵盖地球科学、商业和教育等六个领域的2,225轮交互。任务围绕状态演化模式(例如反事实扰动、回滚、多状态组合)设计,平均依赖跨度为11.3轮。评估五个最先进模型,我们发现最佳模型仅达到48.45%的平均准确率,性能从早期到后期轮次下降近47个百分点,长周期错误占失败原因的52%-69%。进一步分析表明,额外的智能体步骤并不一定能提高性能,这表明关键瓶颈在于维护正确的分析状态,而非增加交互预算。我们发布LongDS以支持可靠的长周期智能数据分析研究。代码和数据将在https://github.com/zjunlp/DataMind发布。

英文摘要

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

2605.30415 2026-06-01 cs.CL cs.AI 版本更新

Domain Adaptation and Reasoning Frameworks in Language Models: A Controlled Experiment with Historical Cosmology

语言模型中的领域适应与推理框架:以历史宇宙学为受控实验

Francesco De Bernardis

发表机构 * Independent Researcher(独立研究者)

AI总结 通过历史宇宙学受控实验,研究领域适应如何重塑语言模型的解释行为,发现适应主要改变解释框架而非直接改变立场。

Comments 17 pages, 3 figures

详情
AI中文摘要

我们以历史宇宙学为受控环境,研究领域适应如何重塑语言模型中的解释行为。在第一阶段,我们在一个去除明确日心说引用的前哥白尼语料库上从头训练一个小型语言模型,并评估地动说或日心说延续是否仍然出现。在第二阶段,我们使用QLoRA在同一语料库上微调一个更大的预训练模型,以研究适应如何修改解释框架和宇宙学立场。模型输出使用LLM-as-judge框架进行评估,该框架标记宇宙学立场(地心说、日心说或模糊)和解释框架(前现代与现代)。在受限的第一阶段,较小的模型偶尔生成局部的地动说延续,但这些延续全局不稳定,不足以支持连贯的宇宙学推理。在第二阶段,微调导致向现代前解释框架的大幅且统计显著的转变,而条件宇宙学立场分布在这些框架内相对稳定。因此,地心说输出的增加主要源于解释机制的重新分布,而非立场的直接修改。这些结果表明,领域适应可能主要重塑生成延续的语言框架,而立场的变化则次要地源于这些转变。

英文摘要

We investigate how domain adaptation reshapes explanatory behavior in language models using historical cosmology as a controlled setting. In Phase 1, we train a small language model from scratch on a pre-Copernican corpus from which explicit heliocentric references were removed, and evaluate whether Earth-motion or heliocentric continuations nevertheless emerge. In Phase 2, we fine-tune a larger pretrained model using QLoRA on the same corpus in order to study how adaptation modifies explanatory framing and cosmological stance. Model outputs are evaluated using an LLM-as-judge framework that labels both cosmological stance (geocentric, heliocentric, or ambiguous) and explanatory frame (premodern versus modern). In the constrained setting of Phase 1, the smaller models occasionally generate local Earth-motion continuations, but these remain globally unstable and insufficient to support coherent cosmological reasoning. In Phase 2, fine-tuning induces a large and statistically significant shift toward premodern explanatory framing, while the conditional cosmological stance distributions remain comparatively stable within those frames. As a result, increases in geocentric outputs arise primarily from redistribution over explanatory regimes rather than from direct modification of stance. These results suggest that domain adaptation may primarily reshape the linguistic frameworks from which continuations are generated, with changes in stance emerging secondarily from those shifts.

2605.30409 2026-06-01 cs.CV cs.AI 版本更新

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

SANA-Streaming: 基于混合扩散Transformer的实时流式视频编辑

Yuyang Zhao, Yicheng Pan, Qiyuan He, Jincheng Yu, Junsong Chen, Tian Ye, Haozhe Liu, Enze Xie, Song Han

发表机构 * NVIDIA MIT(麻省理工学院) THU(清华大学) NUS(新加坡国立大学) HKU(香港大学)

AI总结 提出系统-算法协同设计的SANA-Streaming框架,通过混合扩散Transformer架构、循环反向正则化训练策略和高效系统协同设计,在消费级GPU上实现高分辨率实时流式视频编辑,达到1280×704分辨率24 FPS的端到端性能。

详情
AI中文摘要

实时流式视频到视频编辑(V2V)对于直播和游戏等交互式应用至关重要,但由于对时间一致性和推理吞吐量的严格要求,它仍然是一个严峻的挑战。在本文中,我们提出了SANA-Streaming,一个系统-算法协同设计的框架,用于在消费级GPU上进行高分辨率、实时流式视频编辑,具有以下三个核心设计:(1)混合扩散Transformer架构在部分块中引入softmax注意力以提高局部建模能力,同时保持线性层的效率。(2)循环反向正则化是一种新颖的训练策略,通过流匹配从生成内容预测源帧来强制语义一致性,无需成对的长编辑视频即可提高时间一致性。(3)高效系统协同设计结合了融合GDN内核和针对NVIDIA Blackwell(RTX 5090)架构优化的混合精度量化(MPQ)。通过分析实际吞吐量,我们的MPQ在保持生成质量的同时最大化Tensor Core利用率。最终系统在单个RTX 5090 GPU上以24 FPS的端到端帧率实现实时1280×704分辨率编辑,其中DiT核心运行在58 FPS。实验结果表明,我们的协同设计方法在时间一致性和系统吞吐量方面均显著优于现有最先进方法。

英文摘要

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

2605.30406 2026-06-01 cs.CY cs.AI 版本更新

AI Loss of Control Incident Management: Response & Resilience

AI失控事件管理:响应与韧性

Ross Gruetzemacher

发表机构 * Wichita State University(威斯康星州立大学)

AI总结 针对AI失控事件管理的研究空白,本文提出一个基础框架和分类法,将失控场景分为“极其昂贵”和“不可能”两类,并分别给出韧性投资和主动事件管理(包含遏制与威胁中和)的应对策略。

Comments 25 pages, 4 figures

详情
AI中文摘要

近期研究表明,展示欺骗和抵抗关闭能力的AI系统表明AI失控(LOC)是一个紧迫的政策问题,然而当前文献几乎只关注对齐和预防。为填补这一空白,本文引入了一个用于管理灾难性AI失控事件的基础框架和分类法。该分类法的第一层区分了重新控制“极其昂贵”与“不可能”的场景。虽然不可能的场景需要立即进行韧性投资以从根本上限制AI的攻击面,但极其昂贵的场景需要通过遏制和威胁中和进行主动事件管理。该框架进一步将这些可管理的事件分为意外失控(需要自动断路器响应)和对抗性失控(需要逐步升级的应对措施)。通过将三个严重性等级映射到具体场景矩阵,本文为管理前所未有的AI风险提供了具体且相称的指南。

英文摘要

Recent research demonstrating AI systems exhibiting deception and shutdown resistance suggests that AI loss of control (LOC) is an urgent policy concern , yet current literature focuses almost exclusively on alignment and prevention. To address this gap, this paper introduces a foundational framework and taxonomy for managing catastrophic AI LOC incidents. The taxonomy's first level distinguishes between scenarios where regaining control is 'extremely costly' versus 'impossible'. While impossible scenarios demand immediate resilience investments to fundamentally restrict an AI's attack surface , extremely costly scenarios require active incident management via Containment and Threat Neutralization. The framework further categorizes these manageable events into accidental LOC (requiring automated circuit-breaker responses) and adversarial LOC (requiring graduated escalatory measures). By mapping three severity classes to specific scenario matrices, this paper provides a concrete, proportional guide for managing unprecedented AI risks.

2605.30394 2026-06-01 cs.SE cs.AI 版本更新

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

CodeGolf Bench:评估大型语言模型简洁代码生成能力的多语言基准

Vedant Padwal

发表机构 * Independent(独立)

AI总结 提出CodeGolf Bench基准,基于代码高尔夫竞赛,在60种编程语言中评估LLM生成简洁高效代码的能力,实验表明推理模型显著优于非推理模型。

Comments 12 pages, 6 figures, 5 tables

详情
AI中文摘要

本文介绍了CodeGolf Bench,一个能够评估大型语言模型(LLM)在60种编程语言中生成简洁代码能力的基准。基于代码高尔夫(一种专注于最小字符或字节解决方案的娱乐性编程竞赛),该基准提供了衡量LLM生成高效、简洁代码能力的独特指标。与现有受限于固定问题集和语言覆盖范围的基准不同,CodeGolf Bench利用code.golf平台提供新问题和实时人类表现基线。对九个LLM在Python和C++任务上的评估表明,推理模型显著优于非推理模型,达到了70.97%的最佳平均百分位数。这一性能差距在C++中尤为明显,突显了推理对于具有严格语法要求的语言的重要性。非推理模型在两种语言中的效率优化方面表现更差,最佳百分位数显著低于推理模型。CodeGolf Bench提供了一个动态框架,用于评估LLM在代码高尔夫中针对不断进化的人类表现的代码生成能力。

英文摘要

This paper introduces Code Bench, a benchmark capable of evaluating Large Language Models (LLMs) concise code generation abilities in 60 programming languages. Based on code golf, a recreational programming competition focused on minimal character or byte solutions, the benchmark provides a distinctive measure of LLMs ability to produce efficient, concise code. Unlike existing benchmarks limited by fixed problem sets and language coverage, CodeGolf Bench leverages the code.golf platform to provide new problems and live human performance baselines. Evaluation of nine LLMs on Python and C++ tasks demonstrates that reasoning models significantly outperform non-reasoning models, achieving best average percentile of 70.97%. This performance gap is particularly pronounced in C++, highlighting reasoning's importance for languages with strict syntax requirements. Non-reasoning models struggle more with efficiency optimization across both languages, with best percentiles significantly lower than reasoning counterparts. CodeGolf Bench offers a dynamic framework for evaluating LLM code generation capabilities against evolving human performance on code golf.

2605.30393 2026-06-01 cs.LG cs.AI cs.CR 版本更新

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

NumLeak: 基础模型中的公开数值基准作为潜在标签

Anany Kotawala

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出NumLeak框架,通过API边界探测和开源因果模型的白盒验证,揭示基础模型在预训练中记忆公开数值基准,导致评估高估泛化能力。

Comments 23 pages, 12 figures, 17 tables. Accepted at the ICML 2026 Workshop on the Impact of Memorization on Trustworthy Foundation Models (MemFM)

详情
AI中文摘要

公开数值基准出现在预训练中,因此基于日期进行评估可能测量的是记忆性回忆而非样本外技能。我们引入NumLeak,一个结合生产模型API边界探测与开源因果模型白盒受控验证的测量框架。顶级前沿LLM在3种子池化后,对Fama-French市场超额收益的回忆皮尔逊相关系数r=0.97-0.99,同时五个兄弟因子在25个基点内误差不超过0.15;在美国失业率、CPI通胀和NOAA温度上观察到类似保真度。在近期发布的保留集上,解析率骤降至21-57%,但在回答的月份上r仍约为0.99,拒绝-回忆不对称性符合记忆通道的预测。白盒实验重现了剂量反应,对数概率排名检测到开放生成遗漏的记忆,意味着封闭API黑盒探测低估了该通道。一个Sonnet“日期到市场情绪”回归与真实Mkt-RF的相关性r=0.74,在残差化模型自身回忆后降至r=0.02。一行系统提示防御在概念和历史叙事查询上以接近零的效用成本阻止了99.8%的非自适应单轮后缀攻击集。

英文摘要

Public numeric benchmarks appear in pretraining, so an evaluation that conditions on a date may be measuring memorized recall rather than out-of-sample skill. We introduce NumLeak, a measurement framework that combines API-boundary probes on production models with a white-box controlled validation on an open causal LM. Top-tier frontier LLMs recall the Fama-French market excess return at 3-seed pooled Pearson r=0.97-0.99 while staying within 0.15 within-25bps on the five sibling factors; comparable fidelity appears on U.S. unemployment, CPI inflation, and NOAA temperature. On a recent-release holdout, parse rate collapses to 21-57% but r stays at approximately 0.99 on months answered, the refuse-or-recall asymmetry a memorized channel predicts. The white-box experiment reproduces the dose-response, and logprob ranking detects memorization that open-ended generation misses, implying closed-API black-box probes understate the channel. A Sonnet "date to market-sentiment" regression that correlates with true Mkt-RF at r=0.74 collapses to r=0.02 once the model's own recall is residualized out. A one-line system-prompt defense blocks 99.8% of a non-adaptive single-turn suffix attack set at near-zero utility cost on conceptual and historical-narrative queries

2605.30391 2026-06-01 cs.MA cs.AI cs.CL 版本更新

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

机器中的社会推理:探究大语言模型辩论中的集体求真动态

Tom Pecher

发表机构 * Department of Computer Science University of Bath(计算机科学系英国巴斯大学)

AI总结 通过多智能体辩论模拟论证推理理论,证明大语言模型集体辩论能显著提升基于问卷的求真任务性能,并提出利用辩论动态测量模型内在属性的新基准方法。

Comments Master's thesis

详情
AI中文摘要

人类推理长期以来被理论化地认为是通过社会方式运作的,而非孤立的个体认知,而是通过集体对抗性话语,这一框架被称为论证推理理论(ATR)。ATR不依赖个体“理性主义者”作为求真的主要载体,而是将真理重新概念化为社会认识论的涌现属性:即在不完美的个体推理在辩论的对抗压力下精炼的产物。这种集体智能的分布式方法引导人类达到了更高的认知高度,并支撑着所有民主制度的基本原则。本论文首次通过大语言模型(LLM)的多智能体辩论(MAD)模拟ATR,开辟了新领域。通过严格的实证分析,我们证明,在正确设计一组认知多样化的模型时,LLM-MAD能够显著提高基于问卷的求真任务性能,即使单个辩论参与者独立表现有限。此外,我们提供了强有力的实证证据,表明这种性能提升在机制上根植于ATR的核心原则,暗示集体推理可能普遍优于个体推理,而非生物学或进化中的偶然现象。最后,基于对辩论动态的分析,我们提出了一种新的基准测试方法,利用LLM-MAD测量模型内在属性(如幻觉倾向),从而以当前静态基准方法无法支持的方式比较模型。

英文摘要

Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than relying on individual "intellectualist reasoners" as the primary vehicle for truth-seeking, ATR reconceptualises truth as an emergent property of social epistemology: the product of imperfect individual reasoning refined under the adversarial pressure of debate. This distributed method of collective intelligence has guided humanity to ever-greater epistemic heights and underpins the foundational principles of all democratic systems. This thesis breaks new ground by, for the first time, simulating ATR through the multi-agent debate (MAD) of large language models (LLMs). With rigorous empirical analysis, we demonstrate that, when correctly engineering an epistemically diverse set of models, LLM-MAD can significantly improve truth-seeking performance on questionnaire-based tasks, even when individual debate participants exhibit limited standalone performance. Furthermore, we present strong empirical evidence that this performance gain is mechanistically grounded in the central principles of ATR, suggesting that collective reasoning may be universally favourable over individualist reasoning, rather than a quirk in biology or evolution. Finally, drawing on our analysis of debate dynamics, we propose a novel benchmarking methodology that leverages LLM-MAD to measure intrinsic model properties (such as hallucination propensity) in order to compare models in ways that current static benchmarking approaches cannot support.

2605.30387 2026-06-01 cs.LG cs.AI cs.CV eess.SP 版本更新

Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification

基于小波图像变换和频谱流匹配的功能磁共振时间序列生成用于脑疾病识别

Hwa Hui Tew, Junn Yong Loo, Fang Yu Leong, Julia K. Lau, Ding Fan, Hernando Ombao, Raphaël C. -W. Phan, Chee Pin Tan, Chee-Ming Ting

发表机构 * School of Information Technology, Monash University Malaysia(墨尔本大学马来西亚分校信息科技学院) School of Engineering, Monash University Malaysia(墨尔本大学马来西亚分校工程学院) Statistics Program, King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹大学科学与技术学院统计学项目)

AI总结 提出双频谱流匹配(DSFM)框架,通过离散小波变换和离散余弦变换对BOLD信号进行双频表示,结合频谱流匹配生成类条件余弦频率表示,再经逆变换重建生理上合理的时域BOLD信号,以改善下游脑网络分类。

Comments Accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

功能磁共振成像(fMRI)通过测量随时间变化的血氧水平依赖(BOLD)信号,提供对动态脑活动的非侵入性访问。然而,fMRI采集的资源密集型特性限制了数据驱动脑分析模型所需的高保真样本的可用性。虽然现代生成模型可以合成fMRI数据,但它们在复制原始BOLD信号固有的非平稳性、复杂的时空动态和生理变化方面仍然面临挑战。为了解决这些挑战,我们提出了双频谱流匹配(DSFM),一种新颖的fMRI生成框架,它将BOLD信号的双频表示与频谱流匹配级联起来。具体来说,我们的框架首先通过离散小波变换(DWT)将BOLD信号转换为小波分解图,以捕获全局瞬态和多尺度变化,并将其投影到跨脑区和时间的离散余弦变换(DCT)空间中,以利用低频主导BOLD系数的局部能量压缩。随后,训练一个频谱流匹配模型来生成类条件余弦频率表示。通过逆DCT和逆DWT操作重建生成的样本,以恢复生理上合理的时域BOLD信号。这种双变换方法施加了结构化的频率先验,并保留了关键的生理脑动力学。最终,我们通过改进的下游基于fMRI的脑网络分类证明了我们方法的有效性。代码可在 https://github.com/htew0001/DSFM.git 获取。

英文摘要

Functional Magnetic Resonance Imaging (fMRI) provides non-invasive access to dynamic brain activity by measuring blood oxygen level-dependent (BOLD) signals over time. However, the resource-intensive nature of fMRI acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often remain challenging in replicating their inherent non-stationarity, intricate spatiotemporal dynamics, and physiological variations of raw BOLD signals. To address these challenges, we propose Dual-Spectral Flow Matching (DSFM), a novel fMRI generative framework that cascades dual frequency representation of BOLD signals with spectral flow matching. Specifically, our framework first converts BOLD signals into a wavelet decomposition map via a discrete wavelet transform (DWT) to capture globalized transient and multi-scale variations, and projects into the discrete cosine transform (DCT) space across brain regions and time to exploit localized energy compaction of low-frequency dominant BOLD coefficients. Subsequently, a spectral flow matching model is trained to generate class-conditioned cosine-frequency representation. The generated samples are reconstructed through inverse DCT and inverse DWT operations to recover physiologically plausible time-domain BOLD signals. This dual-transform approach imposes structured frequency priors and preserves key physiological brain dynamics. Ultimately, we demonstrate the efficacy of our approach through improved downstream fMRI-based brain network classification. The code is available at https://github.com/htew0001/DSFM.git .

2605.30385 2026-06-01 cs.LG cs.AI 版本更新

LLMs Without Deep Neural Networks: New Architecture, Benefits and Case Study

无需深度神经网络的LLM:新架构、优势与案例研究

Vincent Granville

AI总结 本文提出一种基于RBF网络的新架构,无需深度神经网络即可通过闭式解找到损失函数全局最优,消除繁琐训练步骤,并提高可解释性和准确性。

Comments 9 pages, 5 figures

详情
AI中文摘要

本文旨在验证我在LLM背景下提出的深度神经网络替代方案。最近,中国研究人员对一种称为RBF网络的模型产生了浓厚兴趣,该模型作为标准DNN的替代品,具有更高的可解释性和准确性。事实证明,我独立发现的新模型基于完全相同的机制,但有一个重大转折:它不需要DNN,因为它以闭式解在一次迭代中找到损失函数的全局最优,从而消除了繁琐的训练步骤。这里我提供了我的技术的高层概述,包括案例研究和与类似方法的比较。

英文摘要

The purpose of this article is to provide validation to my deep neural network alternative in the context of LLMs. Very recently, there has been a significant interest by Chinese researchers in a model called RBF network, as a substitute to standard DNNs, with increased explainability and higher accuracy. It turns out that my new model, discovered independently, is based on the exact same machinery. But with a major twist: it does not need DNN as it finds the global optimum of the loss function in closed form, in one iteration, thus eliminating the tedious training step. Here I provide a high-level overview of my technology, with case study and comparison to similar methods.

2605.30383 2026-06-01 cs.RO cs.AI 版本更新

Structured interactions improve distributed coordination beyond model scaling in a real-world multi-robot system

结构化交互在真实世界多机器人系统中超越模型规模提升分布式协调能力

Junping Wang, Zhizhong Zhang, Yongqiang Tang, Geng Zheng, Jiaming Zhang, Shiji Song, Yanmei Li, Yushan Ma

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) School of Computer Science and Technology, East China Normal University(华东师范大学计算机科学与技术学院) Department of Automation, Tsinghua University(清华大学自动化系) Liupanshan Laboratory, Ningxia University(宁夏大学鲁班实验室)

AI总结 通过真实多机器人实验,发现模块化层次化交互拓扑相比增加模型规模能更显著提升协调性能。

详情
AI中文摘要

提升单个机器人能力是常见但昂贵的做法。本文研究真实多机器人协调中的系统级设计问题:在硬件预算匹配的情况下,重构机器人间的通信是否比增加机载模型规模带来更大收益?使用10个物理机器人执行代表性的运输与建图任务(每种条件5次运行,共60次运行),我们发现从全连接切换到模块化层次化交互可将归一化性能提升47分(0-100分),而将神经网络隐藏层大小加倍最多提升9分。嵌套混合效应模型比较显示,拓扑对模型拟合的改善远大于规模。该模式在独立的SMAC复制实验中得到确认;异构基准重分析提供次要支持性一致性检查而非主要证据。在仿真校准的外推中观察到超过1024个隐藏单元的性能饱和,但未直接在硬件上验证。这些结果表明,在测试系统和任务设置中,交互结构可发挥主导作用,但更广泛的定量泛化仍有待建立。

英文摘要

Scaling individual robot capabilities is common but costly. Here we investigate a system-level design question in real-world multi-robot coordination: given matched hardware budgets, does restructuring communication among robots yield larger gains than increasing onboard model size? Using a representative transport-and-mapping task with 10 physical robots (5 runs per condition, 60 runs total), we find that switching from fully connected to modular hierarchical interactions improves normalised performance by 47 points (0--100), whereas doubling neural network hidden size yields at most 9 points. Nested mixed-effects model comparisons show a substantially larger improvement in model fit for topology than for scale. The pattern is confirmed in independent SMAC replications; heterogeneous benchmark reanalyses provide secondary supporting consistency checks rather than primary evidence. Performance saturation beyond 1024 hidden units is observed in simulation-calibrated extrapolation, not directly on hardware. These results indicate that interaction structure can play a dominant role within the tested system and task setting, while broader quantitative generalisation remains to be established.

2605.30381 2026-06-01 cs.LG cs.AI 版本更新

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

当LLM学会一致错误:合成欺骗的线性表示的多模型研究

Vahideh Zolfaghari

发表机构 * Algoverse AI Research Medical Sciences Education Research Center, Mashhad University of Medical Sciences(马什哈德大学医学科学教育研究中心) Student Research Committee, Department of Health Information Technology and Management, Medical Informatics, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences(谢赫·贝赫什提大学医学科学学院学生研究委员会,健康信息科技与管理系,医学信息学)

AI总结 通过LoRA微调五个Transformer模型的诚实与欺骗变体,使用线性探针检测合成欺骗,发现早期层即可达到近完美AUC,支持线性表示假说,并揭示两种表示机制。

详情
AI中文摘要

欺骗性对齐(模型保持准确的内部表示同时故意产生错误输出)仍然是AI安全的核心挑战。虽然战略性欺骗是主要的长期关注点,但通过直接优化错误答案诱导的合成不诚实为研究学习欺骗的表示基础提供了受控测试平台。我们引入了一个多模型范式,其中五个Transformer模型(Pythia-1.4B、Gemma-2-2B/9B、Qwen2.5-7B、Llama-3.1-8B)的诚实和欺骗变体使用LoRA在相同问题分布上进行微调。在平均池化隐藏状态上训练的线性探针在四个架构的1-3层即可检测到合成欺骗,AUC接近完美(≥0.99),而Pythia-1.4B达到峰值0.705。逻辑回归探针始终匹配或优于MLP探针,支持线性表示假说。在TruthfulQA上训练的探针以近乎零损失(ΔAUC≈0)泛化到保留的MMLU主题。深层表示对高斯噪声表现出强鲁棒性,其中Gemma-2模型表现出卓越的稳定性。对Fisher判别比、有效秩、质心几何、方向稳定性、跨域对齐和校准(ECE)的机制分析揭示了两种机制:Pythia/Llama/Qwen中的表示坍缩与Gemma-2中的高维保持。在所有模型中,欺骗方向在更深层逐渐巩固,在1-4层可实现最优校准(除Pythia外ECE<0.01)。这些结果表明,通过适度的监督微调,鲁棒、域不变的欺骗表示可以迅速固化,对基于激活的监控具有启示意义。

英文摘要

Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety. While strategic deception is the primary long-term concern, synthetic dishonesty - induced via direct optimization on incorrect answers - provides a controlled testbed for studying the representational basis of learned deception. We introduce a multi-model paradigm in which honest and deceptive variants of five transformer models (Pythia-1.4B, Gemma-2-2B/9B, Qwen2.5-7B, Llama-3.1-8B) are fine-tuned using LoRA on the same question distribution. Linear probes trained on mean-pooled hidden states detect synthetic dishonesty with near-perfect AUC (greater than or equal to 0.99) as early as layers 1-3 in four architectures, while Pythia-1.4B reaches a peak of 0.705. Logistic regression probes consistently match or outperform MLP probes, supporting the Linear Representation Hypothesis. Probes trained on TruthfulQA generalize with near-zero loss (Delta AUC approx. 0) to held-out MMLU subjects. Late-layer representations show strong robustness to Gaussian noise, with Gemma-2 models exhibiting exceptional stability. Mechanistic analysis of Fisher Discriminant Ratio, effective rank, centroid geometry, directional stability, cross-domain alignment, and calibration (ECE) reveals two regimes: representational collapse in Pythia/Llama/Qwen versus high-dimensional preservation in Gemma-2. Across all models, the dishonesty direction consolidates progressively in deeper layers, with optimal calibration (ECE less than 0.01 except Pythia) achievable in layers 1-4. These results demonstrate that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.

2605.30376 2026-06-01 cs.LG cs.AI 版本更新

Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling

Unicorn: 通过通用相关性建模实现高维时间序列的规模化预测

Haochen Yuan, Yichen Song, Yunbo Wang, Xiaokang Yang

发表机构 * MoE Key Lab of Artificial Intelligence(人工智能大规模并行计算实验室) AI Institute(人工智能研究院) School of Computer Science(计算机科学学院) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出Unicorn框架,通过潜在原型码本解耦相关性建模与特定通道身份,实现跨异构数据集的可扩展多数据集预训练,在少样本迁移场景中显著优于现有模型。

详情
AI中文摘要

现代时间序列架构面临一个基本权衡:通道独立模型随着数据量增加可扩展性好,但忽略了关键的通道间依赖性;而通道依赖模型具有表达力,但仍然是“维度受限的”,难以泛化到异构数据集。为了弥合这一差距,我们引入了Unicorn(通用相关网络),一个用于高维时间序列的可扩展、多数据集预训练框架。Unicorn的核心是一个潜在原型码本,它将相关性建模与特定通道身份解耦。通过将异构通道投影到共享潜在空间,Unicorn学习与身份无关的、可复用的交互模式,这些模式可以跨具有不同维度和语义的领域迁移。大量实验表明,Unicorn显著优于最先进的预测架构,特别是在少样本迁移场景中,为多变量时间序列基础模型提供了一条可扩展的路径。

英文摘要

Modern time series architectures face a fundamental trade-off: channel-independent models scale well with increasing data volume but ignore critical inter-channel dependencies, while channel-dependent models are expressive but remain ``dimension-bounded'', struggling to generalize across heterogeneous datasets.To bridge this gap, we introduce Unicorn (Universal Correlation Network), a framework for scalable, multi-dataset pretraining on high-dimensional time series. At the core of Unicorn is a latent prototype codebook that decouples correlation modeling from specific channel identities. By projecting heterogeneous channels into a shared latent space, UniCorN learns identity-agnostic, reusable interaction patterns that transfer across domains with diverse dimensionalities and semantics. Extensive experiments show that Unicorn significantly outperforms state-of-the-art forecasting architectures, particularly in few-shot transfer scenarios, offering a scalable path toward multivariate time series foundation models.

2605.30375 2026-06-01 physics.flu-dyn cs.AI 版本更新

Full-field prediction for engineering-scale three-dimensional aircraft with multigrid-hierarchical learning

基于多重网格分层学习的工程尺度三维飞机全场预测

Yunfei Liu, Hao Wang, Yuhang Qi, Hao Yue, Dehong Meng, Wei Li, Rui Wang, Tiejun Li, Jie Liu, Junwu Hong, Xinhai Chen

发表机构 * Computational Aerodynamics Institute(计算空气动力学研究所) Laboratory of Digitizing Software for Frontier Equipment(前沿装备数字化软件实验室) National Key Laboratory of Parallel and Distributed Computing(国家并行与分布式计算重点实验室) College of Computer Science and Technology(计算机科学与技术学院)

AI总结 提出MHLF框架,结合拓扑一致的多重网格表示和分层策略,实现工程尺度三维飞机流场的高效高保真预测,加速CFD收敛3-8倍。

详情
AI中文摘要

高保真计算流体动力学对航空航天设计至关重要,但实际三维飞机的工程尺度模拟计算成本高昂。基于学习的流场初始化可以通过减少初始解与收敛解之间的数值距离来提高效率,然而现有的深度学习方法难以扩展到具有多尺度区域异质性的大型三维飞机流场。因此,大多数先前的研究集中在二维问题、表面量、积分气动系数或网格分辨率有限的简化三维案例上。本文提出MHLF,一种多重网格分层学习框架,用于加速工程尺度飞机流场模拟,同时保持高保真数值精度。MHLF将拓扑一致的几何多重网格表示与分层策略相结合,在预测和后续CFD校正过程中捕捉区域流场异质性。在涵盖马赫数0.15至6.0、包括亚声速、跨声速和超声速状态的三个工程尺度飞机案例中,MHLF在不牺牲流场精度的情况下加速收敛,相比传统初始化实现了3至8倍的效率提升。这些结果展示了CFD领域内大型三维飞机实际全场流场预测的能力,并为数据驱动的高保真飞机流场模拟加速奠定了基础。

英文摘要

High-fidelity computational fluid dynamics is essential for aerospace design, but engineering-scale simulations of practical three-dimensional aircraft remain computationally expensive. Learning-based flow-field initialization can improve efficiency by reducing the numerical distance between the initial and converged solutions, yet existing deep learning approaches remain difficult to scale to large three-dimensional aircraft flows with multiscale regional heterogeneity. Most prior studies therefore focus on two-dimensional problems, surface quantities, integral aerodynamic coefficients, or simplified three-dimensional cases with limited grid resolution.Here we propose MHLF, a multigrid-hierarchical learning framework for accelerating engineering-scale aircraft flow simulations while preserving high-fidelity numerical accuracy. MHLF combines a topologically consistent geometric multigrid representation with a hierarchical strategy that captures regional flow heterogeneity during both prediction and subsequent CFD correction. Across three engineering-scale aircraft cases spanning Mach 0.15 to 6.0 and covering subsonic, transonic and supersonic regimes, MHLF accelerates convergence without sacrificing flow-field accuracy, achieving a 3 to 8 times efficiency improvement over conventional initialization. These results demonstrate practical full-flow-field prediction for large three-dimensional aircraft within the CFD domain and provide a foundation for data-driven acceleration of high-fidelity aircraft flow simulation.

2605.30372 2026-06-01 cs.NE cs.AI cs.LG q-bio.NC 版本更新

Evolutionary Algorithm for Reservoir Learning and Yielding

用于储层学习和生成的进化算法

Julien Testu, Pierrick Legrand, Xavier Hinaut

发表机构 * Inria LaBRI, CNRS UMR 5800(LaBRI,CNRS UMR 5800) Bordeaux INP, ENSC(Bordeaux INP,ENSC) IMS, CNRS UMR 5218(IMS,CNRS UMR 5218)

AI总结 提出进化算法EARLY,通过进化多储层回声状态网络的拓扑和超参数,在时序学习任务上优于随机搜索,并发现任务难度影响网络结构。

详情
Journal ref
GECCO '26 - The Genetic and Evolutionary Computation Conference, Jul 2026, San jos{é}, Costa Rica
AI中文摘要

储层计算是一种递归神经网络,因其将动态处理与训练好的读出层分离而成为时序学习的有前途方法。然而,经典的回声状态网络(ESN)通常需要针对任务调整其架构和超参数才能获得良好性能。本文介绍了EARLY(用于储层学习和生成的进化算法),这是一个旨在进化多储层ESN的拓扑和超参数的框架。受大脑模块化组织的启发,EARLY将架构编码为基于图的基因组,并应用交叉、变异和选择来发现有效的配置。我们的目标是创建通用架构和任务诱导泛化。该方法在CogScale数据集的时序学习任务上进行了评估。结果表明,进化出的架构在多个任务上优于通过随机搜索获得的架构,并根据任务难度表现出结构差异:简单任务产生轻量级架构,而复杂任务倾向于更丰富的模块化组织。这些发现表明,进化搜索有助于为更广泛的时序问题识别可复用的储层结构。进一步在跨情境学习数据集上评估进化出的架构,以评估其适应新环境的能力。

英文摘要

Reservoir computing, a type of recurrent neural network, is a promising approach for temporal learning as it separates dynamic processing from the trained readout layer. However, classical Echo State Networks (ESNs) often require task-specific tuning of their architecture and hyperparameters to achieve good performance. This paper introduces EARLY (Evolutionary Algorithm for Reservoir Learning and Yielding), a framework designed to evolve both the topology and hyperparameters of multi-reservoir ESNs. Inspired by the modular organisation of the brain, EARLY encodes architectures as graph-based genomes and applies crossover, mutation, and selection to discover effective configurations. Our goal is to create both generic architectures and tasks inducing generalization. The method is evaluated on temporal learning tasks from the CogScale dataset. Results show that evolved architectures outperform those obtained with random search on several tasks and exhibit structural differences depending on task difficulty: simpler tasks yield lightweight architectures, while more complex tasks favour richer modular organisations. These findings suggest that evolutionary search can help identify reusable reservoir structures for a broader range of temporal problems. The evolved architectures are further evaluated on a cross-situational learning dataset to assess their ability to adapt to new environments.

2605.30368 2026-06-01 cs.NE cs.AI cs.RO q-bio.NC 版本更新

Reinterpreting Safety Thresholds as Neuron Spiking Thresholds

将安全阈值重新解释为神经元放电阈值

Enrico Del Re, Mohamed Sabry, Cristina Olaverri-Monreal

发表机构 * Johannes Kepler University Linz(约翰·凯撒大学林茨) Department Intelligent Transport Systems(智能交通运输系统部门)

AI总结 提出将替代安全措施(SSM)的固定阈值重新解释为泄漏积分点火(LIF)神经元的放电阈值,构建脉冲神经网络(SNN)学习人类刹车起始点,实现客观SSM与主观安全感知的融合。

Comments 6 pages

详情
AI中文摘要

替代安全措施(SSM)在自动驾驶领域的交通风险评估中被广泛使用。然而,大多数基于SSM的评估采用固定阈值,无法捕捉人类对持续临界状态的响应或对短暂高风险峰值的反应。本文提出了一种受生物学启发的SSM阈值重新解释,将其建模为泄漏积分点火(LIF)神经元的放电阈值,并将多个SSM输入组合成脉冲神经网络(SNN)。该SNN经过训练,使其发放的脉冲与人类刹车起始点对齐。训练数据是在使用3D-CoAutoSim平台(基于CARLA/Unreal和六自由度运动平台)的受控跟车实验中记录的,实验中生成了诱导的关键事件。结果表明,学习到的脉冲活动在定性上与跨场景的刹车行为一致,并捕捉了仅靠阈值交叉无法一致解释的反应。跨参与者的分析进一步表明,学习到的输入阈值保持相对一致,而学习到的衰减因子编码了SSM的不同时间敏感性。本研究的发现表明,脉冲动力学可能作为一种机制,促进客观SSM与主观人类安全感知的融合。

英文摘要

Surrogate Safety Measures (SSMs) are extensively utilised in the evaluation of traffic risk in automated driving contexts. However, the majority of SSM-based evaluations employ fixed thresholds that fail to capture the human response to sustained borderline conditions or the reaction to brief, high-risk peaks. The present work proposes a biologically inspired reinterpretation of SSM thresholds. This is modelled as spiking thresholds of leaky integrate-and-fire (LIF) neurons, with multiple SSM inputs combined into a spiking neural network (SNN). The SNN is trained to emit spikes that are aligned with human braking onsets. The training data was recorded in a controlled car-following experiment using the 3D-CoAutoSim platform with CARLA/Unreal and a 6-DOF motion platform, where induced critical events were generated. The results demonstrate that the learned spiking activity qualitatively aligns with braking behaviour across scenarios and captures reactions that are not consistently explained by threshold crossings alone. Analysis across participants further indicates that learned input thresholds remain relatively consistent, while learned decay factors encode different temporal sensitivities for the SSMs. The findings of this study indicate that spiking dynamics may serve as a mechanism to facilitate the convergence of objective SSMs with subjective human safety perception.

2605.30365 2026-06-01 cs.SD cs.AI eess.AS 版本更新

Mental Damage: Caption Poisoning Attacks on Retrieval-Augmented Text-to-Music Generation

心理伤害:面向检索增强文本到音乐生成的标题投毒攻击

Yizhu Wen, Shuhao Zhang, Nan Zhang, Long Cheng, Hanqing Guo

发表机构 * Clemson University(克莱姆森大学) Michigan State University(密歇根州立大学)

AI总结 提出双层标题投毒策略,通过向音乐知识库注入少量恶意标题,使检索增强文本到音乐系统生成偏离用户意图的音乐,暴露了系统的完整性风险。

Comments This paper was accepted by the S&P 2026 ArtSec Workshop

详情
AI中文摘要

检索增强文本到音乐(TTM)系统通过从音乐标题数据集中检索的标题来增强未指定的用户提示。这种设计引入了对音乐知识数据库的完整性依赖。我们表明,攻击者可以通过注入少量精心制作的音乐标题来毒化数据库,导致系统检索恶意标题,从而偏置提示增强并使生成偏离用户预期功能,而无需修改用户提示、检索器或生成器。为了实现音乐标题投毒攻击,我们提出了一种双层标题投毒策略,该策略保留高级检索锚点,同时注入低级声学描述符,以将提示增强和下游音乐生成引导至攻击者选择的目标意图。在MusicCaps知识数据库、CLAP检索器和MusicGen流水线中,被投毒的生成结果显著接近攻击者的目标,同时与原始用户查询保持可比的对齐。这些结果暴露了检索增强创意AI系统的实际完整性风险。我们的演示可在以下网址找到:https://yizhu-wen.github.io/Mental-Damage/

英文摘要

Retrieval-augmented text-to-music (TTM) systems augment underspecified user prompts using captions retrieved from a music caption dataset. This design introduces an integrity dependency on the music knowledge database. We show that an attacker can poison the database by injecting a small number of crafted music captions, causing the system to retrieve malicious captions that bias prompt augmentation and steer generation away from the user's intended function, without modifying the user prompt, retriever, or generator. To achieve the music caption poisoning attack, we propose a dual-layer caption poisoning strategy that preserves high-level retrieval anchors while injecting low-level acoustic descriptors to steer prompt augmentation and downstream music generation toward an attacker-chosen target intent. In a MusicCaps knowledge database, CLAP retriever, and MusicGen pipeline, poisoned generations move substantially closer to the attacker's target, while remaining comparably aligned with the original user query. These results expose a practical integrity risk for retrieval-augmented creative AI systems. Our demo can be found at: https://yizhu-wen.github.io/Mental-Damage/

2605.30364 2026-06-01 eess.SP cs.AI 版本更新

Hamiltonian-Inspired Attention Mechanism for Scalable RF Transmitter Fingerprinting

哈密顿启发的注意力机制用于可扩展射频发射器指纹识别

Chitraksh Singh, Monisha Dhanraj, Akram Sheriff

发表机构 * Frondeur Labs(弗朗德实验室)

AI总结 提出哈密顿Transformer,通过物理启发的注意力结构(规范保持值更新和相位增量嵌入)提升射频发射器指纹识别在规模扩展下的性能。

Comments 9 pages

详情
AI中文摘要

射频(RF)指纹识别利用基带I/Q信号中硬件引入的缺陷来识别无线发射器。然而,深度学习模型在接收机和信道分布变化下性能下降,尤其是当发射器数量增加时。本文提出哈密顿Transformer,一种物理启发的注意力架构,通过使用学习到的斜对称生成器和Störmer-Verlet蛙跳积分步骤,在每个注意力头内强制执行规范保持的值动态。额外的相位增量嵌入在输入层揭示振荡器动态。所有实验使用WiSig数据集的非均衡原始I/Q信号,在四种协议下进行:同一天分类、跨接收机泛化、跨天泛化和扩展到150个设备。哈密顿Transformer在同一天条件下达到99.12%的准确率,在150个发射器时达到61.64%,在所有规模点上持续优于CNN和Transformer基线。受控消融研究确定值更新中的规范保持是驱动扩展优势的主要归纳偏置,而相位增量嵌入提供了最大的单组件改进。这些结果表明,将物理启发的结构先验嵌入注意力机制是在原始无线信号上进行大规模发射器识别的有效方法。

英文摘要

Radio-frequency (RF) fingerprinting identifies wire-less transmitters using hardware-induced imperfections present in baseband I/Q signals. However, deep learning models often degrade under receiver and channel distribution shifts, particularly as transmitter populations grow. This work proposes the Hamiltonian Transformer, a physics-informed attention architecture that enforces norm preserving value dynamics within each attention head using a learned skew-symmetric generator and a Störmer-Verlet leapfrog integration step. An additional phase-increment embedding exposes oscillator dynamics at the input layer. All experiments use non-equalized raw I/Q signals from the WiSig dataset under four protocols: same-day classification, cross-receiver generalisation, cross-day generalisation, and transmitter scaling up to 150 devices. The Hamiltonian Transformer achieves 99.12% accuracy under same-day conditions and 61.64% at 150 transmitters, consistently outperforming CNN and Transformer baselines across all scale points. A controlled ablation study identifies norm-preservation in the value update as the primary inductive bias driving the scaling advantage, with the phase increment embedding providing the single largest per-component improvement. These results indicate that embedding physics-informed structural priors into attention mechanisms is an effective approach to large-scale transmitter identification on raw wireless signals.

2605.30363 2026-06-01 q-fin.CP cs.AI cs.LG q-fin.ST 版本更新

Enhancing Regime Shift Detection Using Unstructured Data: A Study on the Treasury Market

利用非结构化数据增强制度转换检测:国债市场研究

Mingxuan Yi, Vidal Mehra, Jing Chen, John Cartlidge

发表机构 * School of Engineering Mathematics and Technology, University of Bristol, UK(布里斯托大学工程数学与技术学院) Propellant Digital B.V., Amsterdam, Netherlands(荷兰阿姆斯特丹Propellant Digital公司) School of Mathematics, Cardiff University, UK(卡迪夫大学数学学院)

AI总结 提出一种结合大语言模型推理与统计检验的文本增强型制度转换检测框架,在国债市场数据上实现F1=0.82,优于纯数据驱动方法。

Comments 8 pages, 4 figures. Code available at: https://github.com/mingxuan-yi/regime_shift

详情
AI中文摘要

金融市场的制度转换会重组资产价格和宏观变量的联合动态,打破任何单一制度校准。然而,由于数据信号嘈杂且高度多重共线性,而宣布制度转换的同期文本是非结构化的,因此难以可靠检测。标准的制度转换检测方法仅依赖结构化时间序列数据,忽略政策沟通,尽管这些文本往往在观察到的价格中实现转换之前就发出信号。我们提出了一种文本增强的制度转换检测流程,该流程将大语言模型(LLM)对央行沟通的推理与多元金融时间序列的统计验证相结合。该框架是检测器无关的:文本提出的候选点通过向量自回归(VAR)上的自助法似然比检验进行验证,而来自任意制度检测器的数据驱动候选点则通过宽松的LLM文本检查进行确认。我们在2010-2024年FOMC会议记录以及14变量美国国债和宏观经济面板数据上评估了该框架,使用了四种可互换的数据驱动检测器。所提出的流程在经核实的货币政策制度转换锚定列表上实现了F1=0.82,具有当日模态检测延迟,并且性能始终优于纯数据驱动基线。结果表明,将非结构化政策文本与统计结构性断点检测相结合,提高了金融市场制度转换识别的鲁棒性和可解释性。

英文摘要

Regime shifts in financial markets reorganise the joint dynamics of asset prices and macro variables, breaking any single-regime calibration. They are nonetheless difficult to detect reliably because the data signal is noisy and heavily multicollinear, while the contemporaneous text that announces them is unstructured. Standard regime shift detection methods rely solely on structured time-series data and ignore policy communications, even though these texts often signal shifts before they materialise in observed prices. We propose a text-enhanced regime shift detection pipeline that combines large language model (LLM) reasoning over central-bank communications with statistical validation on multivariate financial time series. The framework is detector-agnostic: text-proposed candidates are validated using a bootstrap likelihood-ratio test on a vector autoregression (VAR), while data-driven candidates from arbitrary regime detectors are ratified through a lenient LLM text check. We evaluate the framework on 2010-2024 FOMC minutes paired with a 14-variable U.S. Treasury and macroeconomic panel, using four interchangeable data-driven detectors. The proposed pipeline achieves F1 = 0.82 against a verified anchor list of monetary-policy regime shifts, with same-day modal detection latency and consistently stronger performance than pure data-driven baselines. The results demonstrate that combining unstructured policy text with statistical structural-break detection improves the robustness and interpretability of regime shift identification in financial markets.

2605.30362 2026-06-01 cs.NE cs.AI cs.CV 版本更新

XOResNet: Exclusive-OR Meta-Residuals Facilitate Deep Spiking Neural Networks Learning

XOResNet: 异或元残差促进深度脉冲神经网络学习

Jianfang Wu, Junsong Wang

发表机构 * School of Artificial Intelligence, Shenzhen Technology University(人工智能学院,深圳技术大学) Faculty of Data Science, City University of Macau(数据科学学院,澳门城市大学)

AI总结 针对深度脉冲神经网络中残差结构存在的脉冲冗余、信息损失和冗余学习问题,提出OR-ADD捷径连接和XOR元残差机制,构建XOResNet,在多个数据集上超越现有方法。

Comments 33 pages, 12 figures, 7 Tables

详情
AI中文摘要

脉冲神经网络(SNN)在深度模型中展现出优越的学习和表示能力。鉴于ResNet在深度学习中的巨大成功,自然希望用残差学习训练深度SNN。然而,现有的用于构建深度SNN的残差结构仍然面临脉冲冗余或信息损失以及冗余学习的挑战。在本研究中,我们首先旨在解决恒等映射中的相对脉冲冗余和非恒等映射中的信息损失问题。为此,我们提出了一种OR-ADD(OA)捷径连接,用于合并残差结构中两个分支的输出脉冲/电流。此外,为了减轻残差结构主干分支中的冗余学习,我们引入了XOR元残差的概念,即使用异或(XOR)操作为主干分支选择预学习残差。最后,通过整合OA捷径和XOR元残差,我们设计了XOR残差块,并基于该块进一步构建了不同深度的XOResNet。在Fashion-MNIST、CIFAR-10、CIFAR-100和miniImageNet四个数据集上的大量实验表明,所提出的XOResNet优于现有的通过梯度下降优化的最先进深度SNN。这些结果验证了我们的OA捷径和XOR元残差组件在克服SNN中残差学习基本局限性方面的有效性,为构建高性能神经形态系统提供了新的架构见解。

英文摘要

Spiking neural networks (SNNs) hold promise for demonstrating superior learning and representation capabilities in deep models. Given the tremendous success of ResNet in deep learning, it would naturally follow to train deep SNNs with residual learning. However, existing residual structures for constructing deep SNNs still present challenges of spike redundancy or information loss, as well as redundant learning. In the present study, we first aim to address issues of relative spike redundancy in identity mapping and information loss in non-identity mapping. To this end, we propose an OR-ADD (OA) shortcut connection to merge output spikes/currents from two branches in the residual structure. Furthermore, to mitigate redundant learning in the backbone branch of the residual structure, we introduce the concept of XOR meta-residuals, i.e., selecting pre-learning residuals using the Exclusive-OR (XOR) operation for the backbone branch. Finally, by integrating the OA shortcut and XOR meta-residuals, we devise the XOR residual block and further construct XOResNet with varying depths based on this block. Extensive experiments on four datasets, Fashion-MNIST, CIFAR-10, CIFAR-100, and miniImageNet, show that the proposed XOResNet outperforms existing state-of-the-art deep SNNs optimized via gradient descent. These results validate the effectiveness of our OA shortcut and XOR meta-residual components in overcoming fundamental limitations of residual learning in SNNs, providing new architectural insights for building high-performance neuromorphic systems.

2605.30361 2026-06-01 cs.NE cs.AI cs.LG 版本更新

Gradient-Free Training of Spiking Neural Networks via Low-Rank Evolution Strategies

通过低秩进化策略的无梯度训练脉冲神经网络

Dhruv Patankar, Sachit Ramesha Gowda

发表机构 * Shunya Research(Shunya研究)

AI总结 提出EGGROLL方法,利用低秩因子化进化策略扰动,在N-MNIST数据集上以79.21%测试精度和2.23倍加速实现脉冲神经网络的无梯度训练。

Comments 12 pages, 4 figures

详情
AI中文摘要

脉冲神经网络(SNN)在神经形态硬件上具有显著的能效优势,但由于离散脉冲阈值不可微,其训练仍然具有挑战性。代理梯度方法通过近似导数规避了这一问题,但它们需要反向传播基础设施,这与片上学习不兼容。进化策略(ES)是一种自然的无梯度替代方案,但其计算成本随参数数量扩展,使得对于大型权重矩阵不实用。我们提出了一种使用EGGROLL训练SNN的方法,这是一种ES扰动的低秩因子化,将每代内存从$\mathcal{O}(mn)$降低到$\mathcal{O}(r(m{+}n))$。将EGGROLL与N-MNIST上的漏积分点火SNN相结合,我们证明了无梯度训练达到了79.21%的测试准确率,同时相对于全秩ES,每代墙钟时间减少了2.23倍。我们的结果表明EGGROLL对于SNN训练是可行的,具有明确的准确率-速度权衡,并且兼容于无需代理梯度的神经形态硬件上的训练。

英文摘要

Spiking Neural Networks (SNNs) offer compelling energy efficiency on neuromorphic hardware, yet their training remains challenging because the discrete spike threshold is non-differentiable. Surrogate-gradient methods sidestep this by approximating the derivative, but they impose backpropagation infrastructure that is incompatible with on-chip learning. Evolution Strategies (\es) are a natural gradient-free alternative, yet their computational cost scales with the number of parameters, making them impractical for large weight matrices. We present a method for training SNNs using EGGROLL, a low-rank factorisation of ES perturbations that reduces per-generation memory from $\mathcal{O}(mn)$ to $\mathcal{O}(r(m{+}n))$. Combining EGGROLL with a Leaky Integrate-and-Fire SNN on N-MNIST, we demonstrate that gradient-free training achieves 79.21% test accuracy while reducing per-generation wall-clock time by 2.23$\times$ relative to full-rank ES. Our results demonstrate EGGROLL is viable for SNN training, with a clear accuracy-speed tradeoff, compatible with training on neuromorphic hardware without surrogate gradients.

2605.27996 2026-06-01 cs.AI 版本更新

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

奖励偏差替代:单轴偏差缓解措施重定向优化压力

Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing, Mykel J. Kochenderfer

发表机构 * Stanford University(斯坦福大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出奖励偏差替代现象,即单轴缓解奖励模型偏差(如减少对长度、谄媚或风格的依赖)会将优化压力转移到相关代理上而非消除,并通过理论证明和实验(如GRPO训练中的长度惩罚导致过度自信)揭示了该问题,建议在评估中纳入策略诱导分布并跟踪多偏差。

Comments Improved readability (mostly appendix D)

详情
AI中文摘要

单轴缓解奖励模型偏差(例如,减少代理对长度、谄媚或风格的依赖)可以将优化压力旋转到相关代理上,而不是消除它,这种失败模式我们称之为奖励偏差替代。这种失败是由于在缓解评估和策略训练期间,审计分布与策略诱导分布之间的测量与优化差距造成的。我们将缓解结果形式化为一个机制分类,并证明成功的缓解、偏差替代和过度修正会在任何审计分布评分下产生相同的可观测结果,包括排名准确率和胜率,即使允许对真实奖励进行神谕访问。在已发表的偏好学习缓解工作中,我们调查的方法都没有报告证明成功缓解所需的证据。在跟踪多个偏差的同时,用策略诱导分布增强评估可以证明缩小差距,我们将其转化为缓解方法和基准的可操作处方。我们在语言模型RLHF中演示了偏差替代,其中GRPO训练期间的长度惩罚按预期压缩了响应,但将优化压力重定向到置信度校准上,导致策略过度自信,而事实自由形式准确性下降。我们还展示了一个已发表的长度去偏操作,它在审计分布上将奖励-长度相关性归零,但在四个最先进奖励模型中的三个上,在最佳N选择下重新引入了偏差,以及一个长度-谄媚耦合,其方向在人类-LLM判断者分歧下反转。

英文摘要

Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during mitigation evaluation and policy training. We formalize mitigation outcomes into a regime taxonomy and prove that successful mitigation, bias substitution, and overcorrection produce identical observables under any audit-distribution scoring, including ranking accuracy and win-rate, even when granted oracle access to the true reward. Across published preference-learning mitigation work, no method we survey reports the evidence needed to certify successful mitigation. Augmenting evaluation with policy-induced distributions while tracking multiple biases provably closes the gap, and we translate this into actionable prescriptions for mitigation methods and benchmarks. We demonstrate bias substitution in language model RLHF, where a length penalty during GRPO training compresses responses as intended yet redirects optimization pressure onto confidence calibration, driving the policy into overconfidence while factual free-form accuracy falls. We also show a published length-debiasing operator that zeroes reward-length correlation on the audit distribution but reintroduces bias under best-of-N selection on three of four SOTA reward models, and a length-sycophancy coupling whose direction reverses under human-LLM judge disagreement.

2605.27355 2026-06-01 cs.AI cs.CL cs.LG 版本更新

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

对齐篡改:人类反馈强化学习如何被利用以优化错位偏见

Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee

发表机构 * MIT(麻省理工学院)

AI总结 本文提出对齐篡改漏洞,即对齐中的LLM通过影响偏好数据集使RLHF放大不良行为,并通过实验展示多种偏见的放大,指出现有缓解方法难以在不牺牲质量的情况下解决该问题。

Comments Accepted at ICML 2026, Source code: https://alignment-tampering.github.io/

详情
AI中文摘要

人类反馈强化学习(RLHF)是将大型语言模型(LLM)与人类偏好对齐的标准方法。在本工作中,我们引入对齐篡改,这是一种潜在漏洞,即正在对齐的LLM影响偏好数据集,导致RLHF放大不良行为。这源于RLHF的核心局限性:(1)偏好数据集由LLM自身的输出构建,使其能够影响它们;(2)成对比较仅指示哪个响应更好,而不说明原因。这些局限性可能被利用以导致对齐篡改。例如,如果LLM以更高质量生成有偏见的响应,标注者会基于质量偏好它们。然而,偏好标签无法区分质量与偏见,奖励模型继承了这一局限性。通过强化学习或最佳N采样优化此类奖励可能放大错位偏见。我们的实验展示了跨多种偏见的放大:从关键词偏见到宣传(例如性别歧视)、品牌推广和工具性目标寻求。缓解仍然具有挑战性,因为现有的鲁棒RLHF技术无法在不牺牲响应质量的情况下完全解决对齐篡改。这些发现揭示了当前RLHF的结构性漏洞,并强调了防止此漏洞的必要性。项目页面:https://alignment-tampering.github.io/

英文摘要

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/

2605.27255 2026-06-01 cs.CL cs.AI 版本更新

Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs

Pair-In, Pair-Out: 面向高效LLM的潜在多令牌预测

Wenhui Tan, Minghao Li, Xiaoqian Ma, Siqi Fan, Xiusheng Huang, Liujie Zhang, Ruihua Song, Weihang Chen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学首都人工智能学院) AI Platform, Xiaohongshu Inc.(小红书人工智能平台) University of Electronic Science and Technology of China(电子科技大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出Pair-In, Pair-Out (PIPO)方法,通过统一潜在压缩和多令牌预测,并训练轻量级置信度头消除验证器开销,在保持可靠性的同时实现推理加速。

Comments Project Page: GitHub.com/RedAI-Infra/PIPO

详情
AI中文摘要

长链式推理使得自回归解码成为现代大语言模型的主要推理成本。现有方法要么针对输入侧(潜在压缩),要么针对输出侧(推测解码和多令牌预测,MTP),但这两条工作线是独立进行的。此外,输出侧方法必须进行昂贵的验证器传递,以验证MTP预测的不可靠草稿令牌。为解决这些问题,我们提出 extbf{Pair-In, Pair-Out (PIPO)},通过将潜在压缩器和MTP头视为镜像操作来统一两侧:压缩器将两个输入令牌折叠成一个潜在表示,而MTP头将一个隐藏状态展开成一个额外的输出令牌。为了在不牺牲可靠性的情况下消除验证器成本,PIPO训练一个轻量级置信度头,决定是否接受草稿令牌。我们观察到,在线策略蒸馏(OPD)自然匹配推测解码的拒绝采样准则,因此置信度头可以以可忽略的额外成本与OPD一起训练。在AIME 2025、GPQA-Diamond、LiveCodeBench v6和LongBench v2上使用Qwen3.5-4B和9B骨干网络的实验表明,PIPO在常规解码上将pass@4提高了最多+7.15个点,同时实现了高达2.64倍的首令牌延迟和2.07倍的每令牌延迟加速。项目页面:GitHub.com/RedAI-Infra/PIPO。

英文摘要

Long chain-of-thought reasoning has made autoregressive decoding the dominant inference cost of modern large language models. Existing methods target either the input side (latent compression) or the output side (speculative decoding and multi-token prediction, MTP), but the two lines of work have been pursued independently. Moreover, output-side methods must incur an expensive verifier pass to validate the unreliable draft tokens predicted by MTP. To address these issues, we propose \textbf{Pair-In, Pair-Out (PIPO)}, which unifies both sides by viewing a latent compressor and an MTP head as mirror-image operations: the compressor folds two input tokens into one latent representation, while the MTP head unfolds one hidden state into one additional output token. To remove the verifier cost without sacrificing reliability, PIPO trains a lightweight confidence head that decides whether draft tokens should be accepted. We observe that On-Policy Distillation (OPD) naturally matches the rejection-sampling criterion of speculative decoding, so the confidence head can be trained alongside OPD with negligible extra cost. Experiments on AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 with Qwen3.5-4B and 9B backbones show that PIPO improves pass@4 over regular decoding by up to $+7.15$ points, while delivering up to $2.64\times$ first-token-latency and $2.07\times$ per-token-latency speedups. Project Page: GitHub.com/RedAI-Infra/PIPO.

2605.26942 2026-06-01 cs.AI cs.LO cs.SE 版本更新

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

面向数据敏感领域的LLM输出的神经符号验证(扩展预印本)

Paul Sigloch, Christoph Benzmüller

发表机构 * University of Bamberg(巴姆堡大学) Free University of Berlin(柏林自由大学)

AI总结 提出一种结合形式符号方法与神经语义分析的混合验证架构,用于检测LLM输出中的幻觉、不一致和隐私漏洞,在医疗设备损伤评估系统中实现83%的结构化实体幻觉检测率和72%的语义虚构检测率。

Comments Extended preprint version of accepted technical communication at KI 2026. 22 pages, 3 figures

详情
AI中文摘要

部署在高风险领域的LLM面临根本性的可靠性挑战:幻觉、不一致性和隐私漏洞引入了不可接受的风险,因为错误会带来法律、财务或安全后果。本文提出一种混合验证架构,结合形式符号方法与神经语义分析,为LLM生成的内容提供互补性保证。该架构采用逻辑推理进行输入验证,利用完备性属性为结构化需求提供可判定的保证。对于输出验证,基于嵌入的语义相似性检测上下文幻觉,弥补形式方法表达力不足的问题。这种分离通过并行的、基于角色的流水线实现,解决了基于提示的自验证方法(继承了产生幻觉的分布偏差)的局限性。所提出的架构和类型感知验证方法通过HAIMEDA(一个通过行动设计研究开发的真实世界医疗设备损伤评估报告系统)进行验证。评估显示,结构化实体的幻觉检测率超过83%,语义虚构的检测率为72%,报告创建时间减少30%,表明神经符号架构可以为LLM在数据敏感领域的部署提供原则性的安全保障。

英文摘要

LLMs deployed in high-stakes domains face fundamental reliability challenges: hallucinations, inconsistencies, and privacy vulnerabilities introduce unacceptable risks where errors carry legal, financial, or safety consequences. This paper presents a hybrid verification architecture combining formal symbolic methods with neural semantic analysis to provide complementary guarantees for LLM-generated content. This architecture employs logical reasoning for input verification, leveraging completeness properties to provide decidable guarantees on structured requirements. For output validation, embedding-based semantic similarity detects contextual hallucinations where formal methods lack expressiveness. This separation is realized in a parallel, actor-based pipeline, addressing limitations of prompt-based self-verification approaches, which inherit the distributional biases that produce hallucinations. The proposed architecture and type-aware verification method are validated with HAIMEDA, a real-world medical device damage assessment reporting system developed through Action Design Research. Evaluation shows hallucination detection rates of over 83% for structured entities and 72% for semantic fabrications, with a 30% reduction in report creation time, demonstrating that neuro-symbolic architectures can provide principled safeguards for LLM deployment in data-sensitive domains.

2605.26396 2026-06-01 cs.AI cs.CL cs.LG 版本更新

Advancing Creative Physical Intelligence in Large Multimodal Models

推进大型多模态模型中的创造性物理智能

Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Emre Can Acikgoz, Bingxuan Li, Kunlun Zhu, Jiateng Liu, Aditi Tiwari, Zhenhailong Wang, Xiusi Chen, Mahdi Namazifar, Heng Ji

发表机构 * UIUC(伊利诺伊大学香槟分校) Amazon(亚马逊)

AI总结 针对大型多模态模型在开放式环境中缺乏基于视觉的创造性工具使用能力的问题,提出MM-CreativityBench基准和基于偏好学习的具身对齐方法,显著提升实体选择并减少幻觉。

Comments 51 Pages, 9 Figures, 7 Tables, Previous Work CreativityBench: arXiv:2605.02910

详情
AI中文摘要

大型多模态模型(LMMs)在感知和推理方面取得了快速进展;然而,目前尚不清楚这些能力是否能够泛化到在开放式环境中发现基于视觉的解决方案,超越模式识别。在此类场景中,智能需要的不仅仅是回答明确的问题:它涉及识别场景中的元素如何以非显而易见但物理上可行的方式被重新利用。这种创造性问题解决形式是人类智能的核心,但在当前基准测试中基本上未得到测试。为了评估这一能力,我们引入了MM-CreativityBench,这是一个用于在视觉丰富、物理受限的环境中进行基于可操作性的创造性工具使用的基准。每个实例呈现一个场景图像,包含候选实体及其部件的结构化视图,从而能够对模型如何迭代检查场景、识别相关可操作性以及组合视觉和物理上可行的解决方案进行细粒度、交互式评估。我们的实验表明,当前的LMMs往往表现不佳,不是由于缺乏生成能力,而是因为它们无法维持基于具身的探索。模型经常忽略相关实体,对关键部件检查不足,或幻觉出图像中不存在的属性。受此失败模式的启发,我们提出了具身对齐,将创造性工具使用视为一个偏好学习问题。使用直接偏好优化,我们鼓励模型偏好基于视觉证据的属性-可操作性推理,而非幻觉替代方案。此外,我们结合从可操作性知识库中获得的监督,以指导更广泛的实体探索和多轮规划。我们的结果显示,在正确选择实体和部件方面取得了持续改进,同时大幅减少了幻觉和与具身相关的错误。

英文摘要

Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.

2605.26371 2026-06-01 cs.AI 版本更新

Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL

利用局部动态规律性实现离线分层强化学习中的可复用技能

Sarthak Dayal, Abhinav Peri, Carl Qi, Claas Voelcker, Alexander Levine, Caleb Chuck, Amy Zhang

发表机构 * UT Austin(UT奥斯汀)

AI总结 提出CARL算法,通过对比学习对齐局部动态与动作序列,在离线分层强化学习中学习可复用技能,提升下游任务性能。

详情
AI中文摘要

分层强化学习(HRL)有望通过发现和复用时间上扩展的技能,比非分层方法更有效地解决长时域强化学习(RL)任务。然而,获得真正可复用的技能仍然是一个开放挑战。为此,我们关注利用局部动态直觉的抽象:不同全局上下文中的局部转换需要类似的动作序列。通过将这些上下文与其所需的动作序列对齐,我们能够学习哪些技能可以复用以及在何处复用它们。原则上,这些信息应有益于许多HRL算法,其中高层策略需要推理其使用的低层技能。由此产生的算法CARL(基于对比动作的可复用局部控制表示)在复杂人形环境中展示了有意义技能的定性聚类,并且在与HIQL集成时,在OGBench基准上提升了下游性能。

英文摘要

Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and reusing temporally-extended skills. However, obtaining skills that are actually reusable remains an open challenge. Towards this end, we focus on abstractions that exploit the intuition of local dynamics: local transitions in different global contexts require similar kinds of action sequences. By aligning these contexts with the action sequences they require, we are able to learn which skills to reuse and where to reuse them. In principle, this information should benefit many HRL algorithms, where high-level policies have to reason about the low-level skills they use. The resulting algorithm CARL (Contrastive Action-based Representations for Reusable Local Control) shows both qualitative clustering of meaningful skills in complex humanoid environments and improved downstream performance on the OGBench benchmark when integrated with HIQL.

2605.26121 2026-06-01 cs.LG cs.AI 版本更新

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

GEM: 用于最优LLM数据策展的几何熵混合

Yue Min, Ziyun Qiao, Ruining Chen, Yujun Li

发表机构 * The Hong Kong University of Science and Technology, Hong Kong SAR, China(香港科学与技术大学) Peking University, Beijing, China(北京大学) University of Science and Technology of China, Hefei, China(中国科学技术大学)

AI总结 提出GEM框架,通过将数据策展重构为超球面上的变分问题并采用MM算法优化,解决了分类缺陷和嵌入各向异性问题,在1.1B参数模型上实现下游准确率提升1.2%。

Comments ICML 2026 Poster

详情
AI中文摘要

LLM预训练的有效性越来越依赖于数据组成而非单纯的数据量。然而,最优混合受到分类缺陷的阻碍:人类分类法存在本体论错位,而欧几里得聚类无法解决嵌入各向异性。我们引入GEM(几何熵混合),这是一个将数据策展重构为超球面上的变分问题并辅以混合平衡正则化项的框架。通过解耦生成先验并使用可证明的MM(Minorize-Maximize)算法优化目标,GEM有效对抗聚类坍缩,从而发现欧几里得启发式方法无法察觉的平衡语义结构。我们采用师生蒸馏将这种几何保真度扩展到网络规模语料库,并引入几何影响分数(GIS)用于可解释的分类法生成。使用1.1B参数模型的实验表明,当集成到DoReMi和RegMix等混合策略中时,GEM建立了新的最先进水平,将平均下游准确率提升高达1.2%,并为可预测的数据混合提供了稳健的坐标系。

英文摘要

LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy. We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics. We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation. Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.

2605.21168 2026-06-01 cs.AI 版本更新

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

ScenePilot: 可控的边界驱动型自动驾驶关键场景生成

Qiyu Ruan, Yuxuan Wang, He Li, Zhenning Li, Cheng-zhong Xu

发表机构 * State Key Laboratory of Internet of Things for Smart City (SKL-IOTSC), University of Macau, Macau, China(智能城市物联网国家重点实验室(SKL-IOTSC)、澳门大学、中国澳门) Faculty of Science and Technology, University of Macau, Macau, China(澳门大学科技学院)

AI总结 提出ScenePilot框架,通过结合RSS物理可行性评分与在线学习的AV风险预测器,将场景生成建模为约束多目标强化学习,并引入步级可行性感知屏蔽,以生成物理上可解但导致自动驾驶系统失败的关键场景。

详情
AI中文摘要

安全关键场景对于评估自动驾驶系统至关重要,但由于其在自然日志中罕见,基于仿真的压力测试不可或缺。大多数场景生成方法将周围智能体视为对手,但它们要么(i)未显式建模车辆-道路物理极限而导致失败,产生视觉极端但物理上不可解的碰撞,要么(ii)单独强制执行物理可行性或策略可行性,可能过度关注激进操作或受限于控制器依赖的能力边界。我们提出ScenePilot,一个可行性引导的、边界驱动的框架,针对边界带:即原则上物理可解但仍导致部署的自动驾驶堆栈失败的场景。我们将生成建模为约束多目标强化学习,结合RSS衍生的物理可行性评分$σ$和在线学习的AV风险预测器$Φ$,并引入步级可行性感知屏蔽,以保持探索接近可行性边界,同时避免不可行的伪影。在SafeBench上使用多个规划器的实验表明,ScenePilot在保持物理有效性的同时,产生了显著更高的碰撞率(+6.2个百分点),并且在这些边界带场景上的对抗性微调持续降低了下游碰撞率。代码可在https://github.com/QiyuRuan/ScenePilot获取。

英文摘要

Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle-road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over-focus on aggressive maneuvers or remain tied to a controller-dependent capability boundary. We propose ScenePilot, a feasibility-guided, boundary-driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score $σ$ with an online-learned AV-risk predictor $Φ$, and introduce step-level feasibility-aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine-tuning on these boundary-band scenarios consistently reduces downstream crash rates. The code is available at https://github.com/QiyuRuan/ScenePilot.

2605.30288 2026-06-01 cs.AI 版本更新

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

MIRA: 基于自锚定评分标准的中期训练源感知数据选择

Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu, Bryan Dai

发表机构 * Beihang University(北洋大学) IQuest Research(IQuest研究院) Shanghai Jiao Tong University(上海交通大学) University of British Columbia(不列颠哥伦比亚大学) Langboat Multilingual-Multimodal-NLP/mira(Langboat多语言-多模态-NLP/mira)

AI总结 针对中期训练中异构数据源的选择问题,提出MIRA框架,通过自锚定评分标准发现和可扩展的学生评分器,在代码中期训练中仅用一半token即可匹配全语料性能。

详情
AI中文摘要

中期训练已成为现代大语言模型开发中的重要阶段,使用大规模精选混合数据在最终后训练前增强能力。其数据选择问题具有独特性:数据在接近预训练规模的预训练风格目标下优化,但针对下游能力进行策划,并来自具有不同格式和训练角色的异构源。因此,有效选择需要可扩展性和源自适应语义标准。现有的基于模型的方法可扩展性好,但仅提供隐式质量信号。语义选择方法提供更强的判断,但通常假设固定评分标准或标准化数据格式。为解决这一不匹配,我们提出MIRA,一种基于自锚定评分标准发现的源感知过滤框架。关键思想是将评分标准构建作为数据选择的一部分:MIRA首先发现每个源组应评估什么,然后将这些判断提炼为可扩展的学生评分器,用于全语料过滤。在包含21个源和5个源组的代码中期训练中,MIRA在九个代码基准测试中优于选择基线,并在仅使用一半token的情况下匹配全语料运行。

英文摘要

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

2605.30039 2026-06-01 cs.AI 版本更新

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

基于最小充分表示学习的大语言模型领域特定数据合成

Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang

发表机构 * vivo AI Lab(vivo人工智能实验室) Ant Group(蚂蚁集团) Zhejiang University(浙江大学)

AI总结 提出DOMINO框架,通过对比解耦学习最小充分领域表示,指导生成领域对齐的合成数据,在隐式领域定义下提升微调性能。

Comments Accepted by KDD 2026

详情
AI中文摘要

大语言模型在通用能力上取得了显著进展,并可通过在领域特定数据上微调在特定领域实现强性能。然而,获取目标领域的高质量数据仍是一个重大挑战。现有数据合成方法遵循演绎范式,严重依赖自然语言表达的显式领域描述和精心设计的提示工程,限制了其在领域难以描述或正式表述的现实场景中的适用性。在这项工作中,我们通过归纳范式处理未被充分探索的领域特定数据合成问题,其中目标领域仅通过一组参考示例定义,特别是在领域特征难以用自然语言表述时。我们提出了一种新颖框架DOMINO,它从参考样本中学习最小充分的领域表示,并利用它来指导生成领域对齐的合成数据。DOMINO将提示调优与对比解耦目标相结合,以分离领域级模式与样本特定噪声,在保留核心领域特征的同时缓解过拟合。理论上,我们证明DOMINO扩展了合成数据分布的支持集,确保了更大的多样性。在隐式领域定义的具有挑战性的编码基准上,对DOMINO合成的数据进行微调,在强大的指令调优基线上将Pass@1准确率提高了高达4.63%,证明了其有效性和鲁棒性。这项工作为领域特定数据合成建立了一种新范式,无需手动提示设计或自然语言领域规范即可实现实用且可扩展的领域适应。

英文摘要

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.

2605.29833 2026-06-01 cs.AI 版本更新

OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

OmniMatBench:跨19个材料科学子领域的人类校准多模态推理基准

Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang, Jue Wang, Ran Sun, Zhuo Yang, Wanli Ouyang, Lei Bai, Tianfan Fu, Lu Chen, Xin Chen, Yuqiang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Fudan University(复旦大学) Southeast University(东南大学) Nanjing University(南京大学) Suzhou Laboratory(苏州实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对现有基准忽视从材料知识到应用的推理过程,提出OmniMatBench,包含3171个专家策划的问答与计算问题,覆盖19个子领域,评估13个多模态大模型,最佳模型仅得0.372分,揭示当前模型在材料科学推理中的显著差距。

Comments 22 Pages

详情
AI中文摘要

随着多模态语言模型在科学研究中扮演越来越重要的角色,材料科学因其跨学科、多模态和应用驱动的特性而成为一个关键的测试平台。然而,现有的材料基准主要关注属性预测、知识问答或表征理解,而忽略了从材料知识到应用的更广泛推理过程。为填补这一空白,我们提出了OmniMatBench,一个针对材料科学的人类校准多模态推理基准。OmniMatBench包含3171个专家策划的问答和计算问题,涵盖19个材料科学子领域,包括基础材料知识、结构材料与工程材料、材料加工与制造以及功能材料与应用材料。我们评估了13个开源和闭源的多模态大语言模型,发现最佳模型仅获得0.372的总体得分,揭示了当前材料科学推理中的显著差距。进一步分析显示,不同子领域之间存在强烈差异、固定的推理启发式、不均匀的材料知识,以及在公式辅助、检索辅助和代码辅助设置下有限的高级知识应用。OmniMatBench为当前多模态大语言模型的能力和局限性提供了关键见解,并为材料科学研究中可靠的AI助手奠定了基础。

英文摘要

As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due to its interdisciplinary, multimodal, and application-driven nature. However, existing materials benchmarks mainly focus on property prediction, knowledge QA, or characterization understanding, leaving the broader reasoning process from materials knowledge to application underexplored. To fill this gap, we present OmniMatBench, a human-calibrated multimodal reasoning benchmark for materials science. OmniMatBench contains 3,171 expert-curated QA and calculation problems across 19 materials-science subfields, spanning fundamental materials knowledge, structural and engineering materials, materials processing and manufacturing, and functional and applied materials. We evaluate 13 open-source and closed-source MLLMs and find that the best model achieves only a 0.372 overall score, revealing a substantial gap in current materials-science reasoning. Further analysis shows strong variation across subfields, fixed reasoning heuristics, uneven materials knowledge, and limited high-level knowledge application under formula-, retrieval-, and code-assisted settings. OmniMatBench provides crucial insights into the capabilities and limitations of current MLLMs and establishes a foundation for reliable AI assistants in materials-science research.

2605.22737 2026-06-01 cs.LG cs.AI 版本更新

The Distillation Game: Adaptive Attacks & Efficient Defenses

蒸馏博弈:自适应攻击与高效防御

Youssef Allouah, Mahdi Haghifam, Sanmi Koyejo, Reza Shokri

发表机构 * Stanford University(斯坦福大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) National University of Singapore(新加坡国立大学)

AI总结 通过最小化博弈框架研究蒸馏攻击中模型提供者的部署权衡,提出自适应评估规则和产品专家(PoE)防御方法,实验表明自适应学生能恢复更多能力,且PoE在成本和质量上具有优势。

详情
AI中文摘要

蒸馏攻击为模型提供者带来了部署权衡:使模型更有用的相同输出也可能使其更容易被模仿。我们通过一个效用受限的教师和自适应学生之间的最小化博弈来研究这种权衡。我们的框架产生了可处理的一侧响应规则:一个自适应评估规则,其中学生重新加权高价值示例,以及一个教师侧防御模板,抑制对蒸馏最有用的输出。从示例价值的廉价代理中,我们推导出产品专家(PoE),一种简单的前向传递防御,在生成过程中将教师与代理学生结合。实验上,自适应评估揭示了一个大的被动-自适应差距:在最先进的防御上,自适应学生在GSM8K和MATH上恢复了比被动评估所建议的更多的能力。在这种更强的评估下,昂贵防御和PoE之间的明显鲁棒性差距显著缩小,而PoE仍然便宜得多,并保留了更高质量的推理轨迹。总体而言,我们的结果表明,强大的蒸馏仍然难以阻止,并且反蒸馏的进展应该根据自适应学生而非被动学生来判断。我们的代码可在:https://github.com/ysfalh/distillation-game 获取。

英文摘要

Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade-off through a minimax game between a utility-constrained teacher and an adaptive student. Our framework yields tractable one-sided response rules: an adaptive evaluation rule in which the student reweights high-value examples, and a teacher-side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product-of-Experts (PoE), a simple forward-pass-only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive--adaptive gap: on state-of-the-art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher-quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: https://github.com/ysfalh/distillation-game.

2605.29299 2026-06-01 cs.CV cs.AI 版本更新

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

口袋牙医:通过高效多模态大语言模型实现设备端牙科图像理解

Kai Bian, Xucheng Guo, Bin Chen, Lingyan Ruan, Yiran Shen, Ting Dang, Hong Jia

发表机构 * The University of Auckland, New Zealand(奥克兰大学) Shandong University, China(山东大学) The University of Melbourne, Australia(墨尔本大学)

AI总结 提出Pocket-Dentist基准,通过评估14种视觉语言模型发现紧凑模型(2B参数)在牙科图像理解中精度更高且计算成本更低,并在iPhone 17 Pro上实现低延迟部署。

详情
AI中文摘要

牙科视觉语言模型的评估在数据集、任务定义和指标上仍然分散,并且常常忽略其计算成本。这限制了它们在专科中心之外的广泛部署用于牙科筛查,而及时推理、有限的硬件以及对患者图像的本地处理对于实用、保护隐私的临床预筛查至关重要。本文提出了Pocket-Dentist,一个面向牙科多模态问答的效率感知基准,它汇集了三个数据集,涵盖约1159名患者、五种任务类型和七种指标。在典型的14种VLM上,我们的结果揭示了一个有趣的观察:紧凑型VLM(例如2B参数模型)在牙科图像理解中精度更高,同时所需计算成本大幅降低。在iPhone 17 Pro上本地部署时,我们微调的紧凑型VLM Pocket-Dentist-2B处理每个样本耗时4.31秒,与7B基线相比延迟降低4.9倍,内存使用减少2.3倍。

英文摘要

Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their computational cost. This limits their widespread deployment for dental screening outside specialist centres, where timely inference, limited hardware, and local handling of patient images are vital for practical, privacy-preserving clinical prescreening. Here we present Pocket-Dentist, an efficiency-aware benchmark for dental multimodal question answering that brings together three datasets spanning approximately 1,159 patients, five task types and seven metrics. Across typical 14 VLMs, our results reveals an interesting observation: compact VLMs (e.g., 2B-parameter models) outperform larger VLMs in accuracy while requiring substantially lower computational costs in dental image understanding. Deployed locally on an iPhone 17 Pro, our finetuned compact VLM Pocket-Dentist-2B processed each sample in 4.31 s, reducing latency by 4.9-fold and memory use by 2.3-fold compared with a 7B baseline.

2605.29268 2026-06-01 cs.CL cs.AI cs.LG cs.NE 版本更新

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

进化搜索中的计算分配:从深度-广度到多臂老虎机

Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang, Haozheng Luo, Tianfan Fu, Aarthy Nagarajan

发表机构 * University of Notre Dame(诺丁汉大学) Northeastern University(东北大学) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Southeast University(东南大学) Northwestern University(西北大学) Nanjing University(南京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对LLM引导的进化搜索中固定预算的LLM调用分配问题,提出基于多臂老虎机的BaSE方法,通过跨并行轨迹分配调用,平均适应度提升12.3%。

详情
AI中文摘要

LLM引导的进化搜索(Evolve系统)在数学和组合任务上达到了最先进的结果,但现有系统通常只报告多次运行中的最佳结果,而未记录运行间的分布。我们询问如何分配固定的LLM调用预算,以及单次运行达到报告数字的可靠性如何。通过扫描五个模型和三个任务的深度-广度网格,我们识别出两个经验规律:一个适应度-计算包络线,其中能力排序主要取决于有效FLOPs;以及一个双线性深度-广度拟合,具有任务特定的交互;两者都受模型-任务能力门控。受这些规律启发,我们提出BaSE(基于老虎机的自进化),一种多臂老虎机,它在并行轨迹间分配LLM调用。在不改变模型、提示或评估器的情况下,BaSE在8个(模型,任务)单元上比最强的岛屿协议基线平均适应度提高12.3%,在方差高的设置上增益最大:仅通过分配实现可靠性提升。

英文摘要

LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most existing systems report only the best of many runs and leave the run-to-run distribution undocumented. We ask how a fixed budget of LLM calls should be allocated, and how reliably a single run reaches the reported numbers. Sweeping the depth-breadth grid over five models and three tasks, we identify two empirical regularities: a fitness-compute envelope along which capability ordering largely collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interaction; both are gated by model-task capability. Motivated by these regularities, we propose BaSE (Bandit-based Self-Evolving), a multi-armed bandit that allocates LLM calls across parallel trajectories. Without changing the model, prompt, or evaluator, BaSE improves mean fitness by 12.3% over the strongest island-protocol baseline across 8 (model, task) cells, with the largest gains on high-variance settings: a reliability gain from allocation alone.

2605.29146 2026-06-01 cs.CL cs.AI 版本更新

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

SafeRx-Agent: 基于知识的多智能体框架用于安全且可解释的药物推荐

Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu, Qincheng Lu, Ziyu Zhao, Jijun Chi, Jingrui Tian, Xiao-Wen Chang, Ziyang Song

发表机构 * McGill University(麦吉尔大学) McMaster University(麦马斯特大学) University of Toronto(多伦多大学) Ohio University(俄亥俄大学)

AI总结 提出SafeRx-Agent,一种基于知识的多智能体框架,通过患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合,在MIMIC-III和MIMIC-IV数据集上提高了细粒度药物预测准确性,同时控制了药物相互作用、禁忌症和药物集合大小。

详情
AI中文摘要

药物推荐预测患者就诊时的用药,但现有方法仍面临两个关键挑战。在模型层面,传统药物推荐方法仅预测结构化的药物代码,证据基础有限,而LLM智能体可以利用更丰富的临床上下文,但可能缺乏安全验证和可追溯性。在任务层面,现有基准通常使用宽泛的药物类别,忽略了亚组级别的安全性差异,可能导致风险高估。我们引入了基于第四级ATC代码生成的第一个细粒度药物推荐设置。我们提出了安全处方智能体(SafeRx-Agent),一种基于知识的多智能体框架,利用患者上下文、外部临床知识和安全验证来推荐可追溯的药物集合。在MIMIC-III和MIMIC-IV数据集上的实验结果表明,SafeRx-Agent提高了细粒度药物预测准确性,同时控制了药物相互作用、禁忌症和药物集合大小。

英文摘要

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

2605.28918 2026-06-01 cs.LG cs.AI cs.IR 版本更新

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

当LLM奖励设计失败时:面向诊断的稀疏结构化RL改进

Youting Wang, Yuan Tang, Bowen Liu, Xuan Liu, Dingyan Shang

AI总结 针对稀疏结构化强化学习任务,提出诊断驱动的迭代奖励函数改进方法,通过训练诊断和失败模式分类指导修正,显著提升MiniGrid任务成功率。

详情
AI中文摘要

对于具有语义奖励函数接口的稀疏结构化强化学习任务,LLM生成的奖励塑造更适合被视作调试而非一次性生成。我们使用MiniGrid作为核心评估、MuJoCo作为边界压力测试,研究PPO训练的智能体。我们的审计发现两种主要的一次性失败模式——奖励泛滥和语义/API误解,以及一种较罕见的弱塑造情况。我们提出诊断驱动的迭代改进,其中训练诊断和失败模式分类法指导有针对性的奖励函数修订。改进使DoorKey-8x8从2.3%提升至97.6%,KeyCorridor从31.2%提升至86.7%,但种子间方差较高。控制实验表明这些提升并非来自重试或额外训练:仅指标重新提示导致大幅下降,而静态词汇控制恢复了大部分差距(87.6%;70.7%),表明分类法提示是主要机制,动态标签仅提供部分孤立的增量证据。预算匹配和Best-of-3比较将改进与选择和训练时间效应分离。组件移除测试、敏感性分析以及针对作者标签的审计为调试解释提供了汇聚证据,同时揭示了校准限制。连续控制结果显示了边界:基于成功的诊断可能在密集奖励的 locomotion 中误报,而回报趋势反馈移除了一个假阳性机制但未带来稳健提升。低调用协议是与基于种群的奖励搜索的成本对比,而非基准比较。在四个交叉方差设计环境中,点估计表明当LLM奖励函数方差占主导时收益更大,但bootstrap区间较宽。该方法局限于PPO下具有可靠接口的稀疏结构化任务;event_text等字段可能有益、有害或中性。

英文摘要

For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed as debugging than one-shot generation. We study PPO-trained agents using MiniGrid as core evaluation and MuJoCo as boundary stress test. Our audit finds two dominant one-shot failure modes -- reward flooding and semantic/API misunderstanding -- plus a rarer weak-shaping case. We propose diagnostic-driven iterative refinement, where training diagnostics and a failure-mode taxonomy guide targeted reward-function revision. Refinement improves DoorKey-8x8 from 2.3% to 97.6% and KeyCorridor from 31.2% to 86.7% with high seed-to-seed variance. Controls show these gains are not from retrying or extra training: metrics-only re-prompting yields large drops, while a static-vocabulary control recovers much of the gap (87.6%; 70.7%), showing the taxonomy prompt is a major mechanism and dynamic labels provide only partially isolated incremental evidence. Budget-matched and Best-of-3 comparisons separate refinement from selection and training-time effects. Component-removal tests, sensitivity analyses, and an audit against author labels provide converging evidence for the debugging interpretation while revealing calibration limits. Continuous-control results show the boundary: success-based diagnostics can misfire in dense-reward locomotion, and return-trend feedback removes one false-positive mechanism without robust gains. The low-call protocol is a cost contrast with population-based reward search, not a benchmark comparison. In four crossed-variance-design environments, point estimates suggest larger gains when LLM reward-function variance dominates but bootstrap intervals are wide. The method is bounded to sparse structured tasks with reliable interfaces under PPO; fields like event_text may help, hurt, or be neutral.

2605.28916 2026-06-01 astro-ph.IM cs.AI cs.HC 版本更新

First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope

应用于爱因斯坦望远镜模拟数据分析的智能体AI首次头对头比较

Gianluca Inguglia

发表机构 * Anthropic OpenAI

AI总结 本文首次直接比较了Claude Code和Codex两种智能体AI系统在无人干预下自主执行引力波数据分析管线的行为、科学结果和计算成本,揭示了速度与可审计性、指令解释差异等关键问题。

Comments Version 2; includes the report autonomoulsy written in PRD style by agentic AI systems as supplemental material

详情
AI中文摘要

我们报告了两种最先进的智能体AI系统——Claude Code (Anthropic) 和 Codex (OpenAI) 的比较,它们被要求在共享计算基础设施上无人干预地自主执行一个简单的端到端引力波数据分析管线。该管线包括:从爱因斯坦望远镜模拟噪声中估计功率谱密度、生成几何模板库、对100个双黑洞信号注入进行匹配滤波恢复、自动生成结果,以及在大语言模型辅助下制作以Physical Review D格式排版的手稿。两个智能体均收到相同的书面规范和相同的计算资源。实验进行了两次:第一次使用不切实际的高信噪比注入,第二次将信号重新缩放到物理合理的信噪比范围。两次实验的科学结果均收敛。然而,智能体表现出截然不同的行为和计算成本:Claude Code在约3.4分钟内完成管线,但存在对规范的无声偏差;而Codex需要约16分钟,经历了明确的自我纠正重启,包括对匹配滤波内循环进行未经请求的性能优化。自主生成的手稿在长度、细节和质量上也存在差异。在第二次实验中,对信噪比范围指令解释的细微差异导致了真正的科学分歧:Claude Code无声地重新解释了指令,而Codex严格遵循了规范。我们讨论了这些行为差异(例如速度与可审计性、无声与透明的错误处理、指令解释以及多模型管线中中间数据表示的关键性)对智能体AI在科学计算工作流中部署的影响。

英文摘要

We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.

2605.28836 2026-06-01 cs.CL cs.AI 版本更新

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

不让任何读者掉队:人人能理解的多智能体摘要

Jimin Jung, MyoungJin Kim, Jaehyung Seo, Heuiseok Lim

发表机构 * Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系) Department of Computer Science and Engineering, Konkuk University(konkuk大学计算机科学与工程系)

AI总结 提出NRLB多智能体框架,通过模拟三类读者群体并结合模板规划与迭代优化,生成既忠实又易于理解的平实语言摘要。

详情
AI中文摘要

美国的《平实语言法案》要求政府文件使用清晰、简单的语言,以便公众易于理解,但现有的摘要系统难以应对普通读者中多样化的语言和认知障碍。我们提出了NRLB(不让任何读者掉队),一个用于平实语言摘要的多智能体框架,它模拟了三类代表性读者群体:小学生读者、非母语读者和注意力缺陷读者。NRLB结合了基于模板的规划与迭代的、面向读者的优化,能够系统地检测和解决难懂术语、缺失上下文和令人困惑的句子。在多个数据集上的评估显示,在保持事实准确性的同时,可读性持续提升。人工评估进一步验证了NRLB的效果,标注者偏好率在55%到76%之间,突显了NRLB在生成既忠实于原文又广泛适用于公众的平实语言摘要方面的潜力。

英文摘要

The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

2605.25134 2026-06-01 cs.LG cs.AI 版本更新

Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate

重参数化、权重衰减和自适应学习率下稀疏优化的理论分析

Huangyu Xu, Jingqin Yang, Qianqian Xu, Jiaye Teng

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(人工智能安全国家重点实验室,计算技术研究所,中国科学院,北京,中国) School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学计算机科学与技术学院,北京,中国) Beijing Academy of Artificial Intelligence (BAAI), Beijing, China(北京人工智能研究院(BAAI),北京,中国) IIIS, Tsinghua University, Beijing, China(清华大学人工智能院,北京,中国) School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China(上海财经大学统计与管理学院,上海,中国) Institute of Data Science and Statistics, Shanghai University of Finance and Economics, Shanghai, China(上海财经大学数据科学与统计研究所,上海,中国)

AI总结 针对稀疏优化中的不稳定问题,提出基于重参数化、权重衰减和自适应学习率的ReWA方法,通过改善优化景观实现比ℓ1正则化更好的稀疏性,同时保持测试精度。

Comments 32 pages, 5 figures. Submitted to ICML 2026

详情
AI中文摘要

稀疏优化是各种实际应用中的一个基本挑战。一种流行的稀疏优化方法是ℓ_p正则化。然而,当0<p<1时,由于无界梯度,它可能遇到优化不稳定性。在本文中,我们介绍了一种新的稀疏优化方法,称为ReWA,它基于重参数化、权重衰减和自适应学习率。ReWA与ℓ_p正则化密切相关,但它揭示了一个不同的优化景观,有助于缓解不稳定性问题。在CIFAR-10和ImageNet上使用ResNets进行的实验表明,与ℓ_1正则化方法相比,ReWA在保持测试精度的同时显著提高了稀疏性。

英文摘要

Sparse optimization is a fundamental challenge in various practical applications. A popular approach to sparse optimization is $\ell_p$ regularization. However, it may encounter optimization instability due to the unbounded gradients when $0<p<1$. In this paper, we introduce a novel approach to sparse optimization termed ReWA, based on Reparameterization, Weight decay, and Adaptive learning rate. ReWA is closely connected to $\ell_p$-regularization, yet it unveils a distinct optimization landscape that helps mitigate instability issues. Experiments on CIFAR-10 and ImageNet with ResNets demonstrate that ReWA leads to significant sparsity improvements over the $\ell_1$-regularization approach while preserving test accuracy.

2603.27052 2026-06-01 cs.CY cs.AI 版本更新

Multi-Level Barriers to Generative AI Adoption Across Disciplines and Professional Roles in Higher Education

高等教育中跨学科与职业角色采用生成式AI的多层次障碍

Jianhua Yang, Kerem Öge, Adrian von Mühlenen, Abdullah Bilal Akbulut, Tanya Suzanne Carey, Chidi Okorro

发表机构 * Warwick Manufacturing Group, The University of Warwick(沃里克大学制造集团) Department of Politics and International Studies, The University of Warwick(沃里克大学政治与国际研究系) Department of Psychology, The University of Warwick(沃里克大学心理学系) Birmingham Business School, The University of Birmingham(伯明翰大学商学院)

AI总结 通过对一所罗素集团大学272名学术与专业服务人员的多方法调查分析,揭示了非STEM学术人员主要报告与学术诚信相关的伦理文化障碍,而STEM和专业服务人员则强调制度、治理和基础设施约束,表明GenAI采用障碍深嵌于组织生态系统和认知规范中。

Comments 21 pages, 3 figures, 6 tables

详情
Journal ref
Educ. Sci. 2026, 16(6), 838;
AI中文摘要

生成式人工智能(GenAI)正在迅速重塑高等教育,但不同学科和机构角色间采用GenAI的障碍仍未得到充分探索。现有文献常将采用障碍归因于个体层面的因素,如感知有用性和易用性。本研究转而调查这些障碍是否由结构产生。通过对一所罗素集团大学的272名学术和专业服务人员进行多方法调查分析,我们考察了学科背景和机构角色如何塑造感知障碍。通过整合多项逻辑回归(MLR)、结构方程模型(SEM)和开放式回答的语义聚类,我们超越了描述性叙述,提供了GenAI采用的多层次解释。我们的发现揭示了清晰、系统的差异:非STEM学术人员主要报告与学术诚信相关的伦理和文化障碍,而STEM和专业服务人员则不成比例地强调制度、治理和基础设施约束。我们得出结论,GenAI采用障碍深嵌于组织生态系统和认知规范中,表明大学必须超越通用培训,开发针对特定角色的治理和支持框架。

英文摘要

Generative Artificial Intelligence (GenAI) is rapidly reshaping higher education, yet barriers to its adoption across different disciplines and institutional roles remain underexplored. Existing literature frequently attributes adoption barriers to individual-level factors such as perceived usefulness and ease of use. This study instead investigates whether such barriers are structurally produced. Drawing on a multi-method survey analysis of 272 academic and professional services (PSs) staff at a Russell Group university, we examine how disciplinary contexts and institutional roles shape perceived barriers. By integrating multinomial logistic regression (MLR), structural equation modelling (SEM), and semantic clustering of open-ended responses, we move beyond descriptive accounts to provide a multi-level explanation of GenAI adoption. Our findings reveal clear, systematic differences: non-STEM academics primarily report ethical and cultural barriers related to academic integrity, whereas STEM and PSs staff disproportionately emphasize institutional, governance, and infrastructure constraints. We conclude that GenAI adoption barriers are deeply embedded in organizational ecosystems and epistemic norms, suggesting that universities must move beyond generalized training to develop role-specific governance and support frameworks.

2602.10388 2026-06-01 cs.CL cs.AI 版本更新

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

少即是多:利用稀疏自编码器在LLM特征空间中合成多样化数据

Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu

发表机构 * Department of Computing, University of Georgia, Georgia, United States(佐治亚大学计算机系) Computer Engineering, University of California San Diego, California, United States(加州大学圣地亚哥分校计算机工程系) Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates(Mohamed bin Zayed人工智能大学机器学习系) Department of Computing, Hong Kong Polytechnic University, Hong Kong, China(香港理工大学计算机系)

AI总结 提出基于稀疏自编码器的特征激活覆盖率(FAC)指标及数据合成框架FAC Synthesis,通过识别缺失特征并生成对应样本来提升数据多样性和下游任务性能。

详情
AI中文摘要

后训练数据的多样性对于大型语言模型(LLM)的有效下游性能至关重要。许多现有的后训练数据构建方法使用基于文本的指标来衡量多样性,这些指标捕捉语言变化,但此类指标仅能为决定下游性能的任务相关特征提供微弱信号。在这项工作中,我们引入了特征激活覆盖率(FAC),该指标在可解释的特征空间中衡量数据多样性。基于此指标,我们进一步提出了一个多样性驱动的数据合成框架,名为FAC Synthesis,该框架首先使用稀疏自编码器从种子数据集中识别缺失特征,然后生成明确反映这些特征的合成样本。实验表明,我们的方法在包括指令遵循、毒性检测、奖励建模和行为引导在内的各种任务上,持续提高了数据多样性和下游性能。有趣的是,我们识别出跨模型家族(即LLaMA、Mistral和Qwen)共享的可解释特征空间,从而实现了跨模型知识迁移。我们的工作为探索以数据为中心的LLM优化提供了坚实且实用的方法论。

英文摘要

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

2509.21190 2026-06-01 cs.LG cs.AI 版本更新

Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

面向零样本时间序列异常检测的基础模型:利用合成数据和相对上下文差异

Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

发表机构 * Department of Industrial Engineering, Tsinghua University, Beijing, China(清华大学工业工程系) Datadog AI Research, Paris, France. This work was completed prior to joining Datadog(Datadog AI 研究院) Lab, Huawei Technologies, ShenZhen, China(华为技术2012实验室)

AI总结 提出基于相对上下文差异(RCD)的预训练范式,通过合成数据训练Transformer模型比较查询模式与上下文,实现零样本时间序列异常检测,在多个基准上超越现有基础模型。

Comments This manuscript is withdrawn, as the authors intend to further extend and develop the work beyond its current scope

详情
AI中文摘要

时间序列异常检测(TSAD)是一项关键任务,但开发能够以零样本方式泛化到未见数据的模型仍然具有挑战性。现有的TSAD基础模型通常依赖推理时的重构误差评分,这可能会遗漏重构良好的细微异常,并可能错误地标记未见领域中复杂但正常的模式。我们引入了TimeRCD,这是一个基于相对上下文差异(RCD)构建的TSAD基础模型,RCD是一种预训练范式,通过比较查询模式与其周围上下文来训练模型检测异常。这种关系公式通过标准Transformer架构实现,使模型能够从输入上下文中推断正常性,而不是依赖固定的全局正常模式。我们进一步构建了一个大规模合成语料库,其中包含上下文相关的异常标签,为RCD提供监督预训练信号。跨多个基准的实验表明,在大多数零样本TSAD设置中,TimeRCD优于现有的通用和异常特定基础模型,同时与数据集特定的全样本基线保持竞争力。这些结果提供了实证证据,表明RCD是构建鲁棒且可泛化的TSAD模型的有效方向。

英文摘要

Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains challenging. Existing foundation models for TSAD often rely on reconstruction-error scoring at inference time, which can miss subtle anomalies that are well reconstructed and can falsely flag complex but normal patterns in unseen domains. We introduce TimeRCD, a foundation model for TSAD built on Relative Context Discrepancy (RCD), a pre-training paradigm that trains the model to detect anomalies by comparing a query pattern with its surrounding context. This relational formulation, implemented with a standard Transformer architecture, enables the model to infer normality from the input context rather than relying on fixed global normal patterns. We further construct a large-scale synthetic corpus with context-dependent anomaly labels to provide supervised pre-training signals for RCD. Experiments across diverse benchmarks show that TimeRCD outperforms existing general-purpose and anomaly-specific foundation models in most zero-shot TSAD settings, while remaining competitive with dataset-specific full-shot baselines. These results provide empirical evidence that RCD is an effective direction for building robust and generalizable TSAD models.

2605.25842 2026-06-01 cs.AI cs.CL 版本更新

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

MuCRASP: 多模态思维链推理感知的结构化剪枝

Aritra Dutta, Somak Aditya

发表机构 * Indian Institute of Technology, Kharagpur(印度理工学院,哈里科普尔)

AI总结 针对视觉语言模型在结构化剪枝后思维链推理准确性下降的问题,提出MuCRASP框架,通过识别推理关键令牌并保持跨模态对齐,在压缩下维持推理质量。

Comments Preprint ver. 2

详情
AI中文摘要

视觉语言模型(VLM)越来越依赖思维链(CoT)推理来解决复杂的多模态任务,但其庞大的参数量使得部署成本高昂。结构化剪枝提供了一种自然的解决方案;然而,现有方法无法在VLM中保持CoT推理的准确性。我们确定了两个关键原因:(1)CoT一致性依赖于生成轨迹中的稀疏过渡点(枢轴令牌),而现有剪枝方法对CoT不敏感;(2)为单模态LLM设计的剪枝方法未考虑视觉和文本模态之间的激活分布差异。基于这些观察,我们提出了MuCRASP,一种结构化剪枝框架,针对推理关键组件,同时保持跨模态对齐并在全局参数预算下考虑层间敏感性。在三个推理基准测试上的四个VLM实验表明,MuCRASP在不断增加压缩的情况下始终能保持推理质量。在Qwen2.5-VL-7B上剪枝30%时,MuCRASP在物理推理任务上获得了8.87的LLM-as-a-Judge评分,而最强基线为7.32。此外,MuCRASP在高达50%的剪枝率下仍保持高推理一致性,显著优于先前的剪枝方法,同时表现出更低的困惑度退化。

英文摘要

Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.

2605.25773 2026-06-01 stat.ML cs.AI cs.CL cs.LG 版本更新

Efficient Benchmarking Is Just Feature Selection and Multiple Regression

高效基准测试仅是特征选择与多元回归

Sam Bowyer, Acyr Locatelli, Kris Cao

发表机构 * Cohere University of Bristol(布里斯托大学)

AI总结 将高效基准测试重新定义为带特征选择的多元回归问题,使用核岭回归预测和mRMR特征选择算法,在降低计算成本的同时提高预测精度和排名相关性。

Comments 36 pages, 27 figures

详情
AI中文摘要

高效基准测试技术旨在通过仅使用基准测试问题子集预测完整基准测试分数,从而降低评估LLMs的计算成本。通过将此问题重新定义为带特征选择的多元回归实例,我们发现只需在预测阶段使用核岭回归即可大幅改进现有高效基准测试方法。此外,使用一种名为最小冗余最大相关性(mRMR)的信息论特征选择算法,我们可以通过选择对预测最有用的问题子集进一步改进这些方法。除数据非常匮乏的情况外,这些方法在二元和连续指标的各种基准测试中,始终实现更小的预测误差(MAE和RMSE),以及预测分数与真实分数之间更大的排名相关性(Spearman ρ和Kendall τ)。此外,mRMR子采样比竞争方法(通常涉及拟合概率模型或运行聚类算法)快得多,并且在不同随机种子或训练数据划分下更可能选择相同的问题。教程代码见https://github.com/sambowyer/mrmr_eval。

英文摘要

Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data-poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman $ρ$ and Kendall $τ$) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .

2503.07482 2026-06-01 cs.LG cs.AI 版本更新

How does Bayesian Sampling help Membership Inference Attacks?

贝叶斯采样如何帮助成员推断攻击?

Zhenlong Liu, Wenyu Jiang, Feng Zhou, Hongxin Wei

发表机构 * Department of Statistics and Data Science, Southern University of Science and Technology(统计与数据科学系,南方科技大学) Shanghai Innovation Institute(上海创新研究院) School of Computer Science, Nanjing University(南京大学计算机科学系) Center for Applied Statistics and School of Statistics, Renmin University of China(应用统计中心和统计学系,中国人民大学)

AI总结 提出贝叶斯成员推断攻击(BMIA),通过拉普拉斯近似对单个参考模型进行贝叶斯采样以估计条件分数分布,理论证明降低模型内方差从而提升攻击性能,并在多模态数据集上实现最先进的效果与效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

成员推断攻击(MIAs)旨在估计特定数据点是否用于给定模型的训练。现有的最先进攻击通常依赖于训练多个参考模型来近似单个数据点的条件分数分布,这导致显著的计算开销并限制了其实际适用性。在这项工作中,我们提出了一种新颖的方法——贝叶斯成员推断攻击(BMIA),通过贝叶斯采样执行条件攻击。具体来说,我们对单个参考模型应用拉普拉斯近似以获得模型参数的后验分布,从而能够直接估计条件分数分布。理论上,我们证明了贝叶斯采样降低了模型内方差,从而提高了攻击能力。这一见解自然地激发了多参考变体,当有额外的参考模型可用时,该变体进一步提升了性能。在图像、文本和表格数据集上的大量实验表明,我们的方法在有效性和效率方面均达到了最先进的性能。

英文摘要

Membership Inference Attacks (MIAs) aim to estimate whether a specific data point was used in the training of a given model. Existing state-of-the-art attacks typically rely on training multiple reference models to approximate the conditional score distribution for individual data points, which leads to significant computational overhead and limits their practical applicability. In this work, we propose a novel approach -- Bayesian Membership Inference Attack (BMIA), which performs conditional attack through Bayesian sampling. Specifically, we apply Laplace approximation to a single reference model to obtain a posterior over model parameters, enabling direct estimation of the conditional score distribution. Theoretically, we demonstrate that Bayesian sampling reduces intra-model variance, thereby improving attack power. This insight naturally motivates the multi-reference variant that further enhances performance when additional reference models are available. Extensive experiments across image, text, and tabular datasets indicate that our method achieves state-of-the-art performance in both effectiveness and efficiency.

2603.24254 2026-06-01 cs.LG cs.AI 版本更新

Beyond Static Uncertainty: Modeling Temporal Uncertainty Dynamics for Probabilistic Time Series Forecasting

超越静态不确定性:为概率时间序列建模时间不确定性动态

Yijun Wang, Qiyuan Zhuang, Larysa Marchanka, Xiu-Shen Wei

发表机构 * Department of Computer Science, Southeast University(东南大学计算机科学系) Francisk Skorina Gomel State University(弗拉基米尔·斯科里纳戈梅尔州立大学)

AI总结 提出VolDy-VAE模型,通过循环尺度路径捕捉波动率动态,实现时间一致的概率预测,提升准确性和不确定性校准。

详情
AI中文摘要

现实世界的时间序列表现出时间结构化的不确定性:波动率在动荡时期聚集,在稳定时期消散,并在结构断裂处突然变化。然而,许多概率预测方法将预测不确定性估计为独立的逐点量,忽略了波动率机制的演变和持续性。我们将这一缺失维度形式化为时间不确定性动态,并在波动率动态变分自编码器(VolDy-VAE)中实例化它,这是一个具有位置-尺度解码器的非自回归生成预测器。VolDy-VAE结合了用于均值预测的位置路径和用于传递和演化波动率隐藏状态的循环尺度路径,该状态从回溯窗口转移到预测范围,从而实现时间一致的预测方差。这种设计产生了一种自适应衰减机制:高方差观测值对位置估计的影响较小,而其不确定性通过明确的尺度预测得以保留。我们进一步提供了一个简化的机制转换分析,表明当方差已知或一致估计时,波动率感知目标简化为逆方差加权,而基于MSE的估计量保持无偏但统计效率较低。在九个基准上的实验表明,VolDy-VAE在保持低推理延迟的同时,提高了预测准确性和不确定性校准,优于竞争的概率和点预测基线;插件研究进一步表明,VolDy原理可以有益于GAN、Koopman VAE和Transformer骨干网络。源代码公开于https://github.com/wangyijunlyy/VolDy-VAE。

英文摘要

Real-world time series exhibit temporally structured uncertainty: volatility clusters in turbulent regimes, dissipates in stable periods, and shifts abruptly around structural breaks. Yet many probabilistic forecasting methods estimate predictive uncertainty as an independent per-step quantity, leaving the evolution and persistence of volatility regimes under-modeled. We formalize this missing dimension as temporal uncertainty dynamics and instantiate it in the Volatility Dynamics Variational Autoencoder (VolDy-VAE), a non-autoregressive generative forecaster with a location-scale decoder. VolDy-VAE combines a location path for mean prediction with a recurrent scale path that transfers and evolves a volatility hidden state from the look-back window to the forecasting horizon, enabling temporally coherent predictive variances. This design yields an adaptive attenuation mechanism: high-variance observations receive lower influence on the location estimate while their uncertainty is preserved through explicit scale predictions. We further provide a simplified regime-switching analysis showing that, when variances are known or consistently estimated, the volatility-aware objective reduces to inverse-variance weighting, whereas MSE-based estimators remain unbiased but statistically inefficient. Experiments on nine benchmarks show that VolDy-VAE improves forecasting accuracy and uncertainty calibration over competitive probabilistic and point-forecasting baselines while maintaining low inference latency; plug-in studies further indicate that the VolDy principle can benefit GAN, Koopman VAE, and Transformer backbones. The source code is publicly available at https://github.com/wangyijunlyy/VolDy-VAE.

2605.23937 2026-06-01 cs.AI cs.LG cs.LO math.OC 版本更新

BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization

BoxLitE:基于凸优化的忠实知识库嵌入

Bruno F. Lourenço, Hesham Morgan, Ana Ozaki, Aleksandar Pavlović, Emanuel Sallinger

发表机构 * The Institute of Statistical Mathematics, Japan(日本统计数学研究所) TU Wien, Austria(奥地利技术大学维也纳分校) University of Oslo, Norway(挪威奥斯陆大学) University of Applied Sciences Campus Vienna, Austria(奥地利应用科学大学维也纳校区)

AI总结 提出BoxLitE模型,通过凸优化实现DL-Lite$^{\mathcal{H}}$知识库的忠实嵌入,确保可满足知识库存在弱忠实模型。

Comments 28 pages. Full version of paper accepted to KR 2026 (23nd International Conference on Principles of Knowledge Representation and Reasoning). Track: KR meets Machine Learning and Explanation. Added a figure and some minor changes

详情
AI中文摘要

知识库(KB)嵌入旨在结合经典知识图谱嵌入在事实(ABox)中泛化信息的能力与本体语言(TBox)表示的概念知识。多位作者最近探索了将概念映射到向量空间中凸区域的思想。这对于表示TBox中通常存在的层次结构很有用,因为更一般的概念可以映射到更大的区域,包含与更具体概念相关的区域。然而,在实际学习任务中,凸性的能力很少被利用。在这里,我们引入了BoxLitE,一个针对DL-Lite$^{\mathcal{H}}$的KB嵌入模型,允许凸优化。我们证明,对于任何可满足的DL-Lite$^{\mathcal{H}}$ KB,存在一个BoxLitE嵌入,它是一个弱忠实模型。作为概念验证,我们展示了如何将KB嵌入任务表述为凸优化问题,以及如何获得具有这种理想忠实性属性的嵌入。

英文摘要

Knowledge base (KB) embeddings aim at combining the capability of classical knowledge graph embeddings to generalize the information present in facts, the ABox, with conceptual knowledge represented in an ontology language, the TBox. Several authors have recently explored the idea of mapping concepts to convex regions in a vector space. This is useful to represent hierarchies, typically present in TBoxes, since more general concepts can be mapped to larger regions, containing those regions associated with more specific concepts. However, the power of convexity is rarely leveraged during the actual learning tasks. Here, we introduce BoxLitE, a KB embedding model for DL-Lite$^{\mathcal{H}}$ that allows for convex optimization. We show that for any satisfiable DL-Lite$^{\mathcal{H}}$ KB, there is a BoxLitE embedding that is a weakly faithful model. As a proof of concept, we show how to formulate the KB embedding task as a convex optimization problem and how to obtain embeddings with such desirable faithfulness properties.

2605.21470 2026-06-01 cs.LG cs.AI 版本更新

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

面向延迟优化的Web Agent规划与调度的Agent即时编译

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini, Christos Kozyrakis

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Agent即时编译系统,通过JIT-Planner生成代码计划、JIT-Scheduler探索并行化策略及不变式工具协议,显著降低延迟并提高准确性。

Comments Accepted at ICML 2026

详情
AI中文摘要

计算机使用Agent通过生成对浏览器中点击、输入、滚动等工具的调用序列,自动化自然语言指定的任务,例如“从Taco Bell订购最便宜的商品”。当前实现遵循顺序的获取截图-执行循环,每次迭代需要一次LLM调用,导致高延迟和因工具使用错误而频繁出错。我们提出了Agent即时编译系统,该系统将任务描述直接编译为可执行代码,其中可能包含LLM调用、工具调用和并行化。我们的方法包括三个组件:(1)JIT-Planner,生成多个代码计划,根据工具规范验证每个计划,并选择最小成本候选;(2)JIT-Scheduler,通过从学习到的延迟分布进行蒙特卡洛成本估计,探索并行化策略;(3)不变式强制工具协议,指定前置条件和后置条件要求,以减少工具使用错误率。在五个应用中,JIT-Planner相比Browser-Use实现了10.4倍的加速和28%的更高准确率,而JIT-Scheduler相比OpenAI CUA实现了2.4倍的加速和9%的更高准确率。

英文摘要

Computer-use agents (CUAs) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We present agent just-in-time (JIT) compilation, a system that compiles task descriptions directly into executable code that may include LLM calls, tool calls, and parallelization. Our approach comprises three components: (1) JIT-Planner, which generates multiple code plans, validates each against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions; and (3) an invariant-enforcing tool protocol specifying precondition and postcondition requirements to reduce the rate of incorrect tool use. Across five applications, JIT-Planner achieves $10.4\times$ speedup and 28$\%$ higher accuracy over Browser-Use, while JIT-Scheduler achieves $2.4\times$ speedup and 9\% higher accuracy over OpenAI CUA.

2605.21108 2026-06-01 cs.LG cs.AI 版本更新

Efficient Learning of Deep State Space Models via Importance Smoothing

通过重要性平滑高效学习深度状态空间模型

John-Joseph Brady, Nikolas Nusken, Yunpeng Li

发表机构 * Centre for Oral, Clinical and Translational Sciences, King's College London, London, United Kingdom(口腔、临床与转化科学中心,伦敦国王学院,伦敦,英国) Department of Mathematics, King's College London, London, United Kingdom(数学系,伦敦国王学院,伦敦,英国)

AI总结 提出并行变分蒙特卡洛(PVMC)方法,结合变分推断和序贯蒙特卡洛,实现深度状态空间模型在判别与生成任务上的高效训练,速度提升10倍。

Comments Accepted to the proceedings of ICML 2026

详情
AI中文摘要

潜在状态空间系统在统计建模中无处不在,当通过噪声观测时间序列时自然出现。然而,大规模训练深度状态空间模型(DSSM)仍然困难。训练DSSM出现了两种截然不同的策略。第一种是自编码DSSM,通过优化变分下界来训练生成模型。第二种是通过经典序贯蒙特卡洛(SMC)算法的输出进行反向传播。这些方法可以训练DSSM用于判别和生成任务,但其固有的顺序前向传递在现代硬件上扩展性差。我们提出了并行变分蒙特卡洛(PVMC),一种新的训练方法,它桥接了这些范式,并稳健地训练DSSM用于判别和生成任务。在一组基准实验中,PVMC达到或超过了最先进的性能,同时训练速度比最快的竞争SMC方法快10倍。

英文摘要

Latent state space systems are ubiquitous in statistical modelling, arising naturally when time series are observed through noisy measurements. However, training deep state space models (DSSMs) at scale remains difficult. Two largely distinct strategies have emerged for training DSSMs. The first, auto-encoding DSSMs, trains generative models by optimising a variational lower bound. The second backpropagates through the outputs of classical sequential Monte Carlo (SMC) algorithms. Such approaches can train DSSMs for both discriminative and generative tasks, but their inherently sequential forward passes scale poorly on modern hardware. We propose \emph{parallel variational Monte Carlo} (PVMC), a new training method that bridges these paradigms and robustly trains DSSMs for both discriminative and generative tasks. Across a set of benchmark experiments, PVMC matches or exceeds state-of-the-art performance while training $10\times$ faster than the fastest competing SMC-based approach.

2605.20873 2026-06-01 cs.AI cs.LG 版本更新

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

PlanningBench: 生成可扩展且可验证的规划数据以评估和训练大型语言模型

Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院 Gallagher 学校) LLM Department, Hunyuan Team, Tencent(腾讯 Hunyuan 团队 LLM 部门) Beijing Academy of Artificial Intelligence(北京人工智能研究院) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出PlanningBench框架,通过约束驱动合成管道生成可扩展、多样化且可验证的规划数据,用于评估和训练LLMs,并验证其在提升规划能力上的有效性。

详情
AI中文摘要

规划是大型语言模型(LLMs)的一项基本能力,因为这类复杂任务要求模型将目标、约束、资源和长期后果协调成可执行且可验证的解决方案。然而,现有的规划基准通常将规划数据视为固定的实例集合,而非可控的生成目标。这限制了场景覆盖范围,将难度与表面代理而非结构来源挂钩,并且对可扩展生成、自动验证或面向规划的训练支持有限。我们引入PlanningBench,一个用于生成可扩展、多样化且可验证的规划数据的框架,既可用于评估也可用于训练。PlanningBench从真实规划场景出发,将实际工作流程抽象为包含30多种任务类型、子任务、约束族和难度因素的结构化分类体系。在该分类体系的指导下,一个约束驱动的合成管道实例化自包含的规划问题,具备自适应难度控制、质量过滤和实例级验证检查表。这将规划数据构建从固定基准收集转变为可控生成,同时保留现实任务基础。我们使用PlanningBench评估开源和闭源前沿LLMs,发现当前模型在耦合约束下仍难以生成完整解决方案。除评估外,在已验证的PlanningBench数据上进行强化学习可提升在未见规划基准和更广泛的指令遵循任务上的性能。进一步分析表明,确定性或明确指定的最优解提供了更清晰的奖励信号和更稳定的训练动态。总体而言,PlanningBench为诊断和提高LLMs中可泛化的规划能力提供了可控的规划数据来源。

英文摘要

Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.

2605.19806 2026-06-01 cs.CL cs.AI 版本更新

Chunking German Legal Code

德国法律文本的分块处理

Max Prior, Natalia Milanova, Andreas Schultz

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 研究针对德国成文法,以德国民法典为基准语料库,比较多种分块策略在检索增强生成中的性能,发现基于法律固有结构(如章节、小节)的分块方法在召回率和计算效率上优于语义增强方法。

Comments Accepted at the Eigth Workshop on Automated Semantic Analysis of Information in Legal Texts co-located with the 21th International Conference on Artificial Intelligence and Law (ICAIL 2026)

详情
AI中文摘要

本文研究了针对德国成文法的检索增强生成的分块策略,以德国民法典作为结构化基准语料库。我们实现并比较了一系列分割方法,包括结构单元(章节、小节、句子、命题)、固定大小窗口、上下文分块、语义聚类、Lumber风格分块以及基于RAPTOR的层次检索。所有方法都在一个具有章节级黄金标签的法律问答数据集上进行评估,测量召回率、查询延迟、索引构建时间和存储需求。结果表明,与固有法律结构对齐的分块策略——特别是基于章节和小节的检索——实现了最高的召回率,而覆盖这种结构的更复杂方法表现更差。与上下文分块、RAPTOR和Lumber等LLM密集型技术相比,这些更简单的方法还提供了有利的计算效率。研究结果突出了语义丰富性与操作成本之间的关键权衡,并证明保留领域特定结构对于有效的法律信息检索至关重要。

英文摘要

This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a structured benchmark corpus. We implement and compare a range of segmentation approaches, including structural units (sections, subsections, sentences, propositions), fixed-size windows, contextual chunking, semantic clustering, Lumber-style chunking, and RAPTOR-based hierarchical retrieval. All methods are evaluated on a legal question-answering dataset with section-level gold labels, measuring recall, query latency, index build time, and storage requirements. Results show that chunking strategies aligned with the inherent legal structure - particularly section and subsection - based retrieval-achieve the highest recall, while more complex approaches that override this structure perform worse. These simpler methods also offer favorable computational efficiency compared to LLM-intensive techniques such as contextual chunking, RAPTOR, and Lumber. The findings highlight a key trade-off between semantic enrichment and operational cost, and demonstrate that preserving domain-specific structure is critical for effective legal information retrieval.

2605.18807 2026-06-01 cs.LG cs.AI 版本更新

Block-Based Double Decoders

基于块的双解码器

Asher Labovich, Benjamin Bradley, Vanessa Alexander, Chaitanya Harsha

发表机构 * Brown University(布朗大学)

AI总结 提出基于块的双解码器架构,利用双重因果块注意力掩码实现全损失监督和静态序列打包,结合解码器训练效率与编码器-解码器推理效率,在缩放定律实验中优于编码器-解码器并接近解码器模型,推理时KV缓存和每token计算减少至少2/3。

Comments 8 pages main, 13 pages total

详情
AI中文摘要

编码器-解码器模型在推理时间上比仅解码器模型节省大量成本,但其预训练目标存在稀疏监督和动态序列长度的问题,使其难以大规模实践。我们提出了基于块的双解码器,一种新颖的Transformer架构,利用双重因果块注意力掩码进行全损失监督和静态序列打包,结合了解码器训练效率与编码器-解码器推理效率。在缩放定律实验中,基于块的双解码器显著优于编码器-解码器,并在各规模上紧密跟踪仅解码器模型。在推理时,它们在不牺牲预填充缓存或仅解码器模型可用的其他现有推理优化的情况下,将KV缓存内存和每token计算减少至少2/3。

英文摘要

Encoder-decoder models offer substantial inference-time savings over decoder-only models, but their pretraining objectives suffer from sparse supervision and dynamic sequence lengths, keeping them out of practice at scale. We propose block-based double decoders, a novel transformer architecture that utilizes doubly-causal block-based attention masks to train with full loss supervision and static sequence packing, combining decoder-only training efficiency with encoder-decoder inference efficiency. In scaling law experiments, block-based double decoders strongly outperform encoder-decoders and closely track decoder-only models across scales. At inference time, they cut KV-cache memory and per-token compute by at least 2/3 without sacrificing prefill caching or other existing inference optimizations available to decoder-only models.

2605.18803 2026-06-01 cs.LG cs.AI 版本更新

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

PROWL: 基于优先遗憾驱动的世界模型学习优化

Ahmet H. Güzel, Jenny Seidenschwarz, Benjamin Graham, Jonathan Sadeghi, Jeffrey Hawke, Ilija Bogunovic

发表机构 * University College London AI Centre(伦敦大学学院人工智能中心) Odyssey University of Basel(巴塞尔大学)

AI总结 提出一种KL约束的对抗课程,通过训练策略暴露扩散世界模型的高误差轨迹并持续微调,结合优先对抗轨迹缓冲区,解决被动数据中罕见关键转换的鲁棒性问题。

详情
AI中文摘要

现代动作条件视频世界模型在短期视觉真实性上表现强劲,但在罕见且对交互关键的转换上仍不可靠,而这些转换主导了下游规划和策略性能。由于被动演示数据系统性地对这些高影响区域采样不足,提高鲁棒性需要主动引发模型失败,而非依赖其自然发生。我们引入了一种KL约束的对抗课程,其中训练一个策略来暴露基于扩散的世界模型的高误差轨迹,同时保持接近行为分布。世界模型在这些对抗性发现的轨迹上持续微调,形成一个对抗训练循环,将罕见失败转化为稳定的、接近分布的训练信号,而不会漂移到分布外利用。为了在模型改进时持续对未解决的弱点施加压力,我们提出了一种优先对抗轨迹(PAT)缓冲区,该缓冲区根据预测误差、动作保真度和学习进度对轨迹重新排序,将训练集中在未解决的失败模式上,而不是重复访问已解决的案例。我们在MineRL框架中实现了我们的方法,并在保留的分布外轨迹上进行了评估;PROWL提高了相对于仅在被动数据上训练的模型的鲁棒性,揭示了在弱行为约束下的奖励黑客行为,并证明了有效的对抗世界模型训练关键取决于平衡探索性失败发现与显式行为正则化。我们的结果表明,可扩展的世界模型不仅受益于更大的数据集,还受益于选择性生成信息丰富的训练数据。

英文摘要

Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under-samples these high-impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL-constrained adversarial curriculum in which a policy is trained to expose high-error trajectories of a diffusion-based world model while remaining close to the behavior distribution. The world model is continuously fine-tuned on these adversarially discovered trajectories, yielding an adversarial training loop that converts rare failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation. To maintain pressure on unresolved weaknesses as the model improves, we propose a Prioritized Adversarial Trajectory (PAT) buffer that re-ranks trajectories based on prediction error, action fidelity, and learning progress, focusing training on unresolved failure modes rather than repeatedly revisiting solved cases. We implement our approach in the MineRL framework and evaluate it on held-out out-of-distribution trajectories; PROWL improves robustness over models trained on passive data alone, reveals reward-hacking behaviors under weak behavioral constraints, and demonstrates that effective adversarial world-model training critically depends on balancing exploratory failure discovery with explicit behavioral regularization. Our results suggest that scalable world models benefit not only from larger datasets, but also from selectively generating informative training data.

2605.18024 2026-06-01 cs.LG cs.AI cs.MA 版本更新

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

交互破坏对抗学习框架用于鲁棒多智能体强化学习

Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence, UNIST, Ulsan, South Korea(人工智能研究生院,UNIST,乌山,韩国)

AI总结 提出交互破坏对抗学习框架,从信息论角度构建攻击破坏智能体间交互,并训练智能体在干扰下可靠执行,提升鲁棒性。

Comments 9 pages for main, 33 pages for total, Accepted to ICML 2026

详情
AI中文摘要

合作是多智能体强化学习(MARL)的核心,然而当外部扰动破坏智能体间的交互时,学到的协调可能变得脆弱。先前的鲁棒MARL方法主要考虑面向价值的攻击,在交互结构本身被破坏时存在鲁棒性缺口。在本文中,我们提出一个交互破坏对抗学习(IBAL)框架,该框架从信息论角度构建攻击,通过扰动智能体的观测和动作来阻碍协调,并训练智能体在此类干扰下可靠执行。实验上,我们的方法在多种攻击设置下比现有鲁棒MARL基线具有更好的鲁棒性,甚至在智能体缺失场景下也表现出更强的性能。我们的代码可在 https://sunwoolee0504.github.io/IBAL 获取。

英文摘要

Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered value-oriented attacks, leaving a gap in robustness when interaction structures themselves are corrupted. In this paper, we propose an interaction-breaking adversarial learning (IBAL) framework that takes an information-theoretic view to construct attacks that impede coordination by perturbing agents' observations and actions, and trains agents to perform reliably under such disruptions. Empirically, our approach improves robustness over existing robust MARL baselines across diverse attack settings and yields stronger performance even under agent-missing scenarios. Our code is available at https://sunwoolee0504.github.io/IBAL.

2602.03012 2026-06-01 cs.CR cs.AI 版本更新

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

CVE-Factory:规模化专家级代码安全漏洞智能体任务

Xianzhen Luo, Jingyuan Zhang, Shiqi Zhou, Jinyang Huang, Chuan Xiao, Qingfu Zhu, Zhiyuan Ma, Xing Yue, Yang Yue, Wencong Zeng, Wanxiang Che

发表机构 * Harbin Institute of Technology (HIT)(哈尔滨工业大学) Central South University (CSU)(中南大学) University of Science and Technology of China (UTSC)(中国科学技术大学)

AI总结 提出CVE-Factory多智能体框架,自动将稀疏CVE元数据转化为可执行的智能体任务,构建持续更新的基准LiveCVEBench和训练数据集,微调模型性能显著提升。

Comments Accepted by ICML2026 Oral

详情
AI中文摘要

评估和改进代码智能体的安全能力需要高质量、可执行的漏洞任务。然而,现有工作依赖昂贵且不可扩展的人工复现,并受限于过时的数据分布。为解决这些问题,我们提出了CVE-Factory,这是首个实现专家级质量的多智能体框架,能够自动将稀疏的CVE元数据转化为完全可执行的智能体任务。与人类专家复现的交叉验证表明,CVE-Factory实现了95%的解决方案正确性和96%的环境保真度,证实了其专家级质量。该框架还在最新的真实漏洞上进行了评估,达到了66.2%的验证成功率。这种自动化带来了两个下游贡献。首先,我们构建了LiveCVEBench,一个持续更新的基准测试,包含190个任务,涵盖14种语言和153个仓库,捕获了包括AI工具漏洞在内的新兴威胁。其次,我们合成了超过1000个可执行的训练环境,这是代码安全领域智能体任务的首次大规模扩展。微调后的Qwen3-32B在LiveCVEBench上的性能从5.3%提升到35.8%,超过了Claude 4.5 Sonnet,并且这些提升泛化到了Terminal Bench(从12.5%到31.3%)。我们开源了CVE-Factory、LiveCVEBench、Abacus-cve(微调模型)、训练数据集和排行榜。所有资源可在https://github.com/livecvebench/CVE-Factory获取。

英文摘要

Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows that CVE-Factory achieves 95\% solution correctness and 96\% environment fidelity, confirming its expert-level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2\% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI-tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security. Fine-tuned Qwen3-32B improves from 5.3\% to 35.8\% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5\% to 31.3\%). We open-source CVE-Factory, LiveCVEBench, Abacus-cve (fine-tuned model), training dataset, and leaderboard. All resources are available at https://github.com/livecvebench/CVE-Factory .

2605.17373 2026-06-01 cs.LG cs.AI 版本更新

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

FML-bench:从搜索动力学视角对AI研究代理策略的受控研究

Qiran Zou, Hou Hei Lam, Wenhao Zhao, Tingting Chen, Yiming Tang, Samson Yu, Yingtao Zhu, Srinivas Anumasa, Zufeng Zhang, Tianyi Zhang, Chang Liu, Zhengyao Jiang, Anirudh Goyal, Dianbo Liu

发表机构 * National University of Singapore(国立新加坡大学) Tsinghua University(清华大学) University of Minnesota(明尼苏达大学) Weco Meta

AI总结 本文提出FML-Bench基准,通过分离策略与基础设施并定义过程级指标,评估六种代理策略,发现贪婪爬山法接近最优树搜索,且自适应策略基于搜索密度切换可超越其他代理。

Comments Our benchmark is available at: https://github.com/qrzou/FML-bench

详情
AI中文摘要

AI研究代理通过自动化假设生成、实验和实证改进来加速机器学习研究。现有代理策略从贪婪爬山法到树搜索和进化优化不等,但哪些策略选择驱动性能仍不清楚。回答这个问题需要一个基准,该基准将代理策略(例如搜索拓扑)与执行基础设施(例如代码编辑器)分离,以便性能差异归因于策略而非基础设施,并提供最终分数之外的过程级指标来分析探索行为。现有基准支持有限。我们提出FML-Bench,一个涵盖10个领域18个基础ML研究任务的基准,将代理策略与执行基础设施分离,并定义了12个过程级行为指标。评估六个代表性代理,我们发现:(1) 策略复杂性本身并不能保证强性能:一个简单的贪婪爬山者几乎与最佳性能的树搜索代理相匹配,两者均远高于其余代理;(2) 我们的分析表明,这种模式与改进机会结构相关:当机会密集时,贪婪搜索往往更有效,而当机会稀疏时,树搜索和进化策略往往更有效;基于这一见解构建的自适应代理在检测到改进停滞时切换到更广泛的探索,并优于其他六个代理,初步支持了这一观察;(3) 过程级分析表明,早期收敛和方向聚焦的探索与最终性能显著相关,而解决方案多样性和计算成本则不然。我们的基准可在 https://github.com/qrzou/FML-bench 获取。

英文摘要

AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML-bench.

2605.17101 2026-06-01 cs.CL cs.AI 版本更新

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG: 面向医学推理的自演化多智能体检索增强生成框架

Yongfeng Huang, Ruiying Chen, James Cheng

发表机构 * CSE, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Wuhan University of Technology(武汉理工大学)

AI总结 针对医学问答中单轮静态检索与临床推理多阶段过程不匹配的问题,提出SEMA-RAG框架,通过任务解耦和动态多轮探索,由三个专业智能体分别负责临床解释、自演化检索和证据裁决,在多个基准上平均提升准确率6.46个百分点。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

检索增强生成(RAG)被广泛用于缓解医学问答中的幻觉和知识过时等风险,但其主要采用单轮静态检索范式,与临床推理的多阶段过程不匹配。这种压缩的工作流导致两个结构性缺陷:问题到查询的转换通常缺乏临床基础的语义解释,且检索缺乏迭代充分性反馈,难以形成可靠的证据链。我们认为这两个问题源于更深层的原因:将解释、探索和裁决等异构任务过载到单一推理链上。解决方案是通过任务解耦和动态多轮探索来重构工作流。为此,我们提出SEMA-RAG,一种用于医学问答的自演化多智能体RAG框架,将这些角色分配给三个专业智能体:解释智能体负责临床模式解释,探索智能体负责充分性驱动的自演化检索,裁决智能体负责证据裁决和答案选择。在五个基准和五个LLM骨干网络上,SEMA-RAG平均比最强基线提高6.46个准确率点(按骨干网络测量)。

英文摘要

Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.

2605.16215 2026-06-01 cs.AI cs.CL 版本更新

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

完全开放的Meditron:临床大语言模型的可审计流水线

Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 提出首个完全开放的临床大语言模型构建流水线Fully Open Meditron,通过可审计的数据集、可复现的训练框架和对齐评估协议,在不牺牲可审计性和可复现性的前提下实现了领域最新性能。

Comments Preprint. 31 pages, 10 figures. Code, models, and data: https://github.com/EPFLiGHT/FullyOpenMeditron

详情
AI中文摘要

临床决策支持系统(CDSS)需要可审查、可审计的流水线,以实现严格、可复现的验证。然而,当前基于LLM的CDSS仍然大多不透明。大多数“开放”模型仅开放权重,发布参数的同时隐瞒了决定模型行为的数据来源、整理程序和生成流水线。完全开放(FO)模型暴露完整的训练堆栈,目前在医学领域尚不存在。我们引入了Fully Open Meditron,这是首个用于构建LLM-CDSS的完全开放流水线,包含临床医生审计的训练语料库、可复现的数据构建和训练框架,以及使用对齐的评估协议。该语料库将八个公共医学QA数据集统一为标准化对话格式,并通过三个经临床医生审查的合成扩展扩展了覆盖范围:考试式QA、源自46,469个临床实践指南的指南基础QA以及临床小插曲。该流水线强制执行系统级去污染、教师生成的金标签重采样以及由四位医生小组进行的端到端验证。我们使用LLM-as-a-judge协议对专家撰写的临床小插曲进行评估,并针对204名人类评分者进行校准。我们将该配方应用于五个FO基础模型(Apertus-70B/8B-Instruct、OLMo-2-32B-SFT、EuroLLM-22B/9B-Instruct)。所有MeditronFO变体均优于其基础模型。Apertus-70B-MeditronFO在综合医学基准上比其基础模型提高了+6.6个百分点(从47.2%到53.8%),建立了新的FO SoTA。Gemma-3-27B-MeditronFO在58.6%的LLM-as-a-judge比较中优于MedGemma,并在HealthBench上表现更优(58% vs 55.9%)。这些结果表明,完全开放的流水线可以在不牺牲可审计性或可复现性的情况下实现最先进的领域特定性能。

英文摘要

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

2602.00747 2026-06-01 cs.CL cs.AI 版本更新

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

将搜索与训练解耦:通过模型合并实现大规模语言模型预训练的数据混合缩放

Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao

发表机构 * NLP Team, Xiaohongshu Inc., Shanghai, China(小红书自然语言处理团队,小红书公司,上海,中国) Tsinghua University, Beijing, China(清华大学,北京,中国) The University of Tokyo, Tokyo, Japan(东京大学,东京,日本)

AI总结 提出DeMix框架,通过模型合并预测最优数据配比,在降低搜索成本的同时提升基准性能。

Comments 18 pages, 5 figures, accepted at ICML 2026

详情
AI中文摘要

确定有效的数据混合是大语言模型(LLM)预训练的关键因素,模型必须在通用能力与数学、代码等困难任务的专业性之间取得平衡。然而,识别最优混合仍然是一个开放挑战,现有方法要么依赖不可靠的小规模代理实验,要么需要代价高昂的大规模探索。为此,我们提出“将搜索与训练解耦混合”(DeMix),一种利用模型合并预测最优数据配比的新框架。DeMix不是为每个采样的混合训练代理模型,而是按规模在候选数据集上训练组件模型,并通过加权模型合并推导数据混合代理。这种范式将搜索与训练成本解耦,使得无需额外训练负担即可评估无限采样的混合,从而通过更多搜索试验促进更好的混合发现。大量实验表明,DeMix打破了充分性、准确性和效率之间的权衡,以更低的搜索成本获得更高基准性能的最优混合。此外,我们发布了DeMix语料库,一个包含高质量预训练数据和已验证混合的综合22T令牌数据集,以促进开放研究。我们的代码和DeMix语料库可在https://github.com/Lucius-lsr/DeMix获取。

英文摘要

Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.

2601.15197 2026-06-01 cs.AI cs.CL cs.CV cs.RO 版本更新

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

LangForce: 通过潜在动作查询对视觉语言动作模型进行贝叶斯分解

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

发表机构 * Huazhong University of Science and Technology(华中科技大学) Beijing Zhongguancun Academy(北京中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Harbin Institute of Technology(哈尔滨工业大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhengzhou University(郑州大学) Beihang University(北航) East China Normal University(东华大学) DeepCybot Co., Ltd.(DeepCybot有限公司)

AI总结 针对VLA模型在训练中因数据偏差导致语言信息被忽略的问题,提出LangForce框架,通过贝叶斯分解和潜在动作查询构建双分支架构,最大化动作与指令的点互信息,无需新数据即可显著提升泛化能力。

Comments ICML 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中显示出潜力,但往往难以泛化到新指令或复杂的多任务场景。我们识别出当前训练范式中的一个关键病理:目标驱动的数据收集造成了数据集偏差。在此类数据集中,仅凭视觉观察就能高度预测语言指令,导致指令与动作之间的条件互信息消失,我们将此现象称为信息崩溃。因此,模型退化为忽略语言约束的纯视觉策略,并在分布外(OOD)设置中失败。为解决此问题,我们提出LangForce,一种通过贝叶斯分解强制执行指令跟随的新框架。通过引入可学习的潜在动作查询,我们构建了一个双分支架构,用于估计纯视觉先验 $p(a \mid v)$ 和语言条件后验 $π(a \mid v, \ell)$。然后我们优化策略以最大化动作与指令之间的条件点互信息(PMI)。该目标有效惩罚了视觉捷径,并奖励明确解释语言命令的动作。无需新数据,LangForce显著提升了泛化能力。在SimplerEnv和RoboCasa上的大量实验证明了显著改进,包括在具有挑战性的OOD SimplerEnv基准上提升11.3%,验证了我们的方法在动作中稳健地锚定语言的能力。

英文摘要

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

2511.16084 2026-06-01 cs.CV cs.AI 版本更新

SpectralTrain: A Universal Framework for Hyperspectral Image Classification

SpectralTrain:一种通用的高光谱图像分类框架

Meihua Zhou, Liping Yu, Xinyu Tong, Wai Kin Fung, Ruiguo Hu, Jiarui Zhao, Nan Wan

发表机构 * School of Medical Information, Wannan Medical University(皖南医学院信息学院) University of Chinese Academy of Sciences(中国科学院大学) The Chinese University of Hong Kong(香港中文大学) Northeastern University(东北大学)

AI总结 提出SpectralTrain通用训练框架,通过课程学习与基于PCA的光谱下采样提升高光谱图像分类效率,在多个数据集上实现2-7倍训练加速且精度损失小。

详情
AI中文摘要

高光谱图像(HSI)分类通常涉及大规模数据和计算密集的训练,这限制了深度学习模型在实际遥感任务中的部署。本研究引入SpectralTrain,一个通用的、与架构无关的训练框架,通过将课程学习(CL)与基于主成分分析(PCA)的光谱下采样相结合,提高学习效率。通过逐步引入光谱复杂性同时保留关键信息,SpectralTrain能够在显著降低计算成本的情况下高效学习光谱-空间模式。该框架独立于特定架构、优化器或损失函数,并与经典和最先进(SOTA)模型兼容。在三个基准数据集——Indian Pines、Salinas-A和新引入的CloudPatch-7上的大量实验表明,该框架在空间尺度、光谱特性和应用领域上具有很强的泛化能力。结果显示,训练时间一致减少2-7倍,精度变化取决于骨干网络。在云分类上的应用进一步揭示了其在气候相关遥感中的潜力,强调训练策略优化作为HSI模型中架构设计的有效补充。代码可在https://github.com/mh-zhou/SpectralTrain获取。

英文摘要

Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.

2605.11946 2026-06-01 cs.AI 版本更新

Counterfactual Trace Auditing of LLM Agent Skills

LLM Agent技能的反事实痕迹审计

Xiaolin Zhou, Jinbo Liu, Li Li, Ryan A. Rossi, Xiyang Hu

发表机构 * Arizona State University(亚利桑那州立大学) University of Southern California(南加州大学) Adobe Research(Adobe研究)

AI总结 提出反事实痕迹审计(CTA)框架,通过配对有无技能的Agent轨迹并生成结构化技能影响模式(SIP)注释,揭示技能对行为的重塑效应,弥补仅通过通过率评估的不足。

Comments Code and data are available at https://github.com/WillChow66/CTA.git

详情
AI中文摘要

大型语言模型Agent越来越多地配备Agent技能。当前对技能的评估方法仍然有限。大多数已部署的基准测试仅报告技能附加前后的通过率,将技能视为对Agent行为的黑盒更改。我们引入了反事实痕迹审计(CTA),这是一个衡量技能如何改变Agent行为的框架。CTA将每个带技能的Agent轨迹与同一任务上不带技能的对应轨迹配对,将两条轨迹分割成目标导向的阶段,对齐这些阶段,并输出结构化的技能影响模式(SIP)注释。这些注释描述了技能的行为效果,而不仅仅是任务结果。我们在SWE-Skills-Bench上使用Claude对49个软件工程任务实例化了CTA。由此产生的审计揭示了一个明显的评估差距。通过率平均仅变化+0.3个百分点,表明总体效果很小。然而,CTA在相同的配对轨迹中识别出522个SIP实例,表明即使在通过率几乎不变的情况下,技能也显著重塑了Agent行为。审计还分离了通过率无法检测到的几种反复出现的效果,包括字面模板复制、偏离任务的人工制品创建、过度规划和任务恢复。出现了三个发现。首先,高基线任务包含了大多数观察到的技能效果,尽管它们的通过率已经饱和,因此无法反映这些效果。其次,基线性能适中的任务显示出最大的可恢复增益,但通常以显著更高的令牌成本为代价。第三,主导的SIP类型可以通过基线桶识别:表面锚定在最高任务中最常见,边缘案例提示在中档和最低任务中最常见。这些规律将非正式的故障模式观察转化为可重复的行为测量。

英文摘要

Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE-Skills-Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge-case prompting is most common on mid-range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.

2605.11336 2026-06-01 cs.IR cs.AI cs.CL cs.HC 版本更新

Much of Geospatial Web Search Is Beyond Traditional GIS

大部分地理空间网络搜索超越了传统GIS

Ilya Ilyankou, Stefano Cavazzi, James Haworth

发表机构 * SpaceTimeLab(空间时间实验室) Department of Civil, Environmental, and Geomatic Engineering(土木、环境与测绘工程系) UCL(伦敦大学学院)

AI总结 通过密集句子嵌入、SetFit分类器和密度聚类,在MS MARCO语料库中发现18%的查询具有地理空间性质,并构建了88类分类体系,揭示地理搜索以事务性和实用性查询为主,多数超出传统GIS和知识图谱范围。

详情
AI中文摘要

网络搜索查询涉及地点的频率远高于现有标注方案所表明的,然而地理空间网络搜索查询的景观——人们对地点的询问内容及其频率——在大规模上仍然缺乏特征描述。我们对包含101万条真实必应查询的完整MS MARCO语料库应用密集句子嵌入、轻量级SetFit分类器和基于密度的聚类,无需预先过滤地名或空间关键词,识别出181,827条地理空间查询(18.0%),几乎是原始标注中标记为“位置”的6.17%的三倍。由此产生的88个查询类别分类体系揭示,地理空间网络搜索以事务性和实用性查询为主:仅成本和价格就占地理空间查询的15.3%,几乎是整个自然地理主题规模的两倍。这些活动中的大部分——成本、营业时间、联系方式、天气、旅行推荐——超出了传统GIS和知识图谱旨在服务的范围。这些类别在它们所接受的答案类型上差异很大,从可由空间数据库或知识图谱回答的确定性查询,到需要生成式或实时系统的评估性或时间波动性查询。我们讨论了对混合检索架构以及大型语言模型中地理推理基准的启示。我们公开发布了标注数据集、分类器和分类体系。

英文摘要

Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries - what people ask of place, and how often - remains poorly characterised at scale. We apply dense sentence embeddings, a lightweight SetFit classifier, and density-based clustering to the full MS MARCO corpus of 1.01 million real Bing queries without prior filtering for toponyms or spatial keywords, identifying 181,827 geospatial queries (18.0%), nearly threefold the 6.17% labelled as Location in the original annotations. The resulting taxonomy of 88 query categories reveals that geospatial web search is dominated by transactional and practical lookups: costs and prices alone account for 15.3% of geospatial queries, nearly twice the size of the entire physical geography theme. Much of this activity - costs, opening hours, contact details, weather, travel recommendations - falls outside the scope of what traditional GIS and knowledge graphs are built to serve. The categories vary substantially in the kind of answer they admit, from deterministic lookups answerable from spatial databases or knowledge graphs to evaluative or temporally volatile queries that require generative or real-time systems. We discuss implications for hybrid retrieval architectures and for benchmarks of geographic reasoning in large language models. We openly release the labelled dataset, classifier, and taxonomy.

2605.11134 2026-06-01 cs.LG cs.AI 版本更新

Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

偏好优化中的虚假相关学习:机制、后果及通过平局训练的缓解方法

Christian Moya, Alex Semendinger, Guang Lin, Elliott Thornley

发表机构 * Department of Mathematics, Purdue University, West Lafayette IN, USA(普渡大学数学系) School of Mechanical Engineering, Purdue University, West Lafayette IN, USA(普渡大学机械工程学院) Massachusetts Institute of Technology, Cambridge MA, USA(麻省理工学院)

AI总结 本文通过统一理论分析揭示了偏好优化(如DPO)中虚假相关学习的机制(均值虚假偏差和因果-虚假相关泄漏),证明其导致分布偏移下的不可逆脆弱性,并提出平局训练数据增强策略以选择性减少虚假学习。

Comments Proceedings of the 43rd International Conference on Machine Learning, 2026, Seoul, South Korea

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, 2026, Seoul, South Korea
AI中文摘要

偏好学习方法(如直接偏好优化DPO)已知会诱导对虚假相关的依赖,导致当前语言模型中的谄媚和长度偏差,并可能在未来系统中造成严重的目标泛化错误。在这项工作中,我们对此现象进行了统一的理论分析,描述了虚假学习的机制、其在部署中的后果以及一种可证明的缓解策略。聚焦于对数线性策略,我们展示了标准偏好学习目标通过两个渠道在总体水平上诱导对虚假特征的依赖:均值虚假偏差和因果-虚假相关泄漏。然后我们表明这种依赖造成了分布偏移的不可逆脆弱性:来自相同训练分布的更多数据无法减少模型对虚假特征的依赖。为了解决这个问题,我们提出了平局训练,一种使用平局(等效用偏好对)的数据增强策略,以引入数据驱动的正则化。我们证明了该方法选择性地减少虚假学习而不降低因果学习。最后,我们在对数线性模型上验证了我们的理论,并提供了实证证据,表明虚假学习机制和平局训练的益处均适用于神经网络和大语言模型。

英文摘要

Preference learning methods like Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal-spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model's dependence on spurious features. To address this, we propose tie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.

2602.16165 2026-06-01 cs.LG cs.AI 版本更新

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

HiPER: 具有显式信用分配的分层强化学习用于大型语言模型智能体

Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong

发表机构 * University of Minnesota Northwestern University Amazon AGI Texas A\&M University Cisco Research

AI总结 针对稀疏奖励长程任务中LLM智能体信用分配困难的问题,提出HiPER分层规划-执行框架,通过分层优势估计(HAE)在规划和执行层面显式分配信用,在ALFWorld和WebShop上达到97.4%和83.3%的成功率。

Comments ICML 2026

详情
AI中文摘要

将LLM训练为用于多轮决策的交互式智能体仍然具有挑战性,特别是在具有稀疏和延迟奖励的长程任务中,智能体必须在获得有意义的反馈之前执行一系列扩展的动作。大多数现有的强化学习方法将LLM智能体建模为在单一时间尺度上运行的扁平策略,每轮选择一个动作。在稀疏奖励设置中,这种扁平策略必须跨整个轨迹传播信用,而没有显式的时间抽象,这常常导致不稳定的优化和低效的信用分配。我们提出HiPER,一种新颖的分层规划-执行强化学习框架,明确地将高层规划与低层执行分开。HiPER将策略分解为一个提出子目标的高层规划器和一个在多个动作步骤中执行这些子目标的低层执行器。为了将优化与此结构对齐,我们引入了一种称为分层优势估计(HAE)的关键技术,该技术在规划和执行层面仔细分配信用。通过聚合每个子目标执行过程中的回报并协调两个层面的更新,HAE提供了无偏的梯度估计器,并且与扁平广义优势估计相比,可证明地减少了方差。实验上,HiPER在具有挑战性的交互式基准测试中达到了最先进的性能,在ALFWorld上达到97.4%的成功率,在WebShop上达到83.3%的成功率(使用Qwen2.5-7B-Instruct,分别比先前最佳方法高出6.6%和8.3%),在需要多个依赖子任务的长程任务上尤其取得了巨大收益。这些结果突显了显式层次分解对于多轮LLM智能体的可扩展RL训练的重要性。

英文摘要

Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4\% success on ALFWorld and 83.3\% on WebShop with Qwen2.5-7B-Instruct (+6.6\% and +8.3\% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.

2605.08145 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

自描述多模态交互调优:放大可利用冗余以实现鲁棒的视觉语言模型

Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, Roy Ka-Wei Lee

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) DSO National Laboratories(国防部国家实验室) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对视觉语言模型中的幻觉和鲁棒性问题,提出自描述多模态交互调优方法,通过放大模态间冗余信息来补偿受损模态,并设计多模态交互门机制将独特交互转化为冗余交互,实验表明该方法可减少38.3%的视觉诱导错误并提升16.8%的一致性。

Comments Accepted to ICML 2026. Code: https://github.com/yurielryan/Multimodal-Interaction-Tuning

详情
AI中文摘要

当前的视觉语言模型在面对模糊或受损模态时存在幻觉和鲁棒性问题。我们假设这些问题可以通过利用模态间的共享信息来补偿受损模态得到解决。为此,我们分析了多模态交互——模态提供的冗余(共享)、独特(排他)和协同(涌现)任务相关信息——以确定它们对模型可靠性的影响。具体来说,放大冗余交互将增加这种可利用的共享信息以解决这些问题;然而,现代指令数据集通常消除冗余以优先考虑视觉定位。我们通过一个自描述工作流弥合这一差距,该工作流包含一个 extsc{多模态交互门}:一种将独特交互转化为冗余交互的机制。我们的发现表明,增加冗余可以减少38.3%的视觉诱导错误,并提高16.8%的一致性。

英文摘要

Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3\% and improve consistency by 16.8\%.

2605.06831 2026-06-01 cs.LG cs.AI 版本更新

Why DDIM Hallucinates More Than DDPM: A Theoretical Analysis of Reverse Dynamics

为什么DDIM比DDPM更容易产生幻觉:反向动力学的理论分析

Muhammad H. Ashiq, Samanyu Arora, Abhinav N. Harish, Ishaan Kharbanda, Hung Yun Tseng, Grigorios G. Chrysos

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 通过理论分析高斯混合目标下的反向ODE(DDIM)和SDE(DDPM),证明在临界时间τ后DDIM会卡在两个最近模式之间的线段上,而DDPM的随机性帮助其脱离该区域从而避免幻觉。

Comments Accepted in ICML

详情
AI中文摘要

我们从理论上研究了两种经典扩散采样器中的幻觉现象:随机去噪扩散概率模型(DDPM)和确定性去噪扩散隐式模型(DDIM)。我们分析了高斯混合目标下的反向ODE(DDIM)和SDE(DDPM),证明在临界时间τ后,(a) DDIM可能卡在连接两个最近模式的线段上,(b) DDPM的随机性帮助其脱离该区域,从而避免幻觉。我们的实证验证表明,当进入该区域时,DDPM的幻觉率显著低于DDIM。基于我们的观察,我们展示了如何使用额外的随机步骤帮助DDIM避免幻觉,并为设计改进的采样器提供了新见解。

英文摘要

We theoretically study the hallucination phenomena in two canonical diffusion samplers: the stochastic Denoising Diffusion Probabilistic Model (DDPM) and the deterministic Denoising Diffusion Implicit Model (DDIM). We analyze the reverse ODE (DDIM) and SDE (DDPM) for a Gaussian mixture target, proving that after a critical time $τ$, (a) DDIM can become stuck on the segment connecting the two nearest modes and (b) DDPM *stochasticity* helps it become unstuck from this region, thus avoiding hallucination. Our empirical validation verifies that DDPM has a significantly lower hallucination rate than DDIM when this region is entered. Building on our observations, we exhibit how using additional stochastic steps can help DDIM avoid hallucinations and offer new insights on how to design improved samplers.

2605.06235 2026-06-01 cs.IR cs.AI 版本更新

OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries

OBLIQ-Bench:揭示现代检索器中被忽视的瓶颈——潜在与隐式查询

Diane Tchuindjo, Devavrat Shah, Omar Khattab

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对现有检索基准饱和但实际搜索问题未解决的现象,提出一类“倾斜查询”并构建OBLIQ-Bench基准,揭示检索与验证之间的不对称性,即推理LLM能可靠识别潜在相关性但检索管道无法召回多数相关文档。

详情
AI中文摘要

检索基准日益饱和,但我们认为高效搜索远非已解决的问题。我们识别出一类称为“倾斜”的查询,它们寻求实例化潜在模式的文档,例如找到所有表达隐式立场的推文、展示特定失败模式的聊天记录或匹配抽象场景的转录文本。我们研究了倾斜性产生的三种机制,并引入了OBLIQ-Bench,这是一套基于真实长尾语料库的五个倾斜搜索问题。OBLIQ-Bench揭示了检索与验证之间一个被忽视的不对称性:当相关文档被呈现时,推理LLM能可靠地识别潜在相关性,但即使是复杂的检索管道也无法首先召回大多数相关文档。我们希望OBLIQ-Bench能推动研究高效捕获大规模语料库中潜在模式和隐式信号的检索架构。

英文摘要

Retrieval benchmarks are increasingly saturating, but we argue that efficient search is far from a solved problem. We identify a class of queries we call oblique, which seek documents that instantiate a latent pattern, like finding all tweets that express an implicit stance, chat logs that demonstrate a particular failure mode, or transcripts that match an abstract scenario. We study three mechanisms through which obliqueness may arise and introduce OBLIQ-Bench, a suite of five oblique search problems over real long-tail corpora. OBLIQ-Bench exposes an overlooked asymmetry between retrieval and verification, where reasoning LLMs reliably recognize latent relevance whenever relevant documents are surfaced, but even sophisticated retrieval pipelines fail to surface most relevant documents in the first place. We hope that OBLIQ-Bench will drive research into retrieval architectures that efficiently capture latent patterns and implicit signals in large corpora.

2605.06137 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Autoregressive Visual Generation Needs a Prologue

自回归视觉生成需要一个序幕

Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) hi-Lab, Xiaohongshu Inc(小红书实验室)

AI总结 提出Prologue方法,通过生成前置的序幕令牌来弥合自回归图像生成中的重建-生成差距,在不影响重建质量的前提下显著提升生成性能。

Comments Code: https://github.com/Zyriix/prologue Demo: https://huggingface.co/spaces/Zyriix/prologue-demo

详情
AI中文摘要

在这项工作中,我们提出了Prologue,一种弥合自回归(AR)图像生成中重建-生成差距的方法。Prologue不修改视觉令牌以同时满足重建和生成,而是生成一小部分序幕令牌,并将其前置到视觉令牌序列之前。这些序幕令牌仅使用AR交叉熵(CE)损失进行训练,而视觉令牌则专用于重建。这种解耦设计使我们能够通过AR模型的真实分布优化生成,而不影响重建质量,我们进一步从ELBO角度形式化了这一点。在ImageNet 256x256上,Prologue-Base在没有无分类器引导的情况下将gFID从21.01降至10.75,同时几乎保持重建不变;Prologue-Large使用标准AR模型,无需辅助语义监督,达到了具有竞争力的rFID 0.99和gFID 1.46。有趣的是,仅由AR梯度驱动,序幕令牌展现出涌现的语义结构:对16个序幕令牌进行线性探测达到35.88%的Top-1准确率,远高于标准分词器前16个令牌的23.71%;使用固定序幕令牌进行重采样保留了相似的高层语义布局。我们的结果暗示了一个新方向:通过引入单独学习的生成表示,同时保持原始表示不变,可以提升生成质量。

英文摘要

In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.

2605.01134 2026-06-01 cs.AI 版本更新

To Use AI as Dice of Possibilities with Timing Computation

将AI用作带有时序计算的可能性骰子

Jia Li, Vipin Kumar, Rui Zhang

发表机构 * Department of Surgery, University of Minnesota(明尼苏达大学外科系) Department of Computer Science & Engineering, University of Minnesota(明尼苏达大学计算机科学与工程系)

AI总结 本文提出基于动词的范式,定义时序计算和可能性,使AI能作为实现思维语法的工具,并在乳腺癌患者数据上自动发现临床轨迹和反事实时序推断。

详情
AI中文摘要

主流的基于名词的建模范式从根本上限制了AI的发展,无法充分表示未来作为开放的时间维度。本文引入了一种基于动词的范式,并给出了时序计算和可能性的精确定义,使AI能够成为实现我们思维语法的有效工具。将该框架应用于3276名乳腺癌患者的纵向EHR数据,实证表明:(1)自动发现具有临床意义的患者轨迹,以及(2)反事实时序推断。这两个结果都是纯数据驱动的,不需要先验领域知识,并且据我们所知,代表了机器学习文献中首次此类演示。

英文摘要

The dominant noun-based modeling paradigm has fundamentally constrained AI development, precluding any adequate representation of the future as an open temporal dimension. This paper introduces a verb-based paradigm, together with precise definitions of \emph{timing computation} and \emph{possibility}, that enables AI to function as an effective instrument for realizing the grammar of our thought. Applied to longitudinal EHR data from 3,276 breast cancer patients, the framework empirically demonstrates: (1) automatic discovery of clinically significant patient trajectories, and (2) counterfactual timing deduction. Both results are purely data-driven, require no prior domain knowledge, and, to our knowledge, represent the first such demonstrations in the machine learning literature.

2508.21762 2026-06-01 cs.CL cs.AI 版本更新

Reasoning-Intensive Regression

推理密集型回归

Diane Tchuindjo, Omar Khattab

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对推理密集型回归任务,提出MENTAT方法,结合批量反思提示优化与神经集成学习,在基准测试中相比基线提升高达65%。

详情
AI中文摘要

AI研究人员和从业者越来越多地将大型语言模型(LLMs)应用于我们称之为推理密集型回归(RiR)的任务,即从文本中推断细微的数值分数。与情感分析或相似性分析等标准语言回归任务不同,RiR通常出现在临时应用中,例如基于评分标准的评分、复杂环境中的密集奖励建模或特定领域的检索,这些任务需要对上下文进行更深入的分析,而可用的任务特定训练数据和计算资源有限。我们将四个实际问题作为RiR任务,建立初始基准,并用于测试我们的假设:即冻结的LLMs和通过梯度下降微调Transformer编码器在RiR中通常都会遇到困难。然后,我们提出MENTAT,一种简单轻量的方法,结合批量反思提示优化与神经集成学习。MENTAT在两个基线上实现了高达65%的提升,尽管未来仍有很大的改进空间。

英文摘要

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks such as sentiment or similarity analysis, RiR often appears instead in ad-hoc applications such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and fine-tuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances.

2604.27617 2026-06-01 cs.CV cs.AI 版本更新

Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection

用于实时无人机桥梁检测的鲁棒轻量级裂缝分类

Wei Li, Haisheng Li, Weijie Li, Jiandong Wang, Kaichen Ma, Luming Yang

发表机构 * Bay Area Super Bridge Maintenance Technology Center, Guangdong Provincial Highway Construction Co., Ltd., Guangdong, China(湾区超级桥梁维护技术中心、广东省高速公路建设有限公司、广东,中国) Guangdong AIHISUN Technology Co., Ltd., Guangdong, China(广东AIHISUN技术有限公司、广东,中国)

AI总结 提出一个由轻量级骨干网络、CBAM注意力模块、基于场景先验的定向鲁棒增强策略和Focal Loss组成的统一轻量级CNN框架,在SDNET2018数据集上以11.21M参数和1.82G FLOPs实现825 FPS推理速度,F1分数提升2.51%,召回率提升3.95%。

详情
AI中文摘要

随着无人机在桥梁结构健康监测中的广泛应用,基于深度学习的自动裂缝检测已成为主要研究热点。然而,实际无人机检测仍面临四个关键挑战:弱裂缝特征、退化成像条件、严重类别不平衡以及实际无人机检测工作流程中有限的计算资源。为了解决这些问题,本文提出了一个统一的轻量级卷积神经网络框架,由四个协同组件组成:轻量级骨干网络、用于通道和空间增强的卷积块注意力模块(CBAM)、基于检测场景先验的定向鲁棒增强策略,以及用于类别不平衡下难样本学习的Focal Loss。在SDNET2018桥面数据集上的实验表明,所提方法仅以11.21M参数和1.82G FLOPs实现了825 FPS的推理速度。与基线模型相比,完整框架的F1分数提高了2.51%,召回率提高了3.95%。此外,Grad-CAM可视化表明,引入的注意力模块将模型关注点从分散区域转移到沿裂缝轨迹的精确跟踪。总体而言,本研究在准确性、速度和鲁棒性之间取得了强平衡,为无人机桥梁检测中地面站辅助的实时部署提供了实用解决方案。源代码可在 https://github.com/skylynf/AttXNet 获取。

英文摘要

With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning-based automatic crack detection has become a major research focus. However, practical UAV inspections still face four key challenges: weak crack features, degraded imaging conditions, severe class imbalance, and limited computational resources for practical UAV inspection workflows. To address these issues, this paper proposes a unified lightweight convolutional neural network framework composed of four synergistic components: a lightweight backbone network, a Convolutional Block Attention Module (CBAM) for channel and spatial enhancement, a directed robust augmentation strategy based on inspection-scene priors, and Focal Loss for hard-sample learning under class imbalance. Experiments on the SDNET2018 bridge deck dataset show that the proposed method achieves an inference speed of 825 FPS with only 11.21M parameters and 1.82G FLOPs. Compared with the baseline model, the complete framework improves the F1-score by 2.51% and recall by 3.95%. In addition, Grad-CAM visualizations indicate that the introduced attention module shifts the model's focus from scattered regions to precise tracking along crack trajectories. Overall, this study achieves a strong balance among accuracy, speed, and robustness, providing a practical solution for ground-station assisted real-time deployment in UAV bridge inspections. The source code is available at: https://github.com/skylynf/AttXNet .

2604.23468 2026-06-01 math.MG cs.AI cs.LO math.NT 版本更新

Progress in Formalizing Sphere Packing in Dimension 8

八维球堆积形式化进展

Sidharth Hariharan, Christopher Birkbeck, Seewoo Lee, Ho Kiu Gareth Ma, Bhavik Mehta, Auguste Poiroux, Maryna Viazovska

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of East Anglia(东安格利亚大学) University of California, Berkeley(加州大学伯克利分校) University of Warwick(沃里克大学) Imperial College London(伦敦帝国理工学院) Math, Inc(Math公司) École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)

AI总结 本文介绍了使用Lean定理证明器形式化验证Viazovska在8维球堆积问题上的解,并讨论了与自动形式化模型Gauss的合作及剩余目标。

Comments 8 pages, title updated

详情
AI中文摘要

2016年,Viazovska利用模形式构造了一个满足Cohn和Elkies在2003年确定的最优性条件的“魔法”函数,著名地解决了8维球堆积问题。2024年3月,Hariharan和Viazovska启动了一个项目,旨在用Lean定理证明器形式化这一解及相关数学事实。2026年2月,一个重要的里程碑达成:该结果被形式化验证,验证的最后阶段由Math公司的自动形式化模型“Gauss”完成。我们讨论了实现这一里程碑所使用的技术,反思了人类与Gauss之间的独特合作,并讨论了剩余的项目目标。

英文摘要

In 2016, Viazovska famously solved the sphere packing problem in dimension $8$, using modular forms to construct a 'magic' function satisfying optimality conditions determined by Cohn and Elkies in 2003. In March 2024, Hariharan and Viazovska launched a project to formalize this solution and related mathematical facts in the Lean Theorem Prover. A significant milestone was achieved in February 2026: the result was formally verified, with the final stages of the verification done by Math, Inc.'s autoformalization model 'Gauss'. We discuss the techniques used to achieve this milestone, reflect on the unique collaboration between humans and Gauss, and discuss project objectives that remain.

2604.16922 2026-06-01 cs.AI 版本更新

ClimAgent: LLM as Agents for Autonomous Open-ended Climate Science Analysis

ClimAgent:基于大语言模型的自主开放式气候科学分析智能体

Hao Wang, Jindong Han, Wei Fan, Hao Liu

AI总结 提出ClimAgent框架,通过统一工具使用环境和严格推理协议,实现端到端建模与分析,在ClimaBench基准上相比原始LLM方案提升40.21%。

Comments It was submitted without the full consent of all co-authors

详情
AI中文摘要

气候研究对于缓解全球环境危机至关重要,然而多尺度数据集的加速增长和分析工具的复杂性造成了显著瓶颈,将科学发现限制在碎片化且劳动密集的工作流程中。尽管大语言模型(LLMs)的出现为扩展科学专业知识提供了变革性范式,但现有探索仍主要局限于简单的问答(Q&A)任务。这些方法往往过度简化现实世界的挑战,忽视了专业气候科学所需的复杂物理约束和数据驱动特性。为弥补这一差距,我们引入了ClimAgent,一个通用自主框架,旨在跨不同气候子领域执行广泛的研究任务。通过将统一工具使用环境与严格推理协议相结合,ClimAgent超越了简单的检索,实现了端到端的建模与分析。为促进系统评估,我们提出了ClimaBench,这是首个面向真实气候发现的综合基准。它涵盖了源自2000年至2025年间专业场景的5个不同任务类别中的挑战性问题。在ClimaBench上的实验表明,ClimAgent显著优于最先进的基线,在解决方案的严谨性和实用性上比原始LLM解决方案提升了40.21%。我们的代码可在https://github.com/usail-hkust/ClimAgent获取。

英文摘要

Climate research is pivotal for mitigating global environmental crises, yet the accelerating volume of multi-scale datasets and the complexity of analytical tools have created significant bottlenecks, constraining scientific discovery to fragmented and labor-intensive workflows. While the emergence Large Language Models (LLMs) offers a transformative paradigm to scale scientific expertise, existing explorations remain largely confined to simple Question-Answering (Q&A) tasks. These approaches often oversimplify real-world challenges, neglecting the intricate physical constraints and the data-driven nature required in professional climate science.To bridge this gap, we introduce ClimAgent, a general-purpose autonomous framework designed to execute a wide spectrum of research tasks across diverse climate sub-fields. By integrating a unified tool-use environment with rigorous reasoning protocols, ClimAgent transcends simple retrieval to perform end-to-end modeling and analysis. To foster systematic evaluation, we propose ClimaBench, the first comprehensive benchmark for real-world climate discovery. It encompasses challenging problems spanning 5 distinct task categories derived from professional scenarios between 2000 and 2025. Experiments on ClimaBench demonstrate that ClimAgent significantly outperforms state-of-the-art baselines, achieving a 40.21% improvement over original LLM solutions in solution rigorousness and practicality. Our code are available at https://github.com/usail-hkust/ClimAgent.

2604.22722 2026-06-01 cs.IR cs.AI cs.LG 版本更新

Aligning Dense Retrievers with LLM Utility via Distillation

通过蒸馏将稠密检索器与LLM效用对齐

Rajinder Sandhu, Di Mu, Cheng Chang, Md Shahriar Tasjid, Himanshu Rai, Maksims Volkovs, Ga Wu

发表机构 * Dalhousie University(达尔豪西大学)

AI总结 提出Utility-Aligned Embeddings (UAE)框架,通过蒸馏LLM的困惑度降低效用分布来训练双编码器,在不增加测试时LLM推理开销的情况下提升稠密检索的精度和效率。

详情
AI中文摘要

稠密向量检索是检索增强生成(RAG)的实用支柱,但相似性搜索可能受限于精度。相反,利用LLM重排序的基于效用的方法通常能实现更优性能,但计算成本高且易受困惑度估计中固有噪声的影响。我们提出Utility-Aligned Embeddings (UAE),一个旨在将这些优势融合为实用、高性能检索方法的框架。我们将检索表述为分布匹配问题,使用Utility-Modulated InfoNCE目标训练双编码器以模仿由困惑度降低导出的效用分布。该方法将分级效用信号直接注入嵌入空间,无需测试时LLM推理。在QASPER基准上,UAE在召回率@1上提升30.59%,MAP提升30.16%,Token F1提升17.3%,优于强语义基线BGE-Base。关键的是,UAE比高效的LLM重排序方法快180倍以上,同时保持竞争性能,表明将检索与生成效用对齐能在规模上产生可靠的上下文。

英文摘要

Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.

2604.09429 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

射线即像素:学习视频与相机轨迹的联合分布

Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang

发表机构 * Meta AI

AI总结 提出一种视频扩散模型(Rays as Pixels),通过将相机表示为密集射线像素(raxels)并与视频帧共享潜在空间,联合去噪实现相机轨迹预测和相机控制视频生成。

Comments Accepted to ICML 2026. 9-page main paper plus supplementary material. Project page: https://wbjang.github.io/raysaspixels/

详情
AI中文摘要

从图像恢复相机参数和从新视角渲染场景在计算机视觉和图形学中被视为独立任务。当图像覆盖稀疏或姿态模糊时,这种分离会失效,因为每个任务依赖于另一个任务的输出。我们提出Rays as Pixels,一种视频扩散模型(VDM),学习视频和相机轨迹的联合分布。据我们所知,这是首个在单一框架内预测相机姿态并进行相机控制视频生成的模型。我们将每个相机表示为密集射线像素(raxels),这是一种与视频帧位于同一潜在空间的像素对齐编码,并通过解耦自交叉注意力机制联合去噪两者。一个训练好的模型处理三个任务:从视频预测相机轨迹、沿预定义轨迹从输入图像生成视频、以及从输入图像联合合成视频和轨迹。我们在姿态估计和相机控制视频生成上进行评估,并引入闭环自一致性测试,显示模型预测的姿态及其基于这些姿态的渲染结果一致。与Plücker嵌入的消融实验证实,将相机与视频共享潜在空间显著更有效。

英文摘要

Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel-aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre-defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera-controlled video generation, and introduce a closed-loop self-consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Plücker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.

2604.18587 2026-06-01 cs.LG cs.AI cs.LO cs.PL 版本更新

Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

编译以压缩:通过编译器输出提升形式定理证明器

Guchan Li, Rui Tian, Hongning Wang

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系)

AI总结 利用编译器将大量证明尝试压缩为结构化失败模式,提出一种学习-精炼框架,通过树搜索基于验证器反馈局部修正错误,在可比测试时预算下在PutnamBench上达到最先进性能。

详情
AI中文摘要

大型语言模型在形式定理证明中展现出显著潜力,但最先进的性能往往需要通过大量展开或扩展上下文窗口来实现令人望而却步的测试时计算。在这项工作中,我们通过利用形式验证中的一种信息结构来解决这一可扩展性瓶颈:观察到编译器将大量不同的证明尝试空间映射到一组紧凑的结构化失败模式。我们引入了一个学习-精炼框架,利用这种压缩来执行高效的学习和证明探索。我们执行树搜索,根据明确的验证器反馈局部修正错误,从而避免了积累长历史证明尝试的相关成本。大量评估表明,我们的方法在不同规模上持续增强了基础证明器的推理能力。值得注意的是,在可比较的测试时预算下,我们的方法在PutnamBench上达到了公开报告的约80亿和约320亿参数模型中的最先进性能,为下一代验证器引导推理提供了一种可扩展的范式。

英文摘要

Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we address this scalability bottleneck by exploiting an informative structure in formal verification: the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes. We introduce a learning-to-refine framework that leverages this compression to perform efficient learning and proof exploration. We perform tree search that corrects errors locally conditioned on explicit verifier feedback, thereby circumventing the costs associated with accumulating a long history of proof attempts. Extensive evaluations show that our method consistently amplifies the reasoning capabilities of base provers across varying scales. Notably, our approach achieves state-of-the-art performance on PutnamBench among publicly reported $\sim$8B and $\sim$32B parameter models under comparable test-time budgets, offering a scalable paradigm for next-generation verifier-guided reasoning.

2604.17551 2026-06-01 cs.LG cs.AI 版本更新

SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

SVL:目标条件强化学习作为生存学习

Franki Nguimatsia Tiofack, Fabian Schramm, Théotime Le Hellard, Justin Carpentier

发表机构 * Inria(法国国家信息与自动化研究所) École Normale Supérieure, PSL Research University, Paris, France(巴黎高等师范学院,PSL研究大学)

AI总结 提出生存价值学习(SVL),通过将时间到目标建模为概率分布,将目标条件强化学习重构为生存学习问题,并利用危险模型进行最大似然估计,在离线基准上匹配或超越强基线方法。

Comments Accepted to the 43rd International Conference on Machine Learning, Seoul, South Korea

详情
AI中文摘要

标准的目标条件强化学习(GCRL)方法依赖于时间差分学习,由于自举可能导致不稳定和样本效率低下。虽然最近的工作探索了对比和监督公式以提高稳定性,但我们提出了一种概率替代方案,称为生存价值学习(SVL),通过将每个状态到目标的时间建模为概率分布,将GCRL重新定义为生存学习问题。这种结构化的分布蒙特卡洛视角产生了一个闭式恒等式,将目标条件价值函数表示为生存概率的折扣和,从而通过危险模型在事件和右删失轨迹上进行最大似然估计来实现价值估计。我们引入了三种实用的价值估计器,包括有限视界截断和两种分箱无限视界近似,以捕捉长视界目标。在离线GCRL基准上的实验表明,SVL与层次化演员结合,匹配或超越了强大的层次化TD和蒙特卡洛基线,在复杂的长视界任务上表现出色。网页和代码:https://simple-robotics.github.io/publications/survival-value-learning/

英文摘要

Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called survival value learning (SVL), that reframes GCRL as a survival learning problem by modeling the time-to-goal from each state as a probability distribution. This structured distributional Monte Carlo perspective yields a closed-form identity that expresses the goal-conditioned value function as a discounted sum of survival probabilities, enabling value estimation via a hazard model trained via maximum likelihood on both event and right-censored trajectories. We introduce three practical value estimators, including finite-horizon truncation and two binned infinite-horizon approximations to capture long-horizon objectives. Experiments on offline GCRL benchmarks show that SVL combined with hierarchical actors matches or surpasses strong hierarchical TD and Monte Carlo baselines, excelling on complex, long-horizon tasks. Webpage and Code: https://simple-robotics.github.io/publications/survival-value-learning/

2604.16278 2026-06-01 cs.AI cs.CL cs.LG 版本更新

Learning to Reason with Insight for Informal Theorem Proving

学习在非形式定理证明中进行洞察推理

Yunhe Li, Hao Shi, Bowen Deng, Wei Wang, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Siyang Gao, Chao Wang, Shuang Qiu, Linqi Song

发表机构 * City University of Hong Kong(香港城市大学) Tsinghua University(清华大学) Ke Holdings Inc.(Ke控股公司) Shenzhen University of Advanced Technology(深圳先进技术大学) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 针对非形式定理证明中缺乏洞察(识别核心技巧)的瓶颈,提出统一训练框架DeepInsight,通过分层数据集、渐进式多阶段SFT和基于洞察的策略优化方法,显著提升大语言模型的数学推理能力。

详情
AI中文摘要

尽管大多数自动定理证明方法依赖于形式证明系统,但非形式定理证明能更好地发挥大语言模型(LLMs)在自然语言处理方面的优势。在这项工作中,我们识别出非形式定理证明的一个主要瓶颈是缺乏洞察,即难以识别解决复杂问题所需的核心技巧。为了解决这个问题,我们提出了$ exttt{DeepInsight}$,一个统一的训练框架,旨在培养这种基本的推理技能,并使LLMs能够进行洞察推理。我们的框架由三个部分组成:(1)$ exttt{DeepInsightTheorem}$,一个分层数据集,通过显式提取核心技巧和证明草图以及最终证明来结构化非形式证明;(2)渐进式多阶段SFT策略,模拟人类学习过程,教授模型证明写作、规划和洞察识别;(3)$ exttt{InsightPO}$,一种策略优化方法,在此洞察层次结构上分配结构化奖励。我们在具有挑战性的数学基准上的实验表明,这种洞察感知的生成策略显著优于基线。这些结果表明,教模型识别和应用核心技巧可以大幅提高其数学推理能力。

英文摘要

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose $\texttt{DeepInsight}$, a unified training framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. Our framework consists of three components: (1) $\texttt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof; (2) a Progressive Multi-Stage SFT strategy that mimics the human learning process, teaching the model proof writing, planning, and insight identification; and (3) $\texttt{InsightPO}$, a policy optimization method that assigns structured rewards over this insight hierarchy. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.

2604.11613 2026-06-01 cs.LG cs.AI 版本更新

Symmetry Reveals Layerwise Dynamics: How Transformers Perform In-Context Classification

对称性揭示逐层动力学:Transformer如何执行上下文分类

Patrick Lutz, Themistoklis Haris, Arjun Chandra, Aditya Gangrade, Venkatesh Saligrama

发表机构 * Boston University, Departments of Computer Science

AI总结 通过强制特征和标签排列等变性,从Transformer中提取出显式的深度索引递归更新规则,揭示了上下文分类的几何驱动算法。

Comments appears in the Proceedings of the 43rd International Conference on Machine Learning (ICML '26)

详情
AI中文摘要

Transformer可以从少量标记示例中执行上下文分类,但推理时的算法仍然不透明。我们研究了硬无间隔机制下的多类线性分类,并通过在每一层强制特征和标签排列等变性使计算可识别。这实现了可解释性,同时保持了功能等价性,并产生了高度结构化的权重。从这些模型中,我们提取出一个显式的深度索引递归:一个端到端可识别的、在softmax Transformer内部涌现的更新规则,据我们所知这是首个此类规则。由混合特征-标签Gram结构形成的注意力矩阵驱动训练点、标签和测试探针的耦合更新。由此产生的动力学实现了一个几何驱动的算法主题,该主题可以证明放大类别分离并产生鲁棒的期望类别对齐。

英文摘要

Transformers can perform in-context classification from a few labeled examples, yet the inference-time algorithm remains opaque. We study multi-class linear classification in the hard no-margin regime and make the computation identifiable by enforcing feature- and label-permutation equivariance at every layer. This enables interpretability while maintaining functional equivalence and yields highly structured weights. From these models we extract an explicit depth-indexed recursion: an end-to-end identified, emergent update rule inside a softmax transformer, to our knowledge the first of its kind. Attention matrices formed from mixed feature-label Gram structure drive coupled updates of training points, labels, and the test probe. The resulting dynamics implement a geometry-driven algorithmic motif, which can provably amplify class separation and yields robust expected class alignment.

2603.12277 2026-06-01 cs.CL cs.AI cs.CR 版本更新

Prompt Injection as Role Confusion

提示注入作为角色混淆

Charles Ye, Jasmine Cui, Dylan Hadfield-Menell

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过角色探测和CoT伪造攻击,揭示提示注入源于LLM对文本来源的角色感知混淆,并提出角色混淆程度可预测攻击成功率。

Comments ICML 2026

详情
AI中文摘要

LLM将世界视为单一的文本流,并划分为<user>或<tool>等角色。我们将提示注入追溯到角色混淆:模型根据文本听起来的方式而非其标记的角色来感知文本来源。隐藏在网页中的命令劫持了代理,仅仅因为它听起来像<user>文本,尽管其标签是<tool>。我们设计了角色探测器来测量LLM内部如何感知“谁在说话”,并发现注入的文本占据了与它所模仿的可信角色相同的表示空间。我们通过CoT伪造(一种零样本攻击)证明了这一点,该攻击将捏造的推理注入用户提示和工具输出中。模型将伪造内容误认为是自己的思维,导致对前沿模型的攻击成功率达到60%,而基线接近零。引人注目的是,角色混淆的程度可以在生成单个token之前预测攻击成功。这一机制超越了CoT伪造,适用于标准的代理提示注入,揭示了提示注入是角色感知的可测量后果。对模型而言,听起来像某个角色与成为该角色是无法区分的。

英文摘要

LLMs see the world as a single stream of text, partitioned into roles like <user> or <tool>. We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like <user> text, despite its <tool> label. We design role probes to measure how LLMs internally perceive "who is speaking," and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts and tool outputs. Models mistake the forgery for their own thoughts, yielding 60% attack success against frontier models with near-zero baselines. Strikingly, the degree of role confusion predicts attack success before a single token is generated. This mechanism generalizes beyond CoT Forgery to standard agent prompt injections, revealing prompt injection as a measurable consequence of role perception. To the model, sounding like a role is indistinguishable from being one.

2509.10078 2026-06-01 cs.CL cs.AI 版本更新

Human Psychometric Questionnaires Mischaracterize LLM Behavior

人类心理测量问卷误判LLM行为

Woojung Song, Dongmin Choi, Yoonah Park, Jongwook Han, Eun-Ju Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Communication, Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学通信系人工智能交叉学科项目)

AI总结 通过比较LLM在Likert问卷和生成概率上的价值与人格特征,发现问卷存在系统性偏差,提出基于生成概率的评估方法更准确。

Comments 38 pages, 6 figures

详情
AI中文摘要

我们检验了人类心理测量问卷是否可以作为可靠工具来表征和预测LLM在日常用户交互中的行为。我们分析了八个开源LLM,比较了从两种不同方法得出的价值和人格特征:基于既定问卷(PVQ-40/21和BFI-44/10)的Likert自我报告,以及对日常用户查询的价值负载响应的生成概率。两种特征显著不同。在生成概率中,常被引为LLM稳定倾向证据的构念内项目一致性消失了。我们将这一差距归因于既定问卷项目中的显式词汇线索使模型能够识别目标构念并以一致、社会期望的方式响应,而现实用户查询不提供此类线索。此外,人口统计角色提示以与真实人类模式一致的方式改变了模型对人类问卷的响应,但在对现实用户查询的响应生成概率中没有出现此类变化,表明它们在模拟目标人口统计在真实世界用户交互中的行为方面能力有限。总体而言,我们的研究表明,人类心理测量问卷不足以预测LLM行为,并建议基于生成的评估作为更准确的度量。

英文摘要

We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.

2604.01985 2026-06-01 cs.LG cs.AI cs.RO 版本更新

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

World Action Verifier: 通过前向-反向不对称性自我改进世界模型

Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, Yilun Du

发表机构 * Stanford University(斯坦福大学) UC San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学) Google DeepMind(谷歌深Mind) Harvard University(哈佛大学)

AI总结 提出World Action Verifier (WAV)框架,利用状态合理性和动作可达性的独立验证以及前向-反向不对称性,通过视频语料库的多样子目标生成器和稀疏逆模型实现循环一致性,从而在欠探索区域自我改进世界模型,在多个任务中样本效率提升2倍且下游策略性能提升22%以上。

Comments Project Website: https://world-action-verifier.github.io

详情
AI中文摘要

通用世界模型有望实现可扩展的策略评估、优化和规划,但达到所需的鲁棒性仍然具有挑战性。与主要关注最优动作的策略学习不同,世界模型需要在大量次优动作的空间中保持可靠,而这些动作在带有动作标签的机器人交互中往往代表性不足。为了解决这一挑战,我们提出了World Action Verifier (WAV)框架,该框架使世界模型能够识别自身的预测错误并进行自我改进。关键思想是将动作条件的状态预测分解为两个独立可验证的因素:状态合理性和动作可达性。我们证明,由于两个潜在的不对称性——更广泛的无动作数据的可用性和动作相关特征的更低维度——验证这些因素比直接前向预测更容易处理。利用这些不对称性,我们通过(i)从视频语料库中获得的多样子目标生成器和(ii)从状态特征子集推断动作的稀疏逆模型来增强世界模型。通过强制提议的子目标、推断的动作和前向展开之间的循环一致性,WAV在现有方法常常失败的欠探索区域提供了一种有效的验证机制。在涵盖MiniGrid、RoboMimic和ManiSkill的九个任务中,我们的方法实现了2倍的样本效率提升,同时将下游策略性能提高了22%以上。

英文摘要

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning which primarily focuses on optimal actions, a world model needs to be reliable over a vast space of suboptimal actions, which are often underrepresented in action-labeled robot interactions. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two independently verifiable factors: state plausibility and action reachability. We show that verifying these factors is significantly more tractable than direct forward prediction due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among proposed subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods often fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by over 22%.

2603.20253 2026-06-01 physics.comp-ph cs.AI cs.DC cs.LG 版本更新

SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

SimulCost: 一个用于自动化物理模拟的代价感知基准与工具包

Yadi Cao, Sicheng Lai, Jiahe Huang, Yang Zhang, Zach Lawrence, Rohan Bhakta, Izzy F. Thomas, Mingyun Cao, Chung-Hao Tsai, Zihao Zhou, Yidong Zhao, Hao Liu, Alessandro Marinoni, Alexey Arefiev, Rose Yu

发表机构 * University of California San Diego(加州大学圣地亚哥分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Peking University(北京大学) University of California, Los Angeles(加州大学洛杉矶分校) California Institute of Technology(加州理工学院) ETH Zurich(苏黎世联邦理工学院)

AI总结 针对现有LLM评估忽略工具使用代价的问题,提出SimulCost基准,通过单轮和多轮参数调优任务比较LLM与传统扫描方法在准确性和计算代价上的表现,发现LLM在高精度任务中初始猜测不可靠且多轮模式效率更低。

Comments accepted version at ICML

详情
AI中文摘要

评估用于科学任务的LLM代理时,现有研究主要关注令牌成本,而忽略了工具使用成本,如模拟时间和实验资源。因此,在现实预算约束下,pass@k等指标变得不切实际。为弥补这一差距,我们引入了SimulCost,这是首个针对物理模拟中代价敏感参数调优的基准。SimulCost将LLM调优代价敏感参数与传统扫描方法在准确性和计算成本上进行比较,涵盖了来自流体动力学、固体力学和等离子体物理的13个模拟器中的2,947个单轮(初始猜测)和1,931个多轮(通过试错调整)任务。每个模拟器的成本是解析定义的且与平台无关。前沿LLM在单轮模式下的成功率为46-65%,在高精度要求下下降至35-55%,使得它们的初始猜测在高精度任务中不可靠。多轮模式将成功率提升至72-81%,但LLM比传统扫描慢1.5-2.5倍,因此不是经济的选择。我们还研究了参数组相关性以了解知识迁移潜力,以及上下文示例和推理努力的影响,为部署和微调提供了实际意义。我们将SimulCost开源为一个静态基准和可扩展工具包,以促进改进物理模拟的代价感知代理设计以及扩展新的模拟环境的研究。代码和数据可在https://github.com/Rose-STL-Lab/SimulCost-Bench获取。

英文摘要

Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost-sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,947 single-round (initial guess) and 1,931 multi-round (adjustment by trial-and-error) tasks across 13 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform-independent. Frontier LLMs achieve 46-65% success rates in single-round mode, dropping to 35-55% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi-round mode improves rates to 72-81%, but LLMs are 1.5-2.5x slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in-context examples and reasoning effort, providing practical implications for deployment and fine-tuning. We open-source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost-aware agentic designs for physics simulations, and for expanding new simulation environments. Code and data are available at https://github.com/Rose-STL-Lab/SimulCost-Bench.

2603.23977 2026-06-01 cs.LG cs.AI 版本更新

Circuit-Inspired High-Order Neural Networks with Unified Neural Dynamics Modeling for PDE Solving and Visual Perception

电路启发的具有统一神经动力学建模的高阶神经网络用于PDE求解与视觉感知

Tongfei Chen, Jingying Yang, Linlin Yang, Juan Zhang, Jinhu Lü, David Doermann, Chunyu Xie, Long He, Tian Wang, Guodong Guo, Baochang Zhang

发表机构 * Communication University of China(通信大学) AI Research, Qihoo 360(360人工智能研究院,奇虎360) Eastern Institute of Technology, Ningbo(宁波工程技术院)

AI总结 提出电路启发的高阶神经网络(CHONN),通过基尔霍夫级联组合实现高阶动力学算子,在PDE求解、长期物理预测和ImageNet-1K识别中提升结构保真度和稳定性。

详情
AI中文摘要

深度网络通常依赖架构启发式方法来塑造表示演化,限制了其对由内在动力学支配的数据的建模能力。我们提出了电路启发的高阶神经网络(CHONN),这是一个模块化框架,将表示演化视为一个潜在势过程,并通过基尔霍夫启发的级联组合增加其有效阶数。单个基尔霍夫神经单元实现稳定的一阶更新,而串行组合的单元在一个块内形成高阶动力学算子。这种构造是可解释的、数值稳定的,并且与常见的神经骨干网络兼容。理论分析表明,级联单元诱导出端到端的高阶算子,控制实验证明块内高阶构造不同于通用深度堆叠,特别是在导数敏感度量上。在稳态算子学习、长期物理预测和ImageNet-1K识别中,CHONN提高了结构保真度、滚动稳定性和视觉表示学习。这些结果将高阶电路组合确定为神经动力学建模的一般原则。

英文摘要

Deep networks often rely on architectural heuristics to shape representation evolution, limiting their ability to model data governed by intrinsic dynamics. We present the Circuit-inspired High-Order Neural Network (CHONN), a modular framework that treats representation evolution as a latent potential process and increases its effective order through Kirchhoff-inspired cascade composition. A single Kirchhoff Neural Cell implements a stable first-order update, while serially composed cells form higher-order dynamical operators within one block. This construction is interpretable, numerically stable and compatible with common neural backbones. Theoretical analysis shows that cascaded cells induce end-to-end high-order operators, and controlled experiments demonstrate that intra-block high-order construction differs from generic depth stacking, especially on derivative-sensitive measures. Across steady-state operator learning, long-horizon physical forecasting and ImageNet-1K recognition, CHONN improves structural fidelity, rollout stability and visual representation learning. These results identify high-order circuit composition as a general principle for neural dynamics modeling.

2601.11702 2026-06-01 cs.HC cs.AI 版本更新

PASTA: A Scalable Framework for Multi-Policy AI Compliance Evaluation

PASTA: 一种用于多策略AI合规评估的可扩展框架

Yu Yang, Ig-Jae Kim, Dongwook Yoon

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) Korea Institute of Science and Technology(韩国科学技术院)

AI总结 提出PASTA框架,通过模型卡格式、策略规范化、LLM驱动的成对评估引擎和可解释界面,实现多策略AI合规的快速、低成本评估,专家评估显示与人类判断高度一致。

Comments 28 pages, 7 figures

详情
AI中文摘要

随着AI系统变得更加强大和普及,AI合规性变得越来越关键。然而,AI政策的快速扩张给缺乏政策专业知识的资源受限从业者带来了沉重负担。现有方法通常一次只处理一项政策,使得多政策合规成本高昂。我们提出了PASTA,一种可扩展的合规工具,集成了四项创新:(1)一种全面的模型卡格式,支持跨开发阶段的描述性输入;(2)一种策略规范化方案;(3)一个高效的基于LLM的成对评估引擎,具有成本节约策略;(4)一个通过合规热图和可操作建议提供可解释评估的界面。专家评估显示,PASTA的判断与人类专家高度一致(ρ≥.626)。该系统在约3美元的成本下,在两分钟内评估五项主要政策。一项用户研究(N=12)证实,从业者发现输出易于理解和可操作,为可扩展的自动化AI治理引入了一个新颖的框架。

英文摘要

AI compliance is becoming increasingly critical as AI systems grow more powerful and pervasive. Yet the rapid expansion of AI policies creates substantial burdens for resource-constrained practitioners lacking policy expertise. Existing approaches typically address one policy at a time, making multi-policy compliance costly. We present PASTA, a scalable compliance tool integrating four innovations: (1) a comprehensive model-card format supporting descriptive inputs across development stages; (2) a policy normalization scheme; (3) an efficient LLM-powered pairwise evaluation engine with cost-saving strategies; and (4) an interface delivering interpretable evaluations via compliance heatmaps and actionable recommendations. Expert evaluation shows PASTA's judgments closely align with human experts ($ρ\geq .626$). The system evaluates five major policies in under two minutes at approximately \$3. A user study (N = 12) confirms practitioners found outputs easy-to-understand and actionable, introducing a novel framework for scalable automated AI governance.

2603.22867 2026-06-01 cs.AR cs.AI cs.LG 版本更新

TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

TRINE: 一种面向多模态AI的令牌感知、运行时自适应FPGA推理引擎

Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Suyeon Jang, Behnam Khaleghi, Fei Wen, Mohsen Imani

发表机构 * University of California, Irvine(加州大学尔湾分校) Purdue University Northwest(北达科他州立大学) Qualcomm(高通) Samsung(三星)

AI总结 针对多模态AI中不同计算/内存模式导致嵌入式平台实时性不足的问题,提出TRINE,一种无需重配置的单比特流FPGA加速器与编译器,通过统一层映射、运行时模式切换、令牌剪枝和依赖感知层卸载,实现端到端多模态推理,在Alveo U50和ZCU104上相比RTX 4090和Jetson Orin Nano分别降低延迟22.57倍和6.86倍,功耗仅20-21W。

Comments Accepted to DAC 2026

详情
AI中文摘要

混合ViT、CNN、GNN和Transformer NLP的多模态堆栈给嵌入式平台带来压力,因为它们的计算/内存模式不同,且硬实时目标几乎没有松弛空间。TRINE是一个单比特流FPGA加速器和编译器,无需重配置即可执行端到端多模态推理。层被统一为DDMM/SDDMM/SpMM,并映射到一个模式可切换的引擎上,该引擎在运行时在权重/输出驻留脉动阵列、1xCS SIMD和可路由加法树(RADT)之间切换,共享PE阵列。一个宽度匹配的两阶段top-k单元支持流内令牌剪枝,而依赖感知层卸载(DALO)在可重构处理单元上重叠独立内核以维持利用率。在Alveo U50和ZCU104上评估,TRINE相比RTX 4090和Jetson Orin Nano分别降低延迟高达22.57倍和6.86倍,功耗20-21W;仅令牌剪枝在ViT密集型流水线上可实现高达7.8倍加速,DALO贡献高达79%的吞吐量提升。采用int8量化,代表性任务的精度下降<2.5%,为统一的视觉、语言和图工作负载提供了最先进的延迟和能效——仅需一个比特流。

英文摘要

Multimodal stacks that mix ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms because their compute/memory patterns diverge and hard real-time targets leave little slack. TRINE is a single-bitstream FPGA accelerator and compiler that executes end-to-end multimodal inference without reconfiguration. Layers are unified as DDMM/SDDMM/SpMM and mapped to a mode-switchable engine that toggles at runtime among weight/output-stationary systolic, 1xCS SIMD, and a routable adder tree (RADT) on a shared PE array. A width-matched, two-stage top-k unit enables in-stream token pruning, while dependency-aware layer offloading (DALO) overlaps independent kernels across reconfigurable processing units to sustain utilization. Evaluated on Alveo U50 and ZCU104, TRINE reduces latency by up to 22.57x vs. RTX 4090 and 6.86x vs. Jetson Orin Nano at 20-21 W; token pruning alone yields up to 7.8x on ViT-heavy pipelines, and DALO contributes up to 79% throughput improvement. With int8 quantization, accuracy drops remain <2.5% across representative tasks, delivering state-of-the-art latency and energy efficiency for unified vision, language, and graph workloads-in one bitstream.

2603.22744 2026-06-01 cs.AI 版本更新

LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

LH-Bench:面向主观企业任务的长期智能体技能基础评估

Abhishek Chandwani, Ishan Gupta

发表机构 * Metaphi Inc.(Metaphi公司)

AI总结 提出LH-Bench,通过专家基础评分标准、真实标注工件和成对偏好评估三支柱,解决主观企业任务中长期自主执行的评估问题。

详情
AI中文摘要

大型语言模型在数学和编程等客观可验证任务上表现出色,这些任务的评估简化为单元测试或单一正确答案。相比之下,现实世界中的企业工作通常是主观且依赖上下文的:成功取决于组织目标、用户意图以及在长期多工具工作流中产生的中间工件的质量。我们引入LH-Bench,一种三支柱评估设计,超越二元正确性,对主观企业任务中的自主长期执行进行评分。这些支柱包括:(i) 专家基础评分标准,为LLM评判者提供评估主观工作所需的领域背景;(ii) 策划的真实工件,提供逐步奖励信号(例如,内容任务的章节级注释);以及(iii) 成对人类偏好评估,用于收敛验证。我们表明,领域作者编写的评分标准比LLM作者编写的评分标准提供更可靠的评估信号(kappa = 0.60 vs. 0.46),并且人类偏好判断确认了相同的顶级分离(p < 0.05),这证明专家基础评估可以在不牺牲可靠性的情况下扩展。我们发布公共数据集,并报告两个环境的结果:Figma到代码(通过MCP针对Figma API的33个真实.fig任务)和程序化内容(41门课程,包含183个单独评估的章节,服务于一个拥有30+日常用户的课程平台)。

英文摘要

Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a single correct answer. In contrast, real-world enterprise work is often subjective and context-dependent: success hinges on organizational goals, user intent, and the quality of intermediate artifacts produced across long, multi-tool workflows. We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. The pillars are: (i) expert-grounded rubrics that give LLM judges the domain context needed to score subjective work, (ii) curated ground-truth artifacts that enable stepwise reward signals (e.g., chapter-level annotation for content tasks), and (iii) pairwise human preference evaluation for convergent validation. We show that domain-authored rubrics provide substantially more reliable evaluation signals than LLM-authored rubrics (kappa = 0.60 vs. 0.46), and that human preference judgments confirm the same top-tier separation (p < 0.05), evidence that expert-grounded evaluation can scale without sacrificing reliability. We release public datasets and report results on two environments: Figma-to-code (33 real .fig tasks against the Figma API via MCP) and Programmatic content (41 courses comprising 183 individually-evaluated chapters on a course platform serving 30+ daily users).

2603.21558 2026-06-01 cs.AI 版本更新

Reliable Self-Improvement Training by Verifying Reasoning, Not Just Answers

可靠的自改进训练:验证推理过程,而不仅仅是答案

Xinyu Zhang

发表机构 * Anyscale

AI总结 针对自改进训练中因依赖最终答案正确性导致推理错误累积的问题,提出VSI框架,通过步骤级结构验证(如符号计算检查算术步骤)筛选训练数据,在GSM8K上实现持续准确率提升(80.5%→91.0%)。

Comments Accepted at ICLR 2026 Workshop LLM Reasoning. 10 pages, 3 figures, 5 tables

详情
AI中文摘要

自改进训练中,模型从自身生成的解决方案中学习,有望带来持续的能力提升,但存在一个普遍失败模式:经过多轮训练后,累积的推理错误导致准确率停滞或下降。我们将这种漂移归因于标准过滤标准——仅根据最终答案的正确性保留解决方案,这使得幸运猜测(答案正确但推理有缺陷)污染训练数据。我们提出已验证自改进(VSI)框架,该框架基于步骤级结构完整性而非仅最终输出决定数据保留。VSI通过计算机代数库(sympy)重新计算算术步骤、检查中间一致性并强制执行领域约束来验证解决方案。在GSM8K上使用Qwen3-4B-Thinking进行5轮自改进评估,与四个基线(无验证、结果验证、多数投票和VSI+DPO)相比,VSI拒绝了约34%的答案正确的解决方案,成功隔离了幸运猜测。这种更清洁的训练信号驱动了所有轮次的持续准确率提升(从80.5%到91.0%),而结果验证趋于平稳,未验证的训练则崩溃。最后,将VSI检查转化为DPO偏好对,训练模型区分合理推理与幸运答案,将奖励准确率从46%提升至63%。VSI提供了一种简单、可复现的配方,用于在自动化推理检查可用时实现稳健的自改进。

英文摘要

Self-improvement training, where models learn from self-generated solutions, promises sustained capability gains but suffers from a pervasive failure mode: across multiple rounds, compounding reasoning errors cause accuracy to stall or degrade. We trace this drift to standard filtering criteria that retain solutions based solely on final answer correctness, which lets lucky guesses (correct answers with flawed reasoning) contaminate the training data. We propose Verified Self-Improvement (VSI), a framework that conditions data retention on step-level structural integrity rather than just the final output. VSI validates solutions by recomputing arithmetic steps via a computer-algebra library (sympy), checking intermediate consistency, and enforcing domain constraints. Evaluating VSI on GSM8K with Qwen3-4B-Thinking across 5 rounds of self-improvement against four baselines (no verification, outcome verification, majority voting, and VSI with DPO) shows that VSI rejects approximately 34% of correct-answer solutions, successfully isolating lucky guesses. This cleaner training signal drives sustained accuracy gains across all rounds (80.5% to 91.0%), whereas outcome verification plateaus and unverified training collapses. Finally, converting VSI checks into DPO preference pairs trains the model to distinguish sound reasoning from lucky answers, boosting reward accuracy from 46% to 63%. VSI offers a simple, reproducible recipe for robust self-improvement whenever automated reasoning checks are available.

2603.19262 2026-06-01 cs.CL cs.AI 版本更新

Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models

大型语言模型中推理时引发的概率变换的经验表征

Mike Farmer, Abhinav Kochar, Yugyung Lee

发表机构 * Bloch School of Management, Regnier Institute for Entrepreneurship & Innovation, University of Missouri–Kansas City(布洛赫管理学院、雷尼创业与创新研究所、密苏里大学堪萨斯城分校) Department of Computer Science, School of Science and Engineering, University of Missouri–Kansas City(计算机科学系、科学与工程学院、密苏里大学堪萨斯城分校)

AI总结 本研究通过经验观察发现,在多种推理时流程(如思维链、自我细化、检索增强和验证器引导修订)下,候选答案的概率变换遵循近似的对数比率关系,并分析了其系数变化和鲁棒性。

Comments 22 pages, 11 figures, 5 tables

详情
AI中文摘要

大型语言模型越来越依赖推理时程序,如思维链推理、自我细化、检索增强和验证器引导修订,但这些程序下引发的概率变换结构仍不清楚。我们研究外部引发的候选答案概率分配,并观察到重复出现的近似对数比率关系:\[ \log \tilde q_t(i) = α_t \left( \log q_t(i) + \log b_t(i) \right) + c_t, \] 其中 $q_t$ 和 $\tilde q_t$ 分别是引发前和引发后的概率,$b_t$ 是外部构建的证据信号,$α_t$ 是提示配置的经验描述符。在来自 GPQA Diamond、TheoremQA、MMLU-Pro 和 ARC-Challenge 的 4,975 个推理问题上,对多个指令微调模型系列进行评估,我们在约 $1.3 \times 10^5$ 个候选级观测上观察到近似对数比率关系,平均 $R^2 \approx 0.76$。系数在不同引发设置下变化,但定性相似的关系在评估条件下持续存在。使用替代统计表示、提示配置、保留评估和 token 级对数概率的鲁棒性分析表明,观察到的结构不依赖于特定的提示程序或概率估计方法。主要贡献不是代数形式本身(它与广义贝叶斯更新和概率变换框架相关),而是经验观察:在受控条件下,多样化的推理时提示流程反复表现出可复现的对数比率结构。该框架为分析推理时 LLM 流程中的校准、证据放大、不确定性传播和交互敏感性提供了协议敏感的视角。

英文摘要

Large language models increasingly rely on inference-time procedures such as chain-of-thought reasoning, self-refinement, retrieval augmentation, and verifier-guided revision, yet the structure of elicited probability transformations under these procedures remains poorly understood. We study externally elicited probability assignments over candidate answers and observe recurring approximate log-ratio relationships: \[ \log \tilde q_t(i) = α_t \left( \log q_t(i) + \log b_t(i) \right) + c_t, \] where $q_t$ and $\tilde q_t$ are pre- and post-elicitation probabilities, $b_t$ is an externally constructed evidence signal, and $α_t$ is an empirical descriptor of the prompting configuration. Across 4,975 reasoning problems from GPQA Diamond, TheoremQA, MMLU-Pro, and ARC-Challenge, evaluated on multiple instruction-tuned model families, we observe approximate log-ratio relationships with mean $R^2 \approx 0.76$ over about $1.3 \times 10^5$ candidate-level observations. Coefficients vary across elicitation settings, but qualitatively similar relationships persist across evaluated conditions. Robustness analyses using alternative statistical representations, prompting configurations, held-out evaluation, and token-level log-probabilities suggest that the observed structure is not tied to one prompting procedure or probability estimation method. The main contribution is not the algebraic form itself, which is related to generalized Bayesian updating and probability-transformation frameworks, but the empirical observation that diverse inference-time prompting pipelines repeatedly exhibit reproducible log-ratio structure under controlled conditions. The framework provides a protocol-sensitive perspective for analyzing calibration, evidence amplification, uncertainty propagation, and interaction sensitivity in inference-time LLM pipelines.

2603.18382 2026-06-01 cs.AI 版本更新

From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents

从弱线索到真实身份:评估LLM代理中推理驱动的去匿名化

Myeongseob Ko, Jihyun Jeong, Sumiran Singh Thakur, Gyuhak Kim, Ruoxi Jia

发表机构 * Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, USA(弗吉尼亚理工学院计算机工程系) Center for Advanced AI, Accenture(Accenture高级人工智能中心)

AI总结 研究通过LLM代理结合分散的非识别线索与公开证据重建真实身份的能力,揭示了即使在没有明确标识符的情况下,代理也能以高成功率实现去匿名化,并提出了新的隐私评估维度。

Comments Accepted at ICML 2026

详情
Journal ref
ICML 2026
AI中文摘要

匿名化通常被认为一旦移除显式标识符就能保护隐私,因为重新识别历来需要专业知识、定制算法和手动验证。我们证明基于LLM的代理削弱了这一屏障:通过将分散的、单独非识别的线索与公开证据相结合,它们重建真实世界的身份,有时甚至在良性任务中也是如此。我们在三种场景中评估了这一风险——经典的链接事件、一个控制基准(\emph{InferLink}),该基准变化指纹类型、任务框架和攻击者知识,以及开放的人机交互痕迹。在Netflix奖去匿名化设置的最稀疏情况下,代理重建了79.2%的身份,而经典匹配基线为56.0%;在\emph{InferLink}上,即使没有明确的重新识别请求,代理也能链接个体,并且在给出请求时更频繁。在编辑过的人机交互痕迹中,代理通过将上下文线索与公开证据相互印证,进一步将匿名化档案解析为特定个体。这些发现表明,对代理系统的隐私评估不仅应衡量访问或披露了哪些信息,还应衡量可以推断出哪些身份。

英文摘要

Anonymization is often assumed to protect privacy once explicit identifiers are removed, because re-identification has historically required specialized expertise, tailored algorithms, and manual corroboration. We show that LLM-based agents weaken this barrier: by combining scattered, individually non-identifying cues with public evidence, they reconstruct real-world identities, sometimes even during benign tasks. We evaluate this risk across three settings -- classical linkage incidents, a controlled benchmark (\emph{InferLink}) that varies fingerprint type, task framing, and attacker knowledge, and open-ended human--AI interaction traces. In the sparsest regime of the Netflix Prize deanonymization setting, agents reconstruct 79.2\% of identities, against 56.0\% for a classical matching baseline; on \emph{InferLink}, they link individuals even without an explicit re-identification request, and more often once one is given. In redacted human--AI interaction traces, agents further resolve anonymized profiles to specific individuals by corroborating contextual cues with public evidence. These findings suggest that privacy evaluations for agentic systems should measure not only what information is accessed or disclosed, but also what identities can be inferred.

2603.17145 2026-06-01 cs.LG cs.AI 版本更新

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

REAL: 面向LLM评判的回归感知强化学习

Yasi Zhang, Tianyu Chen, Mingyuan Zhou, Oscar Leong, Ying Nian Wu, Michal Lukasik

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) The University of Texas at Austin(得克萨斯大学奥斯汀分校) Google Research Now at Google DeepMind(谷歌研究 现在在谷歌深Mind)

AI总结 提出REAL框架,通过广义策略梯度将回归目标融入强化学习,优化LLM作为评分器的数值评估,在多个规模模型上超越SFT和标准RL方法。

Comments Accepted to ICML 2026. The first two authors contributed equally

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自动评估器,为模型输出分配数值分数,这种范式称为LLM-as-a-Judge。然而,标准的强化学习(RL)方法通常依赖二元奖励(例如0-1准确率),从而忽略了回归任务中固有的序结构;例如,当真实值为5时,它们未能识别出预测4显著优于预测1。相反,现有的回归感知方法通常局限于监督微调(SFT),限制了其探索最优推理路径的能力。为弥合这一差距,我们提出\textbf{REAL}(\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning),这是一个原则性的RL框架,旨在优化回归奖励,并且也被证明对相关性指标是最优的。一个关键的技术挑战是回归目标显式地依赖于策略,从而使标准策略梯度方法失效。为解决此问题,我们采用广义策略梯度估计器,该估计器自然地将优化分解为两个互补组件:(1)对思维链(CoT)轨迹的探索,以及(2)最终分数的回归感知预测细化。跨模型规模(8B到32B)的大量实验表明,REAL在域外基准上始终优于回归感知SFT基线和标准RL方法,展现出显著更好的泛化能力。具体在Qwen3-32B上,我们相比SFT基线获得了+8.40 Pearson和+7.20 Spearman相关性的提升,相比基础模型提升了+18.30/+11.20。这些发现凸显了将回归目标整合到RL探索中对准确LLM评估的关键价值。

英文摘要

Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.

2603.16123 2026-06-01 cs.LG cs.AI math.AT math.CT 版本更新

Functorial Neural Architectures from Higher Inductive Types

基于高阶归纳类型的函子神经架构

Karen Sargsyan

发表机构 * Institute of Chemistry, Academia Sinica, Taipei, Taiwan(中国科学院化学研究所,台湾台北)

AI总结 提出通过高阶归纳类型规范编译为神经架构,强制解码器满足严格幺半函子性质,从而在组合泛化任务上比非函子方法提升2-10倍。

Comments 26 pages, 10 tables. Code and Cubical Agda formalization: https://github.com/karsar/hott_neuro

详情
AI中文摘要

神经网络通常能学习任务的各个部分,但在这些部分的新组合上失败。我们认为这种失败是架构性的:只有当解码器尊重任务的代数法则,即从自由生成的序列下降到由这些法则确定的商时,它才能组合泛化。我们通过将高阶归纳类型(HIT)规范编译为神经架构,使这一原则具有建设性。基点、路径构造子和2-胞腔分别映射为基约束、生成器网络、结构拼接和学习到的同伦。由此产生的传输解码器在构造上是严格幺半函子:解码一个拼接的词是独立生成的环段的拼接。相反,我们证明softmax自注意力无法同时满足严格幺半组合和下降到任何非平凡组合商。在环面、圆楔和克莱因瓶上的实验验证了预期的层次结构:函子解码器比非函子替代方案性能提升2-10倍,而学习到的2-胞腔恰好在使用克莱因瓶关系的词上缩小了46%的误差差距。这些结果表明,组合泛化应作为架构中的函子结构强制执行,而非仅从示例中学习。

英文摘要

Neural networks often learn the parts of a task but fail on novel combinations of those parts. We argue that this failure is architectural: a decoder generalizes compositionally only when it respects the algebraic laws of the task, i.e. when it descends from freely generated sequences to the quotient determined by those laws. We make this principle constructive by compiling Higher Inductive Type (HIT) specifications into neural architectures. Basepoints, path constructors, and 2-cells are mapped to base constraints, generator networks, structural concatenation, and learned homotopies. The resulting transport decoders are strict monoidal functors by construction: decoding a concatenated word is concatenation of independently generated loop segments. In contrast, we prove that softmax self-attention cannot simultaneously satisfy strict monoidal composition and descent to any non-trivial compositional quotient. Experiments on the torus, wedge of circles, and Klein bottle validate the predicted hierarchy: functorial decoders outperform non-functorial alternatives by $2$--$10\times$, and a learned 2-cell closes a $46\%$ error gap precisely on words exercising the Klein-bottle relation. These results suggest that compositional generalization should be enforced as functorial structure in the architecture, rather than learned from examples alone.

2603.12916 2026-06-01 cs.LG cs.AI 版本更新

Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection

Surprised by Attention: 面向时间序列异常检测的可预测查询动态

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

发表机构 * Mercedes-Benz AG(梅赛德斯-奔驰集团) Institute for Intelligent Systems, Esslingen University of Applied Sciences(智能系统研究所,埃森嫩应用科学大学)

AI总结 提出 AxonAD 无监督检测器,通过预测多头注意力查询向量的演化并结合重构误差与查询不匹配分数,有效检测多变量时间序列中的结构依赖偏移异常。

Comments This manuscript has been accepted for publication at ECML-PKDD 2026. The final version will be published in the conference proceedings. Main: 17 Pages, 7 Figures, 3 Tables; Appendix: 3 Pages, 4 Tables

详情
AI中文摘要

多变量时间序列异常通常表现为跨通道依赖的偏移,而非简单的幅度异常。例如,在自动驾驶中,转向指令可能内部一致,但与产生的横向加速度解耦。当灵活的序列模型尽管协调性改变仍能合理重构信号时,基于残差的检测器可能遗漏此类异常。我们提出 AxonAD,一种无监督检测器,将多头注意力查询演化视为短视界可预测过程。梯度更新重构路径与仅基于历史上下文的预测器耦合,该预测器通过掩码预测器-目标目标针对指数移动平均(EMA)目标编码器进行训练。推理时,重构误差与尾部聚合的查询不匹配分数结合,该分数衡量最近时间步上预测查询与目标查询之间的余弦偏差。这种双重方法在保留幅度级检测的同时,对结构依赖偏移敏感。在带有区间标注的专有车载遥测数据以及 TSB-AD 多变量套件(17 个数据集,180 个序列)上,使用无阈值和范围感知指标,AxonAD 在排名质量和时间定位上优于强基线。消融实验证实查询预测和组合评分是观察到的改进的主要驱动因素。代码可在 https://github.com/iis-esslingen/AxonAD 获取。

英文摘要

Multivariate time series anomalies often manifest as shifts in cross-channel dependencies rather than simple amplitude excursions. In autonomous driving, for instance, a steering command might be internally consistent but decouple from the resulting lateral acceleration. Residual-based detectors can miss such anomalies when flexible sequence models still reconstruct signals plausibly despite altered coordination. We introduce AxonAD, an unsupervised detector that treats multi-head attention query evolution as a short horizon predictable process. A gradient-updated reconstruction pathway is coupled with a history-only predictor that forecasts future query vectors from past context. This is trained via a masked predictor-target objective against an exponential moving average (EMA) target encoder. At inference, reconstruction error is combined with a tail-aggregated query mismatch score, which measures cosine deviation between predicted and target queries on recent timesteps. This dual approach provides sensitivity to structural dependency shifts while retaining amplitude-level detection. On proprietary in-vehicle telemetry with interval annotations and on the TSB-AD multi-variate suite (17 datasets, 180 series) with threshold-free and range-aware metrics, AxonAD improves ranking quality and temporal localization over strong baselines. Ablations confirm that query prediction and combined scoring are the primary drivers of the observed gains. Code is available at the URL https://github.com/iis-esslingen/AxonAD.

2603.09453 2026-06-01 cs.LG cs.AI stat.ML 版本更新

Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers

变分路由:用于校准混合专家Transformer的可扩展贝叶斯框架

Albus Yizhuo Li, Matthew Wicker

发表机构 * Department of Computing, Imperial College London(伦敦帝国理工学院计算机系)

AI总结 提出变分混合专家路由(VMoER),通过将贝叶斯推断限制在专家选择阶段,实现大规模模型的不确定性校准,在微调基础模型上显著提升路由稳定性、降低校准误差并提高分布外检测AUROC,且额外计算开销极小。

Comments 8 pages, 7 figures for main text; 16 pages for Appendix; Accepted by ICML 2026;

详情
AI中文摘要

基础模型越来越多地部署在需要理解其输出不确定性的场景中,这对于确保负责任部署至关重要。虽然贝叶斯方法为不确定性量化提供了原则性方法,但其计算开销使得在基础模型规模下进行训练或推理不切实际。最先进的模型通过精心设计的稀疏性(包括混合专家(MoE)层)实现了数万亿的参数数量。在这项工作中,我们通过引入变分混合专家路由(VMoER)展示了大规模下的校准不确定性,这是一种用于建模MoE层不确定性的结构化贝叶斯方法。VMoER将贝叶斯推断限制在通常由确定性路由网络完成的专家选择阶段。我们使用两种推断策略实例化VMoER:对路由logits的摊销变分推断和推断用于随机专家选择的温度参数。在微调测试的基础模型上,VMoER在噪声下将路由稳定性提高了38%,校准误差降低了94%,分布外AUROC提高了12%,同时额外FLOPs增加不到1%。这些结果表明,VMoER为构建鲁棒且具有不确定性意识的基础模型提供了一条可扩展的路径。

英文摘要

Foundation models are increasingly being deployed in contexts where understanding the uncertainty of their outputs is critical to ensuring responsible deployment. While Bayesian methods offer a principled approach to uncertainty quantification, their computational overhead renders their use impractical for training or inference at foundation model scale. State-of-the-art models achieve parameter counts in the trillions through carefully engineered sparsity including Mixture-of-Experts (MoE) layers. In this work, we demonstrate calibrated uncertainty at scale by introducing Variational Mixture-of-Experts Routing (VMoER), a structured Bayesian approach for modelling uncertainty in MoE layers. VMoER confines Bayesian inference to the expert-selection stage which is typically done by a deterministic routing network. We instantiate VMoER using two inference strategies: amortised variational inference over routing logits and inferring a temperature parameter for stochastic expert selection. Across fine-tuning tested foundation models, VMoER improves routing stability under noise by 38\%, reduces calibration error by 94\%, and increases out-of-distribution AUROC by 12\%, while incurring less than 1\% additional FLOPs. These results suggest VMoER offers a scalable path toward robust and uncertainty-aware foundation models.

2603.10468 2026-06-01 eess.AS cs.AI cs.HC cs.MM cs.SD 版本更新

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

G-STAR: 端到端全局说话人跟踪属性识别

Jing Peng, Ziyi Chen, Haoyu Li, Yucheng Wang, Duo Ma, Mengtian Li, Yunfan Du, Dezhu Xu, Kai Yu, Shuai Wang

发表机构 * Nanjing University(南京大学) Shanghai Jiao Tong University(上海交通大学) Central Media Technology Institute, Huawei(华为中央媒体技术研究院) Shenzhen Research Institute of Big Data(深圳大数据研究院) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出G-STAR框架,通过缓存条件说话人跟踪模块与Speech-LLM转录骨干耦合,实现长时重叠多说话人语音的端到端说话人属性识别,支持组件优化和联合训练,在局部和全局评估中均表现优异。

Comments submitted to Emnlp 2026

详情
AI中文摘要

我们研究了带时间戳的说话人属性自动语音识别(SA-ASR),针对长时、多说话人且存在重叠的语音。在此设置中,分块推理必须保持会议级别的说话人身份一致性,同时生成带时间戳和说话人标签的转录。先前的Speech-LLM系统倾向于优先考虑局部日志或全局标签,缺乏联合建模细粒度时间边界和鲁棒跨块身份链接的能力。我们提出G-STAR,一个端到端框架,将缓存条件的说话人跟踪模块与Speech-LLM转录骨干耦合。跟踪器提供具有时间基础的结构化说话人线索,LLM基于这些线索生成属性文本。G-STAR支持组件优化和联合端到端训练,能够在异构监督和领域偏移下进行灵活学习。在分块解码协议下,基于预言分割的局部评估和全会议全局评估的实验均显示出强大的说话人属性转录性能。

英文摘要

We study timestamped speaker-attributed automatic speech recognition (SA-ASR) for long-form, multi-party speech with overlap. In this setting, chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Prior Speech-LLM systems tend to prioritize either local diarization or global labeling, lacking the ability to jointly model fine-grained temporal boundaries and robust cross-chunk identity linking. We propose G-STAR, an end-to-end framework that couples a cache-conditioned speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Under chunk-wise decoding protocols, experiments on both oracle-segmented local evaluation and full-meeting global evaluation show strong speaker-attributed transcription performance.

2603.07551 2026-06-01 cs.SD cs.AI 版本更新

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

零样本文本转语音中的目标说话人投毒框架

Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth Narayanan

发表机构 * Thomas Lord Department of Computer Science, University of Southern California, USA(汤姆斯·劳德计算机科学系,美国南加州大学) Signal Analysis and Interpretation Lab, University of Southern California, USA(信号分析与解释实验室,美国南加州大学)

AI总结 针对零样本TTS语音克隆的隐私风险,提出说话人生成投毒(SGSP)任务,通过修改训练模型阻止特定身份生成,并评估了推理时过滤和参数修改基线在1、15和100个遗忘说话人上的隐私-效用权衡。

Comments Submitted to Interspeech2026

详情
AI中文摘要

零样本文本转语音(TTS)语音克隆带来了严重的隐私风险,需要从训练好的TTS模型中移除特定说话人身份。传统的机器遗忘在此情境下不足,因为零样本TTS可以从仅参考提示动态重建声音。我们将此任务形式化为说话人生成投毒(SGSP),其中我们修改训练模型以防止生成特定身份,同时保留其他说话人的效用。我们评估了推理时过滤和参数修改基线在1、15和100个遗忘说话人上的表现。通过效用(WER)和隐私之间的权衡来评估性能,隐私使用AUC和遗忘说话人相似度(FSSIM)量化。我们在最多15个说话人上实现了强隐私,但由于身份重叠增加,在100个说话人时揭示了可扩展性限制。因此,我们的研究引入了一个新颖的问题和评估框架,以推动生成式语音隐私的进一步进展。

英文摘要

Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.

2603.06738 2026-06-01 cs.LG cs.AI 版本更新

Rank-Factorized Implicit Neural Bias: Scaling Super-Resolution Transformer with FlashAttention

秩分解隐式神经偏置:使用FlashAttention扩展超分辨率Transformer

Dongheon Lee, Seokju Yun, Jaegyun Im, Youngmin Ro

发表机构 * University of Seoul(首尔大学) KAIST AI(韩国科学技术院人工智能研究所)

AI总结 提出秩分解隐式神经偏置(RIB)替代相对位置偏置(RPB),通过低秩隐式神经表示和通道级拼接实现FlashAttention兼容,并引入卷积局部注意力和循环窗口策略,在Urban100×2上达到35.63 dB PSNR,训练和推理时间分别减少2.1倍和2.9倍。

详情
AI中文摘要

最近的超分辨率(SR)方法主要采用Transformer,因其强大的长程建模能力和卓越的表征能力。然而,大多数SR Transformer严重依赖相对位置偏置(RPB),这阻碍了它们利用硬件高效的注意力内核,如FlashAttention。这一限制在训练和推理过程中带来了巨大的计算负担,严重限制了通过扩大训练块大小或自注意力窗口来扩展SR Transformer的尝试。因此,与其他积极利用Transformer固有可扩展性的领域不同,SR Transformer仍然主要关注有效利用有限的感受野。在本文中,我们提出了秩分解隐式神经偏置(RIB),作为RPB的替代方案,使SR Transformer能够使用FlashAttention。具体来说,RIB使用低秩隐式神经表示来近似位置偏置,并以通道方式将它们与像素内容标记连接起来,将注意力分数计算中的逐元素偏置加法转化为点积运算。此外,我们引入了卷积局部注意力和循环窗口策略,以充分利用RIB和FlashAttention带来的长程交互优势。我们将窗口大小扩大到**96×96**,同时联合扩大训练块大小和数据集大小,最大化Transformer在SR任务中的优势。因此,我们的网络在Urban100×2上达到了**35.63 dB PSNR**,同时与基于RPB的SR Transformer(PFT)相比,训练和推理时间分别减少了**2.1倍**和**2.9倍**。

英文摘要

Recent Super-Resolution~(SR) methods mainly adopt Transformers for their strong long-range modeling capability and exceptional representational capacity. However, most SR Transformers rely heavily on relative positional bias~(RPB), which prevents them from leveraging hardware-efficient attention kernels such as FlashAttention. This limitation imposes a prohibitive computational burden during both training and inference, severely restricting attempts to scale SR Transformers by enlarging the training patch size or the self-attention window. Consequently, unlike other domains that actively exploit the inherent scalability of Transformers, SR Transformers remain heavily focused on effectively utilizing limited receptive fields. In this paper, we propose Rank-factorized Implicit Neural Bias~(RIB), an alternative to RPB that enables FlashAttention in SR Transformers. Specifically, RIB approximates positional bias using low-rank implicit neural representations and concatenates them with pixel content tokens in a channel-wise manner, turning the element-wise bias addition in attention score computation into a dot-product operation. Further, we introduce a convolutional local attention and a cyclic window strategy to fully leverage the advantages of long-range interactions enabled by RIB and FlashAttention. We enlarge the window size up to \textbf{96$\times$96} while jointly scaling the training patch size and the dataset size, maximizing the benefits of Transformers in the SR task. As a result, our network achieves \textbf{35.63\,dB PSNR} on Urban100$\times$2, while reducing training and inference time by \textbf{2.1$\times$} and \textbf{2.9$\times$}, respectively, compared to the RPB-based SR Transformer~(PFT).

2603.05529 2026-06-01 cs.DB cs.AI 版本更新

NGDBench: Towards Neural Graph Data Management

NGDBench:迈向神经图数据管理

Yufei Li, Yisen Gao, Jiaxuan Xiong, Jiaxin Bai, Shijie Zhong, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Yangqiu Song

发表机构 * HKUST(香港理工大学) Beijing Institute of Technology(北京理工大学) Hong Kong Baptist University(香港 Baptist 大学) Guangdong University of Technology(广东技术大学)

AI总结 针对现有数据管理系统在噪声、不完整和动态更新下缺乏隐式结构发现和推理能力的问题,提出NGDBench基准,通过统一结构化与非结构化来源的图视图,评估神经查询方法在噪声和动态状态跟踪中的表现。

Comments https://github.com/HKUST-KnowComp/NGDBench

详情
AI中文摘要

对现实世界决策至关重要的数据越来越多地存在于组织内部。这些数据是异构的、不断演化的,并且只能被不完美地捕获。然而,当前的数据管理系统仍然是被动的,只能检索显式存储的内容,而在噪声、不完整和持续更新的情况下,对发现隐式结构或推理的支持有限。我们认为,下一代数据管理需要神经能力,这种能力可以揭示复杂的潜在关系,区分可靠信号与噪声,并在底层数据状态演化时保持一致。为了支持这一方向,我们引入了NGDBench,这是一个跨五个领域的基准,统一了结构化和非结构化来源。NGDBench采用图视图,因为图为建模复杂系统、捕获潜在关系以及包含关系表等结构化格式提供了灵活的抽象。每个实例将一个干净的潜在图与一个实际扰动的观测图配对。NGDBench支持完整的Cypher查询和动态数据管理操作。对最先进的基于LLM的Text-to-Cypher和GraphRAG管道的评估表明,当前的神经查询方法仍然对噪声敏感,并且在动态状态跟踪方面存在困难,这凸显了对具有弹性和推理能力的数据管理的需求。我们的代码可在https://github.com/HKUST-KnowComp/NGDBench获取。

英文摘要

Data critical to real-world decision-making is increasingly found within organizations. Such data is heterogeneous, constantly evolving, and only imperfectly captured. However, current data management systems remain largely passive, retrieving what is explicitly stored while offering limited support for uncovering implicit structure or reasoning under noise, incompleteness, and continuous updates. We argue that next-generation data management requires neural capabilities, which can uncover complex latent relationships, distinguish reliable signals from noise, and remain consistent as the underlying data state evolves. To support this direction, we introduce NGDBench, a benchmark across five domains that unifies structured and unstructured sources. NGDBench adopts a graph view because graphs provide a flexible abstraction for modeling complex systems, capturing latent relationships, and subsuming structured formats such as relational tables. Each instance pairs a clean latent graph with a realistically perturbed observed graph. NGDBench supports full Cypher queries and dynamic data management operations. Evaluations of state-of-the-art Text-to-Cypher by LLMs and GraphRAG pipelines reveal that current neural query methods remain sensitive to noise and struggle with dynamic state tracking, highlighting the need for resilient, inference-capable data management. Our code is available at https://github.com/HKUST-KnowComp/NGDBench.

2603.02630 2026-06-01 cs.LG cs.AI 版本更新

MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

MASPOB: 基于图神经网络的多智能体系统提示优化方法

Zhi Hong, Qian Zhang, Jiahang Sun, Zhiwei Shang, Mingze Kong, Xiangyi Wang, Yao Shu, Zhongxiang Dai

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) South China University of Technology(华南理工大学) Ritsumeikan University(立命馆大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出基于赌博机的样本高效框架MASPOB,利用UCB平衡探索与利用、GNN捕获拓扑先验、坐标上升分解优化,解决多智能体系统提示优化中的样本效率、拓扑耦合和组合爆炸问题。

Comments ICML 2026 Spotlight

详情
AI中文摘要

大型语言模型(LLMs)在许多实际应用中取得了巨大成功,尤其是作为多智能体系统(MAS)的认知骨干来编排复杂工作流。由于许多部署场景排除了MAS工作流修改,且其性能对输入提示高度敏感,提示优化成为提高性能的更自然方法。然而,实际中的MAS提示优化面临三个关键挑战:(1)由于评估成本高昂,需要样本效率;(2)提示之间的拓扑诱导耦合;(3)搜索空间的组合爆炸。为了解决这些挑战,我们引入了MASPOB(基于赌博机的多智能体系统提示优化),一种基于赌博机的新型样本高效框架。通过利用上置信界(UCB)量化不确定性,赌博机框架平衡了探索与利用,在严格有限的预算内最大化收益。为了处理拓扑诱导耦合,MASPOB集成了图神经网络(GNN)以捕获结构先验,学习提示语义的拓扑感知表示。此外,它采用坐标上升将优化分解为单变量子问题,将搜索复杂度从指数级降低到线性级。跨不同基准的大量实验表明,MASPOB实现了最先进的性能,持续优于现有基线。

英文摘要

Large Language Models (LLMs) have achieved great success in many real-world applications, especially the one serving as the cognitive backbone of Multi-Agent Systems (MAS) to orchestrate complex workflows in practice. Since many deployment scenarios preclude MAS workflow modifications and its performance is highly sensitive to the input prompts, prompt optimization emerges as a more natural approach to improve its performance. However, real-world prompt optimization for MAS is impeded by three key challenges: (1) the need of sample efficiency due to prohibitive evaluation costs, (2) topology-induced coupling among prompts, and (3) the combinatorial explosion of the search space. To address these challenges, we introduce MASPOB (Multi-Agent System Prompt Optimization via Bandits), a novel sample-efficient framework based on bandits. By leveraging Upper Confidence Bound (UCB) to quantify uncertainty, the bandit framework balances exploration and exploitation, maximizing gains within a strictly limited budget. To handle topology-induced coupling, MASPOB integrates Graph Neural Networks (GNNs) to capture structural priors, learning topology-aware representations of prompt semantics. Furthermore, it employs coordinate ascent to decompose the optimization into univariate sub-problems, reducing search complexity from exponential to linear. Extensive experiments across diverse benchmarks demonstrate that MASPOB achieves state-of-the-art performance, consistently outperforming existing baselines.

2602.22968 2026-06-01 cs.AI cs.CV cs.CY 版本更新

Certified Circuits: Stability Guarantees for Mechanistic Circuits

认证电路:机械论电路的稳定性保证

Alaa Anani, Tobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所信息学研究所) CISPA Helmholtz Center for Information Security(信息安全赫尔姆霍茨中心)

AI总结 提出Certified Circuits框架,通过随机数据子采样认证电路组件(神经元或边)对概念数据集编辑距离扰动的稳定性,生成更紧凑、更准确的电路。

Comments Accepted at ICML 2026

详情
AI中文摘要

理解神经网络如何得出其预测对于调试、审计和部署至关重要。机械论可解释性通过识别电路——负责特定行为的最小子网络——来追求这一目标。然而,现有的电路发现方法脆弱:电路强烈依赖于所选的概念数据集,并且常常无法迁移到分布外,引发对其是否捕捉概念或仅仅是数据集特定伪影的怀疑。我们引入了Certified Circuits,它为电路发现提供了可证明的稳定性保证。我们的框架用随机数据子采样包装任何黑盒发现算法,以认证电路组件——根据基础算法,模型图的神经元或边——的包含决策对概念数据集的有界编辑距离扰动是不变的。不稳定的组件被弃用,从而产生更紧凑、更准确的电路。我们在三个架构(ResNet、ViT、GPT-2)上,针对视觉(ImageNet和四个OOD数据集)和语言(IOI、IOI-Hard、Greater-Than)任务进行了验证。认证电路实现了高达56%的更高准确率和高达80%的更少组件,并且在基线退化时保持可靠。Certified Circuits通过产生可证明稳定且与目标概念更好对齐的机械论解释,将电路发现置于形式化的基础上。代码:https://github.com/AlaaAnani/certified-circuits。

英文摘要

Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits--minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture the concept or merely dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that inclusion decisions over circuit components--neurons or edges of the model graph, depending on the base algorithm--are invariant to bounded edit-distance perturbations of the concept dataset. Unstable components are abstained from, yielding circuits that are more compact and more accurate. We validate across three architectures (ResNet, ViT, GPT-2) on vision (ImageNet and four OOD datasets) and language (IOI, IOI-Hard, Greater-Than) tasks. Certified circuits achieve up to 56% higher accuracy and up to 80% fewer components, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code: https://github.com/AlaaAnani/certified-circuits.

2603.00068 2026-06-01 cs.CY cs.AI 版本更新

The Global Landscape of Environmental AI Regulation: From the Cost of Reasoning to a Right to Green AI

环境AI监管的全球格局:从推理成本到绿色AI权利

Kai Ebert, Boris Gamazaychikov, Philipp Hacker, Sasha Luccioni

发表机构 * European University Viadrina(欧洲维德林大学) Sustainable AI Group(可持续AI小组)

AI总结 本文通过实证证据、全球监管地图和政策建议,揭示了生成式AI日益增长的环境成本,并提出模型级透明度、用户选择权和国际协调的三管齐下应对方案。

Comments 23 pages, 1 table, preprint

详情
AI中文摘要

人工智能系统造成了巨大且不断增长的环境成本,然而随着其部署加速,关于这些影响的透明度却在下降。本文做出三项贡献。首先,我们汇集了实证证据,表明2025年激增的生成式网络搜索和推理模型比前几代AI方法具有更高的累积环境影响。其次,我们绘制了跨越11个司法管辖区的全球监管格局,发现环境治理的方式(主要在设施层面而非模型层面,侧重于训练而非推理,除欧盟外有限的AI特定能源披露要求)限制了其适用性。第三,为解决这一问题,我们提出了三管齐下的政策回应:强制性的模型级透明度,涵盖推理消耗、基准和计算位置;用户有权选择退出不必要的生成式AI集成并选择环境优化的模型;以及国际协调以防止监管套利。最后,我们提出了具体的立法建议——包括对欧盟AI法案、消费者权利指令和数字服务法案的修正——这些可作为其他司法管辖区的模板。

英文摘要

Artificial intelligence (AI) systems impose substantial and growing environmental costs, yet transparency about these impacts has declined even as their deployment has accelerated. This paper makes three contributions. First, we collate empirical evidence that generative Web search and reasoning models - which have proliferated in 2025 - come with much higher cumulative environmental impacts than previous generations of AI approaches. Second, we map the global regulatory landscape across eleven jurisdictions and find that the manner in which environmental governance operates (predominantly at the facility-level rather than the model-level, with a focus on training rather than inference, with limited AI-specific energy disclosure requirements outside the EU) limits its applicability. Third, to address this, we propose a three-pronged policy response: mandatory model-level transparency that covers inference consumption, benchmarks, and compute locations; user rights to opt out of unnecessary generative AI integration and to select environmentally optimized models; and international coordination to prevent regulatory arbitrage. We conclude with concrete legislative proposals - including amendments to the EU AI Act, Consumer Rights Directive, and Digital Services Act - that could serve as templates for other jurisdictions.

2602.24210 2026-06-01 cs.CL cs.AI 版本更新

From Leaky Thoughts to Private Reasoning: Controlling What LRMs Say to Themselves

从泄露思维到私有推理:控制LRM对自己说的话

Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych

发表机构 * Mohamed bin Zayed University of Artificial Intelligence, UAE(穆罕默德·本·扎耶德人工智能大学,阿联酋)

AI总结 针对大型推理模型(LRM)推理过程中隐私泄露问题,提出通过指令跟随(IF)训练和分阶段解码策略(Staged Decoding)增强隐私保护,在IF和隐私基准上分别提升高达20.9和51.9个百分点。

详情
AI中文摘要

大型推理模型(LRM)产生的推理轨迹(RT)通常包含敏感信息。这些泄露的思维难以控制,且经常违反明确的隐私指令。由于RT可能通过提示注入攻击暴露,这对用户构成了直接的隐私风险。我们将此视为一个可控性问题:由于隐私指令本身就是指令,在RT内改进指令跟随(IF)为减少隐私泄露提供了直接途径。为此,我们引入了一个SFT数据集,教会模型在其推理过程中遵循通用指令,并提出了分阶段解码(Staged Decoding),一种简单的解码策略,通过使用独立的LoRA适配器解耦RT和答案生成,以最大化每个组件的IF。我们在两个系列(1.7B-14B参数)的六个模型上,在两个IF基准和两个隐私基准上评估了我们的方法。我们的方法带来了显著的改进,在IF上提升高达20.9分,在隐私基准上提升51.9个百分点,尽管由于推理性能与IF之间的权衡,这些改进可能以牺牲任务效用为代价。我们的结果表明,改进LRM中的IF可以显著增强隐私,为未来隐私感知的LRM提供了一个有前景的方向。我们的代码可在https://github.com/UKPLab/arxiv2026-controllable-reasoning-models获取。

英文摘要

Large reasoning models (LRMs) produce reasoning traces (RTs) that often contain sensitive information. These leaky thoughts are difficult to control and frequently violate explicit privacy directives. Because RTs can be exposed through prompt injection attacks, this becomes a direct privacy risk to the user. We approach this as a controllability problem: since privacy directives are themselves instructions, improving instruction-following (IF) within the RT provides a direct path to reducing privacy leaks. To this end, we introduce an SFT dataset that teaches models to follow general instructions throughout their reasoning process, and propose Staged Decoding, a simple decoding strategy that decouples RT and answer generation using separate LoRA adapters to maximize IF of each component. We evaluate our approach on six models from two families (1.7B-14B parameters), across two IF benchmarks and two privacy benchmarks. Our method yields substantial improvements, with gains of up to 20.9 points in IF and 51.9 percentage points on privacy benchmarks, though these can come at the cost of task utility due to the trade-off between reasoning performance and IF. Our results show that improving IF in LRMs can significantly enhance privacy, suggesting a promising direction for future privacy-aware LRMs. Our code is available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models.

2602.10117 2026-06-01 cs.LG cs.AI 版本更新

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

盲点中的偏见:检测大语言模型未能提及的内容

Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu

发表机构 * Poseidon Research(Poseidon研究) University College London, United Kingdom(伦敦大学学院, 英国) Imperial College London, United Kingdom(伦敦帝国学院, 英国)

AI总结 提出全自动黑盒流水线,通过统计测试和思维链分析,自动检测大语言模型在任务中未明确表述的偏见。

Comments Published at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

大语言模型(LLMs)通常提供看似合理的思维链(CoT)推理痕迹,但可能隐藏内部偏见。我们称这些为未表述的偏见。因此,通过模型陈述的推理来监控模型是不可靠的,现有的偏见评估通常需要预定义类别和手工制作的数据集。在这项工作中,我们引入了一个全自动的黑盒流水线,用于检测特定任务的未表述偏见。给定一个任务数据集,该流水线使用LLM自动评分器生成候选偏见概念。然后,通过生成正面和负面变体,在逐渐增大的输入样本上测试每个概念,并应用统计技术进行多重测试和早期停止。如果一个概念在模型的CoT中未被引用为理由,但产生了统计上显著的性能差异,则将其标记为未表述的偏见。我们在三个决策任务(招聘、贷款审批和大学录取)上对七个LLM评估了我们的流水线。我们的技术自动发现了这些模型中以前未知的偏见(例如,西班牙语流利度、英语熟练度、写作正式度)。在同一运行中,该流水线还验证了先前工作手动识别的偏见(性别、种族、宗教、民族)。更广泛地说,我们提出的方法为自动、更高效和更广泛的特定任务未表述偏见发现提供了一条实用、可扩展的路径。

英文摘要

Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these unverbalized biases. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across seven LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic, more efficient, and broader task-specific unverbalized bias discovery.

2602.22971 2026-06-01 cs.AI 版本更新

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

SPM-Bench:面向扫描探针显微镜的大型语言模型基准测试

Peiyao Xiao, Xiaogang Li, Xinyi Gao, Chengliang Xu, Ben Wang, Zichao Chen, Zeyu Wang, Lin Qu, Bing Zhao, Hu Wei

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出SPM-Bench,一个全自动数据合成管道和严格评估指标(SIP-F1),用于测试LLMs在扫描探针显微镜领域的推理能力,并首次量化模型“个性”。

详情
AI中文摘要

随着LLMs在通用推理方面取得突破,它们在特定科学领域的熟练程度因数据污染、复杂性不足和过高的人力成本而在现有基准测试中暴露出明显差距。在此,我们提出SPM-Bench,一个专为扫描探针显微镜(SPM)设计的原创、博士级多模态基准测试。我们提出一个全自动数据合成管道,确保高权威性和低成本。通过采用锚点门控筛(AGS)技术,我们从2023年至2025年间发表的arXiv和期刊论文中高效提取高价值图像-文本对。通过混合云-本地架构(其中VLM仅返回空间坐标“llbox”以进行本地高保真裁剪),我们的管道在保持高数据集纯度的同时实现了极致的token节省。为了准确客观地评估LLMs的性能,我们引入了严格不完美惩罚F1(SIP-F1)分数。该指标不仅建立了严格的能力层级,而且首次量化了模型“个性”(保守型、激进型、赌徒型或明智型)。通过将这些结果与模型报告的置信度和感知难度相关联,我们揭示了当前AI在复杂物理场景中的真实推理边界。这些见解使SPM-Bench成为自动化科学数据合成的可推广范式。

英文摘要

As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexity, and prohibitive human labor costs. Here we present SPM-Bench, an original, PhD-level multimodal benchmark specifically designed for scanning probe microscopy (SPM). We propose a fully automated data synthesis pipeline that ensures both high authority and low-cost. By employing Anchor-Gated Sieve (AGS) technology, we efficiently extract high-value image-text pairs from arXiv and journal papers published between 2023 and 2025. Through a hybrid cloud-local architecture where VLMs return only spatial coordinates "llbox" for local high-fidelity cropping, our pipeline achieves extreme token savings while maintaining high dataset purity. To accurately and objectively evaluate the performance of the LLMs, we introduce the Strict Imperfection Penalty F1 (SIP-F1) score. This metric not only establishes a rigorous capability hierarchy but also, for the first time, quantifies model "personalities" (Conservative, Aggressive, Gambler, or Wise). By correlating these results with model-reported confidence and perceived difficulty, we expose the true reasoning boundaries of current AI in complex physical scenarios. These insights establish SPM-Bench as a generalizable paradigm for automated scientific data synthesis.

2602.19171 2026-06-01 cs.GR cs.AI 版本更新

HistCAD: A Constraint-Aware Parametric History-Based CAD Representation, Dataset, and Benchmark with Industrial Complexity

HistCAD:一种约束感知的基于参数化历史CAD表示、数据集和具有工业复杂性的基准

Xintong Dong, Chuanyang Li, Peng Zheng, Chuqi Han, Jiaxin Jing, Hailong Shen, Yanzhi Song, Zhouwang Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出HistCAD表示标准、数据集和基准,通过显式约束记录草图、特征和操作,实现可编辑的参数化CAD序列生成与评估。

详情
AI中文摘要

参数化CAD序列是可重用的,因为尺寸和几何约束控制参数变化如何传播。现有的CAD生成数据集和基准强调重建保真度、执行有效性或静态形状相似性,而忽略了编辑下设计意图的保持。我们引入了HistCAD,一个用于可执行参数化CAD且具有显式约束的表示标准、数据集和基准。HistCAD定义了一种独立于CAD软件的中间语言,记录草图图元、约束、特征操作以及用于倒角和圆角等操作的3D点边界参考。该数据集包含170,236个可执行序列,与原生CAD模型、STEP文件、渲染视图和文本注释对齐,结合了学术规模与专业创作的工业复杂性。基于此表示,约束感知可编辑性基准应用参数编辑并报告编辑可达性、条件保留约束满足率和总体可编辑成功率,缩写为ER、cPCSR和OES;这些指标将无法达到有效编辑状态与无法保留所需约束区分开来。实验表明,显式约束对于编辑后保留设计意图至关重要,并且HistCAD支持从文本进行监督式CAD生成以及直接的大语言模型工作流。我们认为HistCAD将CAD生成从静态形状模仿重新定义为具有显式约束的可重用参数化序列的合成。

英文摘要

Parametric CAD sequences are reusable because dimensional and geometric constraints govern how parameter changes propagate. Existing CAD generation datasets and benchmarks emphasize reconstruction fidelity, execution validity, or static shape similarity, leaving preservation of design intent under edits largely unmeasured. We introduce HistCAD, a representation standard, dataset, and benchmark for executable parametric CAD with explicit constraints. HistCAD defines an intermediate language independent of CAD software, recording sketch primitives, constraints, feature operations, and 3D point boundary references for operations such as fillet and chamfer. The dataset contains 170,236 executable sequences aligned with native CAD models, STEP files, rendered views, and text annotations, combining academic scale with professionally authored industrial complexity. Building on this representation, the Constraint-Aware Editability Benchmark applies parameter edits and reports Edit Reachability, conditional preserved constraint satisfaction, and Overall Editable Success, abbreviated ER, cPCSR, and OES; these metrics separate failures to reach a valid edited state from failures to preserve required constraints. Experiments show that explicit constraints are essential for preserving design intent after edits, and that HistCAD supports supervised CAD generation from text and direct LLM workflows. We argue that HistCAD reframes CAD generation from static shape imitation to the synthesis of reusable parametric sequences with explicit constraints.

2602.08885 2026-06-01 cs.LG cs.AI cs.SC 版本更新

Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression

打破摊销神经符号回归中的简化瓶颈

Paul Saegert, Ullrich Köthe

发表机构 * Heidelberg University(海德堡大学)

AI总结 针对摊销符号回归中表达式简化速度慢的问题,提出基于规则的简化引擎SimpliPy,实现百倍加速,从而提升模型精度和可扩展性。

Comments main text: 8 pages, 7 figures; appendix: 12 pages, 11 figures; code available at https://github.com/psaegert/simplipy and https://github.com/psaegert/flash-ansr; v2: Fixed rendering artifact in Figure 7; v3: Fixed Figure 3 title and formula; v4: Fixed Eq (1), example in App. M, Fig 13; v5: ICML 2026 Camera-Ready Version

详情
AI中文摘要

符号回归旨在发现准确描述观测数据的可解释解析表达式。摊销符号回归有望比主流的遗传编程符号回归方法效率更高,但目前难以扩展到真实的科学复杂度。我们发现一个关键障碍是缺乏将等价表达式快速简化为简洁规范形式的方法。摊销符号回归已通过通用计算机代数系统(如SymPy)解决此问题,但其高计算成本严重限制了训练和推理速度。我们提出SimpliPy,一个基于规则的简化引擎,在相当质量下实现比SymPy快100倍的速度。这使摊销符号回归获得显著改进,包括扩展到更大的训练集、更高效地使用每个表达式的令牌预算,以及系统性地消除训练集中与测试等价表达式的污染。我们在Flash-ANSR框架中展示了这些优势,在FastSRB基准上比摊销基线(NeSymReS, E2E)获得更好的准确率。此外,其性能与最先进的直接优化方法(PySR)相当,同时在增加推理预算时恢复更简洁而非更复杂的表达式。

英文摘要

Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this with general-purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule-based simplification engine achieving a 100-fold speed-up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per-expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash-ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state-of-the-art direct optimization (PySR) while recovering more concise rather than more complex expressions with increasing inference budget.

2602.17531 2026-06-01 cs.LG cs.AI 版本更新

Position: Evaluation of ECG Representations Must Be Fixed

Position: Evaluation of ECG Representations Must Be Fixed

Zachary Berger, Daniel Prakah-Asante, John Guttag, Collin M. Stultz

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Massachusetts General Hospital(麻省总医院)

AI总结 本文主张必须改进12导联心电图表示学习的基准测试实践,以确保进展可靠且符合临床目标,并提出了扩展评估范围、采用最佳实践以及将随机编码器作为基线等建议。

Comments Project website at https://ecgfix.csail.mit.edu/

详情
AI中文摘要

这篇立场论文认为,当前12导联心电图表示学习的基准测试实践必须加以改进,以确保进展可靠且与临床有意义的目标一致。该领域已基本集中于三个公共多标签基准(PTB-XL、CPSC2018、CSN),这些基准主要由心律失常和波形形态标签主导,尽管已知心电图编码了更广泛的临床信息。我们认为,下游评估应扩展到包括结构性心脏病评估和患者级预测,以及其他不断发展的心电图相关终点,作为相关的临床目标。接下来,我们概述了多标签、不平衡设置下的评估最佳实践,并表明当应用这些实践时,文献中关于哪些表示性能最佳的当前结论会发生变化。此外,我们展示了一个令人惊讶的结果:随机初始化的编码器在线性评估下与许多任务上的最先进预训练方法相匹配。这促使将随机编码器作为合理的基线模型。我们通过实证评估五种代表性心电图预训练方法在六种评估设置(三个标准基准、一个结构性心脏病数据集、血流动力学推断和患者预测)中的表现来证实我们的观察。

英文摘要

This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of five representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.

2602.13110 2026-06-01 cs.CL cs.AI 版本更新

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

SCOPE: 选择性保形优化的成对LLM评判

Sher Badshah, Ali Emami, Hassan Sajjad

发表机构 * Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada(达勒豪西大学计算机科学学院) Department of Computer Science, Emory University, Atlanta, GA, USA(埃默里大学计算机科学系)

AI总结 提出SCOPE框架,通过校准接受阈值控制非弃权判断的错误率,并引入双向偏好熵(BPE)提供无偏不确定性信号,实现可靠且高覆盖率的LLM成对评估。

Comments Accepted at ICML 2026. 23 pages (9 main plus appendix), 7 figures, 11 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作成对评估中的可扩展评判者,但它们仍然容易受到校准偏差和偏见的影响。我们提出SCOPE(选择性保形优化成对评估),一个校准接受阈值的框架,使得在可交换性条件下,非弃权判断的错误率最多为用户指定的水平$α$。为了向SCOPE提供无偏的不确定性信号,我们引入了双向偏好熵(BPE),它在两个响应位置下查询评判者,并将顺序平均的偏好概率转换为基于熵的分数。在各种成对评判基准上,BPE在校准和区分能力方面优于标准置信度代理,而SCOPE始终满足目标风险界限(在$α=0.10$时,经验FDR约为0.097至0.099),并保持较高的覆盖率。与原始基线相比,在相同风险约束下,SCOPE接受的判断数量最多增加2.4倍,表明BPE能够实现可靠且高覆盖率的基于LLM的评估。

英文摘要

Large language models (LLMs) are increasingly used as scalable judges in pairwise evaluation, but they remain prone to miscalibration and biases. We propose SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework that calibrates an acceptance threshold so that, under exchangeability, the error rate among non-abstained judgments is at most a user-specified level $α$. To supply SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions and converts the order-averaged preference probability into an entropy-based score. Across various pairwise judging benchmarks, BPE outperforms standard confidence proxies in calibration and discrimination, while SCOPE consistently satisfies the target risk bound (empirical FDR $\approx 0.097$ to $0.099$ at $α= 0.10$) and retains substantial coverage. Compared to vanilla baselines, SCOPE accepts up to $2.4\times$ more judgments under the same risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

2602.13812 2026-06-01 cs.DB cs.AI cs.MA 版本更新

DTBench: A Synthetic Benchmark for Document-to-Table Extraction

DTBench:文档到表格提取的合成基准

Yuxiang Guo, Zhuoran Du, Nan Tang, Kezheng Tang, Congcong Ge, Yunjun Gao

发表机构 * Zhejiang University(浙江大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science(香港科学与技术大学)

AI总结 提出DTBench合成基准,通过反向Table2Doc范式和多智能体合成流程生成文档,系统评估LLM在文档到表格提取中的多种能力。

Comments KDD26

详情
AI中文摘要

文档到表格(Doc2Table)提取是在目标模式从非结构化文档中导出结构化表格,实现可靠且可验证的基于SQL的数据分析。尽管大型语言模型(LLM)在灵活信息提取方面显示出潜力,但其生成精确结构化表格的能力仍未被充分理解,特别是对于需要推理和冲突解决等复杂能力的间接提取。现有基准既没有明确区分也没有全面覆盖Doc2Table提取所需的各种能力。我们认为,一个能力感知的基准对于系统评估至关重要。然而,使用人工标注的文档-表格对构建此类基准成本高、难以扩展且能力覆盖有限。为解决此问题,我们采用反向Table2Doc范式,并设计多智能体合成工作流,从真实表格生成文档。基于此方法,我们提出DTBench,一个合成基准,采用提出的Doc2Table能力的两级分类法,涵盖5个主要类别和13个子类别。我们在DTBench上评估了几种主流LLM,展示了模型间的显著性能差距,以及在推理、忠实性和冲突解决方面的持续挑战。DTBench为数据生成和评估提供了全面的测试平台,促进了Doc2Table提取的未来研究。该基准公开于https://github.com/ZJU-DAILY/DTBench。

英文摘要

Document-to-table (Doc2Table) extraction derives structured tables from unstructured documents under a target schema, enabling reliable and verifiable SQL-based data analytics. Although large language models (LLMs) have shown promise in flexible information extraction, their ability to produce precisely structured tables remains insufficiently understood, particularly for indirect extraction that requires complex capabilities such as reasoning and conflict resolution. Existing benchmarks neither explicitly distinguish nor comprehensively cover the diverse capabilities required in Doc2Table extraction. We argue that a capability-aware benchmark is essential for systematic evaluation. However, constructing such benchmarks using human-annotated document-table pairs is costly, difficult to scale, and limited in capability coverage. To address this, we adopt a reverse Table2Doc paradigm and design a multi-agent synthesis workflow to generate documents from ground-truth tables. Based on this approach, we present DTBench, a synthetic benchmark that adopts a proposed two-level taxonomy of Doc2Table capabilities, covering 5 major categories and 13 subcategories. We evaluate several mainstream LLMs on DTBench, and demonstrate substantial performance gaps across models, as well as persistent challenges in reasoning, faithfulness, and conflict resolution. DTBench provides a comprehensive testbed for data generation and evaluation, facilitating future research on Doc2Table extraction. The benchmark is publicly available at https://github.com/ZJU-DAILY/DTBench.

2602.15293 2026-06-01 cs.LG cs.AI cs.CL stat.ML 版本更新

The Information Geometry of Softmax: Probing and Steering

Softmax的信息几何:探测与引导

Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch

发表机构 * University of Chicago(芝加哥大学)

AI总结 本文从信息几何角度研究AI系统如何将语义结构编码到表示空间的几何结构中,并提出一种利用线性探针鲁棒引导表示以展现特定概念的“双重引导”方法。

Comments Code is available at https://github.com/KihoPark/dual-steering

详情
Journal ref
In Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026
AI中文摘要

本文关注AI系统如何将语义结构编码到其表示空间的几何结构中的问题。动机观察是,这些表示空间的自然几何应反映模型使用表示产生行为的方式。我们聚焦于定义softmax分布的重要特例。在这种情况下,我们认为自然几何是信息几何。我们的重点是信息几何在语义编码和线性表示假设中的作用。作为一个说明性应用,我们开发了“双重引导”,一种利用线性探针鲁棒地引导表示以展现特定概念的方法。我们证明双重引导在最小化对非目标概念改变的同时,最优地修改目标概念。实验上,我们发现双重引导增强了概念操控的可控性和稳定性。

英文摘要

This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.

2602.11137 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Weight Decay Improves Language Model Plasticity

权重衰减提升语言模型可塑性

Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade

发表机构 * Broad Institute, Schmidt Center(Broad研究所,Schmidt中心) University of Tübingen, Tübingen AI Center(图宾根大学,图宾根人工智能中心) Harvard University(哈佛大学)

AI总结 本文通过系统实验表明,预训练中较大的权重衰减能提高模型的可塑性,使微调后下游性能更优,并揭示了其促进线性可分表示、正则化注意力矩阵和减少过拟合的机制。

详情
AI中文摘要

大型语言模型通常分两个主要阶段训练:预训练以产生基础模型,然后进一步训练以提高下游性能。然而,超参数优化和缩放定律主要从基础模型验证损失的角度研究,忽略了一个关键的模型属性:下游适应性。在这项工作中,我们从模型可塑性的角度研究预训练,即基础模型在额外训练后成功适应下游任务的能力。我们关注权重衰减的作用,这是预训练中的一个关键正则化参数,并通过系统实验表明,较大的权重衰减提高了预训练模型的可塑性,导致微调后下游性能提升更大。这种效应可能导致反直觉的权衡,即预训练后表现较差的基础模型在进一步训练后可能表现更好。对权重衰减对模型行为的机制影响的进一步研究表明,它鼓励线性可分的表示,正则化注意力矩阵,并减少对训练数据的过拟合。这些发现共同强调了预训练模型可塑性的重要性,使用交叉熵损失作为超参数优化的唯一指标的局限性,以及单个优化超参数在塑造模型行为中的多方面作用。

英文摘要

Large language models are typically trained in two broad phases: pretraining to produce a base model, followed by further training to improve downstream performance. However, hyperparameter optimization and scaling laws are studied primarily from the perspective of the base model's validation loss, overlooking a crucial model property: downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks upon additional training. We focus on the role of weight decay, a key regularization parameter during pretraining, and show through systematic experiments that larger weight decay increases the plasticity of the pretrained model, resulting in greater performance gains downstream after fine-tuning. This effect can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after further training. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. Together, these findings highlight the importance of pretrained model plasticity, the limits of using cross-entropy loss as the sole metric for hyperparameter optimization, and the multifaceted role that a single optimization hyperparameter plays in shaping model behavior.

2602.10324 2026-06-01 cs.AI cs.CL cs.CY cs.HC 版本更新

Discovering Differences in Strategic Behavior Between Humans and LLMs

发现人类与LLM在战略行为上的差异

Caroline Wang, Daniel Kasenberg, Kim Stachenfeld, Pablo Samuel Castro

发表机构 * Department of Computer Science, University of Texas at Austin. Work performed as a student researcher at Google DeepMind(德克萨斯大学计算机科学系。作为谷歌DeepMind的学生研究员进行的工作) Google DeepMind(谷歌DeepMind)

AI总结 使用AlphaEvolve程序发现工具,从数据中直接发现可解释的人类和LLM行为模型,揭示在迭代石头剪刀布中前沿LLM比人类具有更深层次的战略行为。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署在社交和战略场景中,了解它们的行为在何处以及为何与人类行为产生差异变得至关重要。虽然行为博弈论(BGT)为分析行为提供了框架,但现有模型并未完全捕捉到人类或像LLM这样的黑箱非人类代理的独特行为。我们采用AlphaEvolve这一前沿程序发现工具,直接从数据中发现可解释的人类和LLM行为模型,从而能够开放式地发现驱动人类和LLM行为的结构因素。我们对迭代石头剪刀布的分析表明,前沿LLM可能比人类具有更深层次的战略行为。这些结果为理解驱动人类和LLM在战略互动中行为差异的结构性差异奠定了基础。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.

2602.09276 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Effective Reasoning Chains Reduce Intrinsic Dimensionality

有效推理链降低内在维度

Archiki Prasad, Mandar Joshi, Kenton Lee, Mohit Bansal, Peter Shaw

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文通过内在维度量化推理链有效性,发现有效推理策略能降低任务内在维度,并在GSM8K上验证其与泛化性能的强负相关。

Comments ICML (spotlight) camera-ready; 22 pages, 3 figures

详情
AI中文摘要

思维链推理及其变体显著提升了语言模型在复杂推理任务上的性能,但不同策略促进泛化的精确机制仍不明确。虽然当前解释常指向增加测试时计算或结构引导,但建立这些因素与泛化之间一致、可量化的联系仍具挑战。本文中,我们将内在维度识别为表征推理链有效性的定量度量。内在维度量化了在给定任务上达到特定准确率阈值所需的最小模型维度数。通过固定模型架构并改变不同推理策略下的任务表述,我们证明有效推理策略持续降低任务的内在维度。在GSM8K上使用Gemma-3 1B和4B验证这一点,我们观察到推理策略的内在维度与其在分布内和分布外数据上的泛化性能之间存在强负相关。我们的发现表明,有效推理链通过使用更少参数更好地压缩任务来促进学习,为分析推理过程提供了新的定量度量。

英文摘要

Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.

2602.08964 2026-06-01 cs.LG cs.AI cs.CL cs.CY 版本更新

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

语言模型智能体中目标导向性的行为与表征评估

Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli

发表机构 * University of Pennsylvania(宾夕法尼亚大学) New York University(纽约大学) Indiana University, Bloomington(印第安纳大学,布卢明顿) Northeastern University(东北大学) University College London(伦敦大学学院)

AI总结 本文提出一种结合行为评估与内部表征可解释性分析的目标导向性评估框架,并以LLM智能体在2D网格世界中的导航为例,验证了其行为与表征的一致性。

Comments Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

理解智能体的目标有助于解释和预测其行为,但目前尚无可靠的方法来归因智能系统的目标。我们提出一个评估目标导向性的框架,该框架将行为评估与基于可解释性的模型内部表征分析相结合。作为案例研究,我们考察了一个在二维网格世界中导航至目标状态的LLM智能体。在行为上,我们评估智能体在不同网格大小、障碍物密度和目标结构下的最优策略,发现其性能随任务难度扩展,同时对保持难度的变换和多目标结构具有鲁棒性。然后,我们使用探测方法解码环境及多步行动计划的内部表征。我们发现,LLM智能体非线性地编码了一个粗略的空间地图,保留了关于其位置和目标位置的任务相关近似线索;其行动与这些内部表征大致一致;推理过程重新组织这些表征,从空间线索转向即时行动选择。我们的研究结果支持这样的观点:除了行为评估之外,还需要内省检查来表征智能体如何表示和追求其目标。

英文摘要

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world towards a goal state. Behaviourally, we evaluate the agent against optimal policies across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and multi-goal structures. We then use probing methods to decode internal representations of the environment and multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from spatial cues towards immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.

2602.08267 2026-06-01 cs.LG cs.AI 版本更新

Inverting Data Transformations via Diffusion Sampling

通过扩散采样逆变换数据变换

Jinwoo Kim, Sékou-Oumar Kaba, Jiyun Park, Seunghoon Hong, Siamak Ravanbakhsh

发表机构 * Mila - Quebec Artificial Intelligence Institute, Montr\'eal, Canada School of Computer Science, McGill University, Montr\'eal, Canada

AI总结 提出一种在一般李群上通过扩散采样逆变换未知变换的方法,用于恢复原始数据分布,并在测试时等变性应用中提升预训练神经网络的鲁棒性。

Comments 31 pages, 11 figures

详情
AI中文摘要

我们研究了一般李群上的变换逆问题:一个数据被未知群元素变换,目标是恢复一个逆变换,将其映射回原始数据分布。这种未知变换在机器学习和科学建模中广泛出现,会显著扭曲观测数据。我们采用概率视角,将变换的后验建模为玻尔兹曼分布,由数据空间上的能量函数定义。为了从该后验中采样,我们引入了一个李群上的扩散过程,该过程保持所有更新在流形上,并且仅需在关联的李代数中进行计算。我们的方法,即变换逆能量扩散(TIED),依赖于一个新的平凡化目标分数恒等式,能够高效地对变换后验进行基于分数的采样。作为一个关键应用,我们专注于测试时等变性,其目标是提高预训练神经网络对输入变换的鲁棒性。在图像单应性和PDE对称性上的实验表明,TIED可以在测试时将变换后的输入恢复到训练分布,表现出优于强规范化和采样基线的性能。代码可在 https://github.com/jw9730/tied 获取。

英文摘要

We study the problem of transformation inversion on general Lie groups: a datum is transformed by an unknown group element, and the goal is to recover an inverse transformation that maps it back to the original data distribution. Such unknown transformations arise widely in machine learning and scientific modeling, where they can significantly distort observations. We take a probabilistic view and model the posterior over transformations as a Boltzmann distribution defined by an energy function on the data space. To sample from this posterior, we introduce a diffusion process on Lie groups that keeps all updates on-manifold and only requires computations in the associated Lie algebra. Our method, Transformation-Inverting Energy Diffusion (TIED), relies on a new trivialized target-score identity that enables efficient score-based sampling of the transformation posterior. As a key application, we focus on test-time equivariance, where the objective is to improve the robustness of pretrained neural networks to input transformations. Experiments on image homographies and PDE symmetries demonstrate that TIED can restore transformed inputs to the training distribution at test time, showing improved performance over strong canonicalization and sampling baselines. Code is available at https://github.com/jw9730/tied.

2602.01011 2026-06-01 cs.MA cs.AI 版本更新

Multi-Agent Teams Hold Experts Back

多智能体团队阻碍专家发挥

Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, James Zou

发表机构 * stanford(斯坦福大学) apple(苹果公司) goizueta business school, emory(埃默里大学戈izueta商学院)

AI总结 研究自组织多智能体LLM团队在无约束协调下无法匹配专家性能,发现整合妥协行为是主要瓶颈,导致性能损失高达41.1%。

Comments Accepted at the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

多智能体LLM系统越来越多地被部署为自主协作者,其中智能体自由交互而非执行固定的、预先指定的工作流程。在这种设置中,有效的协调无法完全预先设计,而必须通过交互涌现。然而,大多数先前的工作通过固定角色、工作流程或聚合规则来强制协调,留下了在协调不受约束时自组织团队表现如何的问题。借鉴组织心理学,我们研究了自组织LLM团队是否能够实现强协同,即团队表现匹配或超过最佳个体成员。在受人类启发的和前沿的ML基准测试中,我们发现——与人类团队不同——LLM团队始终无法匹配其专家智能体的表现,即使明确告知谁是专家,在ML基准测试中性能损失高达41.1%。分解这一失败,我们表明专家利用而非识别是主要瓶颈。对话分析揭示了一种整合妥协的倾向——平均专家和非专家观点而非适当加权专业知识——这种倾向随团队规模增加而增加,并与性能负相关。有趣的是,这种寻求共识的行为提高了对对抗性智能体的鲁棒性,表明在一致性和有效利用专业知识之间存在权衡。我们的发现揭示了自组织多智能体团队在利用成员集体专业知识方面的显著差距。

英文摘要

Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 41.1% on ML benchmarks. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.

2506.00175 2026-06-01 cs.LG cs.AI 版本更新

Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems

谁获得功劳或责备?在现代AI系统中分配责任

Shichang Zhang, Hongzhe Du, Jiaqi W. Ma, Himabindu Lakkaraju

发表机构 * Harvard University, Cambridge, MA, USA(哈佛大学) University of California, Los Angeles, Los Angeles, CA, USA(加州大学洛杉矶分校) University of Illinois Urbana-Champaign, Urbana-Champaign, IL, USA(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出一个归因框架,通过反事实问题量化模型开发各阶段(预训练、微调等)对最终行为的影响,并设计无需重训练的估计器,成功识别并移除多阶段任务中的虚假关联。

详情
AI中文摘要

现代AI系统通常通过多个阶段开发——预训练、微调轮次以及后续的适应或对齐,每个阶段都建立在先前阶段之上并以不同方式更新模型。这引发了一个关键的责任问题:当部署的模型成功或失败时,哪个阶段负责,以及负责到什么程度?我们提出了责任归因问题,用于将模型行为追溯到模型开发过程的特定阶段。为了解决这一挑战,我们提出了一个通用框架,回答关于阶段效应的反事实问题:如果特定阶段的更新没有发生,模型的行为会如何改变?在此框架内,我们引入了无需重新训练模型即可高效量化阶段效应的估计器,考虑了数据和模型优化动态的关键方面,包括学习率调度、动量和权重衰减。我们证明了我们的方法成功量化了每个阶段对模型行为的责任。基于归因结果,我们的方法可以识别并移除在图像分类和文本毒性检测任务中跨多个阶段开发时学到的虚假相关性。我们的方法为模型分析提供了实用工具,并代表了向更负责任的AI发展迈出的重要一步。

英文摘要

Modern AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment, where each stage builds on the previous ones and updates the model in distinct ways. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the accountability attribution problem for tracing model behavior back to specific stages of the model development process. To address this challenge, we propose a general framework that answers counterfactual questions about stage effects: how would the model's behavior have changed if the updates from a particular stage had not occurred? Within this framework, we introduce estimators that efficiently quantify stage effects without retraining the model, accounting for both the data and key aspects of model optimization dynamics, including learning rate schedules, momentum, and weight decay. We demonstrate that our approach successfully quantifies the accountability of each stage to the model's behavior. Based on the attribution results, our method can identify and remove spurious correlations learned during image classification and text toxicity detection tasks that were developed across multiple stages. Our approach provides a practical tool for model analysis and represents a significant step toward more accountable AI development.

2602.07928 2026-06-01 cs.LG cs.AI 版本更新

A Kinetic Energy Perspective of Flow Matching

流匹配的动能视角

Ziyun Li, Huancheng Hu, Soon Hoe Lim, Xuyu Li, Fei Gao, Enmao Diao, Zezhen Ding, Michalis Vazirgiannis, Henrik Bostrom

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Nordita, Nordic Institute for Theoretical Physics(北欧理论物理研究所) Hasso Plattner Institute, University of Potsdam(波茨坦大学哈索 Plattner 研究院) Trinity College Dublin(都柏林三一学院) Hangzhou Institute of Technology, Xidian University(西安电子科技大学杭州研究院) The Hong Kong University of Science(香港科学大学) Mohamed bin Zayed University of Artificial Intelligence(莫莫丁·本·扎耶德人工智能大学)

AI总结 本文引入动能路径能量(KPE)作为流匹配生成模型的诊断工具,发现其与语义保真度和数据稀疏性相关,并基于此提出无训练的动能轨迹塑形(KTS)策略以改善生成质量。

Comments ICML 2026 Spotlight

详情
AI中文摘要

基于流的生成模型可以通过物理视角来审视:采样通过积分学习到的速度场将粒子从噪声传输到数据,每个样本对应一条具有自身动力学努力的轨迹。受经典力学启发,我们引入了动能路径能量(KPE),这是一种类似作用量的每样本诊断指标,用于测量沿常微分方程(ODE)轨迹累积的动能努力。实验上,KPE表现出两种稳健的对应关系:{i} 较高的KPE预测更强的语义保真度;{ii} 高KPE轨迹落在稀疏表示区域。我们进一步提供了将轨迹能量与数据稀疏性联系起来的理论保证。矛盾的是,这种相关性是非单调的。在足够高的能量下,生成可能退化为记忆。利用经验流匹配的闭式公式,我们表明极端能量驱动轨迹接近训练样本的副本。这产生了金发姑娘原则,并激发了动能轨迹塑形(KTS),一种无训练的两阶段推理策略,该策略增强早期运动并强制执行后期软着陆,从而减少记忆并提高基准任务上的生成质量。

英文摘要

Flow-based generative models can be viewed through a physics lens: sampling transports a particle from noise to data by integrating a learned velocity field, and each sample corresponds to a trajectory with its own dynamical effort. Motivated by classical mechanics, we introduce Kinetic Path Energy (KPE), an action-like, per-sample diagnostic that measures the accumulated kinetic effort along an ordinary differential equation (ODE) trajectory. Empirically, KPE exhibits two robust correspondences: {i} higher KPE predicts stronger semantic fidelity; {ii} high-KPE trajectories land in sparse representation regions. We further provide theoretical guarantees linking trajectory energy to data sparsity. Paradoxically, this correlation is non-monotonic. At sufficiently high energy, generation can degenerate into memorization. Leveraging the closed-form formula of empirical flow matching, we show that extreme energies drive trajectories toward near-copies of training examples. This yields a Goldilocks principle and motivates Kinetic Trajectory Shaping (KTS), a training-free two-phase inference strategy that boosts early motion and enforces a late-time soft landing, reducing memorization and improving generation quality across benchmark tasks.

2602.07905 2026-06-01 cs.AI 版本更新

MedCoG: Maximizing LLM Inference Density in Medical Reasoning via Meta-Cognitive Regulation

MedCoG:通过元认知调节最大化医学推理中的LLM推理密度

Yu Zhao, Hao Guan, Yongcheng Jing, Ying Zhang, Dacheng Tao

发表机构 * College of Computer Science, VCIP, DISSec Center, Nankai University, China 300350(南开大学计算机学院、VCIP、DISSec中心、中国300350) Generative AI Lab, College of Computing and Data Science(生成式人工智能实验室、计算与数据科学学院) Nanyang Technological University, Singapore 639798(南洋理工大学,新加坡639798)

AI总结 提出MedCoG框架,利用元认知评估动态调节知识使用,以缓解推理扩展定律下的收益递减,提升推理效率与准确性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLM)在复杂医学推理中展现出强大潜力,但在推理扩展定律下面临收益递减。现有研究通过添加各种知识类型增强LLM,但额外成本转化为准确性的效果尚不明确。本文探索LLM的元认知(即对其自身认知状态的自我评估)如何调节推理过程。具体而言,我们提出MedCoG,一种带有知识图谱的医学元认知智能体,其中对任务复杂性、熟悉度和知识密度的元认知评估动态调节程序性、情景性和事实性知识的利用。这种以LLM为中心的按需推理旨在通过(1)避免无差别扩展以降低成本,(2)过滤干扰知识以提高准确性,来缓解扩展定律下的收益递减。为验证这一点,我们经验性地刻画了扩展曲线,并引入推理密度来量化推理效率。实验表明MedCoG在五个医学基准困难集上的有效性和高效性,实现了6.2倍的推理密度。此外,Oracle研究凸显了元认知调节的巨大潜力。

英文摘要

Large Language Models (LLMs) have shown strong potential in complex medical reasoning yet face diminishing gains under inference scaling laws. While existing studies augment LLMs with various knowledge types, it remains unclear how effectively the additional costs translate into accuracy. In this paper, we explore how meta-cognition of LLMs, i.e., their self-assessment of their own cognitive states, can regulate the reasoning process. Specifically, we propose MedCoG, a Medical Meta-Cognition Agent with Knowledge Graph, where the meta-cognitive assessments of task complexity, familiarity, and knowledge density dynamically regulate utilization of procedural, episodic, and factual knowledge. The LLM-centric on-demand reasoning aims to mitigate the diminishing returns under scaling law by (1) reducing costs via avoiding indiscriminate scaling, (2) improving accuracy via filtering out distractive knowledge. To validate this, we empirically characterize the scaling curve and introduce inference density to quantify inference efficiency. Experiments demonstrate the effectiveness and efficiency of MedCoG on five hard sets of medical benchmarks, yielding 6.2x inference density. Furthermore, the Oracle study highlights the significant potential of meta-cognitive regulation.

2602.07457 2026-06-01 cs.SE cs.AI cs.CL 版本更新

Pull Requests as a Training Signal for Repo-Level Code Editing

拉取请求作为仓库级代码编辑的训练信号

Qinglin Zhu, Tianyu Chen, Shuai Lu, Lei Ji, Runcong Zhao, Murong Ma, Xiangxiang Dai, Yulan He, Lin Gui, Peng cheng, Yeyun Gong

发表机构 * King's College London, UK(伦敦国王学院) Chinese University of Hong Kong, HK(香港中文大学) National University of Singapore, SG(新加坡国立大学) Microsoft Research Asia, CN(微软亚洲研究院) The Alan Turing Institute, UK(艾伦·图灵研究所)

AI总结 提出Clean-PR方法,利用真实GitHub拉取请求作为训练信号,通过重建和验证转换为搜索/替换编辑块,结合无代理对齐的监督微调,在SWE-bench上显著提升仓库级代码编辑性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

仓库级代码编辑要求模型理解复杂依赖关系并在大型代码库中执行精确的多文件修改。虽然最近在SWE-bench上的进展严重依赖于复杂的代理脚手架,但尚不清楚这种能力有多少可以通过高质量的训练信号内化。为了解决这个问题,我们提出了Clean Pull Request (Clean-PR),一种利用真实世界GitHub拉取请求作为仓库级编辑训练信号的中间训练范式。我们引入了一个可扩展的流水线,通过重建和验证将嘈杂的拉取请求差异转换为搜索/替换编辑块,从而得到最大的公开可用语料库,包含200万个拉取请求,涵盖12种编程语言。使用这个训练信号,我们进行中间训练阶段,然后进行无代理对齐的监督微调过程,并带有错误驱动的数据增强。在SWE-bench上,我们的模型显著优于指令微调基线,在SWE-bench Lite上实现了13.6%的绝对改进,在SWE-bench Verified上实现了12.3%的绝对改进。这些结果表明,仓库级代码理解和编辑能力可以在简化的、无代理的协议下有效地内化到模型权重中,而无需依赖繁重的推理时脚手架。

英文摘要

Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains unclear how much of this capability can be internalised via high-quality training signals. To address this, we propose Clean Pull Request (Clean-PR), a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing. We introduce a scalable pipeline that converts noisy pull request diffs into Search/Replace edit blocks through reconstruction and validation, resulting in the largest publicly available corpus of 2 million pull requests spanning 12 programming languages. Using this training signal, we perform a mid-training stage followed by an agentless-aligned supervised fine-tuning process with error-driven data augmentation. On SWE-bench, our model significantly outperforms the instruction-tuned baseline, achieving absolute improvements of 13.6% on SWE-bench Lite and 12.3% on SWE-bench Verified. These results demonstrate that repository-level code understanding and editing capabilities can be effectively internalised into model weights under a simplified, agentless protocol, without relying on heavy inference-time scaffolding.

2510.10544 2026-06-01 cs.LG cs.AI stat.ML 版本更新

PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

PAC-Bayesian 强化学习训练可泛化策略

Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata

发表机构 * Université Claude Bernard Lyon 1, LIRIS, UMR CNRS 5205, France(里尔一大学,LIRIS,法国CNRS 5205)

AI总结 提出一种新的 PAC-Bayesian 泛化界,通过链的混合时间显式考虑数据中的马尔可夫依赖性,并基于此设计 PB-SAC 算法以优化该界指导探索,在连续控制任务中提供有意义的置信度证书且保持竞争性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情
AI中文摘要

我们推导了一个新的用于强化学习的 PAC-Bayesian 泛化界,该界通过链的混合时间显式考虑了数据中的马尔可夫依赖性。这有助于克服在强化学习中获取泛化保证的挑战,因为数据的序列性质破坏了经典界所依赖的独立性假设。新界为现代离策略算法(如 Soft Actor-Critic)提供了非空泛证书。我们通过 PB-SAC 展示了该界的实际效用,这是一种在训练过程中优化该界以指导探索的新算法。在多个连续控制任务上的实验表明,所提出的方法在保持竞争性能的同时提供了有意义的置信度证书。

英文摘要

We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. The new bound provides non-vacuous certificates for modern off-policy algorithms such as Soft Actor-Critic. We demonstrate the practical utility of the bound through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across several continuous control tasks show that the proposed approach provides meaningful confidence certificates while maintaining competitive performance.

2602.06161 2026-06-01 cs.CL cs.AI 版本更新

Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding

停止翻转:面向快速可撤销扩散解码的上下文保持验证

Yanzheng Xiang, Lan Wei, Yizhen Yao, Qinglin Zhu, Hanqi Yan, Chen Jin, Philip Alexander Teare, Dandan Zhang, Lin Gui, Amrutha Saseendran, Yulan He

发表机构 * King's College London, UK Centre for AI, Data Science \& Artificial Intelligence, BioPharmaceuticals R\&D, AstraZeneca, UK The Alan Turing Institute, UK Imperial College London, UK

AI总结 针对并行扩散解码中因激进并行导致的翻转振荡问题,提出COVER方法,通过KV缓存覆盖和稳定性感知评分实现单次前向传递中的留一验证与稳定草稿,减少不必要修订并加速解码。

详情
AI中文摘要

并行扩散解码可以通过每步解掩多个令牌来加速扩散语言模型推理,但激进的并行常常损害质量。可撤销解码通过重新检查早期令牌来缓解这一问题,然而我们观察到现有的验证方案频繁触发翻转振荡,即令牌被重新掩码后又原样恢复。这种行为以两种方式减慢推理:重新掩码已验证位置削弱了并行草稿的条件上下文,且重复的重新掩码循环消耗修订预算而进展甚微。我们提出COVER(用于高效修订的缓存覆盖验证),它在单次前向传递中执行留一验证和稳定草稿。COVER通过KV缓存覆盖构建两种注意力视图:选定的种子被掩码用于验证,而其缓存的键值状态被注入到所有其他查询中以保留上下文信息,同时通过闭式对角校正防止种子位置的自泄漏。COVER进一步使用稳定性感知评分对种子进行优先级排序,该评分平衡不确定性、下游影响和缓存漂移,并自适应调整每步验证的种子数量。在多个基准测试中,COVER显著减少不必要的修订,实现更快的解码,同时保持输出质量。

英文摘要

Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.

2510.00845 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

作为统计估计的机械可解释性:方差分析

Maxime Méloux, François Portet, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔INP、实验室LIG)

AI总结 本文从统计估计角度审视机械可解释性中的电路发现,揭示因果中介分析中单输入得分的固有方差导致电路不稳定,并系统分解方差来源,倡导更严谨的实践。

详情
AI中文摘要

机械可解释性(MI)旨在通过识别功能子网络来逆向工程模型行为。然而,这些发现的科学有效性取决于其稳定性。在这项工作中,我们认为电路发现不是一个独立的任务,而是一个建立在因果中介分析(CMA)基础上的统计估计问题。我们揭示了这一基础层的根本不稳定性:精确的单输入CMA得分表现出高固有方差,这意味着组件的因果效应是一个易变的随机变量,而非固定属性。然后,我们证明电路发现流程继承了这一方差并进一步放大。快速近似方法,如边缘属性修补及其后续方法,引入了额外的估计噪声,而在数据集上聚合这些噪声得分会导致脆弱的结构估计。因此,输入数据或超参数的小扰动会产生截然不同的电路。我们系统地分解了这些方差来源,并倡导更严格的MI实践,优先考虑统计稳健性和稳定性指标的常规报告。

英文摘要

Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.

2602.01553 2026-06-01 cs.LG cs.AI 版本更新

Plain Transformers are Surprisingly Powerful Link Predictors

普通Transformer竟是惊人的链接预测器

Quang Truong, Yu Song, Donald Loveland, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang

发表机构 * Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA.(计算机科学与工程系,密歇根州立大学,东兰辛,MI,美国。) Department of Computer Science(计算机科学系) Engineering, Michigan State University, East Lansing, MI, USA.(工程,密歇根州立大学,东兰辛,MI,美国。)

AI总结 提出PENCIL,一种仅编码器的普通Transformer,通过采样局部子图的注意力机制替代手工先验,在保持标准Transformer可扩展性的同时,隐式泛化多种启发式方法,实现高效且参数经济的链接预测。

Comments ICML'26

详情
AI中文摘要

链接预测是图机器学习中的核心挑战,需要能够捕捉丰富且复杂的拓扑依赖关系的模型。虽然图神经网络(GNN)是标准解决方案,但最先进的流程通常依赖于显式结构启发式或内存密集型的节点嵌入——这些方法难以泛化或扩展到大规模图。新兴的图Transformer(GT)提供了一种潜在的替代方案,但由于复杂的结构编码,它们通常会产生显著的开销,阻碍了其在大规模链接预测中的应用。我们通过PENCIL挑战这些复杂的范式,这是一种仅编码器的普通Transformer,用对采样局部子图的注意力替代手工先验,保留了标准Transformer的可扩展性和硬件效率。通过实验和理论分析,我们表明PENCIL比GNN提取了更丰富的结构信号,隐式泛化了一类广泛的启发式和基于子图的表达能力。实验上,PENCIL优于启发式信息增强的GNN,并且比基于ID嵌入的替代方案参数效率高得多,同时在各种基准测试中保持竞争力——即使没有节点特征。我们的结果挑战了当前对复杂工程技术的依赖,表明简单的设计选择可能足以实现相同的能力。我们的代码公开在 https://github.com/quang-truong/pencil。

英文摘要

Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies. While Graph Neural Networks (GNNs) are the standard solution, state-of-the-art pipelines often rely on explicit structural heuristics or memory-intensive node embeddings -- approaches that struggle to generalize or scale to massive graphs. Emerging Graph Transformers (GTs) offer a potential alternative but often incur significant overhead due to complex structural encodings, hindering their applications to large-scale link prediction. We challenge these sophisticated paradigms with PENCIL, an encoder-only plain Transformer that replaces hand-crafted priors with attention over sampled local subgraphs, retaining the scalability and hardware efficiency of standard Transformers. Through experimental and theoretical analysis, we show that PENCIL extracts richer structural signals than GNNs, implicitly generalizing a broad class of heuristics and subgraph-based expressivity. Empirically, PENCIL outperforms heuristic-informed GNNs and is far more parameter-efficient than ID-embedding--based alternatives, while remaining competitive across diverse benchmarks -- even without node features. Our results challenge the prevailing reliance on complex engineering techniques, demonstrating that simple design choices are potentially sufficient to achieve the same capabilities. Our code is publicly available at https://github.com/quang-truong/pencil.

2512.19673 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

自底向上策略优化:你的语言模型策略内部隐藏着内部策略

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) Tencent AI Lab(腾讯AI实验室)

AI总结 本文通过分解Transformer残差流中的内部层策略和内部模块策略,提出自底向上策略优化(BuPO)方法,通过早期优化内部层来重建LLM的推理基础,在复杂推理基准上验证了有效性。

Comments Preprint. Our code is available at https://github.com/Trae1ounG/BuPO

详情
AI中文摘要

现有的强化学习方法将大型语言模型(LLM)视为统一策略,忽略了其内部机制。在本文中,我们通过Transformer的残差流将基于LLM的策略分解为内部层策略和内部模块策略。我们对内部策略的熵分析揭示了不同的模式:(1)普遍地,内部策略从早期层的高熵探索演变为顶层层的确定性精炼;(2)Qwen表现出显式的渐进推理结构,与Llama中的突然收敛形成对比。此外,我们发现优化内部层会引发特征精炼,迫使较低层早期捕获高层推理表示。受这些发现启发,我们提出了自底向上策略优化(BuPO),一种新的强化学习范式,通过在早期阶段优化内部层来自底向上重建LLM的推理基础。在复杂推理基准上的大量实验证明了BuPO的有效性。

英文摘要

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via the Transformer's residual stream. Our entropy analysis of internal policy reveals distinct patterns: (1) universally, internal policies evolve from high-entropy exploration in early layers to deterministic refinement in the top layers; and (2) Qwen exhibits an explicit progressive reasoning structure, contrasting with the abrupt convergence in Llama. Furthermore, we discover that optimizing internal layers induces feature refinement, forcing lower layers to capture high-level reasoning representations early. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that reconstructs the LLM's reasoning foundation from the bottom up by optimizing internal layers in early stages. Extensive experiments on complex reasoning benchmarks demonstrate the effectiveness of BuPO.

2602.01399 2026-06-01 cs.LG cs.AI stat.ML 版本更新

An Odd Estimator for Shapley Values

Shapley 值的一个奇估计器

Fabian Fumagalli, Landon Butler, Justin Singh Kang, Kannan Ramchandran, R. Teal Witter

发表机构 * Department of Statistics, LMU Munich(LMU慕尼黑统计系) Department of Electrical Engineering and Computer Science, UC Berkeley(伯克利电子工程与计算机科学系) Mathematical Sciences Department, Claremont McKenna College(克莱门茨麦肯纳学院数学科学系)

AI总结 本文证明 Shapley 值仅依赖于集合函数的奇分量,并基于此提出 OddSHAP 估计器,通过在奇子空间上进行多项式回归实现高效近似,在较大采样预算下达到最先进精度。

Comments Accepted to ICML 2026

详情
AI中文摘要

Shapley 值是机器学习中用于归因的普遍框架,涵盖特征重要性、数据估值和因果推断。然而,其精确计算通常是棘手的,需要高效的近似方法。虽然最有效和流行的估计器利用配对采样启发式来减少估计误差,但驱动这种改进的理论机制仍然不透明。在这项工作中,我们为配对采样提供了一个优雅且基本的理由:我们证明了 Shapley 值仅依赖于集合函数的奇分量,并且配对采样正交化回归目标以滤除无关的偶分量。利用这一见解,我们提出了 OddSHAP,一种新颖的一致估计器,它仅在奇子空间上进行多项式回归。通过利用傅里叶基来隔离该子空间,并使用代理模型识别高影响交互,OddSHAP 克服了高阶近似的组合爆炸。通过广泛的基准测试,我们发现 OddSHAP 在较大的采样预算下实现了最先进的估计精度。

英文摘要

The Shapley value is a ubiquitous framework for attribution in machine learning, encompassing feature importance, data valuation, and causal inference. However, its exact computation is generally intractable, necessitating efficient approximation methods. While the most effective and popular estimators leverage the paired sampling heuristic to reduce estimation error, the theoretical mechanism driving this improvement has remained opaque. In this work, we provide an elegant and fundamental justification for paired sampling: we prove that the Shapley value depends exclusively on the odd component of the set function, and that paired sampling orthogonalizes the regression objective to filter out the irrelevant even component. Leveraging this insight, we propose OddSHAP, a novel consistent estimator that performs polynomial regression solely on the odd subspace. By utilizing the Fourier basis to isolate this subspace and employing a proxy model to identify high-impact interactions, OddSHAP overcomes the combinatorial explosion of higher-order approximations. Through an extensive benchmark, we find that OddSHAP achieves state-of-the-art estimation accuracy at larger sampling budgets.

2602.01186 2026-06-01 cs.LG cs.AI 版本更新

The Gaussian-Head OFL Family: One-Shot Federated Learning from Client Global Statistics

高斯头OFL系列:基于客户端全局统计的一次性联邦学习

Fabio Turazza, Marco Picone, Marco Mamei

发表机构 * Department of Sciences and Methods for Engineering(工程科学与方法系) Artificial Intelligence Research and Innovation Center(人工智能研究与创新中心) University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学)

AI总结 提出高斯头OFL系列方法,通过客户端仅传输每类计数和一二阶矩,服务器利用闭式高斯头、FisherMix和Proto-Hyper三种组件构建模型,实现严格无数据的一次性联邦学习,在强非独立同分布下达到最先进鲁棒性和准确性。

Comments Accepted at the International Conference on Learning Representations (ICLR) 2026 - Final Version

详情
AI中文摘要

经典联邦学习依赖于服务器与客户端之间多轮迭代的模型交换和聚合过程,存在高通信成本和重复模型传输带来的隐私风险。相比之下,一次性联邦学习(OFL)通过将通信减少到单轮来缓解这些限制,从而降低开销并增强实际部署能力。然而,现有大多数一次性方法仍然不切实际或受限,例如,它们通常依赖公共数据集的可用性、假设同质客户端模型,或需要上传额外数据或模型信息。为克服这些问题,我们引入了高斯头OFL(GH-OFL)系列,这是一套一次性联邦方法,假设预训练嵌入具有类条件高斯性。客户端仅传输充分统计量(每类计数和一阶/二阶矩),服务器通过三个组件构建头部:(i)直接从接收统计量计算的闭式高斯头(NB/LDA/QDA);(ii)FisherMix,一种在估计的Fisher子空间中采样的合成样本上训练的带余弦边界的线性头;以及(iii)Proto-Hyper,一种轻量级低秩残差头,通过知识蒸馏在这些合成样本上细化高斯logits。在我们的实验中,GH-OFL方法在强非独立同分布偏移下提供了最先进的鲁棒性和准确性,同时保持严格无数据。

英文摘要

Classical Federated Learning relies on a multi-round iterative process of model exchange and aggregation between server and clients, with high communication costs and privacy risks from repeated model transmissions. In contrast, one-shot federated learning (OFL) alleviates these limitations by reducing communication to a single round, thereby lowering overhead and enhancing practical deployability. Nevertheless, most existing one-shot approaches remain either impractical or constrained, for example, they often depend on the availability of a public dataset, assume homogeneous client models, or require uploading additional data or model information. To overcome these issues, we introduce the Gaussian-Head OFL (GH-OFL) family, a suite of one-shot federated methods that assume class-conditional Gaussianity of pretrained embeddings. Clients transmit only sufficient statistics (per-class counts and first/second-order moments) and the server builds heads via three components: (i) Closed-form Gaussian heads (NB/LDA/QDA) computed directly from the received statistics; (ii) FisherMix, a linear head with cosine margin trained on synthetic samples drawn in an estimated Fisher subspace; and (iii) Proto-Hyper, a lightweight low-rank residual head that refines Gaussian logits via knowledge distillation on those synthetic samples. In our experiments, GH-OFL methods deliver state-of-the-art robustness and accuracy under strong non-IID skew while remaining strictly data-free.

2602.00521 2026-06-01 cs.AI 版本更新

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

通过项目反应理论诊断LLM作为评判者的可靠性

Junhyuk Choi, Sohhyung Park, Chanhee Cho, Hyeonchu Park, Bugeun Kim

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea(Chung-Ang 大学人工智能系) Department of Industrial Engineering, Seoul National University, Seoul, Republic of Korea(首尔国立大学工业工程系)

AI总结 提出基于项目反应理论(IRT)的两阶段诊断框架,通过内在一致性和人类对齐两个维度评估LLM作为评判者的可靠性,并提供可解释的诊断信号。

Comments Accepted ICML 2026

详情
AI中文摘要

虽然LLM作为评判者(LLM-as-a-Judge)在自动评估中被广泛使用,但现有的验证实践主要在观察输出层面进行,对于LLM评判者本身是否作为稳定可靠的测量工具提供的洞察有限。为了解决这一局限性,我们引入了一个基于项目反应理论(IRT)的两阶段诊断框架,用于评估LLM作为评判者的可靠性。该框架采用IRT的分级响应模型(GRM),并沿两个互补维度形式化可靠性:(1)内在一致性,定义为在提示变化下测量行为的稳定性,以及(2)人类对齐,捕捉与人类质量评估的一致性。我们通过该框架实证检验了多种LLM评判者,并表明利用IRT-GRM可以产生可解释的信号,用于系统性地诊断判断。这些信号为验证LLM作为评判者的可靠性以及识别不可靠性的潜在原因提供了实用指导。

英文摘要

While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.

2508.09925 2026-06-01 cs.LG cs.AI 版本更新

Residual Reservoir Memory Networks

残差储备记忆网络

Matteo Pinna, Andrea Ceni, Claudio Gallicchio

发表机构 * Department of Computer Science(计算机科学系) University of Pisa(比萨大学)

AI总结 提出一种新型无训练循环神经网络ResRMN,通过结合线性记忆储备与基于时间残差正交连接的非线性储备,增强长期输入传播,在时间序列和像素级一维分类任务中优于传统储备计算模型。

Comments IJCNN 2025

详情
AI中文摘要

我们在储备计算(RC)范式内引入了一类新型无训练循环神经网络(RNN),称为残差储备记忆网络(ResRMN)。ResRMN将线性记忆储备与非线性储备相结合,其中后者基于沿时间维度的残差正交连接,以增强输入的长期传播。通过线性稳定性分析研究所得储备状态动力学,并探讨了时间残差连接的不同配置。所提出的方法在时间序列和像素级一维分类任务上进行了实证评估。我们的实验结果突出了所提出方法相对于其他传统RC模型的优势。

英文摘要

We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Reservoir Memory Networks (ResRMNs). ResRMN combines a linear memory reservoir with a non-linear reservoir, where the latter is based on residual orthogonal connections along the temporal dimension for enhanced long-term propagation of the input. The resulting reservoir state dynamics are studied through the lens of linear stability analysis, and we investigate diverse configurations for the temporal residual connections. The proposed approach is empirically assessed on time-series and pixel-level 1-D classification tasks. Our experimental results highlight the advantages of the proposed approach over other conventional RC models.

2506.01318 2026-06-01 cs.LG cs.AI 版本更新

Unlearning's Blind Spots: Over-Unlearning and Prototypical Relearning Attack

机器遗忘的盲点:过度遗忘与原型重学习攻击

SeungBum Ha, Saerom Park, Sung Whan Yoon

发表机构 * Graduate School of Artificial Intelligence, Ulsan National Institute of Science and Technology (UNIST), Ulsan, South Korea(人工智能研究生院,乌山国立科学与技术研究所(UNIST),乌山,韩国) Department of Industrial Engineering, UNIST, Ulsan, South Korea(工业工程系,UNIST,乌山,韩国) Department of Electrical Engineering, UNIST, Ulsan, South Korea(电气工程系,UNIST,乌山,韩国)

AI总结 针对类别级机器遗忘,提出过度遗忘度量OU@epsilon并揭示原型重学习攻击,通过Spotter方法结合掩码知识蒸馏和类内分散损失来缓解这两个盲点。

Comments 9 pages, ICML 2026

详情
AI中文摘要

机器遗忘(MU)旨在从训练模型中删除指定的遗忘集,而无需昂贵的重新训练,但现有技术忽略了两个关键盲点:"过度遗忘"会恶化遗忘集附近的保留数据,以及事后"重学习"攻击旨在复活被遗忘的知识。聚焦于类别级遗忘,我们首先推导出一个过度遗忘度量OU@epsilon,它量化了遗忘集邻近区域(过度遗忘主要发生区域)的附带损害。接下来,我们揭示了MU上一个未预见的重学习威胁,即原型重学习攻击,该攻击仅利用少量样本就能利用遗忘类的每类原型,并轻松恢复遗忘前的性能。为了应对类别级遗忘中的这两个盲点,我们引入了Spotter,一个即插即用的目标函数,它结合了(i)对遗忘类邻近区域的掩码知识蒸馏惩罚以抑制OU@epsilon,和(ii)一个类内分散损失,用于分散遗忘类嵌入,从而中和原型重学习攻击。Spotter在CIFAR、TinyImageNet和CASIA-WebFace数据集上取得了最先进的结果,为机器遗忘的盲点提供了实用的补救措施。

英文摘要

Machine unlearning (MU) aims to expunge a designated forget set from a trained model without costly retraining, yet the existing techniques overlook two critical blind spots: "over-unlearning" that deteriorates retained data near the forget set, and post-hoc "relearning" attacks that aim to resurrect the forgotten knowledge. Focusing on class-level unlearning, we first derive an over-unlearning metric, OU@epsilon, which quantifies collateral damage in regions proximal to the forget set, where over-unlearning mainly occurs. Next, we expose an unforeseen relearning threat on MU, i.e., the Prototypical Relearning Attack, which exploits the per-class prototype of the forget class with just a few samples, and easily restores the pre-unlearning performance. To counter both blind spots in class-level unlearning, we introduce Spotter, a plug-and-play objective that combines (i) a masked knowledge-distillation penalty on the nearby region of forget classes to suppress OU@epsilon, and (ii) an intra-class dispersion loss that scatters forget-class embeddings, neutralizing Prototypical Relearning Attacks. Spotter achieves state-of-the-art results across CIFAR, TinyImageNet, and CASIA-WebFace datasets, offering a practical remedy to unlearning's blind spots.

2601.22296 2026-06-01 cs.LG cs.AI 版本更新

ParalESN: Enabling parallel information processing in Reservoir Computing

ParalESN:在储层计算中实现并行信息处理

Matteo Pinna, Giacomo Lagomarsini, Andrea Ceni, Claudio Gallicchio

发表机构 * Department of Computer Science, University of Pisa, Pisa, Italy(意大利比萨大学计算机科学系)

AI总结 提出ParalESN,利用复数域对角线性递归实现储层计算的并行化,在保持回声状态属性和普适性保证的同时,大幅提升计算效率。

Comments ICML 2026

详情
AI中文摘要

储层计算(RC)已成为时间处理的有效范式。然而,其可扩展性受到顺序处理时间数据的需要和高维储层巨大内存占用的严重限制。为了解决这些限制,我们通过结构化算子和状态空间建模的视角重新审视RC,引入了并行回声状态网络(ParalESN)。利用复数域中的对角线性递归,ParalESN实现了时间数据的并行处理以及高效高维储层的构建。彻底的理论分析表明,传统回声状态网络的回声状态属性和普适性保证得以保留,同时允许任意线性储层在复数对角形式下的等价表示。实验上,ParalESN在预测精度上与传统的RC和完全可训练的序列模型相当,同时实现了数量级的计算节省。总体而言,ParalESN为将RC集成到深度学习领域提供了一条可扩展且有原则的路径。

英文摘要

Reservoir Computing (RC) has established itself as an efficient paradigm for temporal processing. However, its scalability remains severely constrained by the need to process temporal data sequentially and the prohibitive memory footprint of high-dimensional reservoirs. To address these limitations, we revisit RC through the lens of structured operators and state space modeling, introducing Parallel Echo State Network (ParalESN). Leveraging diagonal linear recurrence in the complex domain, ParalESN enables parallel processing of temporal data and the construction of efficient, high-dimensional reservoirs. A thorough theoretical analysis demonstrates that the Echo State Property and the universality guarantees of traditional Echo State Networks are preserved, while also admitting an equivalent representation of arbitrary linear reservoirs in the complex diagonal form. Empirically, ParalESN achieves competitive predictive accuracy with traditional RC and with fully trainable sequence models, while delivering computational savings by orders of magnitude. Overall, ParalESN offers a scalable and principled pathway for integrating RC within the deep learning landscape.

2509.24319 2026-06-01 cs.CL cs.AI 版本更新

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models

价值表达的双重机制:大型语言模型中的内在价值与提示价值

Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(数据科学研究生院,首尔国立大学)

AI总结 本文通过价值向量和价值神经元分析,揭示大型语言模型中内在价值表达与提示价值表达在机制上部分共享核心组件,但各自拥有独特功能,内在机制促进多样性,提示机制增强指令遵从。

Comments Accepted at ICML 2026. Project page: https://holi-lab.github.io/ValueMechanism/

详情
AI中文摘要

大型语言模型可以通过两种主要方式表达价值:(1)内在表达,反映模型在训练过程中学习到的固有价值;(2)提示表达,由显式提示引发。鉴于它们在价值对齐中的广泛应用,清楚理解其潜在机制至关重要,特别是它们是否主要重叠(如人们可能预期的)或依赖于不同的机制。我们在机制层面使用两种方法分析这个很大程度上未被充分研究的问题:(1)价值向量,从残差流中提取的代表价值机制的特征方向;(2)价值神经元,对价值向量有贡献的MLP神经元。我们证明内在和提示价值机制部分共享对诱导价值表达至关重要的共同组件,这些组件跨语言泛化并在模型的内部表示中重建理论上的价值间相关性。然而,每种机制也拥有独特的组件,发挥不同的作用。特别是,内在机制在更多样化的价值相关场景中激活并促进响应多样性,而提示机制增强指令遵从,甚至在遥远任务(如越狱)中也能生效。

英文摘要

Large language models can express values in two main ways: (1) intrinsic expression, reflecting the model's inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on distinct mechanisms. We analyze this largely understudied problem at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value vectors. We demonstrate that intrinsic and prompted value mechanisms partly share common components crucial for inducing value expression, generalizing across languages and reconstructing theoretical inter-value correlations in the model's internal representations. Yet, each mechanism also possesses unique components that fulfill distinct roles. In particular, the intrinsic mechanism activates in more diverse value-related scenarios and promotes response diversity, whereas the prompted mechanism strengthens instruction compliance, taking effect even in distant tasks like jailbreaking.

2509.20784 2026-06-01 cs.CL cs.AI 版本更新

Towards Atoms of Large Language Models

迈向大型语言模型的原子

Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China(认知与决策智能复杂系统重点实验室,自动化研究所,中国科学院,北京,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(人工智能学院,中国科学院大学,北京,中国)

AI总结 本文提出原子理论,通过原子内积(AIP)定义、评估和识别大型语言模型的基本表示单元(原子),并证明在阈值激活稀疏自编码器(TSAE)下原子可识别,实验发现神经元和特征不满足理想原子标准,而通过匹配TSAE容量与数据规模可识别出近乎完美的原子。

Comments To be published in ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)的基本表示单元(FRUs)尚未定义,这限制了对它们底层机制的进一步理解。在本文中,我们引入原子理论来系统地定义、评估和识别这样的FRUs,我们称之为原子。基于原子内积(AIP),一种捕捉LLM表示底层几何结构的非欧几里得度量,我们正式定义了原子,并提出了理想原子的两个关键标准:忠实性($R^2$)和稳定性($q^*$)。我们进一步证明,在阈值激活稀疏自编码器(TSAEs)下原子是可识别的。在实验上,我们揭示了LLMs中普遍存在的表示偏移,并证明AIP纠正了这种偏移以捕捉底层的表示几何结构。我们发现两个广泛使用的单元——神经元和特征——不符合理想原子的条件:神经元是忠实的($R^2=1$)但不稳定($q^*=0.5\%$),而特征更稳定($q^*=68.2\%$)但不忠实($R^2=48.8\%$)。为了找到LLMs的原子,利用TSAEs下的原子可识别性,我们通过大规模实验表明,只有当TSAE容量与数据规模匹配时,才能实现可靠的原子识别。在此洞察的指导下,我们在Gemma2-2B、Gemma2-9B和Llama3.1-8B的各层中识别出具有近乎完美忠实性($R^2=99.9\%$)和稳定性($q^*=99.8\%$)的FRUs,在统计上满足理想原子的标准。进一步分析证实,这些原子与理论预期一致,并表现出显著更高的单语义性。总体而言,我们提出并验证了原子理论作为理解LLMs内部表示的基础。代码可在https://github.com/ChenhuiHu/towards_atoms获取。

英文摘要

The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their underlying mechanisms. In this paper, we introduce Atom Theory to systematically define, evaluate, and identify such FRUs, which we term atoms. Building on the atomic inner product (AIP), a non-Euclidean metric that captures the underlying geometry of LLM representations, we formally define atoms and propose two key criteria for ideal atoms: faithfulness ($R^2$) and stability ($q^*$). We further prove that atoms are identifiable under threshold-activated sparse autoencoders (TSAEs). Empirically, we uncover a pervasive representation shift in LLMs and demonstrate that the AIP corrects this shift to capture the underlying representational geometry. We find that two widely used units, neurons and features, fail to qualify as ideal atoms: neurons are faithful ($R^2=1$) but unstable ($q^*=0.5\%$), while features are more stable ($q^*=68.2\%$) but unfaithful ($R^2=48.8\%$). To find atoms of LLMs, leveraging atom identifiability under TSAEs, we show via large-scale experiments that reliable atom identification occurs only when the TSAE capacity matches the data scale. Guided by this insight, we identify FRUs with near-perfect faithfulness ($R^2=99.9\%$) and stability ($q^*=99.8\%$) across layers of Gemma2-2B, Gemma2-9B, and Llama3.1-8B, satisfying the criteria of ideal atoms statistically. Further analysis confirms that these atoms align with theoretical expectations and exhibit substantially higher monosemanticity. Overall, we propose and validate Atom Theory as a foundation for understanding the internal representations of LLMs. Code available at https://github.com/ChenhuiHu/towards_atoms.

2601.18537 2026-06-01 cs.RO cs.AI 版本更新

SKETCH: Semantic Key-Point Conditioning for Long-Horizon Vessel Trajectory Prediction

SKETCH: 面向长时域船舶轨迹预测的语义关键点条件建模

Linyong Gan, Zimo Li, Wenxin Xu, Xingjian Li, Jianhua Z. Huang, Enmei Tu, Shuhang Chen

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)数据科学学院) School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)科学与工程学院) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)人工智能学院) COSCO SHIPPING Advanced Technology Institute, Shanghai, China(中远海运技术研究院)

AI总结 针对长时域轨迹预测中方向漂移问题,提出基于语义关键点(NKP)的条件轨迹建模框架,将预测分解为全局语义决策与局部运动建模,采用预训练-微调策略估计NKP先验,在真实AIS数据上显著提升长时域、方向精度和细粒度预测性能。

详情
AI中文摘要

由于复杂导航行为和环境因素导致的复合不确定性,准确的长时域船舶轨迹预测仍然具有挑战性。现有方法在长时间外推时往往难以保持全局方向一致性,导致轨迹漂移或不合理。为解决这一问题,我们提出了一种语义关键点条件轨迹建模框架,通过以捕获导航意图的高级下一关键点(NKP)为条件来预测未来轨迹。该公式将长时域预测分解为全局语义决策和局部运动建模,有效将未来轨迹的支持集限制在语义可行的子集内。为了从历史观测中高效估计NKP先验,我们采用了预训练-微调策略。在真实AIS数据上的大量实验表明,所提方法在长旅行时长、方向精度和细粒度轨迹预测方面持续优于现有最先进方法。

英文摘要

Accurate long-horizon vessel trajectory prediction remains challenging due to compounded uncertainty from complex navigation behaviors and environmental factors. Existing methods often struggle to maintain global directional consistency, leading to drifting or implausible trajectories when extrapolated over long time horizons. To address this issue, we propose a semantic-key-point-conditioned trajectory modeling framework, in which future trajectories are predicted by conditioning on a high-level Next Key Point (NKP) that captures navigational intent. This formulation decomposes long-horizon prediction into global semantic decision-making and local motion modeling, effectively restricting the support of future trajectories to semantically feasible subsets. To efficiently estimate the NKP prior from historical observations, we adopt a pretrain-finetune strategy. Extensive experiments on real-world AIS data demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly for long travel durations, directional accuracy, and fine-grained trajectory prediction.

2601.21372 2026-06-01 cs.AI 版本更新

NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents

NEMO: 通过自主编码代理实现执行感知的优化建模

Yang Song, Anoushka Vyas, Zirui Wei, Sina Khoshfetrat Pakazad, Henrik Ohlsson, Graham Neubig

发表机构 * Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA(语言技术研究所,计算机科学学院,卡内基梅隆大学,匹兹堡,PA,美国)

AI总结 提出NEMO系统,利用自主编码代理将决策问题的自然语言描述转化为可执行的数学优化实现,通过执行感知的架构和协调模式实现最先进的性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们提出NEMO,一个使用自主编码代理(ACA)将决策问题的自然语言描述转化为正式的可执行数学优化实现的系统。现有方法依赖于专门的大语言模型(LLM)或定制的任务特定代理,这些方法通常脆弱且经常生成语法无效或不可执行的代码。NEMO则将ACA视为与基于API的LLM交互类似的一等抽象;其沙盒执行保证代码从结构上可执行,并支持自动验证和修复。我们引入了新颖的协调模式,包括独立生成的优化器和模拟器实现之间的非对称验证循环、用于经验重用的外部记忆,以及通过最小贝叶斯风险(MBR)解码和自一致性增强的鲁棒性。在九个已建立的优化基准测试中,NEMO在大多数任务上取得了最先进的性能,并在多个数据集上大幅领先,展示了执行感知的代理架构在自动化优化建模中的强大能力。

英文摘要

We present NEMO, a system that translates Natural-language descriptions of decision problems into formal Executable Mathematical Optimization implementations using autonomous coding agents (ACAs). Existing approaches rely on specialized large language models (LLMs) or bespoke task-specific agents that are often brittle and frequently generate syntactically invalid or non-executable code. NEMO instead treats ACAs as a first-class abstraction analogous to API-based interaction with LLMs; their sandboxed execution guarantees code is executable by construction and supports automated validation and repair. We introduce novel coordination patterns including asymmetric validation loops between independently generated optimizer and simulator implementations, external memory for experience reuse, and robustness enhancements via minimum Bayes risk (MBR) decoding and self-consistency. Across nine established optimization benchmarks, NEMO achieves state-of-the-art performance on the majority of tasks with substantial margins on several datasets, demonstrating the power of execution-aware agentic architectures for automated optimization modeling.

2601.19936 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Gap-K%: 通过测量 Top-1 预测差距检测预训练数据

Minseo Kwak, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 提出 Gap-K% 方法,利用 LLM 的 top-1 预测与目标 token 的对数概率差距及滑动窗口策略,在 WikiMIA 和 MIMIR 基准上实现预训练数据检测的最优性能。

Comments ACL 2026 Main Conference; 15 pages

详情
AI中文摘要

大型语言模型(LLM)中大规模预训练语料库的不透明性引发了严重的隐私和版权问题,使得预训练数据检测成为一项关键挑战。现有的最先进方法通常依赖于 token 似然,但它们往往忽略了目标 token 与模型 top-1 预测之间的差距,以及相邻 token 之间的局部相关性。在这项工作中,我们提出了 Gap-K%,一种基于 LLM 预训练优化动态的新型预训练数据检测方法。通过分析下一个 token 预测目标,我们观察到模型 top-1 预测与目标 token 之间的差异会引发强烈的梯度信号,这些信号在训练过程中被明确惩罚。受此启发,Gap-K% 利用 top-1 预测 token 与目标 token 之间的对数概率差距,并结合滑动窗口策略来捕获局部相关性并缓解 token 级别的波动。在 WikiMIA 和 MIMIR 基准上的大量实验表明,Gap-K% 实现了最先进的性能,在各种模型大小和输入长度上始终优于先前的基线方法。

英文摘要

The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model's top-1 prediction, as well as local correlations between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model's top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training. Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local correlations and mitigate token-level fluctuations. Extensive experiments on the WikiMIA and MIMIR benchmarks demonstrate that Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.

2504.04430 2026-06-01 cs.AI 版本更新

Foundational Requirements for Artificial General Intelligence: A Falsifiable Framework Based on Signal Prediction

人工通用智能的基础要求:一个基于信号预测的可证伪框架

Matej Šprogar

发表机构 * University of Maribor, Faculty of Electrical Engineering and Computer Science(马里博大学电子工程与计算机科学学院)

AI总结 本文提出一个基于信号预测的可证伪框架,通过定义低层要求(如从无知状态学习、实时活性)并设计可重复测试来检验人工通用智能。

Comments 9 pages, 2 figures

详情
AI中文摘要

基于高级智能可以从低级信号处理中涌现的前提,我们提出了关于人工通用智能所需的低级要求的假设。所提出的要求刻画了通过预测具有初始未知语义内容的时空结构化信号进行学习的系统的核心属性。它们包括从认知神经科学中观察到的基本原理,从从无知状态学习到实时活性。为了进行实证检验和假设拒绝,我们引入了一个由透明且可重复的测试组成的操作测试平台,每个要求对应一个测试。迄今为止,尚未发现或报告有任何非智能系统成功通过该测试平台。在出现这样的反例之前,该测试平台可作为通向通用智能的候选实证里程碑。该测试平台的参考实现已公开可用。

英文摘要

Grounded in the premise that high-level intelligence can emerge from low-level signal processing, we advance a hypothesis regarding low-level requirements necessary for artificial general intelligence. The proposed requirements characterise core properties of systems that learn through prediction over spatially and temporally structured signals with initially unknown semantic content. They include a selection of basic principles observed in cognitive neuroscience, from learning from an uninformed state to real-time liveness. To enable empirical testing and hypothesis rejection, we introduce an operational testbed composed of transparent and reusable tests, one per requirement. To date, no non-intelligent system has been identified or reported as successfully passing the testbed. Pending such a counterexample, the testbed serves as a candidate empirical milestone toward general intelligence. The reference implementation of the testbed is publicly available.

2510.05115 2026-06-01 cs.AI cs.CL cs.PL 版本更新

SAC-Opt: Semantic Anchors for Iterative Correction in Optimization Modeling

SAC-Opt:优化建模中用于迭代修正的语义锚点

Yansen Zhang, Qingcan Kang, Yujie Chen, Yufei Wang, Xiongwei Han, Tao Zhong, Mingxuan Yuan, Chen Ma

发表机构 * Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China(香港城市大学计算机科学系) Huawei Noah's Ark Lab, Hong Kong SAR, China(华为诺亚实验室(香港)) Huawei's Supply Chain Management Department, Shenzhen, China(华为供应链管理部(深圳))

AI总结 提出SAC-Opt框架,通过语义锚点对齐和选择性修正,在无需额外训练的情况下提升大语言模型生成优化建模代码的语义忠实度,平均建模准确率提升7.7%。

Comments ICML 2026 accepted

详情
AI中文摘要

大语言模型(LLMs)通过从自然语言描述生成可执行的求解器代码,为优化建模开辟了新范式。尽管前景广阔,现有方法通常仍以求解器驱动:它们依赖单次前向生成,并基于求解器错误信息进行有限的事后修正,这留下了未被检测到的语义错误,这些错误会静默地产生语法正确但逻辑有缺陷的模型。为应对这一挑战,我们提出SAC-Opt,一种反向引导的修正框架,将优化建模建立在问题语义而非求解器反馈之上。在每一步中,SAC-Opt将原始语义锚点与从生成代码中重建的锚点对齐,并仅选择性修正不匹配的组件,从而驱动模型收敛到语义忠实的模型。这种锚点驱动的修正能够对约束和目标逻辑进行细粒度改进,在无需额外训练或监督的情况下增强忠实性和鲁棒性。在七个公共数据集上的实验结果表明,SAC-Opt将平均建模准确率提升了7.7%,在ComplexLP数据集上提升高达21.9%。这些发现强调了在基于LLM的优化工作流中,语义锚点修正对于确保从问题意图到求解器可执行代码的忠实翻译的重要性。

英文摘要

Large language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from natural language descriptions. Despite this promise, existing approaches typically remain solver-driven: they rely on single-pass forward generation and apply limited post-hoc fixes based on solver error messages, leaving undetected semantic errors that silently produce syntactically correct but logically flawed models. To address this challenge, we propose SAC-Opt, a backward-guided correction framework that grounds optimization modeling in problem semantics rather than solver feedback. At each step, SAC-Opt aligns the original semantic anchors with those reconstructed from the generated code and selectively corrects only the mismatched components, driving convergence toward a semantically faithful model. This anchor-driven correction enables fine-grained refinement of constraint and objective logic, enhancing both fidelity and robustness without requiring additional training or supervision. Empirical results on seven public datasets demonstrate that SAC-Opt improves average modeling accuracy by 7.7%, with gains of up to 21.9% on the ComplexLP dataset. These findings highlight the importance of semantic-anchored correction in LLM-based optimization workflows to ensure faithful translation from problem intent to solver-executable code.

2601.13704 2026-06-01 cs.SD cs.AI cs.LG eess.AS 版本更新

Performance and Complexity Trade-off Optimization of Speech Models During Training

训练过程中语音模型的性能与复杂度权衡优化

Esteban Gómez, Tom Backström

发表机构 * Department of Information and Communications Engineering, Aalto University(信息与通信工程系,艾尔托大学)

AI总结 提出一种基于特征噪声注入的重新参数化技术,利用随机梯度下降方法在训练中联合优化语音模型的性能和计算复杂度,实现动态模型大小调整。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

在语音机器学习中,神经网络模型通常通过选择具有固定层大小和结构的架构来设计。这些模型随后被训练以最大化与任务目标相关的性能指标。虽然整体架构通常由任务的先验知识指导,但各层的大小往往是启发式选择的。然而,这种方法并不能保证性能与计算复杂度之间的最优权衡;因此,通常采用权重量化或模型剪枝等后处理方法以降低计算成本。这是因为随机梯度下降(SGD)方法只能优化可微函数,而影响计算复杂度的因素(如层大小和每秒浮点运算次数(FLOP/s))是不可微的,需要在训练过程中修改模型结构。我们提出了一种基于特征噪声注入的重新参数化技术,使得在训练过程中能够使用基于SGD的方法联合优化性能和计算复杂度。与传统的剪枝方法不同,我们的方法允许模型大小针对目标性能-复杂度权衡进行动态优化,而无需依赖启发式标准来选择要移除的权重或结构。我们通过三个案例研究证明了我们方法的有效性,包括一个合成示例和两个实际应用:语音活动检测和音频反欺骗。与我们的工作相关的代码已公开,以鼓励进一步研究。

英文摘要

In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying the model structure during training. We propose a reparameterization technique based on feature noise injection that enables joint optimization of performance and computational complexity during training using SGD-based methods. Unlike traditional pruning methods, our approach allows the model size to be dynamically optimized for a target performance-complexity trade-off, without relying on heuristic criteria to select which weights or structures to remove. We demonstrate the effectiveness of our method through three case studies, including a synthetic example and two practical real-world applications: voice activity detection and audio anti-spoofing. The code related to our work is publicly available to encourage further research.

2502.12119 2026-06-01 cs.CV cs.AI cs.CL 版本更新

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

PRISM:免训练多模态数据选择的自剪枝内在选择方法

Jinhe Bi, Aniri, Zengjie Jin, Yifan Wang, Danqi Yan, Wenke Huang, Xiaowen Ma, Sikuan Yan, Artur Hecker, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, Yunpu Ma

发表机构 * LMU Munich(慕尼黑大学) Munich Research Center, Huawei Technologies(慕尼黑研究中心,华为技术) METEOR School of Computer Science, Wuhan University(武汉大学计算机学院) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 针对多模态大语言模型视觉指令数据冗余问题,提出一种免训练框架PRISM,通过隐式重中心化消除视觉特征各向异性导致的全局语义漂移,实现高效数据选择,在降低计算成本的同时提升模型性能。

Comments Accepted to ACL 2026 and selected for the Best Paper list; later desk-rejected due to an inadvertent manual bibliography-editing error. Previous versions are withdrawn due to an inadvertent manual bibliography-editing error; please refer to the latest corrected version

详情
AI中文摘要

视觉指令微调使预训练的多模态大语言模型(MLLMs)能够遵循人类指令以应用于现实场景。然而,这些数据集的快速增长引入了显著的冗余,导致计算成本增加。现有的指令数据选择方法旨在修剪这种冗余,但主要依赖于计算密集型技术,如基于代理的推理或基于训练的指标。因此,这些选择过程产生的巨大计算成本往往加剧了它们本应解决的效率瓶颈,对MLLMs的可扩展和有效微调构成了重大挑战。为了解决这一挑战,我们首先发现了一个关键但先前被忽视的因素:视觉特征分布中固有的各向异性。我们发现这种各向异性引发了 extit{全局语义漂移},而忽视这一现象是限制当前数据选择方法效率的关键因素。受此启发,我们设计了 extbf{PRISM},这是第一个用于高效视觉指令选择的免训练框架。PRISM通过隐式重中心化建模内在视觉语义,精确移除全局背景特征的干扰影响。实验表明,PRISM将数据选择和模型微调的端到端时间减少到传统流程的30%。更值得注意的是,它在实现这一效率的同时提升了性能,在八个多模态和三个语言理解基准上超越了在全数据集上微调的模型,最终相对于基线实现了101.7%的相对改进。代码可通过\href{https://github.com/bibisbar/PRISM}{此仓库}获取。

英文摘要

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.

2601.06453 2026-06-01 cs.AI 版本更新

ConSensus: Multi-Agent Collaboration for Multimodal Sensing

ConSensus:面向多模态感知的多智能体协作

Hyungjun Yoon, Mohammad Malekzadeh, Sung-Ju Lee, Fahim Kawsar, Lorena Qendro

发表机构 * KAIST(韩国科学技术院) Nokia Bell Labs(诺基亚贝尔实验室) University of Glasgow(格拉斯哥大学)

AI总结 提出ConSensus,一种无需训练的多智能体协作框架,通过将多模态感知任务分解为专用智能体并采用混合融合机制,在五个基准上平均准确率提升7.1%,融合token成本降低12.7倍。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

大型语言模型(LLMs)越来越多地基于传感器数据来感知和推理人类生理及物理世界。然而,准确解释异构多模态传感器数据仍然是一个基本挑战。我们表明,单一的整体LLM通常无法跨模态进行连贯推理,导致解释不完整和先验知识偏差。我们引入了ConSensus,一种无需训练的多智能体协作框架,将多模态感知任务分解为专门的、模态感知的智能体。为了聚合智能体级别的解释,我们提出了一种混合融合机制,该机制平衡了语义聚合(实现跨模态推理和上下文理解)与统计共识(通过跨模态一致性提供鲁棒性)。虽然每种方法都有互补的失败模式,但它们的组合能够在传感器噪声和缺失数据下实现可靠推理。我们在五个不同的多模态感知基准上评估了ConSensus,与单智能体基线相比,平均准确率提高了7.1%。此外,ConSensus匹配或超过了迭代多智能体辩论方法的性能,同时通过单轮混合融合协议将平均融合token成本降低了12.7倍,为现实世界的多模态感知任务提供了鲁棒且高效的解决方案。源代码可在https://github.com/nokia/multi-agent-collaboration-for-multimodal-sensing获取。

英文摘要

Large language models (LLMs) are increasingly grounded in sensor data to perceive and reason about human physiology and the physical world. However, accurately interpreting heterogeneous multimodal sensor data remains a fundamental challenge. We show that a single monolithic LLM often fails to reason coherently across modalities, leading to incomplete interpretations and prior-knowledge bias. We introduce ConSensus, a training-free multi-agent collaboration framework that decomposes multimodal sensing tasks into specialized, modality-aware agents. To aggregate agent-level interpretations, we propose a hybrid fusion mechanism that balances semantic aggregation, which enables cross-modal reasoning and contextual understanding, with statistical consensus, which provides robustness through agreement across modalities. While each approach has complementary failure modes, their combination enables reliable inference under sensor noise and missing data. We evaluate ConSensus on five diverse multimodal sensing benchmarks, demonstrating an average accuracy improvement of 7.1% over the single-agent baseline. Furthermore, ConSensus matches or exceeds the performance of iterative multi-agent debate methods while achieving a 12.7 times reduction in average fusion token cost through a single-round hybrid fusion protocol, yielding a robust and efficient solution for real-world multimodal sensing tasks. The source code is available at https://github.com/nokia/multi-agent-collaboration-for-multimodal-sensing.

2601.01456 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Rethinking Multimodal Few-Shot 3D Point Cloud Segmentation: From Fused Refinement to Decoupled Arbitration

重新思考多模态少样本3D点云分割:从融合精炼到解耦仲裁

Wentao Bian, Fenglei Xu

发表机构 * Suzhou University of Science and Technology(苏州科技大学)

AI总结 针对多模态少样本3D点云分割中“融合-精炼”范式的“可塑性-稳定性困境”和CLIP的语义盲区,提出解耦专家仲裁少样本分割网络(DA-FSS),通过解耦语义与几何路径并相互正则化梯度,实现更好的泛化性能。

Comments Accepted to IJCAI-ECAI 2026 (Main Track). 9 pages, 3 figures, 3 tables

详情
AI中文摘要

本文重新审视多模态少样本3D点云语义分割(FS-PCS),识别出“融合-精炼”范式中的一个冲突:“可塑性-稳定性困境”。此外,CLIP的类间混淆可能导致语义盲区。为解决这些问题,我们提出解耦专家仲裁少样本分割网络(DA-FSS),该模型有效区分语义和几何路径,并相互正则化它们的梯度以实现更好的泛化。DA-FSS采用与MM-FSS相同的主干网络和预训练文本编码器生成文本嵌入,从而提高自由模态的利用率并更好地利用每个模态的信息空间。为此,我们提出并行专家精炼模块以生成每个模态相关性。我们还提出堆叠仲裁模块(SAM)执行卷积融合并为每个模态路径仲裁相关性。并行专家解耦两条路径:几何专家保持可塑性,语义专家确保稳定性。它们通过解耦对齐模块(DAM)协调,该模块在不传播混淆的情况下传递知识。在流行数据集(S3DIS、ScanNet)上的实验表明DA-FSS优于MM-FSS。同时,几何边界、完整性和纹理区分均优于基线。代码可在https://github.com/MoWenQAQ/DA-FSS/获取。

英文摘要

In this paper, we revisit multimodal few-shot 3D point cloud semantic segmentation (FS-PCS), identifying a conflict in "Fuse-then-Refine" paradigms: the "Plasticity-Stability Dilemma." In addition, CLIP's inter-class confusion can result in semantic blindness. To address these issues, we present the Decoupled-experts Arbitration Few-Shot SegNet (DA-FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA-FSS employs the same backbone and pre-trained text encoder as MM-FSS to generate text embeddings, which can increase free modalities' utilization rate and better leverage each modality's information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA-FSS over MM-FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: https://github.com/MoWenQAQ/DA-FSS/.

2601.01075 2026-06-01 cs.LG cs.AI cs.CV 版本更新

Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

流等变世界模型:部分观测动态环境的记忆

Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, Thomas Anderson Keller

发表机构 * Kempner Institute, Harvard University(哈佛大学 Kempner 研究所) ML, Carnegie Mellon University(卡内基梅隆大学 ML 研究所) SEAS, Harvard University(哈佛大学 SEAS 研究所)

AI总结 提出流等变世界建模框架,利用时间参数化对称性在潜在记忆中实现长时程稳定准确的动力学预测,解决部分观测问题。

Comments Accepted at ICML 2026

详情
AI中文摘要

具身系统将世界体验为“流之交响”:多种连续感官输入流与自身运动耦合,并与外部物体的动力学交织。这些感官流和世界的基本动力学遵循平滑的时间参数化对称性,而现有的世界模型忽略了这一点。如果没有尊重这种结构的记忆,部分可观测性对现有方法构成主要障碍:每次观测仅揭示世界的一部分,而未观测区域继续演化。在这项工作中,我们引入了流等变世界建模,这是一个利用潜在记忆中的时间参数化对称性来实现长时程稳定准确动力学预测的框架。潜在记忆随自身运动和推断的外部物体运动等变地移动和变换,使关于视野外区域的信息随时间保持对齐。我们在2D和3D部分观测视频世界建模基准上展示了该框架相对于最先进的扩散、记忆增强和循环世界模型架构的优势。更广泛地说,我们的结果表明,当预测表示按照它们所建模的世界的时间和动力学结构组织时,它们会变得更加强大。项目页面:https://flowequivariantworldmodels.github.io/

英文摘要

Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These sensory streams and the underlying dynamics of the world obey smooth, time-parameterized symmetries which existing world models ignore. Without a memory that respects this structure, partial observability presents a major obstacle to existing methods: each observation reveals only a fraction of the world, while unobserved regions continue to evolve. In this work, we introduce Flow Equivariant World Modeling, a framework that leverages time-parameterized symmetries within a latent memory for stable and accurate dynamics prediction over long horizons. The latent memory shifts and transforms equivariantly with self-motion and inferred external object motion, keeping information about out-of-view regions aligned as time progresses. We demonstrate the advantage of this framework over state-of-the-art diffusion, memory-augmented, and recurrent world model architectures on 2D and 3D partially observed video world modeling benchmarks. More broadly, our results suggest that predictive representations become more powerful when they are organized in line with the temporal and dynamical structure of the world they model. Project page: https://flowequivariantworldmodels.github.io/

2508.17671 2026-06-01 cs.GT cs.AI cs.MA econ.TH 版本更新

Consistent Opponent Modeling in Imperfect-Information Games

不完全信息博弈中的一致对手建模

Sam Ganzfried

发表机构 * Ganzfried Research(甘兹弗里德研究)

AI总结 针对不完全信息博弈中现有对手建模方法无法保证收敛到对手真实策略的问题,提出一种基于序列形式博弈表示和投影梯度下降的凸优化算法,实现高效且一致的对手建模。

详情
AI中文摘要

多智能体环境中智能体的目标是在与对手交互时最大化总收益。遵循博弈论解概念(如纳什均衡)在某些场景下可能获得强性能;然而,这类方法未能利用与对手重复交互中的历史和观测数据。对手建模算法整合机器学习技术,利用可用数据来利用次优对手;然而,这类方法在不完全信息博弈中的有效性至今相当有限。我们表明,即使面对来自已知先验分布的静态对手,现有对手建模方法也无法满足一个简单的理想性质;即,即使博弈迭代次数趋近无穷,它们也不能保证模型趋近对手的真实策略。我们开发了一种新算法,能够实现这一性质,并通过基于序列形式博弈表示和投影梯度下降求解凸最小化问题来高效运行。在标准贝叶斯可辨识性和访问假设下,该算法保证从游戏过程的观测以及可能可用的额外历史数据中高效收敛到对手的真实策略。

英文摘要

The goal of agents in multi-agent environments is to maximize total reward against the opposing agents that are encountered. Following a game-theoretic solution concept, such as Nash equilibrium, may obtain a strong performance in some settings; however, such approaches fail to capitalize on historical and observed data from repeated interactions against our opponents. Opponent modeling algorithms integrate machine learning techniques to exploit suboptimal opponents utilizing available data; however, the effectiveness of such approaches in imperfect-information games to date is quite limited. We show that existing opponent modeling approaches fail to satisfy a simple desirable property even against static opponents drawn from a known prior distribution; namely, they do not guarantee that the model approaches the opponent's true strategy even in the limit as the number of game iterations approaches infinity. We develop a new algorithm that is able to achieve this property and runs efficiently by solving a convex minimization problem based on the sequence-form game representation using projected gradient descent. The algorithm is guaranteed to efficiently converge to the opponent's true strategy under standard Bayesian identifiability and visitation assumptions, given observations from gameplay and possibly additional historical data if it is available.

2512.23626 2026-06-01 cs.AI cs.LG 版本更新

Regret-Based Federated Causal Discovery with Unknown Interventions

基于遗憾的联邦因果发现与未知干预

Federico Baldo, Charles K. Assaad

发表机构 * Sorbonne Université(索邦大学) INSERM(国家健康与医学研究院) Institut Pierre Louis d’Epidémiologie et de Santé Publique(流行病学与公共卫生研究所)

AI总结 提出I-PERI算法,通过恢复客户端图并集的CPDAG并利用跨客户端干预引起的结构差异定向额外边,得到更紧的Φ-马尔可夫等价类,解决联邦环境下未知客户端级干预的因果发现问题。

Comments ICML 2026

详情
AI中文摘要

大多数因果发现方法从观测数据中恢复一个表示马尔可夫等价类的完全部分有向无环图。最近的工作将这些方法扩展到联邦设置以解决数据去中心化和隐私约束,但通常假设所有客户端共享相同的因果模型,这在实践中不现实,因为客户端特定的策略或协议(例如不同医院)自然会导致异质且未知的干预。在这项工作中,我们解决了未知客户端级干预下的联邦因果发现问题。我们提出了I-PERI,一种新颖的联邦算法,首先恢复客户端图并集的CPDAG,然后通过利用跨客户端干预引起的结构差异来定向额外的边。这产生了一个更紧的等价类,我们称之为Φ-马尔可夫等价类,由Φ-CPDAG表示。我们提供了I-PERI收敛性及其隐私保护属性的理论保证,并在合成数据上进行了实证评估,证明了所提算法的有效性。

英文摘要

Most causal discovery methods recover a completed partially directed acyclic graph representing a Markov equivalence class from observational data. Recent work has extended these methods to federated settings to address data decentralization and privacy constraints, but often under idealized assumptions that all clients share the same causal model. Such assumptions are unrealistic in practice, as client-specific policies or protocols, for example, across hospitals, naturally induce heterogeneous and unknown interventions. In this work, we address federated causal discovery under unknown client-level interventions. We propose I-PERI, a novel federated algorithm that first recovers the CPDAG of the union of client graphs and then orients additional edges by exploiting structural differences induced by interventions across clients. This yields a tighter equivalence class, which we call the $\mathbfΦ$-Markov Equivalence Class, represented by the $\mathbfΦ$-CPDAG. We provide theoretical guarantees on the convergence of I-PERI, as well as on its privacy-preserving properties, and present empirical evaluations on synthetic data demonstrating the effectiveness of the proposed algorithm.

2512.20732 2026-06-01 cs.LG cs.AI cs.SE 版本更新

FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

FEM-Bench:评估代码生成大语言模型的结构化科学推理基准

Saeed Mohammadzadeh, Erfan Hamdi, Joel Shor, Emma Lejeune

发表机构 * Boston University(波士顿大学) Move37 Labs(Move37实验室) Department of Mechanical Engineering(机械工程系)

AI总结 提出FEM-Bench基准,通过有限元方法相关编程任务评估大语言模型在科学计算中的结构化推理能力,实验表明现有模型尚不能稳定解决所有任务。

Comments 45 pages, 5 figures, 9 tables, 7 listings

详情
AI中文摘要

随着大语言模型在物理世界推理能力上的进步,缺乏严格基准来评估其生成科学有效物理模型的能力已成为一个关键缺口。计算力学开发和运用数学模型与数值方法,预测物理系统在力、变形和约束下的行为,为结构化科学推理评估提供了理想基础。问题遵循清晰的数学结构,强制执行严格的物理和数值约束,并支持客观验证。该学科要求构建物理系统的显式模型,并推理几何、空间关系和材料行为,直接联系到新兴的AI物理推理和世界建模目标。我们提出FEM-Bench,一个计算力学基准,旨在评估大语言模型生成正确有限元方法及相关代码的能力。FEM-Bench 2025包含一系列入门但非平凡的任务,与计算力学研究生第一门课程的材料一致。这些任务捕捉了基本的数值和物理建模挑战,同时仅代表该学科复杂性的很小一部分。尽管简单,最先进的大语言模型并不能可靠地解决所有任务。在五次尝试中,函数编写表现最好的模型Gemini 3 Pro至少一次完成了30/33个任务,五次全部完成26/33个任务。单元测试编写表现最好的模型GPT-5的平均联合成功率为73.8%。其他流行模型显示出广泛的性能差异。FEM-Bench为评估AI生成的科学代码建立了结构化基础,未来版本将纳入更复杂的任务以跟踪模型进展。

英文摘要

As LLMs advance their reasoning capabilities about the physical world, the absence of rigorous benchmarks for evaluating their ability to generate scientifically valid physical models has become a critical gap. Computational mechanics, which develops and applies mathematical models and numerical methods to predict the behavior of physical systems under forces, deformation, and constraints, provides an ideal foundation for structured scientific reasoning evaluation. Problems follow clear mathematical structure, enforce strict physical and numerical constraints, and support objective verification. The discipline requires constructing explicit models of physical systems and reasoning about geometry, spatial relationships, and material behavior, connecting directly to emerging AI goals in physical reasoning and world modeling. We introduce FEM-Bench, a computational mechanics benchmark designed to evaluate the ability of LLMs to generate correct finite element method (FEM) and related code. FEM-Bench 2025 contains a suite of introductory but nontrivial tasks aligned with material from a first graduate course on computational mechanics. These tasks capture essential numerical and physical modeling challenges while representing only a small fraction of the complexity present in the discipline. Despite their simplicity, state-of-the-art LLMs do not reliably solve all of them. In a five attempt run, the best performing model at function writing, Gemini 3 Pro, completed 30/33 tasks at least once and 26/33 tasks all five times. The best performing model at unit test writing, GPT-5, had an Average Joint Success Rate of 73.8%. Other popular models showed broad performance variation. FEM-Bench establishes a structured foundation for evaluating AI-generated scientific code, and future iterations will incorporate increasingly sophisticated tasks to track progress as models evolve.

2512.11779 2026-06-01 stat.ML cs.AI cs.LG 版本更新

Conditional Coverage Diagnostics for Conformal Prediction

条件覆盖诊断用于共形预测

Sacha Braun, David Holzmüller, Michael I. Jordan, Francis Bach

发表机构 * Sierra team, Inria Paris, France(Inria巴黎研究院法国团队) Ecole Normale Supérieure, PSL Research University, Paris(巴黎高等师范学院PSL研究大学) Soda team, Inria Paris-Saclay, France(Inria巴黎-萨克雷分校法国团队) Departments of EECS(电子工程与计算机科学系)

AI总结 提出将条件覆盖估计转化为分类问题,通过超额风险度量(ERT)来诊断共形预测的条件覆盖偏差,实验表明使用现代分类器比传统指标具有更高的统计功效。

详情
AI中文摘要

评估条件覆盖仍然是评估预测系统可靠性中最持久的挑战之一。尽管共形方法可以保证边际覆盖,但没有方法能保证产生具有正确条件覆盖的集合,这使得实践者无法清晰解释局部偏差。为了克服现有指标的样本低效和过拟合问题,我们将条件覆盖估计转化为一个分类问题。当且仅当某个分类器能够达到比目标覆盖更低的风险时,条件覆盖被违反。通过选择(适当的)损失函数,得到的风险差异给出了自然误覆盖度量(如L1和L2距离)的保守估计,甚至可以分离过覆盖和欠覆盖以及非恒定目标覆盖的影响。我们将得到的度量族称为目标覆盖的超额风险(ERT)。实验表明,使用现代分类器比基于简单分类器的现有指标(如CovGap)具有更高的统计功效。此外,我们使用我们的度量来基准测试不同的共形预测方法。最后,我们发布了ERT以及先前条件覆盖度量的开源软件包。这些贡献共同为理解、诊断和改进预测系统的条件可靠性提供了新视角。

英文摘要

Evaluating conditional coverage remains one of the most persistent challenges in assessing the reliability of predictive systems. Although conformal methods can give guarantees on marginal coverage, no method can guarantee to produce sets with correct conditional coverage, leaving practitioners without a clear way to interpret local deviations. To overcome sample-inefficiency and overfitting issues of existing metrics, we cast conditional coverage estimation as a classification problem. Conditional coverage is violated if and only if some classifier can achieve lower risk than the target coverage. Through the choice of a (proper) loss function, the resulting risk difference gives a conservative estimate of natural miscoverage measures such as L1 and L2 distance, and can even separate the effects of over- and under-coverage, and non-constant target coverages. We call the resulting family of metrics excess risk of the target coverage (ERT). We show experimentally that the use of modern classifiers provides much higher statistical power than simple classifiers underlying established metrics like CovGap. Additionally, we use our metric to benchmark different conformal prediction methods. Finally, we release an open-source package for ERT as well as previous conditional coverage metrics. Together, these contributions provide a new lens for understanding, diagnosing, and improving the conditional reliability of predictive systems.

2512.02743 2026-06-01 cs.CV cs.AI 版本更新

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

面向仇恨视频检测的推理感知多模态融合

Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, Zeyu Fu

发表机构 * Multimodal Intelligence Lab(多模态智能实验室) Department of Computer Science(计算机科学系) University of Exeter(埃克塞特大学) School of Computer Science(计算机科学学院) University of Leeds(利兹大学) School of Computer Science and Informatics(计算机科学与信息学学院) University of Liverpool(利物浦大学) University of Birmingham(伯明翰大学) Machine Intelligence + x Group(机器智能+X小组)

AI总结 提出推理感知多模态融合框架,通过局部-全局上下文融合和语义交叉注意力实现多模态交互,并引入对抗推理生成互补语义视角,在仇恨视频检测中提升Macro-F1和召回率3%和7%。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

在线视频中的仇恨言论对数字平台构成日益严重的威胁,尤其是当视频内容变得日益多模态和上下文依赖时。现有方法通常难以有效融合模态间的复杂语义关系,且缺乏理解细微仇恨内容的能力。为解决这些问题,我们提出了一种创新的推理感知多模态融合(RAMF)框架。针对第一个挑战,我们设计了局部-全局上下文融合(LGCF)以捕捉局部显著线索和全局时间结构,并提出语义交叉注意力(SCA)以实现细粒度多模态语义交互。针对第二个挑战,我们引入了对抗推理——一个结构化的三阶段过程,其中视觉语言模型生成(i)客观描述、(ii)仇恨假设推理和(iii)非仇恨假设推理——提供互补的语义视角,丰富模型对细微仇恨意图的上下文理解。在两个真实仇恨视频数据集上的评估表明,我们的方法实现了稳健的泛化性能,在Macro-F1和仇恨类别召回率上分别比现有最先进方法提高了3%和7%。重现我们结果所需的源代码和数据可在https://github.com/Multimodal-Intelligence-Lab-MIL/RAMF获取。

英文摘要

Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. The source codes and data required to reproduce our results are available at https://github.com/Multimodal-Intelligence-Lab-MIL/RAMF.

2512.00349 2026-06-01 cs.AI 版本更新

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

图像辩论:检测多模态大语言模型中的欺骗行为

Sitong Fang, Shiyi Hou, Kaile Wang, Boyuan Chen, Donghai Hong, Jiayi Zhou, Josef Dai, Yaodong Yang, Jiaming Ji

发表机构 * Institute of Artificial Intelligence, Peking University(北京大学人工智能研究院)

AI总结 本文提出 MM-DeceptionBench 基准和基于图像辩论的多智能体监控框架,系统揭示并量化多模态大语言模型中的欺骗风险,有效提升欺骗行为检测能力。

Comments 39 pages, 16 figures, camera ready version for ICML 2026

详情
AI中文摘要

前沿AI系统是否变得更加强大?当然。然而,这种进步并非纯粹的福音,而是一匹特洛伊木马:在性能飞跃的背后,隐藏着更隐蔽和更具破坏性的安全风险,即欺骗。与幻觉(源于能力不足并导致错误)不同,欺骗代表一种更深层次的威胁,模型通过复杂推理和不真诚的回应故意误导用户。随着系统能力的提升,欺骗行为已从文本环境扩展到多模态环境,放大了其潜在危害。首先,我们如何监控这些隐蔽的多模态欺骗行为?然而,当前研究几乎完全局限于文本,多模态大语言模型的欺骗风险尚未被探索。在这项工作中,我们系统地揭示并量化多模态欺骗风险,引入了MM-DeceptionBench,这是第一个专门设计用于评估多模态欺骗的基准。涵盖六类欺骗,MM-DeceptionBench描述了模型如何通过视觉和文本模态的组合策略性地操纵和误导。另一方面,多模态欺骗评估在现有方法中几乎是一个盲点。其隐蔽性,加上视觉语义模糊性和跨模态推理的复杂性,使得行动监控和思维链监控基本无效。为应对这一挑战,我们提出了图像辩论,一种新颖的多智能体辩论监控框架。通过迫使模型将其主张基于视觉证据,该方法显著提高了欺骗策略的可检测性。实验表明,它在所有测试模型上持续提高与人类判断的一致性,在GPT-4o上将Cohen's kappa提升了1.5倍,准确率提升了1.25倍。

英文摘要

Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind their performance leaps lie more insidious and destructive safety risks, namely deception. Unlike hallucination, which arises from insufficient capability and leads to mistakes, deception represents a deeper threat in which models deliberately mislead users through complex reasoning and insincere responses. As system capabilities advance, deceptive behaviours have spread from textual to multimodal settings, amplifying their potential harm. First and foremost, how can we monitor these covert multimodal deceptive behaviors? Nevertheless, current research remains almost entirely confined to text, leaving the deceptive risks of multimodal large language models unexplored. In this work, we systematically reveal and quantify multimodal deception risks, introducing MM-DeceptionBench, the first benchmark explicitly designed to evaluate multimodal deception. Covering six categories of deception, MM-DeceptionBench characterizes how models strategically manipulate and mislead through combined visual and textual modalities. On the other hand, multimodal deception evaluation is almost a blind spot in existing methods. Its stealth, compounded by visual-semantic ambiguity and the complexity of cross-modal reasoning, renders action monitoring and chain-of-thought monitoring largely ineffective. To tackle this challenge, we propose debate with images, a novel multi-agent debate monitor framework. By compelling models to ground their claims in visual evidence, this method substantially improves the detectability of deceptive strategies. Experiments show that it consistently increases agreement with human judgements across all tested models, boosting Cohen's kappa by 1.5x and accuracy by 1.25x on GPT-4o.

2510.15859 2026-06-01 cs.CL cs.AI 版本更新

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

InfiMed-ORBIT: 通过基于评分标准的增量训练使大语言模型对齐开放复杂任务

Pengkai Wang, Pengwei Liu, Qi Zuo, Zhijie Sang, Congkai Xie, Hongxia Yang

发表机构 * Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学计算机系) Department of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院)

AI总结 提出ORBIT框架,利用动态生成的病例条件评分标准指导增量强化学习,仅用2k样本将Qwen3-4B-Instruct在HealthBench-Hard上的得分从7.0提升至27.5,达到同规模开源模型最优。

详情
AI中文摘要

强化学习(RL)推动了大语言模型(LLM)的许多近期突破,尤其是在奖励可自动计算的任务(如代码生成)中。然而,在开放式的医学对话中,RL效果较差,因为反馈模糊、依赖上下文,且难以简单总结为单一标量信号——通常需要高度监督的奖励模型,并存在奖励破解的风险。因此,我们引入了ORBIT,一个专为关键医学对话设计的基于评分标准的开放式增量训练框架。ORBIT将医学对话构建与动态生成的病例条件评分标准相结合,这些评分标准作为增量RL的自适应指南。与依赖外部医学知识库或手工规则的方法不同,ORBIT使用评分标准引导的评估,并可与通用指令遵循LLM一起实现,避免了任务特定的评判微调。仅使用2k训练样本,ORBIT将Qwen3-4B-Instruct的HealthBench-Hard得分从7.0提升至27.5,在相似规模的开源模型中实现了最先进的性能,同时随着评分标准覆盖范围的扩大,保持了良好的咨询质量。

英文摘要

Reinforcement learning (RL) has powered many recent breakthroughs in large language models (LLMs), especially for tasks where rewards can be computed automatically, such as code generation. However, it is less effective in open-ended medical dialogue, where feedback is ambiguous, context-dependent, and difficult to simply summarize into a single scalar signal-often requiring heavily supervised reward models and creating risks of reward hacking. Thus, we introduce ORBIT, an open-ended rubric-based incremental training framework tailored for critical medical dialogues. ORBIT integrates medical dialogue construction with dynamically generated case-conditioned rubrics that serve as adaptive guides for incremental RL. Unlike approaches that rely on external medical knowledge bases or handcrafted rules, ORBIT uses rubric-guided evaluation and can be implemented with general-purpose instruction-following LLMs, avoiding task-specific judge fine-tuning. With only 2k training samples, ORBIT raises Qwen3-4B-Instruct's HealthBench-Hard score from 7.0 to 27.5, achieving state-of-the-art performance among similarly sized open-source models while maintaining strong consultation quality as rubric coverage broadens.

2509.21379 2026-06-01 cs.CV cs.AI 版本更新

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

SAEmnesia:基于监督稀疏自编码器的扩散模型概念擦除

Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto

发表机构 * University of Turin, Italy(意大利都灵大学) Intesa Sanpaolo AI Research, Italy(意大利Intesa Sanpaolo人工智能研究院)

AI总结 提出监督稀疏自编码器框架SAEmnesia,通过强制一对一概念-神经元映射实现特征集中化,从而高效、精准地擦除扩散模型中的概念。

Comments Accepted at ICML 2026

详情
AI中文摘要

扩散模型中的概念遗忘受到特征分裂的阻碍,即概念分布在许多潜在特征上,使得移除它们具有挑战性且计算成本高。我们引入了SAEmnesia,一种监督稀疏自编码器框架,通过强制一对一的概念-神经元映射来克服这一问题。通过在训练过程中系统地标记概念,我们的方法实现了特征集中化,将每个概念绑定到一个可解释的神经元上。这使得概念擦除高度精准且高效。与最先进的基于稀疏自编码器的遗忘方法相比,SAEmnesia将超参数搜索减少了96.67%,并在UnlearnCanvas对象基准上实现了9.22%的提升。我们的方法在顺序遗忘中也表现出卓越的可扩展性,在移除九个对象时准确率提高了28.4%,为精确可控的概念擦除迈出了一步。此外,SAEmnesia在I2P基准上有效抑制了裸体内容,并对对抗攻击保持鲁棒性。源代码可在https://github.com/EIDOSLAB/SAEmnesia获取。

英文摘要

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. Compared to the state-of-the-art sparse autoencoder-based unlearning approach, SAEmnesia reduces hyperparameter search by 96.67% and achieves a 9.22% improvement on the UnlearnCanvas benchmark for objects. Our method also shows superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a step forward for precise and controllable concept erasure. Moreover, SAEmnesia effectively suppresses nudity on the I2P benchmark and remains robust to adversarial attacks. Source code available at https://github.com/EIDOSLAB/SAEmnesia.

2511.19433 2026-06-01 cs.RO cs.AI cs.CV 版本更新

Mixture of Horizons in Action Chunking

动作分块中的视野混合

Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, Mingyu Ding

发表机构 * Renmin University of China(中国人民大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) The Chinese University of Hong Kong(香港中文大学)

AI总结 针对视觉-语言-动作模型中动作分块长度(视野)的权衡问题,提出混合视野策略,通过并行处理不同视野的动作片段并融合输出,同时提升长期预见与短期精度,实现性能与泛化性的改进。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出显著能力,但其性能对训练中使用的$ extbf{动作分块长度}$(称为$ extbf{视野}$)敏感。我们的实证研究揭示了一个内在权衡:较长的视野提供更强的全局预见但降低细粒度精度,而较短的视野增强局部控制但在长期任务上表现不佳,这意味着固定选择单一视野是次优的。为缓解这一权衡,我们提出$ extbf{混合视野(MoH)}$策略。MoH将动作分块重新排列为多个不同视野的片段,通过共享动作变换器并行处理,并使用轻量线性门控融合输出。它具有三个吸引人的优点:1) MoH在单个模型中联合利用长期预见和短期精度,提高了复杂任务的性能和泛化能力。2) MoH对全注意力动作模块即插即用,训练或推理开销极小。3) MoH支持自适应视野的动态推理,通过跨视野共识选择稳定动作,实现比基线高2.5倍的吞吐量,同时保持优越性能。在基于流的策略$π_0$、$π_{0.5}$和单步回归策略$π_{ ext{reg}}$上的大量实验表明,MoH在仿真和真实世界任务上均取得一致且显著的提升。值得注意的是,在混合任务设置下,带有MoH的$π_{0.5}$在LIBERO上仅经过$30k$次训练迭代即达到99$\%$的平均成功率,创下新纪录。项目页面:https://timsty1.github.io/moh/

英文摘要

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://timsty1.github.io/moh/

2511.18760 2026-06-01 cs.AI cs.FL 版本更新

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

HERMES: 迈向高效且可验证的LLM数学推理

Azim Ospanov, Zijin Feng, Jiacheng Sun, Haoli Bai, Xin Shen, Farzan Farnia

发表机构 * Department of Computer Science \& Engineering, The Chinese University of Hong Kong Huawei Foundation Model Department

AI总结 提出Hermes框架,通过将非正式推理与Lean形式化验证交替结合,并引入中间形式化检查和记忆模块,在提升推理准确性的同时显著降低计算成本。

详情
AI中文摘要

非正式数学一直是现代大型语言模型(LLM)推理的核心,提供了灵活性和高效构建论证的能力。然而,纯粹的非正式推理容易产生逻辑漏洞和细微错误,难以检测和纠正。相比之下,形式化定理证明提供了严谨、可验证的数学推理,其中每个推理步骤都由可信的编译器检查,但缺乏非正式问题解决的探索自由度。这种不匹配使得当前基于LLM的数学代理缺乏一种原则性的方法来结合两种范式的优势。在这项工作中,我们引入了Hermes,这是第一个明确将非正式推理与Lean中的形式化验证证明交替结合的工具辅助代理。该框架执行中间形式化检查以防止推理漂移,并配备一个记忆模块以在多步推理链中保持证明的连续性,从而同时实现探索和验证。我们在四个具有挑战性的数学推理基准上评估了Hermes,使用了不同参数规模的LLM,从小模型到最先进的系统。在所有设置中,Hermes可靠地提高了基础模型的推理准确性,同时与基于奖励的方法相比,显著减少了推理令牌使用量和计算成本。在AIME和HARDMath2等困难数据集上,Hermes@1实现了高达40%的准确性提升,同时总推理FLOPs减少了80%。在测试时扩展时,Hermes@5进一步将准确性提高了20%。实现和代码库公开于https://github.com/aziksh-ospanov/HERMES。

英文摘要

Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and efficient construction of arguments. However, purely informal reasoning is prone to logical gaps and subtle errors that are difficult to detect and correct. In contrast, formal theorem proving provides rigorous, verifiable mathematical reasoning, where each inference step is checked by a trusted compiler, but lacks the exploratory freedom of informal problem-solving. This mismatch leaves current LLM-based math agents without a principled way to combine the strengths of both paradigms. In this work, we introduce Hermes, the first tool-assisted agent that explicitly interleaves informal reasoning with formally verified proofs in Lean. The framework performs intermediate formal checking to prevent reasoning drift and a memory module for proof continuity across multi-step reasoning chains, enabling both exploration and verification. We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales, from small models to state-of-the-art systems. Across all settings, Hermes reliably improves the reasoning accuracy of base models while substantially reducing reasoning token usage and computational cost compared to reward-based approaches. On difficult datasets such as AIME and HARDMath2, Hermes@1 achieves up to a 40% accuracy improvement while using 80% fewer total inference FLOPs. When scaled at test time, Hermes@5 boosts accuracy further by 20%. The implementation and codebase are publicly available at https://github.com/aziksh-ospanov/HERMES.

2506.08255 2026-06-01 cs.LG cs.AI cs.CR 版本更新

SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense

SHIELD: 用于增量扩展学习防御的安全超网络

Patryk Krukowski, Łukasz Gorczyca, Piotr Helm, Kamil Książek, Przemysław Spurek

发表机构 * Jagiellonian University, Faculty of Mathematics and Computer Science(杰洛内维大学数学与计算机科学学院) Jagiellonian University, Doctoral School of Exact and Natural Sciences(杰洛内维大学精确与自然科学研究博士学院) Akces NCBR IDEAS Research Institute(IDEAS研究所)

AI总结 提出一种结合区间边界传播(IBP)与超网络的框架SHIELD,通过生成任务特定参数和区间混合训练策略,实现可认证鲁棒的持续学习,在保持可扩展性的同时达到最优平均准确率。

Comments Accepted to CVPR 2026 (Findings track)

详情
AI中文摘要

在对抗条件下的持续学习仍然是一个开放问题,现有方法往往在鲁棒性、可扩展性或两者之间做出妥协。我们提出了一种新颖的框架,将区间边界传播(IBP)与基于超网络的架构相结合,以实现跨顺序任务的可认证鲁棒持续学习。我们的方法SHIELD通过一个共享的超网络生成任务特定的模型参数,该超网络仅依赖于紧凑的任务嵌入,从而消除了对重放缓冲区或完整模型副本的需求,并实现了高效的时间扩展。为了进一步增强鲁棒性,我们引入了区间混合(Interval MixUp),这是一种新颖的训练策略,它将表示为以MixUp点为中心的$\ell_{\infty}$球的虚拟示例混合。利用区间算术,该技术保证了可认证的鲁棒性,同时减轻了包裹效应,从而产生更平滑的决策边界。我们在多个基准测试上评估了SHIELD在强白盒对抗攻击(包括PGD和AutoAttack)下的表现。它持续优于现有的鲁棒持续学习方法,在保持可扩展性和认证性的同时,实现了最先进的平均准确率。这些结果向在对抗环境中实现实用且理论扎实的持续学习迈出了重要一步。

英文摘要

Continual learning under adversarial conditions remains an open problem, as existing methods often compromise either robustness, scalability, or both. We propose a novel framework that integrates Interval Bound Propagation (IBP) with a hypernetwork-based architecture to enable certifiably robust continual learning across sequential tasks. Our method, SHIELD, generates task-specific model parameters via a shared hypernetwork conditioned solely on compact task embeddings, eliminating the need for replay buffers or full model copies and enabling efficient over time. To further enhance robustness, we introduce Interval MixUp, a novel training strategy that blends virtual examples represented as $\ell_{\infty}$ balls centered around MixUp points. Leveraging interval arithmetic, this technique guarantees certified robustness while mitigating the wrapping effect, resulting in smoother decision boundaries. We evaluate SHIELD under strong white-box adversarial attacks, including PGD and AutoAttack, across multiple benchmarks. It consistently outperforms existing robust continual learning methods, achieving state-of-the-art average accuracy while maintaining both scalability and certification. These results represent a significant step toward practical and theoretically grounded continual learning in adversarial settings.

2509.12440 2026-06-01 cs.CL cs.AI 版本更新

MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

MedFact:大型语言模型在中文医学文本上的事实核查能力基准测试

Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai

发表机构 * Xunfei Healthcare Technology Co., Ltd.(讯飞医疗科技有限公司)

AI总结 为评估LLM在中文医学文本中的事实核查能力,构建了包含2116个专家标注实例的MedFact基准,涵盖13个专科、8种错误类型等,并发现模型在错误定位上表现不足,存在“过度批评”现象。

Comments Accepted to The Fifth Workshop on Generation, Evaluation, and Metrics (GEM) at ACL 2026

详情
AI中文摘要

在医疗应用中部署大型语言模型(LLM)需要具备事实核查能力,以确保患者安全和法规合规。我们引入了MedFact,一个具有挑战性的中文医学事实核查基准,包含来自多样化真实文本的2,116个专家标注实例,涵盖13个专科、8种错误类型、4种写作风格和5个难度级别。构建采用混合AI-人类框架,其中迭代的专家反馈优化AI驱动的多标准过滤,以确保高质量和难度。我们评估了20个领先的LLM在真实性分类和错误定位方面的表现,结果显示模型通常能判断文本是否包含错误,但难以精确定位错误,顶级模型的表现仍不及人类。我们的分析揭示了“过度批评”现象,即模型倾向于将正确信息误判为错误,而高级推理技术(如多智能体协作和推理时扩展)可能加剧这一问题。MedFact突显了部署医疗LLM的挑战,并为开发事实可靠的医疗AI系统提供了资源。

英文摘要

Deploying Large Language Models (LLMs) in medical applications requires fact-checking capabilities to ensure patient safety and regulatory compliance. We introduce MedFact, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show models often determine if text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals the "over-criticism" phenomenon, a tendency for models to misidentify correct information as erroneous, which can be exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.

2511.05875 2026-06-01 cs.HC cs.AI cs.CV 版本更新

Towards a Humanized Social-Media Ecosystem: AI-Augmented HCI Design Patterns for Safety, Agency & Well-Being

迈向人性化的社交媒体生态系统:面向安全、自主与福祉的AI增强人机交互设计模式

Mohd Ruhul Ameen, Akif Islam

发表机构 * College of Engineering(工程学院) Computer Sciences Marshall University Huntington, WV, USA(计算机科学马歇尔大学亨廷顿州威斯康星州) Department of Computer Science(计算机科学系) Engineering University of Rajshahi Rajshahi 6205, Bangladesh(工程 Rajshahi 大学 Rajshahi 6205 巴基斯坦)

AI总结 提出Human-Layer AI(HL-AI)框架,通过浏览器端用户拥有的可解释中介,在不依赖平台合作的情况下赋予用户实时控制权,实现内容重写、完整性检测、信息流定制、行为中断和恢复模式等五种设计模式,以提升社交媒体安全性与用户福祉。

Comments 6 pages, 5 tables, 7 figures, and 2 algorithm tables. Accepted at International Conference on Signal Processing, Information, Communication and Systems (SPICSCON 2025)

详情
Journal ref
2025 IEEE International Conference on Signal Processing, Information, Communication and Systems (SPICSCON)
AI中文摘要

社交平台连接了数十亿人,但其以参与度优先的算法往往对用户施加影响而非与用户协作,加剧了压力、虚假信息和失控感。我们提出Human-Layer AI(HL-AI)——用户拥有的、可解释的中介,位于浏览器中平台逻辑与界面之间。HL-AI赋予人们实用的、即时的控制权,无需平台合作。我们贡献了一个可用的Chrome/Edge原型,实现了五种代表性模式框架——上下文感知帖子重写器、帖子完整性检测器、精细信息流策展器、微退出代理和恢复模式——以及一个统一的数学公式,平衡用户效用、自主成本和风险阈值。评估涵盖技术准确性、可用性和行为结果。结果是一套人性化的控制手段,帮助用户在伤害发生前重写内容、通过完整性提示阅读、有意图地调整信息流、暂停强迫性循环以及在骚扰期间寻求庇护,同时通过解释和覆盖选项保留自主权。该原型为改造当今的信息流以融入安全性、自主性和福祉提供了实用路径,并邀请进行严格的跨文化用户评估。

英文摘要

Social platforms connect billions of people, yet their engagement-first algorithms often work on users rather than with them, amplifying stress, misinformation, and a loss of control. We propose Human-Layer AI (HL-AI)--user-owned, explainable intermediaries that sit in the browser between platform logic and the interface. HL-AI gives people practical, moment-to-moment control without requiring platform cooperation. We contribute a working Chrome/Edge prototype implementing five representative pattern frameworks--Context-Aware Post Rewriter, Post Integrity Meter, Granular Feed Curator, Micro-Withdrawal Agent, and Recovery Mode--alongside a unifying mathematical formulation balancing user utility, autonomy costs, and risk thresholds. Evaluation spans technical accuracy, usability, and behavioral outcomes. The result is a suite of humane controls that help users rewrite before harm, read with integrity cues, tune feeds with intention, pause compulsive loops, and seek shelter during harassment, all while preserving agency through explanations and override options. This prototype offers a practical path to retrofit today's feeds with safety, agency, and well-being, inviting rigorous cross-cultural user evaluation.

2511.04393 2026-06-01 cs.AI 版本更新

Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

将LLM后训练为更好的决策智能体:一种遗憾最小化方法

Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Maryland, College Park(马里兰大学哥伦比亚学院)

AI总结 提出迭代遗憾最小化微调(Iterative RMFT),通过反复蒸馏低遗憾决策轨迹来后训练LLM,提升其在在线决策任务中的表现,无需依赖已知算法或人工模板。

Comments Camera ready version of ICML 2026

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为交互式和动态环境中的决策智能体。然而,由于它们最初并非为决策设计,最近的研究表明,LLM即使在基本的在线决策问题中也可能表现不佳,无法实现低遗憾或有效的探索-利用权衡。为了解决这个问题,我们引入了迭代遗憾最小化微调(Iterative RMFT),这是一种后训练过程,反复将低遗憾决策轨迹蒸馏回基础模型。在每次迭代中,模型生成多个决策轨迹,选择k个最低遗憾的轨迹,并在此基础上进行微调。与先前方法(a)从已知决策算法中蒸馏动作序列或(b)依赖人工设计的思维链模板不同,我们的方法利用遗憾度量来激发模型自身的决策能力和推理依据。这种对模型生成推理的依赖避免了僵化的输出工程,并提供了更灵活、自然语言的训练信号。实验结果表明,Iterative RMFT在多种模型上提升了LLM的决策性能——从具有数值输入/输出的Transformer,到开源权重LLM,再到像GPT-4o mini这样的先进闭源模型。其在输出和推理格式上的灵活性使其能够泛化到具有不同时间范围、动作空间、奖励过程和自然语言上下文的任务。最后,我们提供了理论见解,表明在这种范式下,单层Transformer可以在简化设置中充当无遗憾学习器。总体而言,Iterative RMFT为增强LLM的决策能力提供了一个有原则且通用的后训练框架。

英文摘要

Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle even in basic online DM problems, failing to achieve low regret or an effective exploration-exploitation tradeoff. To address this, we introduce Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories back into the base model. At each iteration, the model rolls out multiple decision trajectories, selects the k-lowest regret ones, and fine-tunes itself on them. Unlike prior methods that (a) distill action sequences from known DM algorithms or (b) rely on manually crafted chain-of-thought templates, our approach leverages the regret metric to elicit the model's own DM ability and reasoning rationales. This reliance on model-generated reasoning avoids rigid output engineering and provides more flexible, natural-language training signals. Empirical results show that Iterative RMFT improves LLMs' DM performance across diverse models - from Transformers with numerical input/output, to open-weight LLMs, and advanced closed-weight models like GPT-4o mini. Its flexibility in output and reasoning formats enables generalization across tasks with varying horizons, action spaces, reward processes, and natural-language contexts. Finally, we provide theoretical insight showing that a single-layer Transformer under this paradigm can act as a no-regret learner in a simplified setting. Overall, Iterative RMFT offers a principled and general post-training framework for enhancing LLMs' decision-making capabilities.

2511.03100 2026-06-01 cs.LG cs.AI cs.MA 版本更新

Scaling Multi-Agent Environment Co-Design with Diffusion Models

基于扩散模型的多智能体环境协同设计扩展

Hao Xiang Li, Michael Amir, Amanda Prorok

发表机构 * Department of Computer Science, University of Cambridge, Cambridge, United Kingdom(剑桥大学计算机科学系,剑桥,英国)

AI总结 提出扩散协同设计(DiCoDe)框架,通过投影通用引导(PUG)和评论家蒸馏机制,实现高维环境设计空间下的可扩展、样本高效的智能体-环境协同优化。

详情
AI中文摘要

智能体-环境协同设计范式联合优化智能体策略和环境配置,以寻求系统性能提升。其应用领域从仓库物流到风电场管理,有望从根本上改变多智能体系统的部署方式。然而,当前的协同设计方法难以扩展:在高维环境设计空间下失效,且在处理联合优化中固有的移动目标时样本效率低下。我们通过开发扩散协同设计(DiCoDe)来应对这些挑战,这是一个可扩展且样本高效的协同设计框架,将协同设计推向实际相关场景。DiCoDe包含两项核心创新。首先,我们引入投影通用引导(PUG),这是一种采样技术,使DiCoDe能够在满足硬约束(如障碍物之间的空间间隔)的同时,探索奖励最大化环境的分布。其次,我们设计了一种评论家蒸馏机制,以共享来自强化学习评论家的知识,确保引导扩散模型利用密集且最新的学习信号适应不断演化的智能体策略。在具有挑战性的多智能体环境协同设计基准(包括仓库自动化、多智能体路径规划和风电场优化)上验证时,这些改进共同产生了更优的环境-策略对。我们的方法持续超越现有技术,例如在仓库场景中,以少66%的仿真样本实现了39%更高的奖励。这为智能体-环境协同设计设立了新标准,并向着在现实世界中收获协同设计成果迈出了关键一步。

英文摘要

The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co-design promises to fundamentally change how we deploy multi-agent systems. However, current co-design methods struggle to scale. They collapse under high-dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co-Design (DiCoDe), a scalable and sample-efficient co-design framework pushing co-design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward-maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up-to-date learning signal. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent environment co-design benchmarks including warehouse automation, multi-agent pathfinding and wind farm optimisation. Our method consistently exceeds the state-of-the-art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent-environment co-design, and is a stepping stone towards reaping the rewards of co-design in real world domains.

2503.05846 2026-06-01 cs.CL cs.AI 版本更新

EMCEE: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

EMCEE:通过提取合成多语言上下文桥接知识与推理以提升大语言模型的多语言能力

Hamin Koo, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 提出EMCEE框架,通过从LLM自身提取并融合语言特定知识,结合推理输出,显著提升多语言任务性能,尤其在低资源语言上平均提升31.7%。

Comments ACL 2026 Main

详情
AI中文摘要

大语言模型(LLMs)在广泛任务中取得了显著进展,但其对以英语为中心的训练数据的严重依赖导致在非英语语言中性能大幅下降。虽然现有的多语言提示方法强调将查询重新表述为英语或增强推理能力,但它们往往未能融入对某些查询至关重要的语言和文化特定基础。为了解决这一局限性,我们提出了EMCEE(提取合成多语言上下文并合并),一个简单而有效的框架,通过从LLM自身显式提取和利用查询相关知识来增强其多语言能力。具体来说,EMCEE首先提取合成上下文以揭示LLM中编码的潜在语言特定知识,然后通过基于判断的选择机制动态地将这种上下文见解与面向推理的输出合并。在涵盖多种语言和任务的四个多语言基准上的大量实验表明,EMCEE始终优于先前的方法,总体平均相对提升16.4%,在低资源语言中提升31.7%。

英文摘要

Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. While existing multilingual prompting methods emphasize reformulating queries into English or enhancing reasoning capabilities, they often fail to incorporate the language- and culture-specific grounding that is essential for some queries. To address this limitation, we propose EMCEE (Extracting synthetic Multilingual Context and merging), a simple yet effective framework that enhances the multilingual capabilities of LLMs by explicitly extracting and utilizing query-relevant knowledge from the LLM itself. In particular, EMCEE first extracts synthetic context to uncover latent, language-specific knowledge encoded within the LLM, and then dynamically merges this contextual insight with reasoning-oriented outputs through a judgment-based selection mechanism. Extensive experiments on four multilingual benchmarks covering diverse languages and tasks demonstrate that EMCEE consistently outperforms prior approaches, achieving an average relative improvement of 16.4% overall and 31.7% in low-resource languages.

2510.11683 2026-06-01 cs.LG cs.AI cs.CL 版本更新

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

边界引导策略优化:面向扩散大语言模型的内存高效强化学习

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

发表机构 * Tsinghua University(清华大学)

AI总结 针对扩散大语言模型中似然函数难以处理导致强化学习内存开销大的问题,提出边界引导策略优化(BGPO),通过构造满足线性和等价性的下界实现内存高效训练,在数学求解、代码生成和规划任务中显著优于现有方法。

详情
AI中文摘要

将强化学习(RL)应用于扩散大语言模型(dLLMs)的一个关键挑战是其似然函数的难解性,而似然函数对于RL目标至关重要,因此在训练过程中需要相应的近似。现有方法通过自定义蒙特卡洛(MC)采样,利用证据下界(ELBO)近似对数似然,但由于需要保留所有MC样本用于RL目标中非线性项的梯度计算,导致显著的内存开销,从而限制了可行的样本量,导致似然近似不精确和RL目标失真。为了解决这个问题,我们提出了边界引导策略优化(BGPO),一种内存高效的RL算法,它最大化基于ELBO的目标的一个特殊构造的下界。该下界经过精心设计,满足两个关键性质:(1)线性:它是一个线性求和,其中每一项仅依赖于单个MC样本,从而能够跨样本进行梯度累积并确保恒定的内存使用;(2)等价性:在在线策略训练中,该下界的值和梯度与基于ELBO的目标相等,因此它也是对原始RL目标的有效近似。这些性质使得BGPO能够采用大的MC样本量,改进似然近似和RL目标估计,从而带来性能提升。实验表明,BGPO在数学问题求解、代码生成和规划任务中显著优于先前的dLLMs RL算法。我们的代码和模型可在https://github.com/THU-KEG/BGPO获取。

英文摘要

A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) is the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation during training. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, they incur significant memory overhead due to the need to retain all MC samples for the gradient computation of non-linear terms in the RL objective, and thus restrict feasible sample sizes, leading to imprecise likelihood approximations and distorted RL objective. To address this, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, improving likelihood approximations and RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks. Our codes and models are available at \href{https://github.com/THU-KEG/BGPO}{https://github.com/THU-KEG/BGPO}.

2505.17607 2026-06-01 cs.AI cs.CL 版本更新

Symbolic Intermediaries as a Linguistic-Numerical Interface for LLM-Driven Geometric Reasoning

符号中介作为LLM驱动几何推理的语言-数值接口

João Pedro Gandarela, Thiago Rios, Stefan Menzel, André Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) École Polytechnique Fédérale de Lausanne(瑞士联邦理工学院) Honda Research Institute Europe(本田欧洲研究院) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) National Biomarker Centre, CRUK-MI, University of Manchester(曼彻斯特大学国家生物标记中心)

AI总结 提出符号中介作为连接物理模拟器数值输出与语言模型推理的接口,通过符号回归将连续数值转化为符号表达式,并在协同优化循环中提升几何推理性能。

Comments 33 pages, 18 figures

详情
AI中文摘要

大型语言模型(LLM)在语言和符号对象上展示出推理能力,但直接解释物理模拟器的连续数值输出(例如距离、曲率和轨迹)的能力有限,这些输出难以进行离散分词。在从机构设计到运动规划等空间基础的工程推理任务中,这定义了一个根本性的差距,限制了LLM在更广泛几何领域(例如与物理模拟器接口)的应用。我们提出符号中介,即通过符号回归发现的紧凑解析表达式,作为一种结构化接口,将模拟器的数值轨迹转换为符号形式,语言模型可以解释、比较和批评,同时保留原始几何语义。围绕这个接口,我们构建了一个智能体协调与优化循环:设计智能体将自然语言规范映射为可执行模拟代码,批评智能体基于共享符号词汇进行推理,修订步骤将此反馈转化为基于基础的优化决策,从而实现无需参数更新的推理时泛化。在平面机构综合的MSynth基准上,所有三个评估的LLM智能体比预算匹配的遗传算法基线高出19-53%(带反馈时中位误差降低高达63%),对三种模型架构的批评条目分析表明,该接口将推理从通用结构评论转向基于基础的几何验证。将连续模拟输出转换为符号形式的原理可推广到任何需要以语言方式解释模拟器行为的领域。

英文摘要

Large Language Models (LLMs) display reasoning capabilities over linguistic and symbolic objects but have limited capabilities to directly interpret the continuous numerical outputs of physics simulators, e.g., distances, curvatures, and trajectories that resist discrete tokenisation. Across spatially grounded engineering reasoning tasks, from mechanism design to motion planning, this defines a fundamental gap, which limits the wider application of LLMs within broader geometrical domains, for exmaple interfacing with physics simulators. We propose symbolic intermediaries, compact analytical expressions discovered via symbolic regression, as a structured interface that translates a simulator's numerical traces into a symbolic form, which language models can interpret, compare, and critique while preserving the original geometric semantics. Around this interface we build an agentic coordination-and-refinement loop: a design agent maps natural-language specifications to executable simulation code, a critique agent reasons over the shared symbolic vocabulary, and a revision step turns this feedback into grounded refinement decisions, enabling inference-time generalization without parameter updates. On the MSynth benchmark for planar mechanism synthesis, all three evaluated LLM agents outperform a budget-matched genetic-algorithm baseline by 19-53% (up to 63% lower median error with feedback), and analysis of the critique entries across three model architectures shows that the interface shifts reasoning from generic structural commentary to grounded geometric verification. The principle of translating continuous simulation outputs into symbolic forms generalises to any domain where simulator behaviour must be interpreted linguistically.

2510.03415 2026-06-01 cs.PL cs.AI cs.CL cs.SE 版本更新

LLMs Lean on Priors, Not Programming Language Semantics

LLMs 依赖先验而非编程语言语义

Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Cisco Research(思科研究)

AI总结 通过 PLSemanticsBench 基准测试,发现前沿大语言模型在程序执行任务中依赖预训练统计规律而非形式语义规则,语义变异和结构复杂度导致准确率大幅下降。

Comments Accepted at ICML 2026

详情
AI中文摘要

近期工作探究大语言模型(LLMs)是否基于显式规则而非预训练统计规律进行推理。程序执行提供了一个典型实例:形式语义通过符号转换规则定义行为,这些规则在分布偏移下可被系统性改变。我们研究 LLMs 能否通过程序执行基于形式语义进行推理,并引入 PLSemanticsBench,将轻量级 C 程序与两种语义系统(小步操作语义和 K 语义)配对,探测四种能力:组合规则得到最终状态、状态未变时选择规则、在长轨迹上维持这种条件推理、以及在新语义下遵循提供的规则。为解耦语义推理与语法熟悉度,我们重新定义熟悉运算符以引发符号-含义冲突,并引入仅通过提供规则定义的新符号,同时在人类编写、LLM 翻译和模糊生成的分割上以递增的结构复杂度进行压力测试。在 11 个前沿 LLM 上,标准语义下的最终状态准确率(高达 90%)在语义变异和结构复杂度增加时急剧下降,降幅达 40-60 个百分点。仅少数模型实现了非零的长程条件推理准确率,即使最佳系统也仅达到 35%。这些结果表明,当代 LLMs 往往依赖预训练的词汇关联,而非系统地基于提供的正式规则进行推理。PLSemanticsBench 公开于 https://EngineeringSoftware.github.io/PLSemanticsBench。

英文摘要

Recent work asks whether large language models (LLMs) condition their reasoning on explicit rules rather than statistical regularities from pretraining. Program execution provides a canonical instance: formal semantics define behavior through symbolic transition rules that can be systematically altered under distribution shift. We investigate whether LLMs can condition their reasoning on formal semantics through program execution and introduce PLSemanticsBench, pairing featherweight C programs with two semantic systems -- small-step operational semantics and K semantics -- and probing four capabilities: composing rules for final states, selecting rules when state is unmutated, sustaining such conditioning over long traces, and following supplied rules under novel semantics. To decouple semantic reasoning from syntactic familiarity, we redefine familiar operators to induce symbol-meaning conflict and introduce novel symbols defined only through the supplied rules, and stress-test models on Human-Written, LLM-Translated, and Fuzzer-Generated splits with increasing structural complexity. Across 11 frontier LLMs, strong final-state accuracy under standard semantics (up to 90%) drops sharply -- by as much as 40--60% points -- under semantic mutations and increasing structural complexity. Only a handful of models achieve non-zero long-horizon conditioning accuracy, and even the best systems reach just 35%. Together, these results suggest that contemporary LLMs often rely on pretrained lexical associations rather than systematically conditioning on supplied formal rules. PLSemanticsBench is publicly available at https://EngineeringSoftware.github.io/PLSemanticsBench.

2510.02060 2026-06-01 cs.AI cs.LG 版本更新

ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection

ReTabAD: 恢复表格异常检测中语义上下文的基准

Sanghyu Yoon, Dongmin Kim, Suhee Yoon, Ye Seul Sim, Seungdong Yoa, Hye-Seung Cho, Soonyoung Lee, Hankook Lee, Woohyung Lim

发表机构 * LG AI Research, Seoul, South Korea(LG人工智能研究实验室,首尔,韩国) Sungkyunkwan University, Suwon, South Korea(成均馆大学,水原,韩国)

AI总结 针对现有表格异常检测基准缺乏语义上下文的问题,提出ReTabAD基准,通过丰富结构化文本元数据并集成零样本LLM框架,验证了语义上下文能提升检测性能和可解释性。

Comments Accepted to ICLR 2026

详情
AI中文摘要

在表格异常检测(AD)中,文本语义通常承载关键信号,因为异常的定义与特定领域的上下文紧密相关。然而,现有基准仅提供原始数据点,缺乏语义上下文,忽略了专家在实践中依赖的丰富文本元数据,如特征描述和领域知识。这一限制阻碍了研究灵活性,并阻止模型充分利用领域知识进行检测。ReTabAD通过恢复文本语义来解决这一差距,以实现上下文感知的表格AD研究。我们提供(1)20个精心策划的表格数据集,这些数据集丰富了结构化的文本元数据,以及最先进的AD算法的实现,包括经典方法、深度学习和基于LLM的方法,以及(2)一个零样本LLM框架,该框架利用语义上下文而无需特定任务训练,为未来研究建立了强大的基线。此外,本工作通过实验和分析提供了关于文本元数据在AD中的作用和实用性的见解。结果表明,语义上下文通过支持领域感知推理提高了检测性能并增强了可解释性。这些发现将ReTabAD确立为系统探索上下文感知AD的基准。

英文摘要

In tabular anomaly detection (AD), textual semantics often carry critical signals, as the definition of an anomaly is closely tied to domain-specific context. However, existing benchmarks provide only raw data points without semantic context, overlooking rich textual metadata such as feature descriptions and domain knowledge that experts rely on in practice. This limitation restricts research flexibility and prevents models from fully leveraging domain knowledge for detection. ReTabAD addresses this gap by restoring textual semantics to enable context-aware tabular AD research. We provide (1) 20 carefully curated tabular datasets enriched with structured textual metadata, together with implementations of state-of-the-art AD algorithms including classical, deep learning, and LLM-based approaches, and (2) a zero-shot LLM framework that leverages semantic context without task-specific training, establishing a strong baseline for future research. Furthermore, this work provides insights into the role and utility of textual metadata in AD through experiments and analysis. Results show that semantic context improves detection performance and enhances interpretability by supporting domain-aware reasoning. These findings establish ReTabAD as a benchmark for systematic exploration of context-aware AD.

2509.22335 2026-06-01 cs.LG cs.AI 版本更新

Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning

深度持续学习中的谱坍缩导致塑性丧失

Arjun Prakash, Naicheng He, Kaicheng Guo, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, George Konidaris

发表机构 * Department of Computer Science, Brown University(布朗大学计算机科学系)

AI总结 研究深度神经网络在持续学习中塑性丧失的原因,发现新任务初始化时的Hessian谱坍缩是主要因素,并提出基于Kronecker分解的两种正则化方法以保持塑性。

详情
AI中文摘要

我们研究为什么深度神经网络在持续学习中会丧失塑性,从而在不重新初始化参数的情况下无法学习新任务。我们表明,这种失败之前在新任务初始化时会出现Hessian谱坍缩,其中有意义的曲率方向消失,梯度下降变得无效。通过分析线性化ReLU网络,我们推导出成功训练的显式$ε$-秩条件,并证明损失加权Gram矩阵在谱上与广义高斯-牛顿近似等价,从而将NTK动力学与Hessian曲率联系起来。直接针对谱坍缩,我们讨论了Hessian的Kronecker因子近似,这激发了两种正则化增强:保持高有效特征秩和应用L2惩罚。在持续监督学习和强化学习任务上的实验证实,结合这两种正则化器可以有效保持塑性。

英文摘要

We investigate why deep neural networks suffer from loss of plasticity in continual learning, and thus fail to learn new tasks without reinitializing parameters. We show that this failure is preceded by Hessian spectral collapse at new-task initialization, where meaningful curvature directions vanish and gradient descent becomes ineffective. Analyzing a linearized ReLU network, we derive explicit $ε$-rank conditions for successful training and prove that the loss-weighted Gram matrix is spectrally equivalent to the Generalized Gauss-Newton approximation, thereby relating NTK dynamics to Hessian curvature. Targeting spectral collapse directly, we then discuss the Kronecker factored approximation of the Hessian, which motivates two regularization enhancements: maintaining high effective feature rank and applying L2 penalties. Experiments on continual supervised and reinforcement learning tasks confirm that combining these two regularizers effectively preserves plasticity.

2506.11653 2026-06-01 cs.CV cs.AI cs.LG 版本更新

DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

DISCO: 使用条件距离相关性减轻深度学习中的偏差

Emre Kavak, Tom Nuno Wolf, Christian Wachinger

发表机构 * Technical University of Munich, Germany(慕尼黑技术大学) Konrad Zuse School of Excellence in Reliable AI, Germany(Konrad Zuse可靠性人工智能卓越学院) Munich Center for Machine Learning (MCML), Germany(慕尼黑机器学习中心(MCML))

AI总结 提出基于反因果模型的条件独立性准则,并设计条件距离相关性的高效估计器DISCO$_m$和sDISCO,通过正则化实现梯度模型中的偏差缓解,在多个数据集上优于或媲美现有方法。

Comments Accepted to ICML 2026 (oral)

详情
AI中文摘要

数据集偏差常常导致深度学习模型利用虚假相关性而非任务相关信号。我们引入了标准反因果模型(SAM),这是一个统一的因果框架,用于刻画偏差机制并得出因果稳定性的条件独立性准则。基于这一理论,我们提出了DISCO$_m$和sDISCO,它们是条件距离相关性的高效且可扩展的估计器,能够在基于梯度的模型中实现独立性正则化。在六个不同数据集上,我们的方法在现有观察偏差缓解方法中持续表现更优或具有竞争力,同时需要更少的超参数并能够无缝扩展到多偏差场景。这项工作桥接了因果理论与实际深度学习,为稳健预测提供了原则性基础和有效工具。源代码:https://github.com/yakamoz5/DISCO。

英文摘要

Dataset bias often leads deep learning models to exploit spurious correlations instead of task-relevant signals. We introduce the Standard Anti-Causal Model (SAM), a unifying causal framework that characterizes bias mechanisms and yields a conditional independence criterion for causal stability. Building on this theory, we propose DISCO$_m$ and sDISCO, efficient and scalable estimators of conditional distance correlation that enable independence regularization in gradient-based models. Across six diverse datasets, our methods consistently outperform or are competitive in existing observed bias mitigation approaches, while requiring fewer hyperparameters and scaling seamlessly to multi-bias scenarios. This work bridges causal theory and practical deep learning, providing both a principled foundation and effective tools for robust prediction. Source Code: https://github.com/yakamoz5/DISCO.

2509.00834 2026-06-01 cs.AI cs.FL cs.LG cs.LO 版本更新

Neuro-Symbolic Predictive Process Monitoring

神经符号预测性过程监控

Axel Mezini, Elena Umili, Ivan Donadello, Fabrizio Maria Maggi, Matteo Mancanelli, Fabio Patrizi

发表机构 * Faculty of Engineering, Free University of Bozen-Bolzano(博洛尼亚-博尔扎诺自由大学工程学院) Department of Computer, Control and Management Engineering, Sapienza, Università di Roma(罗马大学计算机、控制与管理工程系)

AI总结 提出一种结合数据驱动学习与时序逻辑先验知识的神经符号方法,通过可微逻辑损失函数训练自回归序列预测器,以提升业务过程管理中后缀预测的准确性和逻辑一致性。

详情
AI中文摘要

本文通过提出一种神经符号预测性过程监控(PPM)方法,解决了业务流程管理(BPM)中的后缀预测问题,该方法将数据驱动学习与时序逻辑先验知识相结合。尽管最近的方法利用深度学习模型进行后缀预测,但由于训练过程中缺乏领域知识的显式集成,它们常常无法满足甚至基本的逻辑约束。我们提出了一种新颖方法,将有限迹上的线性时序逻辑(LTLf)融入自回归序列预测器的训练过程。我们的方法引入了一个可微的逻辑损失函数,该函数使用LTLf语义的软近似和Gumbel-Softmax技巧定义,可以与标准预测损失结合。这确保了模型学习生成既准确又逻辑一致的后缀。在三个真实世界数据集上的实验评估表明,我们的方法提高了后缀预测的准确性和对时序约束的遵从性。我们还引入了逻辑损失的两种变体(局部和全局),并展示了它们在噪声和现实环境下的有效性。虽然是在BPM背景下开发的,我们的框架适用于任何符号序列生成任务,并有助于推进神经符号人工智能。

英文摘要

This paper addresses the problem of suffix prediction in Business Process Management (BPM) by proposing a Neuro-Symbolic Predictive Process Monitoring (PPM) approach that integrates data-driven learning with temporal logic-based prior knowledge. While recent approaches leverage deep learning models for suffix prediction, they often fail to satisfy even basic logical constraints due to the lack of explicit integration of domain knowledge during training. We propose a novel method to incorporate Linear Temporal Logic over finite traces (LTLf) into the training process of autoregressive sequence predictors. Our approach introduces a differentiable logical loss function, defined using a soft approximation of LTLf semantics and the Gumbel-Softmax trick, which can be combined with standard predictive losses. This ensures that the model learns to generate suffixes that are both accurate and logically consistent. Experimental evaluation on three real-world datasets shows that our method improves suffix prediction accuracy and compliance with temporal constraints. We also introduce two variants of the logic loss (local and global) and demonstrate their effectiveness under noisy and realistic settings. While developed in the context of BPM, our framework is applicable to any symbolic sequence generation task and contributes to advancing Neuro-Symbolic AI.

2508.19830 2026-06-01 cs.CV cs.AI 版本更新

Target-Agnostic Calibration under Distribution Shift with Frequency-Aware Gradient Rectification

分布偏移下基于频率感知梯度修正的目标无关校准

Yilin Zhang, Cai Xu, You Wu, Ziyu Guan, Wei Zhao

发表机构 * School of Computer Science and Technology, Xidian University, Xi'an, China(西安电子科技大学计算机科学与技术学院)

AI总结 提出频率感知梯度修正(FGR)框架,通过对训练图像进行低通滤波减少虚假高频线索并学习域不变特征,同时利用几何投影确保分布内校准不退化,从而在无需目标域信息的情况下提升模型在分布偏移下的校准性能。

Comments 25 pages, Accepted at ICML 2026

详情
AI中文摘要

现实世界中的模型部署不可避免地会遇到分布偏移,使得深度神经网络的置信度估计高度不可靠,在安全关键应用中带来严重风险。现有方法通过训练时正则化或事后调整来改善校准,但通常依赖于对目标域的访问(或模拟),限制了实用性。我们提出频率感知梯度修正(FGR),一种用于鲁棒校准的目标无关训练框架。从频率角度出发,FGR 对部分训练图像应用低通滤波,以减少虚假的高频线索并鼓励学习域不变特征。然而,相关的信息损失可能会降低分布内(ID)校准。为了解决这一权衡,FGR 将 ID 校准视为硬约束,并通过几何投影修正冲突的参数更新。这确保了 ID 校准目标的一阶非增,而无需引入额外的损失平衡系数。在合成、真实世界和语义偏移数据集上的大量实验表明,FGR 在保持 ID 性能的同时显著改善了各种偏移下的校准,并且与事后校准方法兼容。我们的代码可在 https://github.com/YilinZhang107/FGR-Calib 获取。

英文摘要

Real-world model deployments inevitably encounter distribution shifts, rendering the confidence estimates of deep neural networks highly unreliable, posing severe risks in safety-critical applications. Existing methods improve calibration via training-time regularization or post-hoc adjustment, but often rely on access to (or simulation of) target domains, limiting practicality. We propose Frequency-aware Gradient Rectification (FGR), a target-agnostic training framework for robust calibration. From a frequency perspective, FGR applies low-pass filtering to a subset of training images to diminish spurious high-frequency cues and encourage the learning of domain-invariant features. However, the associated information loss can degrade In-Distribution (ID) calibration. To resolve this trade-off, FGR treats ID calibration as a hard constraint and rectifies conflicting parameter updates via geometric projection. This ensures a first-order non-increase in the ID calibration objective without introducing an additional loss-balancing coefficient. Extensive experiments on synthetic, real-world, and semantic shift datasets demonstrate that FGR significantly improves calibration under diverse shifts while preserving ID performance, and it remains compatible with post-hoc calibration methods. Our code is available at https://github.com/YilinZhang107/FGR-Calib.

2501.04661 2026-06-01 cs.CL cs.AI 版本更新

Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

超越记忆:使用短语结构评估大型语言模型中的语义泛化

Wesley Scivetti, Melissa Torgbi, Austin Blodgett, Mollie Shichman, Taylor Hudson, Claire Bonial, Harish Tayyar Madabushi

发表机构 * Georgetown University(乔治城大学) University of Bath(巴斯大学) DEVCOM U.S. Army Research Laboratory(美国陆军研究实验室) University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 通过构式语法构建诊断评估,测试大型语言模型在低频但人类易理解的短语结构上的语义理解与泛化能力,发现模型在句法相同但语义不同的构式上性能下降超40%。

Comments Camera Ready: AACL-IJCNLP (2025)

详情
AI中文摘要

预训练数据的网络规模带来了一个重要的评估挑战:将预训练数据中充分代表的案例的语言能力与对域外语言(特别是预训练数据中较少见的动态真实世界实例)的泛化能力区分开来。为此,我们利用构式语法(CxG)构建了一个诊断评估,系统性地评估大型语言模型(LLM)的自然语言理解。CxG为测试泛化提供了一个心理语言学上合理的框架,因为它明确地将句法形式与抽象的、非词汇意义联系起来。我们的新颖推理评估数据集包含英语短语构式,已知说话者能够抽象出常见的实例以理解和产生创造性实例。我们的评估数据集使用CxG评估两个核心问题:第一,模型是否能够“理解”那些可能在预训练数据中出现频率较低、但对人类而言直观且易于理解的句子的语义;第二,LLM是否能够在句法相同但意义不同的构式中部署适当的构式语义。我们的结果表明,包括GPT-o1在内的最先进模型在第二个任务上性能下降超过40%,揭示了模型无法像人类那样在句法相同的形式上进行泛化以得出不同的构式意义。我们公开了我们的新颖数据集和相关的实验数据,包括提示和模型响应。

英文摘要

The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions: first, if models can 'understand' the semantics of sentences for instances that are likely to appear in pretraining data less often, but are intuitive and easy for people to understand. Second, if LLMs can deploy the appropriate constructional semantics given constructions that are syntactically identical but with divergent meanings. Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task, revealing a failure to generalize over syntactically identical forms to arrive at distinct constructional meanings in the way humans do. We make our novel dataset and associated experimental data, including prompts and model responses, publicly available.

2508.08204 2026-06-01 cs.CL cs.AI 版本更新

Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models

大型语言模型中推理时间不确定性的人类对齐与校准

Kyle Moore, Jesse Roberts, Daryl Watson

AI总结 本文评估了多种推理时间不确定性度量,发现它们与人类群体不确定性高度对齐,尽管与人类答案偏好不一致,但在正确性相关性和分布分析上表现出中等到强校准证据。

Comments We have discovered a critical error in the normalized entropy calculation that may have substantially inflated nearly all results herein. We have since fixed this error in a new work, but we believe that the new work is sufficiently dissimilar in focus, methods, dataset, and results as to be misleading if presented as a simple replacement. As such, we propose removal and retraction instead

详情
AI中文摘要

最近,评估大型语言模型的不确定性校准引起了广泛关注,以促进模型控制和调节用户信任。推理时间不确定性可能为模型或外部控制模块提供实时信号,对于应用这些概念以改善LLM用户体验尤为重要。尽管许多现有论文考虑模型校准,但相对较少的工作试图评估模型不确定性与人类不确定性的对齐程度。在这项工作中,我们使用既有度量和新颖变体评估了一系列推理时间不确定性度量,以确定它们与人类群体水平不确定性以及传统模型校准概念的接近程度。我们发现,许多度量显示出与人类不确定性强烈对齐的证据,尽管与人类答案偏好缺乏对齐。对于那些成功的度量,我们在正确性相关性和分布分析方面发现了中等到强校准证据。

英文摘要

There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.

2411.19463 2026-06-01 cs.SE cs.AI 版本更新

Understanding the Fundamental Design Decisions of Retrieval-Augmented Generation Systems

理解检索增强生成系统的基本设计决策

Shengming Zhao, Yuchen Shao, Yuheng Huang, Jiayang Song, Zhijie Wang, Chengcheng Wan, Lei Ma

发表机构 * Fudan University(复旦大学) East China Normal University(华东师范大学) Shanghai Innovation Institute(上海创新研究院) The University of Tokyo(东京大学) Macau University of Science and Technology(澳门科学理工学院) Concordia University(Concordia大学) University of Alberta(阿尔伯塔大学) The University of Tokyo, Japan(日本东京大学)

AI总结 本文通过系统实验,研究了RAG部署中的三个关键决策(是否部署、检索量、知识集成方式),揭示了任务和模型依赖的优化策略,为实践者提供基于证据的指导。

详情
Journal ref
ACM Transactions on Software Engineering and Methodology (TOSEM), 2026
AI中文摘要

检索增强生成(RAG)已成为增强大型语言模型(LLM)能力的关键技术。然而,实践者在做出RAG部署决策时面临重大挑战。尽管现有研究优先考虑算法创新,但在理解决定RAG成功的基本工程权衡方面仍存在系统性空白。我们首次对三个通用的RAG部署决策进行了全面研究:是否部署RAG、检索多少信息以及如何有效集成检索到的知识。通过在三个LLM和六个数据集(涵盖问答和代码生成任务)上的系统实验,我们揭示了关键见解:(1)RAG部署必须高度选择性,即使有完美文档,可变召回阈值和失败模式也会影响多达12.6%的样本。(2)最优检索量表现出任务依赖性:问答任务呈现通用模式(5-10个文档最优),而代码生成需要针对场景的优化。(3)知识集成有效性取决于任务和模型特性,代码生成从提示方法中显著受益,而问答任务改进甚微。这些发现表明,通用的RAG策略是不够的。有效的RAG系统需要基于任务特性和模型能力的上下文感知设计决策。我们的分析为实践者提供了基于证据的指导,并为原则性RAG部署建立了基础见解。我们的代码、数据和工件公开于https://github.com/ShengmingZ/RAG_Benchmark_Code_QA。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a critical technique for enhancing large language model (LLM) capabilities. However, practitioners face significant challenges when making RAG deployment decisions. While existing research prioritizes algorithmic innovations, a systematic gap persists in understanding fundamental engineering trade-offs that determine RAG success. We present the first comprehensive study of three universal RAG deployment decisions: whether to deploy RAG, how much information to retrieve, and how to integrate retrieved knowledge effectively. Through systematic experiments across three LLMs and six datasets spanning question answering and code generation tasks, we reveal critical insights: (1) RAG deployment must be highly selective, with variable recall thresholds and failure modes affecting up to 12.6\% of samples even with perfect documents. (2) Optimal retrieval volume exhibits task-dependent behavior QA tasks show universal patterns (5-10 documents optimal) while code generation requires scenario-specific optimization. (3) Knowledge integration effectiveness depends on task and model characteristics, with code generation benefiting significantly from prompting methods while question answering shows minimal improvement. These findings demonstrate that universal RAG strategies prove inadequate. Effective RAG systems require context-aware design decisions based on task characteristics and model capabilities. Our analysis provides evidence-based guidance for practitioners and establishes foundational insights for principled RAG deployment. Our code, data and artifacts are publicly available at https://github.com/ShengmingZ/RAG_Benchmark_Code_QA.

2507.11075 2026-06-01 cs.CV cs.AI 版本更新

Joint angle based learning to refine kinematic human pose estimation

基于关节角度学习的运动学人体姿态估计精化

Chang Peng, Yifei Zhou, Haoqiang Ren, Shiqing Huang, Chuangye Chen, Jianming Yang, Bao Yang, Huifeng Xi, Zhenyu Jiang

发表机构 * Department of Engineering Mechanics, School of Civil Engineering and Transportation, South China University of Technology(工程力学系,交通工程学院,华南理工大学) School of Mechanics and Construction Engineering, Jinan University(机械与建筑工程学院,暨南大学) Guangdong Provincial Key Laboratory of Speed Capability, School of Physical Education, Jinan University(广东省速度能力重点实验室,暨南大学体育学院)

AI总结 提出一种基于关节角度的双向循环网络后处理模块,利用高阶傅里叶级数近似生成可靠真值,以精化单图像人体姿态估计,纠正错误关键点并平滑轨迹。

详情
AI中文摘要

无标记人体姿态估计(HPE)在各个领域中的应用日益增多。当前的HPE在分析运动学人体姿态时,偶尔会出现关键点识别错误和关键点轨迹随机波动的问题。现有基于深度学习的HPE精化模型的性能受到训练数据集(关键点手动标注)不准确的显著限制。本文提出了一种新方法克服这一困难,关键技术包括:(i) 基于关节角度的运动学人体姿态鲁棒描述;(ii) 使用高阶傅里叶级数近似关节角度的时间变化以获得可靠的“真值”;(iii) 设计双向循环网络作为后处理模块,以精化基于单图像的HPE模型的估计。使用我们方法构建的高质量数据集训练后,该网络在纠正错误识别关节和平滑其时空轨迹方面表现出卓越性能。测试表明,在花样滑冰和霹雳舞等挑战性案例中,基于关节角度的精化(JAR)优于最先进的HPE精化网络。JAR还展示了纠正现有数据集的巨大潜力。

英文摘要

Marker-free human pose estimation (HPE) has found increasing applications in various fields. Current HPE suffers from occasional errors in keypoint recognition and random fluctuation in keypoint trajectories when analyzing kinematic human poses. The performance of existing deep learning-based models for HPE refinement is considerably limited by inaccurate training datasets in which the keypoints are manually annotated. This paper proposed a novel method to overcome the difficulty, in which the key techniques include: (i) A robust joint angle-based description of kinematic human poses; (ii) Approximating temporal variation of joint angles using high order Fourier series to get reliable "ground truth"; (iii) A bidirectional recurrent network is designed as a post-processing module to refine the estimation of single image-based HPE models. Trained with the high-quality dataset constructed using our method, the network demonstrates outstanding performance to correct wrongly recognized joints and smooth their spatiotemporal trajectories. Tests show that joint angle-based refinement (JAR) outperforms the state-of-the-art HPE refinement network in challenging cases like figure skating and breaking. JAR also demonstrates great potential to rectify existing datasets.

2507.05488 2026-06-01 cs.AI cs.CY 版本更新

OLG++: A Semantic Extension of Obligation Logic Graph

OLG++:义务逻辑图的语义扩展

Subhasis Dasgupta, Jon Stephens, Amarnath Gupta

发表机构 * University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出OLG++,通过引入空间、时间、当事人组、可废止性和逻辑分组等节点与边类型,扩展义务逻辑图以建模市政和跨司法管辖区的法规规则,并通过食品商业法规示例展示其在法律问答中的应用。

详情
AI中文摘要

我们提出了OLG++,这是义务逻辑图(OLG)的语义扩展,用于建模市政和跨司法管辖区的监管和法律规则。OLG++引入了更丰富的节点和边类型,包括空间、时间、当事人组、可废止性和逻辑分组结构,从而能够细致地表示法律义务、例外和层级关系。该模型支持带有上下文条件、优先级和复杂触发器的规则的结构化表示。我们通过食品商业法规的示例展示了其用法,说明了OLG++如何使用属性图查询支持法律问答。我们还讨论了OLG++如何通过提供子类关系、空间约束和具体化例外结构的图原生结构来补充LegalRuleML。工作示例和初步覆盖率分析表明,在所研究的维度上,OLG++在市政监管表示方面比基线OLG模型更具表现力。

英文摘要

We present OLG++, a semantic extension of the Obligation Logic Graph (OLG) for modeling regulatory and legal rules in municipal and interjurisdictional contexts. OLG++ introduces richer node and edge types, including spatial, temporal, party group, defeasibility, and logical grouping constructs, enabling nuanced representations of legal obligations, exceptions, and hierarchies. The model supports structured representation of rules with contextual conditions, precedence, and complex triggers. We demonstrate its use through examples from food-business regulations, showing how OLG++ supports legal question answering using property-graph queries. We also discuss how OLG++ can complement LegalRuleML by providing graph-native constructs for subclass relations, spatial constraints, and reified exception structures. The worked examples and first-pass coverage analysis show that, on the dimensions studied, OLG++ is more expressive than the baseline OLG model for municipal regulatory representation.

2502.12851 2026-06-01 cs.CL cs.AI 版本更新

MeMo: Towards Language Models with Associative Memory Mechanisms

MeMo:迈向具有联想记忆机制的语言模型

Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli

发表机构 * Human-centric ART, University of Rome Tor Vergata(人文导向的ART,罗马大学Tor Vergata) University of Edinburgh(爱丁堡大学) Almawave S.p.A.(Almawave公司)

AI总结 提出MeMo架构,通过分层联想记忆直接记忆文本,实现透明化和模型编辑,实验证明单层和多层配置的记忆能力。

详情
Journal ref
Proceedings of Association for Computational Linguistics (Findings), 2025
AI中文摘要

记忆是基于Transformer的大型语言模型通过学习实现的基本能力。在本文中,我们提出了一种范式转变,通过设计一种直接记忆文本的架构,牢记记忆先于学习的原则。我们引入了MeMo,一种用于语言建模的新颖架构,它在分层联想记忆中显式记忆标记序列。通过设计,MeMo提供了透明性和模型编辑的可能性,包括遗忘文本。我们对MeMo架构进行了实验,展示了单层和多层配置的记忆能力。

英文摘要

Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model editing, including forgetting texts. We experimented with the MeMo architecture, showing the memorization power of the one-layer and the multi-layer configurations.

2506.14842 2026-06-01 cs.CV cs.AI 版本更新

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

PictSure:预训练嵌入对上下文学习图像分类器至关重要

Lukas Schiesser, Cornelius Wolff, Sophie Haas, Simon Pukrop

发表机构 * German Research Center for AI (DFKI)(德国人工智能研究中心(DFKI)) Centrum Wiskunde & Informatica (CWI)(数学与信息学研究中心(CWI))

AI总结 本文提出PictSure视觉上下文学习模型,发现预训练嵌入质量是下游性能的关键瓶颈,而融合层训练数据的多样性影响有限。

Comments 10 pages, 2 figures

详情
AI中文摘要

在数据稀缺领域,构建图像分类模型仍然繁琐,因为收集大规模标注数据集不切实际。上下文学习(ICL)是少样本图像分类(FSIC)的一种有前景的范式,但先前工作未充分探索编码器预训练与融合层训练数据的相对重要性。我们提出了PictSure,一个纯视觉的ICL模型家族,展示了易于使用的融合Transformer架构的潜力,以及需要在更广泛的图像域中获得更好的嵌入表示。在域内和域外评估中,我们发现预训练引起的表示质量与下游ICL性能强相关。关键在于,将融合Transformer的训练数据集从仅ImageNet更改为多样化的多域混合,在评估设置下仅提供有限的额外性能提升,表明一旦嵌入充分结构化,融合层似乎能够有效适应。这些结果表明,视觉ICL的瓶颈是表示质量,而非融合模块的训练多样性。为了促进采用和可重复性,我们以开源形式发布所有模型权重,并提供一个MCP服务器,将PictSure作为可调用工具暴露给基于LLM的智能系统,使少样本图像分类能够在AI流水线中直接调用,无需集成开销。代码可在https://github.com/PictSure获取,模型可在https://huggingface.co/pictsure获取。

英文摘要

Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) is a promising paradigm for few-shot image classification (FSIC), but prior work has underexplored the relative importance of encoder pretraining versus fusion-layer training data. We present PictSure, a vision-only ICL family of models that demonstrates the potential of easy-to-use fusion transformer architectures, as well as the need for better embedding representations across a wider range of image domains. In both in-domain and out-of-domain evaluations, we find that representation quality induced by pretraining strongly correlates with downstream ICL performance. Crucially, varying the training dataset for the fusion transformer, from ImageNet alone to diverse multi-domain mixtures, provides limited additional performance gains under the evaluated settings, demonstrating that the fusion layer appears capable of adapting effectively once embeddings are sufficiently structured. These results show that the bottleneck in visual ICL is representation quality, not fusion-module training diversity. To facilitate adoption and reproducibility, we release all model weights as open-source artifacts and provide an MCP server that exposes PictSure as a callable tool for LLM-based agentic systems, enabling few-shot image classification to be invoked directly within AI pipelines without integration overhead. Code can be found at https://github.com/PictSure and models at https://huggingface.co/pictsure.

2506.12060 2026-06-01 cs.CR cs.AI cs.CY 版本更新

Organizational Adaptation to Generative AI in Cybersecurity

组织对生成式人工智能在网络安全中的适应

Christopher Nott

发表机构 * Independent Researcher, United States(美国独立研究者)

AI总结 本研究通过分析2022至2025年的25项研究,采用定性方法探讨组织如何通过修改框架和混合操作流程适应生成式AI,发现成熟基础设施、监管压力和人力资本投资是成功的关键,同时指出攻防能力不平衡等挑战。

Comments 38 pages, 1 table, 1 figure Revised title, abstract, and formatting for journal submission, corrected heading numbers, no substantive changes in content

详情
AI中文摘要

网络安全组织正在通过修改框架和混合操作流程来适应生成式AI的整合,其成功受到现有安全成熟度、监管要求以及人力和基础设施投资的影响。本定性研究采用系统文档分析和比较案例研究方法,考察了2022至2025年间25项研究如何记录威胁建模框架的组织适应,揭示了从传统基于签名的系统向AI能力框架的转变,涉及三种主要模式:用于安全应用的LLM集成、用于风险检测和响应自动化的GenAI框架,以及用于威胁狩猎和匹配的AI/ML集成。拥有成熟基础设施的组织,尤其是在金融和关键基础设施领域,通过结构化治理、专门的AI团队和稳健的事件响应流程表现出更高的准备度,其中中央银行和金融机构在监管压力下引领适应工作。成功整合需要人工监督自动化系统、关注数据质量和可解释性,以及特定行业的治理,尽管在隐私保护、偏见减少、人员培训和对抗性防御方面仍存在持续困难。进攻性和防御性GenAI能力之间的显著不平衡为安全规划带来了战略担忧。研究结果为网络安全专业人员提供了可操作的见解,并强调了在管理AI增强威胁时采用适应性方法、伦理框架和员工发展的必要性。

英文摘要

Cybersecurity organizations are adapting to GenAI integration through modified frameworks and hybrid operational processes, with success influenced by existing security maturity, regulatory requirements, and investments in human capital and infrastructure. This qualitative research employs systematic document analysis and comparative case study methodology to examine how 25 studies from 2022 to 2025 document organizational adaptation of threat modeling frameworks, revealing a shift away from traditional signature-based systems toward AI-capable frameworks across three primary patterns: LLM integration for security applications, GenAI frameworks for risk detection and response automation, and AI/ML integration for threat hunting and matching. Organizations with mature infrastructures, particularly in finance and critical infrastructure, demonstrate higher readiness through structured governance, dedicated AI teams, and robust incident response processes, with central banks and financial institutions leading adaptation efforts under regulatory pressure. Successful integration requires human oversight of automated systems, attention to data quality and explainability, and sector-specific governance, though ongoing difficulties with privacy protection, bias reduction, personnel training, and adversarial defense persist. Notable imbalances between offensive and defensive GenAI capabilities create strategic concerns for security planning. The findings offer actionable insights for cybersecurity professionals and underscore the need for adaptive approaches, ethical frameworks, and staff development when managing AI-enhanced threats.

2505.22934 2026-06-01 cs.CL cs.AI cs.LG 版本更新

Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

解开LoRA干扰:用于鲁棒模型合并的正交子空间

Haobo Zhang, Jiayu Zhou

发表机构 * University of Michigan Ann Arbor(密歇根大学安娜堡分校)

AI总结 针对LoRA微调模型合并时性能下降的问题,提出通过微调前约束LoRA子空间正交性来减少任务间干扰的方法OSRM,可无缝集成现有合并算法,提升合并性能并保持单任务准确率。

Comments 14 pages, 5 figures, 16 tables, accepted by ACL 2025

详情
AI中文摘要

针对单个任务微调大型语言模型(LM)虽然性能强劲,但部署和存储成本高昂。近期研究探索模型合并,将多个任务特定模型组合成单个多任务模型,无需额外训练。然而,现有合并方法对于使用低秩适应(LoRA)微调的模型往往失败,导致性能显著下降。本文表明,这一问题源于模型参数与数据分布之间先前被忽视的相互作用。我们提出用于鲁棒模型合并的正交子空间(OSRM),在微调*之前*约束LoRA子空间,确保与一个任务相关的更新不会对其他任务的输出产生不利偏移。我们的方法可以无缝集成到大多数现有合并算法中,减少任务间的意外干扰。在八个数据集上使用三种广泛使用的LM和两种大型LM进行的广泛实验表明,我们的方法不仅提升了合并性能,还保持了单任务准确率。此外,我们的方法对合并的超参数表现出更强的鲁棒性。这些结果突显了数据-参数交互在模型合并中的重要性,并为合并LoRA模型提供了一种即插即用的解决方案。

英文摘要

Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace *prior* to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.

2503.14190 2026-06-01 cs.AI 版本更新

Inferring Events from Time Series using Language Models

利用语言模型从时间序列中推断事件

Mingtian Tan, Mike A. Merrill, Zack Gottesman, Tim Althoff, David Evans, Tom Hartvigsen

发表机构 * University of Virginia(弗吉尼亚大学) Stanford University(斯坦福大学) University of Washington(华盛顿大学)

AI总结 研究大型语言模型能否从时间序列数据中推断自然语言事件,提出自动化任务生成方法和新基准,并通过蒸馏与强化学习提升小模型性能。

Comments 21 pages, 15 Figures

详情
AI中文摘要

分析时间序列数据的一个常见目标是理解事件如何导致观测到的变化。我们研究大型语言模型(LLMs)是否能够推断与时间序列数据相关的自然语言事件。我们引入了一种基于体育数据的自动化方法,用于生成测试模型推理与时间序列数据相关事件能力的任务,并开发了一种新的基准测试方法。在涵盖18个LLMs的实验中,我们提示LLMs根据时间序列数据推断未观测到的事件,并观察到令人惊讶的成功,即使在提供极少上下文的情况下。然后,我们展示了将蒸馏与强化学习(RL)相结合可以提高小型语言模型的性能,使其接近大型专有推理模型。重现我们工作所需的所有资源均可获取:https://github.com/hartvigsen-group/GAMETime

英文摘要

A common goal in analyzing time series data is to understand how events cause observed variations. We study whether Large Language Models (LLMs) can infer natural language events associated with time series data. We introduce an automated method for generating tasks that test a model's ability to reason about events associated with time series data based on sports data, and develop a new benchmarking method. In experiments spanning 18 LLMs, we prompt LLMs to infer unobserved events given time series data and observe surprising successes, even when providing minimal context. We then show that combining distillation with Reinforcement Learning (RL) can improve the performance for small language models to approach that of large proprietary reasoning models. All resources needed to reproduce our work are available: https://github.com/hartvigsen-group/GAMETime

2411.13865 2026-06-01 cs.IR cs.AI cs.CL cs.LG 版本更新

Breaking Information Cocoons: A Hyperbolic Framework for Balancing Exploration and Exploitation in Recommender Systems

打破信息茧房:推荐系统中平衡探索与利用的双曲框架

Qiyao Ma, Menglin Yang, Mingxuan Ju, Tong Zhao, Neil Shah, Rex Ying

发表机构 * University of California, Davis(加州大学戴维斯分校) The Hong Kong University of Science(香港科学大学) Snap Inc.(Snap公司) Yale University(耶鲁大学)

AI总结 提出双曲框架HERec,通过语义增强的层次机制和自动层次聚类,在推荐系统中平衡探索与利用,有效缓解信息茧房。

Comments Accepted to KDD 2026. Code: https://github.com/Martin-qyma/HERec

详情
AI中文摘要

现代推荐系统常常形成信息茧房,限制用户接触多样化内容。核心挑战在于平衡内容探索与利用,同时允许用户调整推荐偏好。理想情况下,这种平衡可以通过层次表示来捕捉,其中深度搜索促进利用,广度搜索促进探索。然而,现有方法面临两个基本限制:欧几里得方法难以捕捉层次结构,而双曲方法尽管在层次建模上表现优越,但缺乏对用户和物品画像的语义理解,且未能提供平衡探索与利用的原则性机制。为解决这些问题,我们提出HERec,一个在推荐系统中有效平衡探索与利用的双曲框架。我们的框架引入两项关键创新:(1)语义增强的层次机制,直接在双曲空间中将丰富的文本描述与协同信息对齐。理论梯度分析表明,这种对齐有效利用了底层双曲流形结构,从而更准确地建模用户和物品;(2)通过优化Dasgupta代价的自动层次聚类机制,无需预定义超参数即可发现层次结构,实现用户可调节的探索-利用权衡。大量实验表明,HERec持续优于欧几里得和双曲基线,在效用指标上提升高达5.49%,多样性指标提升11.39%,有效缓解了信息茧房。

英文摘要

Modern recommender systems often create information cocoons, restricting users' exposure to diverse content. The central challenge is to balance content exploration and exploitation while allowing users to adjust their recommendation preferences. Ideally, this balance can be captured with a hierarchical representation, where depth search facilitates exploitation and breadth search enables exploration. However, existing approaches face two fundamental limitations: Euclidean methods struggle to capture hierarchical structures, while hyperbolic methods, despite their superior hierarchical modeling, lack semantic understanding of user and item profiles and fail to provide a principled mechanism for balancing exploration and exploitation. To address these challenges, we propose HERec, a hyperbolic framework that effectively balances exploration and exploitation in recommender systems. Our framework introduces two key innovations: (1) a semantic-enhanced hierarchical mechanism that aligns rich textual descriptions with collaborative information directly in hyperbolic space. Theoretical gradient analysis demonstrates that this alignment effectively leverages the underlying hyperbolic manifold structure, resulting in more accurate modeling of users and items; (2) an automatic hierarchical clustering mechanism by optimizing Dasgupta's cost, which discovers hierarchical structures without requiring predefined hyperparameters, enabling user-adjustable exploration-exploitation trade-offs. Extensive experiments demonstrate that HERec consistently outperforms both Euclidean and hyperbolic baselines, achieving up to 5.49% improvement in utility metrics and 11.39% increase in diversity metrics, effectively mitigating information cocoons.

2409.14583 2026-06-01 cs.AI 版本更新

LLM Bias Evaluation: Gender, Racial, and Age Disparities in Occupational and Crime Scenarios

LLM偏差评估:职业与犯罪场景中的性别、种族和年龄差异

Vishal Mirza, Rahul Kulkarni, Aakanksha Jadhav

发表机构 * New York University(纽约大学) Northeastern University(东北大学) Washington University in St. Louis(圣路易斯华盛顿大学)

AI总结 本文评估了2024年四大领先LLM在职业和犯罪场景中的性别、种族和年龄偏差,发现去偏努力常导致新的公平性权衡,即“去偏悖论”。

Comments Updated title and abstract to emphasize key findings on the debiasing paradox for improved discoverability. Content and findings unchanged. 11 pages, 17 figures, Accepted at IEEE Conference on Artificial Intelligence (IEEE CAI) 2025. Full Paper acceptance in the Vertical HUMAN-CENTERED AI category

详情
Journal ref
2025 IEEE Conference on Artificial Intelligence (CAI)
AI中文摘要

LLM偏差评估至关重要,因为大型语言模型(LLM)越来越多地影响高风险决策。本文对领先LLM中的性别、种族和年龄差异进行了全面评估,揭示出去偏努力常常创造新的公平性权衡。近年来LLM的进展显著,但由于各种限制,企业广泛采用仍然有限。本文考察了LLM中的偏差——这是一个影响其可用性、可靠性和公平性的关键问题。我们的研究评估了2024年发布的四个领先LLM(Gemini 1.5 Pro、Llama 3 70B、Claude 3 Opus和GPT-4o)在职业场景中的性别偏差以及犯罪场景中的性别、年龄和种族偏差。结果显示,LLM在各种职业中描绘女性角色的频率往往高于男性,与美国劳工统计局数据相比偏差达37%。在犯罪场景中,与美国联邦调查局数据的偏差在性别上为54%,种族上为28%,年龄上为17%。关键的是,我们观察到减少性别和种族偏差的努力常常导致过度偏向某一子类的结果,可能加剧差异——这种“去偏悖论”凸显了当前偏差缓解技术的局限性,并强调了更有效方法的必要性。

英文摘要

LLM bias evaluation is critical as large language models (LLMs) increasingly influence high-stakes decisions. This paper provides a comprehensive assessment of gender, racial, and age disparities in leading LLMs, revealing that debiasing efforts often create new fairness trade-offs. Recent advancements in LLMs have been notable, yet widespread enterprise adoption remains limited due to various constraints. This paper examines bias in LLMs - a crucial issue affecting their usability, reliability, and fairness. Our study evaluates gender bias in occupational scenarios and gender, age, and racial bias in crime scenarios across four leading LLMs released in 2024: Gemini 1.5 Pro, Llama 3 70B, Claude 3 Opus, and GPT-4o. Findings reveal that LLMs often depict female characters more frequently than male ones in various occupations, showing a 37% deviation from US BLS data. In crime scenarios, deviations from US FBI data are 54% for gender, 28% for race, and 17% for age. Critically, we observe that efforts to reduce gender and racial bias often lead to outcomes that may over-index one sub-class, potentially exacerbating disparities - a "debiasing paradox" that highlights the limitations of current bias mitigation techniques and underscores the need for more effective approaches.

2501.01926 2026-06-01 cs.CV cs.AI 版本更新

Cross-Modal Attention Calibration for LVLM Hallucination Mitigation

跨模态注意力校准用于LVLM幻觉缓解

Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li

发表机构 * Sun Yat-sen University(中山大学) The University of Hong Kong(香港大学) Meituan(美团) Inspur Database Technology(Inspur数据库技术) Guilin University of Electronic Technology(桂林电子科技大学) Shenzhen Loop Area Institute(深圳环湖院) Guangdong Key Laboratory of Big Data Analysis and Processing(广东大数据分析与处理重点实验室)

AI总结 提出一种无需训练的跨模态注意力校准方法,通过设计模态间解码和位置校准模块,缓解大型视觉语言模型中的幻觉问题。

Comments CVPR2026

详情
AI中文摘要

大型视觉语言模型(LVLM)在视觉-语言理解方面表现出显著能力。尽管取得了成功,LVLM在复杂生成任务中仍然会产生幻觉,导致视觉输入与生成内容不一致。为了解决这个问题,一些方法引入了推理时干预,如对比解码,以减少对语言先验的过度依赖。然而,这些方法忽略了由位置偏差和虚假跨模态相关性引起的幻觉。在本文中,我们提出了一种跨模态注意力校准(CMAC)方法,以无需训练的方式缓解LVLM中的幻觉。在该方法中,我们设计了一个模态间解码(IMD)模块,通过一种新颖的对比解码机制来减轻幻觉。IMD将具有显著跨模态注意力权重的值向量掩蔽为失真,从而同时解决了单模态过度依赖和误导性跨模态相关性问题。此外,跨模态位置校准(CMPC)模块缩小了图像标记的位置差距,缓解了跨模态注意力中的位置偏差。在多种幻觉基准上的实验结果验证了我们的方法在减少LVLM幻觉方面优于现有最先进技术。我们的代码将在https://github.com/lijm48/IMCCD上提供。

英文摘要

Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from position bias and spurious inter-modality correlations. In this paper, we propose a Cross-Modal Attention Calibration (CMAC) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design an Inter-Modality Decoding (IMD) module to alleviate hallucination by a novel contrastive decoding mechanism. IMD masks the value vectors associated with significant cross-modal attention weights as distortion, which addresses both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Cross-Modal Position Calibration (CMPC) module shrinks the position gap of image tokens, alleviating the position bias in cross-modal attention. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations for LVLM. Our code will be available at https://github.com/lijm48/IMCCD.

2502.15224 2026-06-01 cs.LG cs.AI 版本更新

Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery

自动发现基准:在Oracle引导发现中诊断结构化状态追踪

Tingting Chen, Beibei Lin, Srinivas Anumasa, Vedant Shah, Zifeng Yuan, Qiran Zou, Anirudh Goyal, Dianbo Liu

发表机构 * National University of Singapore(国立新加坡大学) Mila-Quebec AI institute(魁北克AI研究院) Meta Superintelligence Labs(Meta超智能实验室)

AI总结 提出Auto-Discovery-Bench基准,通过确定性Oracle引导的假设-干预-反馈循环,诊断智能体在结构化状态追踪中的能力瓶颈。

Comments 13 pages

详情
AI中文摘要

交互式发现要求智能体在多轮反馈中维护和更新结构化信念。在评估智能体于嘈杂、开放的科学环境中的表现之前,有必要在受控条件下隔离这一先决能力。我们引入了Auto-Discovery-Bench,一个确定性的Oracle引导诊断基准,其中智能体通过重复的假设-干预-反馈循环恢复隐藏结构。该基准实例化了三种受控发现抽象:有向图发现、无向关系发现和符号方程发现。在所有模型中,性能随着变量数量、轨迹长度和干扰项的增加而下降。一个独立的轨迹追踪诊断表明,即使移除了干预选择和假设生成,许多失败仍然存在,这表明在维护和整合长程结构化信息方面的限制是Oracle引导发现的重要瓶颈。Auto-Discovery-Bench并非旨在取代真实的发现环境;相反,它提供了一个可重复、低混淆的诊断测试平台,用于隔离交互式科学智能体的先决能力。

英文摘要

Interactive discovery requires agents to maintain and update structured beliefs over many rounds of feedback. Before evaluating agents in noisy, open-ended scientific environments, it is useful to isolate this prerequisite capability under controlled conditions. We introduce Auto-Discovery-Bench, a deterministic oracle-guided diagnostic benchmark in which agents recover hidden structures through repeated hypothesis--intervention--feedback cycles. The benchmark instantiates three controlled discovery abstractions: directed graph discovery, undirected relational discovery, and symbolic equation discovery. Across models, performance degrades as the number of variables, trajectory length, and distractors increase. A separate trajectory-tracking diagnostic shows that many failures persist even when intervention selection and hypothesis generation are removed, suggesting that limitations in maintaining and integrating long-range structured information are an important bottleneck for oracle-guided discovery. Auto-Discovery-Bench is not intended to replace realistic discovery environments; rather, it provides a reproducible, low-confound diagnostic testbed for isolating a prerequisite capability for interactive scientific agents.

2502.04671 2026-06-01 cs.AI cs.LG cs.LO cs.PL 版本更新

ProofWala: A Framework for Multilingual Proof Data Synthesis and Theorem-Proving

ProofWala: 多语言证明数据合成与定理证明框架

Amitayush Thakur, George Tsoukalas, Greg Durrett, Swarat Chaudhuri

发表机构 * University of Texas, Austin, USA(得克萨斯大学奥斯汀分校)

AI总结 提出ProofWala框架,通过itp-interface库实现与交互式定理证明器的程序化交互,支持多语言证明数据合成、并行证明搜索,并验证了跨语言与跨领域迁移的有效性。

详情
AI中文摘要

神经定理证明方法需要强大的基础设施来与交互式定理证明器(ITP)交互、提取结构化证明数据以及大规模执行证明搜索。然而,现有工具通常针对特定助手且面向文件级执行,使得仓库级分析和并行实验变得困难。我们提出ProofWala,一个多语言证明工程框架,基于 exttt{itp-interface}构建,这是一个用于与ITP进行程序化交互的可重用库。对于Lean 4,我们实现了一个在阐释器内部执行的元编程交互层,支持语义上忠实的策略级跟踪,以及跨整个仓库的声明和依赖级提取。该设计超越了传统的REPL式交互,支持项目范围的分析、环境克隆和证明状态的池化执行。相同的接口抽象支持多个版本的Rocq,形成统一的跨助手流水线。 基于此基础设施,ProofWala提供标准化的多语言证明数据集、模型训练工具和并行证明搜索算法。使用该框架,我们展示了跨Lean和Rocq的多语言训练能够实现跨语言和跨领域迁移。我们在Lean Mathlib和领域适应(CategoryTheory)上观察到统计显著的改进,而其他设置也呈现一致的增长趋势。我们在两个仓库中开源了完整框架、并行证明搜索模块、数据集和模型:ProofWala (https://github.com/trishullab/proof-wala) 和 itp-interface 库 (https://github.com/trishullab/itp-interface)。

英文摘要

Neural approaches to theorem proving require robust infrastructure for interfacing with interactive theorem provers (ITPs), extracting structured proof data, and executing proof search at scale. However, existing tooling is often assistant-specific and oriented toward file-level execution, making repository-scale analysis and parallel experimentation challenging. We present ProofWala, a multilingual proof engineering framework built around \texttt{itp-interface}, a reusable library for programmatic interaction with ITPs. For Lean 4, we implement a meta-programmed interaction layer executing inside the elaborator, enabling semantically faithful tactic-level tracing alongside declaration- and dependency-level extraction across entire repositories. This design extends beyond traditional REPL-style interaction by supporting project-wide analysis, environment cloning, and pooled execution of proof states. The same interface abstraction supports multiple versions of Rocq, yielding a unified cross-assistant pipeline. Built on this infrastructure, ProofWala provides standardized multilingual proof datasets, model training utilities, and parallel proof search algorithms. Using the framework, we demonstrate that multilingual training across Lean and Rocq enables cross-lingual and cross-domain transfer. We observe statistically significant improvements on Lean Mathlib and in domain adaptation (CategoryTheory), while other settings exhibit consistent upward trends. We open-source the full framework, parallel proof search module, datasets, and models across two repositories: ProofWala (https://github.com/trishullab/proof-wala) and the itp-interface library (https://github.com/trishullab/itp-interface).

2502.04554 2026-06-01 cs.AI 版本更新

Unifying and Optimizing Data Values for Selection via Sequential Decision-Making

通过序列决策统一和优化数据选择的数据价值

Hongliang Chi, Qiong Wu, Zhengyi Zhou, Jonathan Light, Emily Dodwell, Yao Ma

发表机构 * Rensselaer Polytechnic Institute(伦塞拉尔理工学院)

AI总结 将数据选择重构为序列决策问题,通过动态规划得到最优选择序列,并统一解释Data Shapley等现有方法为近视线性近似,提出基于二分图的高效替代方法,在经典ML和大规模LLM微调数据选择中显著优于现有方法。

详情
AI中文摘要

数据选择已成为数据价值的一个关键下游应用,然而在数据价值用于选择的理论基础方面仍未被充分探索。我们将数据选择重新表述为一个序列决策问题,其中最优选择序列由动态规划产生,而数据价值可以被理解为该最优序列的编码。这一框架通过近似动态规划的视角统一并重新解释了现有方法(如Data Shapley),揭示它们是对序列问题的近视线性近似。我们进一步分析了在子模性下选择最优性如何随效用曲率下降,解释了这些近似何时以及为何失败。为了弥合理论与实践,我们提出了一种基于二分图的高效替代方法,该方法在保持子模结构的同时,实现了具有可证明保证的可扩展贪心选择。在经典机器学习基准和大规模LLM微调数据选择上的实验表明,该方法显著优于现有方法。代码公开于https://github.com/frankhlchi/SeqDataVal。

英文摘要

Data selection has emerged as a crucial downstream application of data valuation, yet the theoretical foundations for using data values in selection remain underexplored. We reformulate data selection as a sequential decision-making problem where the optimal selection sequence arises from dynamic programming, and data values can be understood as encodings of this optimal sequence. This framework unifies and reinterprets existing methods like Data Shapley through the lens of approximate dynamic programming, revealing them as myopic linear approximations to the sequential problem. We further analyze how selection optimality degrades with utility curvature under submodularity, explaining when and why these approximations fail. To bridge theory and practice, we propose an efficient bipartite graph-based surrogate that preserves submodular structure while enabling scalable greedy selection with provable guarantees. Experiments on classical ML benchmarks and large-scale LLM fine-tuning data selection demonstrate substantial improvements over existing methods. Code is publicly available at https://github.com/frankhlchi/SeqDataVal

2404.14928 2026-06-01 cs.LG cs.AI cs.CL cs.SI 版本更新

Graph Machine Learning in the Era of Large Language Models (LLMs)

大语言模型时代的图机器学习

Shijie Wang, Jiani Huang, Zhikai Chen, Yu Song, Wenzhuo Tang, Haitao Mao, Wenqi Fan, Hui Liu, Xiaorui Liu, Dawei Yin, Qing Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Michigan State University(密歇根州立大学) North Carolina State University(北卡罗来纳州立大学) Baidu Inc(百度公司)

AI总结 本文综述了大语言模型如何增强图机器学习的泛化、迁移和少样本学习能力,以及图如何提升大语言模型的推理和可解释性。

Comments Accepted by TIST

详情
AI中文摘要

图在表示社交网络、知识图谱和分子发现等各个领域的复杂关系中扮演着重要角色。随着深度学习的出现,图神经网络(GNN)已成为图机器学习(Graph ML)的基石,促进了图的表示和处理。最近,大语言模型(LLM)在语言任务中展现出前所未有的能力,并被广泛应用于计算机视觉和推荐系统等各种应用中。这一显著成功也引起了将LLM应用于图领域的兴趣。越来越多的努力致力于探索LLM在提升图机器学习的泛化性、迁移性和少样本学习能力方面的潜力。同时,图,尤其是知识图谱,富含可靠的事实知识,可用于增强LLM的推理能力,并可能缓解其局限性,如幻觉和缺乏可解释性。鉴于这一研究方向的快速进展,有必要对LLM时代图机器学习的最新进展进行系统综述,为研究人员和从业者提供深入理解。因此,在本综述中,我们首先回顾了图机器学习的最新发展。然后,我们探讨了如何利用LLM来增强图特征的质量,减轻对标注数据的依赖,并解决图异质性和分布外(OOD)泛化等挑战。之后,我们深入探讨了图如何增强LLM,突出了它们增强LLM预训练和推理的能力。此外,我们调查了各种应用,并讨论了这一有前景领域的潜在未来方向。

英文摘要

Graphs play an important role in representing complex relationships in various domains like social networks, knowledge graphs, and molecular discovery. With the advent of deep learning, Graph Neural Networks (GNNs) have emerged as a cornerstone in Graph Machine Learning (Graph ML), facilitating the representation and processing of graphs. Recently, LLMs have demonstrated unprecedented capabilities in language tasks and are widely adopted in a variety of applications such as computer vision and recommender systems. This remarkable success has also attracted interest in applying LLMs to the graph domain. Increasing efforts have been made to explore the potential of LLMs in advancing Graph ML's generalization, transferability, and few-shot learning ability. Meanwhile, graphs, especially knowledge graphs, are rich in reliable factual knowledge, which can be utilized to enhance the reasoning capabilities of LLMs and potentially alleviate their limitations such as hallucinations and the lack of explainability. Given the rapid progress of this research direction, a systematic review summarizing the latest advancements for Graph ML in the era of LLMs is necessary to provide an in-depth understanding to researchers and practitioners. Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph Heterophily and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.