arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28812 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

超越二元:基于物理接触表示的仿真到现实灵巧操作

Jiahe Pan, Stelian Coros, Jitendra Malik, Toru Lin

发表机构 * ETH Zürich(苏黎世联邦理工学院) UC Berkeley(伯克利加州大学)

AI总结 提出基于物理原理的中心压力(CoP)触觉表示,结合可微动力学传感器标定,实现多指手的零样本仿真到现实迁移,在插销入孔和球平衡任务中优于二元接触和原始触觉基线。

Comments Project site: https://mpan31415.github.io/tactile_rep/

详情
AI中文摘要

接触丰富操作的主要瓶颈是收集真实世界数据的困难。仿真到现实强化学习提供了一种可扩展的替代方案,但仿真-现实差距阻碍了像触觉这样信息密集的模式被有效使用。现有的仿真到现实方法通常通过将触觉数据简化为粗略的低维特征来缩小这一差距——牺牲了复杂操作所需的丰富性。在这项工作中,我们引入了中心压力(CoP),一种基于物理原理的有效触觉表示,它保留了密集的接触信息,同时保持了仿真到现实迁移的鲁棒性。为了支持这种表示,我们提出了一种基于可微动力学的传感器标定方案,使得能够在不需真实力测量的情况下估计触觉单元的朝向。我们在两个盲态、具有挑战性的接触丰富操作任务上评估了CoP:插销入孔和球平衡。在这两个任务中,基于CoP的策略在多指手上实现了零样本仿真到现实迁移,并且优于粗略的二元接触和原始触觉基线。对学习策略状态的分析进一步表明,基于CoP的策略编码了任务相关的物理属性,如物体质量,作为控制的涌现副产品。

英文摘要

A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offers a scalable alternative, but the simulation-reality gap prevents information-dense modalities like touch from being effectively used. Existing sim-to-real methods often mitigate this gap by simplifying tactile data into coarse low-dimensional features -- sacrificing the richness required for complex manipulation. In this work, we introduce Center-of-Pressure (CoP), an effective tactile representation grounded in physical principles that preserves dense contact information while maintaining robustness for sim-to-real transfer. To support this representation, we propose a sensor calibration scheme based on differentiable dynamics, enabling the estimation of taxel orientations without requiring ground-truth force measurements. We evaluate CoP on two blind, challenging contact-rich manipulation tasks: peg-in-hole insertion and ball balancing. Across both tasks, policies conditioned on CoP achieve zero-shot sim-to-real transfer on a multi-fingered hand, and outperform both coarse binary-contact and raw-taxel baselines. Analysis of learned policy states further suggests that CoP-conditioned policies encode task-relevant physical properties, such as object mass, as an emergent byproduct of control.

2605.28807 2026-05-28 cs.AI 版本更新

Calibrating Conservatism for Scalable Oversight

校准保守主义以实现可扩展监督

William Overman, Mohsen Bayati

发表机构 * Stanford Graduate School of Business(斯坦福商学院)

AI总结 提出校准集体监督(CCO)方法,通过在线校准保守主义,在无分布假设下确保不良结果低于用户指定阈值,并在SWE-bench和MACHIAVELLI实验中验证其有效性。

详情
AI中文摘要

能够自主规划和与环境进行长期交互的智能体AI系统提出了一个基本的控制问题:人类如何对可能超越自身能力的系统保持有意义的监督?现有的可扩展监督方法依赖于复杂的假设,大多仍停留在启发式层面,或者缺乏具有统计保证的序列设置实用方法。我们引入了校准集体监督(CCO),它将多样化的辅助评分函数聚合成一个惩罚项,用于衡量与保守基线的偏离。受可达到效用保留的启发,CCO实现了集体保守主义:行动面临与监督者关注程度成比例的惩罚,因此当监督者认为无异议时,高效用行动仍会被选择,只有在关注累积时才被覆盖。CCO使用共形决策理论在线校准这种保守主义,确保不良结果在有限时间内低于用户指定的目标阈值,且无需分布假设。在SWE-bench的修改版本上,较弱的监督者成功约束了对抗性错误对齐的较强智能体;在MACHIAVELLI上,CCO在保持奖励的同时大幅减少了伦理违规。在两种设置中,经验违规率与理论预测的指定目标紧密匹配。

英文摘要

Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed their own capabilities? Existing approaches to scalable oversight rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functions into a penalty measuring deviation from a conservative baseline. Inspired by Attainable Utility Preservation, CCO enables collective conservatism: actions face a penalty proportional to overseer concern, so high-utility actions are still selected when overseers find them unobjectionable and overridden only when concern accumulates. CCO calibrates this conservatism online using Conformal Decision Theory, ensuring that undesirable outcomes remain below a user-specified target threshold with finite-time bounds and no distributional assumptions. On a modified version of SWE-bench, weaker overseers successfully constrain an adversarially misaligned stronger agent; on MACHIAVELLI, CCO substantially reduces ethical violations while preserving reward. In both settings, empirical violation rates closely match the specified targets, as predicted by the theory.

2605.28805 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1: 具有显式结构化重校准的多模态元验证器

Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang

发表机构 * Tsinghua University(清华大学) Pennsylvania State University(宾夕法尼亚州立大学) University of Southern California(南加州大学) Microcyto Princeton University(普林斯顿大学)

AI总结 提出OmniVerifier-M1,通过符号化元验证(如边界框)和解耦强化学习,实现多模态大模型的可靠细粒度验证与动态区域级自校正。

Comments ICML 2026. Project: https://github.com/Cominclip/OmniVerifier

详情
AI中文摘要

视觉结果日益成为多模态大语言模型的核心,因此可靠且细粒度的验证对于扩展通用基础模型至关重要。在这项工作中,我们研究了多模态元验证,它利用验证器生成的推理过程而非仅决策信号,并探索如何有效地将元验证反馈纳入多模态验证器训练。我们发现了两个关键发现。首先,符号化验证器输出(例如边界框)作为元验证推理过程优于文本解释,能够实现高效的基于规则的强化学习奖励,同时避免依赖来自辅助评判模型的基于模型的奖励。其次,解耦二元判断和元验证的强化学习目标显著优于联合奖励优化,这是由于输出结构和学习动态的内在差异。基于这些见解,我们训练了OmniVerifier-M1,一个利用符号化元验证和解耦强化学习的通用视觉验证器。OmniVerifier-M1提供稳健的验证和细粒度的错误定位,并进一步实现了M1-TTS,一个由验证器驱动的智能体生成系统,实现动态区域级自校正。这种方法为更可靠、可解释和细粒度的多模态验证铺平了道路,支持更安全、更可控的基础模型部署。

英文摘要

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

2605.28792 2026-05-28 cs.AI cs.HC cs.LG 版本更新

CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models

CaMBRAIN:基于因果状态空间模型的实时连续脑电图推理

Abhilash Durgam, Nyle Siddiqui, Jeffrey A. Chan-Santiago, Qiushi Fu, Elakkat D. Gireesh, Mubarak Shah

发表机构 * CRCV, University of Central Florida(CRCV,中央佛罗里达大学) University of Central Florida(中央佛罗里达大学) Department of MAE, University of Central Florida(中央佛罗里达大学机械与航空航天工程系) Department of Neurology, Loma Linda University(洛马琳达大学神经病学系)

AI总结 提出首个基于因果Mamba的状态空间模型CaMBRAIN,通过多阶段自监督训练实现实时、长程连续的EEG信号推理,在三个数据集上达到SOTA且吞吐量提升10倍以上。

Comments 22 pages, 3 figures, 8 tables

详情
AI中文摘要

脑电图(EEG)是一种监测脑电活动的关键非侵入性方法。EEG信号时长从几秒到数小时不等,给现有深度学习方法带来两大障碍:(1)现有EEG模型主要基于注意力机制,随着序列长度增加计算量呈二次增长;(2)由于固定长度输入要求,原始EEG信号必须以滑动窗口方式处理,阻碍了对整个信号的全局理解。为此,我们提出CaMBRAIN——首个基于因果Mamba的状态空间模型(SSM),能够实时推理EEG信号,并论证了考虑到EEG的因果单向性,双向方法是不必要的昂贵。然而,训练这样的模型并非易事,因为关键的EEG事件可能极其短暂(不到一秒),却被长达数分钟的间隔分隔。当前的EEG方法使用自监督目标优化信号重建,但这些方法不适用于流式SSM;它们未能明确训练隐藏状态以保留流式推理所需的关键长程上下文。因此,我们引入了一种专门设计的多阶段自监督训练流程,以鼓励长程记忆保持和在EEG信号上的强性能,同时保持状态空间模型的线性时间复杂度。CaMBRAIN在三个不同的EEG数据集上达到了最先进(SOTA)结果,吞吐量比现有模型高10倍以上,成为首个能够对可变长度EEG信号进行长程连续推理的模型。

英文摘要

Electroencephalography (EEG) is a critical, non-invasive method to monitor electrical brain activity. EEGs can span anywhere from a couple seconds to multiple hours, posing a major hurdle for existing deep learning methods due to two major factors: (1) existing EEG models are predominantly built upon the attention mechanism, incurring quadratic scaling as the sequence length increases, and (2) raw EEG signals must be processed in a sliding-window fashion due to fixed-length input requirements, preventing global understanding of the entire signal. To this extent, we propose CaMBRAIN - the first Causal, Mamba-based state space model (SSM) capable of real-time inference of EEG signals, arguing that bidirectional approaches are needlessly expensive given the causal, unidirectional nature of EEG. However, training such a model is non-trivial, as crucial EEG events can be extremely brief - within fractions of a second - yet separated by long intervals spanning minutes. Current EEG methods use self-supervised objectives that optimize for signal reconstruction, but these are not well suited for streaming SSMs; they fail to explicitly train the hidden state to retain the salient long-range context needed for streaming inference. We therefore introduce a multi-stage self-supervised training pipeline specifically tailored to encourage long-range memory retention and strong performance on EEG signals, while preserving the linear-time complexity of state space models. CaMBRAIN achieves state-of-the-art (SOTA) results across 3 different EEG datasets with >10x higher throughput than existing models, enabling the first model capable of long-range, continuous inference of variable-length EEG signals.

2605.28791 2026-05-28 cs.CL cs.AI 版本更新

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

技能条件门控自蒸馏用于大语言模型推理

Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai, Senkang Hu, Yuzhi Zhao

发表机构 * Tsinghua University(清华大学) Fudan University(复旦大学) City University of Hong Kong(香港城市大学) Huazhong University of Science and Technology(华中科技大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出技能条件门控自蒸馏(SGSD),通过从经验技能库中检索技能-错误对构建多教师池,并利用验证器验证教师极性,以鲁棒门控目标蒸馏信息性师生差异,在弱先验信息假设下提升数学推理性能。

详情
AI中文摘要

在线自蒸馏(SD)通过使用教师端特权信息(PI)将稀疏的验证器结果转化为密集的令牌级监督,从而改善大语言模型推理。现有方法通常假设可信的PI,例如参考答案或成功轨迹。我们提出PI是否可以来自经验驱动的技能库,其中检索到的技能紧凑且可重用,但也可能不相关或具有误导性。我们提出技能条件门控自蒸馏(SGSD),将基于技能的SD表述为教师假设验证而非无条件模仿。SGSD检索技能-错误对,构建多教师池,并让所有技能条件教师对相同的普通提示学生输出进行评分。验证器验证每个教师的极性:支持成功或抑制失败提供正向监督,而相反立场则被反转。然后,一个鲁棒的门控目标蒸馏信息性的师生差异,同时抑制不确定或极端信号。在多个数学推理基准上的实验表明,SGSD在弱PI假设下持续优于GRPO,并与答案条件OPSD保持竞争力。例如,在Qwen3-1.7B上,SGSD在AIME24、AIME25和HMMT25上平均比GRPO高出6.2%,比OPSD高出1.7%。我们的代码可在https://github.com/walawalagoose/SGSD获取。

英文摘要

On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.

2605.28787 2026-05-28 cs.IR cs.AI 版本更新

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

智能体需要语义元数据吗?智能体数据检索的比较研究

Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy

发表机构 * Google(谷歌)

AI总结 通过对比基线智能体(搜索开放网络)与语义智能体(利用schema.org元数据)在数据检索中的表现,发现语义元数据在检索可操作数据时精度更高(整体精度高65.7%),而基线智能体覆盖更广但存在“最后一英里效用”失败。

详情
AI中文摘要

在自主智能体时代,机器可操作数据对于数据驱动的工作流至关重要。十多年来,像schema.org这样的语义元数据支撑了机器可操作数据的FAIR原则(可发现、可访问、可互操作、可重用),并支持了Google Dataset Search等发现工具。然而,能够导航非结构化网络的大型语言模型(LLM)的兴起提出了一个基本问题:语义元数据对于智能体数据发现是否仍然必要,或者智能体能否直接从网络可靠地检索可操作数据?我们提出了两种不同环境下的智能体数据检索比较分析:一个基线智能体搜索数十亿开放网络文档,以及一个语义智能体利用使用schema.org的9000万数据集语料库。我们部署了一个“LLM作为裁判”的评估流程,直接映射到FAIR原则,以评估检索数据的语义相关性、数据可访问性和计算实用性。我们的结果揭示了明显的差异。语义智能体在检索可操作数据方面表现出色,对于元数据丰富的注册表,其返回结果中的精度高出44.9%,对于具有机器可读下载的页面,精度高出46.6%。相反,基线智能体经常遭受“最后一英里效用”失败,检索到的是散文密集的页面(占结果的20.1%)和门户登录页面(占8.5%),而不是实际的数据页面。虽然基线智能体通过回答多40%的问题实现了更高的覆盖率,但语义智能体在检索符合FAIR原则的数据集方面实现了更高的准确性,整体精度高出65.7%。我们得出结论,虽然非结构化检索支持广泛的探索性任务,但结构化生态系统仍然是可靠、面向执行的自主工作流不可或缺的基础。

英文摘要

In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using schema.org. We deploy an "LLM-as-a-judge" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers "Last-Mile Utility" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.

2605.28775 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

从弱点中学习:小型计算机使用代理的自动化领域专业化

Suji Kim, Kangsan Kim, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) Samsung Electronics(三星电子)

AI总结 提出LearnWeak框架,通过更强的参考代理识别学生代理在目标领域的弱点,自动合成针对性任务和监督信号,并引入误差感知专业化目标,显著提升小型计算机使用代理在多个领域的性能。

详情
AI中文摘要

计算机使用代理(CUA)最近取得了实质性进展,但为每个软件领域部署单独的大型专家仍然昂贵。小型开源计算机使用代理是更实用的专业化目标,但它们仍然明显较弱,并表现出不均匀的领域特定失败。一个直接的补救措施是为目标领域合成大规模训练数据,但我们发现这种简单方法仅带来边际改进。基于这一观察,我们引入了LearnWeak,一个针对小型计算机使用代理的无注释专业化框架,它使用更强的参考代理来识别学生在目标领域的弱点,合成有针对性的任务,并自动构建监督。LearnWeak进一步引入了一个误差感知的专业化目标,将规划和执行误差分离,从而实现比广泛统一监督更行为精确的更新。在OSWorld上,LearnWeak在八个领域上分别比EvoCUA-8B和OpenCUA-7B平均提高了11.6和11.1个百分点。我们还验证了我们的学生感知数据集生成和训练方法优于现有的自主轨迹生成和训练基线。我们的工作强调了学生意识在数据合成和代理训练中的重要性,为在多样化领域专业化小型计算机使用代理指明了更原则和高效的路径。

英文摘要

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

2605.28773 2026-05-28 cs.CL cs.AI cs.LG cs.MA cs.MM 版本更新

Rethinking Memory as Continuously Evolving Connectivity

重新思考记忆作为持续演化的连接性

Jizhan Fang, Buqiang Xu, Zhixian Wang, Haoliang Cao, Xinle Deng, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Ying Wei, Guozhou Zheng, Feiyu Xiong, Haofen Wang, Huajun Chen, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团) MemTensor Tongji University(同济大学)

AI总结 提出 FluxMem 框架,将记忆建模为异构图并通过三个阶段(初始连接形成、反馈驱动优化、长期巩固)动态演化拓扑结构,以解决现有记忆增强型 LLM 代理在动态环境中的脆弱性问题。

Comments Ongoing work

详情
AI中文摘要

现有的记忆增强型 LLM 代理通常将记忆视为具有预定义表示和固定检索管道的静态存储库,这在动态代理环境中是脆弱的,因为反馈、任务变化和异构信号不断重塑应该记住的内容以及如何连接它们。为了解决这个问题,我们提出了 FluxMem,一种连接性演化的记忆框架,它将记忆建模为异构图,并通过三个阶段逐步优化其拓扑结构:初始连接形成、反馈驱动优化和长期巩固。在执行过程中,FluxMem 修复缺失的链接、修剪干扰、对齐抽象粒度,并将重复的成功轨迹提炼为可重用的程序化电路,由记忆泛化性和演化成熟度的一个度量指导。在三个根本不同的基准测试(包括 LoCoMo、Mind2Web 和 GAIA)上,FluxMem 实现了持续的最先进性能,展示了在复杂代理环境中的强大适应性和泛化能力。代码将在 https://github.com/zjunlp/LightMem 开源。

英文摘要

Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity-evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. During execution, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, guided by one metric for memory generalizability and evolutionary maturity. Across three fundamentally distinct benchmarks including LoCoMo, Mind2Web, and GAIA, FluxMem achieves consistent state-of-the-art performance, demonstrating strong adaptation and generalization in complex agentic environments. The code will be open-sourced in https://github.com/zjunlp/LightMem.

2605.28764 2026-05-28 cs.AI cs.DC cs.MA 版本更新

SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

SwarmHarness:通过去中心化激励对齐的AI智能体网络进行基于技能的任务路由

Edwin Jose

发表机构 * Department of Computer Science(计算机科学系) Western Michigan University(西部密歇根大学)

AI总结 提出SwarmHarness去中心化协议,通过DHT注册、效用函数路由和Shapley值激励,实现无中心化计算集群的自我组织与任务分配。

详情
AI中文摘要

大量计算资源(个人工作站上的GPU周期、空闲推理服务器以及作业间的边缘设备)未被使用,因为没有激励对齐协议让所有者安全且有利可图地共享它们。现有方法要么需要可信的中心协调器(云市场),要么需要繁重的区块链基础设施(Golem, BrokerChain),要么完全缺乏激励层(BOINC, Petals)。我们提出SwarmHarness,一种去中心化协议,其中HarnessAPI技能节点在没有中央权威的情况下自我组织成计算集群。SwarmHarness有三个互锁组件:基于分布式哈希表(DHT)的SwarmRegistry,用于对等发现和能力广告;SwarmRouter,使用基于能力、负载、延迟和信任的效用函数将任务分派给节点;以及SwarmCredit,一种通过Shapley值近似将计算积分奖励分配给贡献节点的激励机制。节点通过服务任务赚取积分,并花费积分来提交任务;从不贡献的空闲节点会耗尽积分并失去路由优先级,从而创建自我调节的参与经济。随着节点向高奖励技能专业化,路由信号充当数字信息素,网络表现出类似于生物集群的涌现集体智能。除了计算共享,SwarmHarness还是自主分布式AI智能体网络的基础原语,其中智能体无需人工中介即可雇佣计算、路由子任务和结算积分。

英文摘要

Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because no incentive-aligned protocol exists for their owners to share them safely and profitably. Existing approaches either require a trusted central coordinator (cloud marketplaces), demand heavy blockchain infrastructure (Golem, BrokerChain), or lack an incentive layer entirely (BOINC, Petals). We propose SwarmHarness, a decentralised protocol in which HarnessAPI skill nodes self-organise into a compute swarm without any central authority. SwarmHarness has three interlocking components: a SwarmRegistry built on a Distributed Hash Table (DHT) for peer discovery and capability advertisement; a SwarmRouter that dispatches tasks to nodes using a utility function over capability, load, latency, and trust; and SwarmCredit, an incentive mechanism that attributes compute-credit rewards to contributing nodes via a Shapley-value approximation. Nodes earn credits by serving tasks and spend credits to submit them; idle nodes that never contribute drain credits and lose routing priority, creating a self-regulating participation economy. As nodes specialise toward high-reward skills and routing signals act as digital pheromones, the network exhibits emergent collective intelligence analogous to biological swarms. Beyond compute sharing, SwarmHarness is a foundational primitive for autonomous distributed AI agent networks in which agents hire compute, route subtasks, and settle credits without human intermediation.

2605.28763 2026-05-28 cs.AI 版本更新

CubePart: An Open-Vocabulary Part-Controllable 3D Generator

CubePart: 一种开放词汇、部件可控的3D生成器

Yiheng Zhu, Kangle Deng, Jean-Philippe Fauconnier, Inaki Navarro, Daiqing Li, Ava Pun, Yinan Zhang, Peiye Zhuang, Xiaoxia Sun, Maneesh Agrawala, Kiran Bhat, Tinghui Zhou

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Stanford University(斯坦福大学)

AI总结 提出CubePart框架,通过开放词汇的部件模式实现用户定义的部件级3D网格生成,无需后处理即可直接用于游戏引擎。

Comments SIGGRAPH 2026. Project Page: https://cubepart.github.io/

详情
AI中文摘要

游戏和仿真中使用的交互式3D资产通常被分解为特定的语义部件以支持动画、物理和脚本行为,然而大多数生成式3D模型要么产生整体网格,要么产生无法与应用特定需求对齐的任意部件分解。我们提出CubePart,一个用于开放词汇、部件可控的3D网格生成的生成框架,将部件结构作为显式的推理时控制信号。给定一个全局文本提示和一个用户定义的部件模式(表示为部件名称的开放列表),我们的方法生成一组网格——每个模式元素一个——这些网格组装成一个连贯的对象,同时尊重指定的语义结构。为了实现这一能力,我们引入了一个可扩展的数据管道来构建一个大型的开放词汇、部件标注的3D数据集,以及一个将全局形状合成与部件级解码分离的两阶段生成架构。我们证明,生成的资产可以直接集成到游戏引擎中,并由动画和行为脚本驱动,无需手动后处理。项目页面:https://cubepart.github.io/

英文摘要

Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and scripted behaviors, yet most generative 3D models produce either monolithic meshes or arbitrary part decompositions that cannot be aligned with application-specific requirements. We present CubePart, a generative framework for open-vocabulary, part-controllable 3D mesh generation that exposes part structure as an explicit inference-time control signal. Given a global text prompt and a user-defined parts schema expressed as an open-ended list of part names, our method generates a set of meshes - one per schema element - that assemble into a coherent object while respecting the specified semantic structure. To enable this capability, we introduce a scalable data pipeline to construct a large open-vocabulary, part-labeled 3D dataset, along with a two-stage generative architecture that separates global shape synthesis from part-level decoding. We demonstrate that the resulting assets can be directly integrated into game engines and driven by animation and behavior scripts without manual post-processing. Project Page: https://cubepart.github.io/

2605.28751 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

外推权重平均揭示代码强化学习中的正确性-效率前沿

Kunhao Zheng, Pierre Chambon, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin Negrevergne, Gabriel Synnaeve

发表机构 * Meta Superintelligence Labs - FAIR(Meta超智能实验室 - FAIR)

AI总结 通过外推权重平均,无需额外RL训练即可扩展微调检查点间的帕累托前沿,在竞争性编程中实现正确性与效率的权衡,并提升推理时性能。

Comments 54 pages

详情
AI中文摘要

线性插值微调检查点已被证明可以追踪竞争目标之间的帕累托前沿,但外推权重平均是否能在不进行额外RL训练的情况下,将此类前沿扩展到推理时有用的新检查点,仍不清楚。我们在竞争性编程的RL中研究这一问题,其中隐藏单元测试在时间和内存限制下同时强制执行功能正确性和计算效率。从共享初始化开始,我们在嵌套单元测试覆盖下训练检查点:低覆盖奖励要求通过较小输入的测试,而高覆盖奖励要求逐步通过更大输入的测试直至完整套件。这种扫描揭示了正确性-效率前沿的出现:在困难问题上,更高覆盖奖励减少了优化失败但增加了正确性失败,使得解决率几乎不变。低覆盖和高覆盖检查点之间的插值恢复了这一前沿,而外推则将其扩展到训练端点之外。该前沿及其外推延续出现在三种推理设置(纯推理、工具使用和智能体编码)以及两种模型规模(32B和7B)中。在问题层面,沿前沿移动会改变被解决的问题,使得外推检查点成为推理时扩展中的互补策略。具有外推权重平均的集成扩大了覆盖范围,并在相同样本预算下,将LCB/hard上的pass@250比最佳单一检查点提高了3.3%。这些结果表明,代码RL中的嵌套单元测试覆盖诱导了一个前沿,外推权重平均可以导航、扩展和利用该前沿。

英文摘要

Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.

2605.28740 2026-05-28 cs.CL cs.AI 版本更新

Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

反向探测:临床文本中大语言模型的监督式词级不确定性量化

Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang

发表机构 * University of Florida(佛罗里达大学) U.S. National Library of Medicine(美国国家医学图书馆)

AI总结 提出反向探测框架,利用预标注摘要从模型内部激活中提取词级不确定性信号,在临床文本中实现高效、可解释的不确定性量化。

详情
AI中文摘要

随着大语言模型越来越多地应用于临床文本,确保它们能够可靠地表明自身的不确定性变得至关重要。大多数现有的不确定性量化(UQ)方法是为开放域生成设计的,无法在长临床文本中定位到词或跨度级别的不确定性。我们提出了反向探测,这是首个专门针对临床摘要的UQ框架,它直接从预标注的摘要中估计词级不确定性。与采样新输出不同,反向探测将文本视为探测模型内部状态的探针,从四类内部激活中提取不确定性信号。我们在两个专家标注的临床数据集上进行了评估,在所有指标上优于八个适配基线,AUPRC最高提升4倍,同时减少了推理时间和计算成本。特征分析表明,delta能量和邻域上下文是所有模型中最一致的预测因子。本研究提供了关于模型内部如何响应无支持的临床内容的可解释性见解。

英文摘要

As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the token or span level in long clinical text. We propose Reverse Probing, the first UQ framework specialized for clinical summarization, which estimates token-level uncertainty directly from pre-existing labeled summaries. Rather than sampling new outputs, Reverse Probing treats the text as a probe into the model's internal state, extracting uncertainty signals from four categories of internal activations. We evaluate on two expert-annotated clinical datasets and outperform eight adapted baselines on all metrics, achieving up to 4 times higher AUPRC while reducing inference time and computational costs. Feature analysis reveals that delta energy and neighborhood context are the most consistent predictors across all models. This study offers interpretable insights into how models internally respond to unsupported clinical content.

2605.28739 2026-05-28 cs.LG cs.AI cs.NE q-bio.QM 版本更新

BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks

BIRDNet: 挖掘和编码布尔蕴含知识图作为可解释深度神经网络

Tirtharaj Dash

发表机构 * BITS Pilani, K K Birla Goa Campus(BITS 印度 Goa 分校)

AI总结 提出BIRDNet,通过挖掘特征间的布尔蕴含关系并编码为稀疏可解释神经网络,在保持高精度的同时大幅减少参数,并在转录组和蛋白质组数据中恢复已知生物学特征。

Comments 5 pages; 1 figure, 4 tables

详情
AI中文摘要

知识丰富领域中的表格数据通常携带特征对之间的布尔蕴含关系(BIR)形式的潜在先验。我们使用稀疏异常二项检验挖掘此类关系。挖掘出的蕴含构成一个带类型的定向图,等价于一个由2-文字子句组成的命题规则库。我们将该图编码为分层神经网络的连接性,称为BIRDNet,其中每个隐藏单元对应一条挖掘出的规则,并仅绑定到其两个特征。我们展示了这种设计的两个结果:首先,该架构在构造上是稀疏的:每个BIR层中最多有$2/d$的权重是活跃的,其中$d$是输入维度。其次,模型是可解释的:每个训练后的单元保持稳定的符号身份,因此无需代理模型即可从网络中读取规则。与大多数神经符号模型不同,BIRDNet不消耗外部规则库;其结构先验是从数据中挖掘的。我们在六个转录组和蛋白质组基准上评估BIRDNet。我们的结果表明,BIRDNet在AUROC上与最强的密集基线相差0.02以内,精度损失很小,同时使用的活跃参数比架构匹配的密集MLP少高达96倍。第一层规则恢复了多种癌症亚型和组织类型中的已知生物学特征,包括典型扩增子、谱系定义共表达模块和免疫浸润标记。数据和代码可在 https://github.com/MAHI-Group/BIRDNet 获取。

英文摘要

Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2-literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most $2/d$ of the weights in each BIR layer are active, where $d$ is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to $96\times$ fewer active parameters than an architecture-matched dense MLP. First-layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage-defining co-expression modules, and immune-infiltration markers. Data and code are available at: https://github.com/MAHI-Group/BIRDNet.

2605.28733 2026-05-28 cs.AI 版本更新

Utility-Aware Multimodal Contrastive Learning for Product Image Generation

效用感知的多模态对比学习用于产品图像生成

Xiaohang Feng, Yiling Xie

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出一种效用感知的多模态对比学习框架,通过引入效用感知InfoNCE损失优化产品图像生成,使图像在语义对齐的同时提升市场需求。

详情
AI中文摘要

产品图像强烈影响在线市场中消费者的决策。借助多模态对比学习,生成式AI可以输出与文本提示紧密对齐的图像。然而,现有的生成式AI模型并未直接优化市场表现。这是一个关键差距,因为仅凭语义对齐并不能保证图像能够促进销售。为了解决这一局限性,我们提出了一个 extit{效用感知的多模态对比学习}框架,将消费者需求纳入新颖的效用感知InfoNCE损失中。优化这一效用感知目标引导生成过程朝向既语义连贯又增强需求的图像。这一效果直接源于学习到的图像-文本表示空间向需求驱动的视觉线索的转变,我们也通过所提目标的理论界限验证了这一点。在Amazon和Airbnb的下游应用中,我们的方法生成和编辑的产品图像在增加需求和保持保真度方面优于最先进的模型,同时保持了文本-图像一致性。值得注意的是,我们的效用感知框架保留了美学和独特性等属性的倒U型需求模式,在保持保真度和语义一致性的同时提升了基于需求的性能。人类受试者实验进一步验证了其商业有效性。随着生成式AI技术的不断发展,我们的效用感知组件可以灵活地嵌入新兴的生成模型中,以改善直接商业用途。

英文摘要

Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts. Yet existing generative AI models do not directly optimize marketplace performance. This is a critical gap, since semantic alignment alone does not guarantee that an image will sell. To address this limitation, we propose a \textit{utility-aware multimodal contrastive learning} framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss. Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing. This effect arises directly from a shift in the learned image-text representation space toward demand-driven visual cues, which we also validate through the theoretical bound of the proposed objective. In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency. Notably, our utility-aware framework preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, improving demand-based performance while preserving fidelity and semantic consistency. Human-subject experiments further validate its commercial effectiveness. As generative AI technology continues to evolve, our utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.

2605.28732 2026-05-28 cs.CL cs.AI cs.LG 版本更新

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

MemTrace:大型语言模型记忆系统中的错误追踪与归因

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 提出MemTrace框架,通过构建可执行的记忆演化图实现细粒度错误追踪,并利用自动归因方法定位根因,进而优化提示词提升下游任务性能。

Comments Ongoing work

详情
AI中文摘要

记忆对于使大型语言模型支持长程推理至关重要,但现有的记忆系统仍然不可靠且难以调试。追踪记忆的动态演化对于理解信息如何随时间合成、传播或损坏至关重要。在这项工作中,我们研究了LLM记忆系统中错误追踪与归因的新问题。我们提出了一种新颖的框架,将记忆流水线转换为可执行的记忆演化图,从而实现对操作信息流的细粒度追踪。然后,我们构建了MemTraceBench,一个从代表性记忆系统(如Long-Context、RAG、Mem0和EverMemOS)收集的基准,以系统地研究记忆故障模式。我们进一步引入了一种自动归因方法,该方法迭代地追踪操作子图以定位任何失败案例的根本原因。我们的分析表明,记忆故障是系统性的,源于操作层面的问题,如信息丢失和检索错位。关键的是,我们利用这些细粒度的归因信号来指导下游提示优化,建立了一个自动纠正故障并提升最终任务性能高达7.62%的闭环系统。代码将在https://github.com/zjunlp/MemTrace发布。

英文摘要

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

2605.28730 2026-05-28 cs.AI 版本更新

AlphaTransit: Learning to Design City-scale Transit Routes

AlphaTransit: 学习设计城市级公交线路

Bibek Poudel, Sai Swaminathan, Weizi Li

发表机构 * Department of EECS, University of Tennessee, Knoxville, TN, USA(田纳西大学电子工程与计算机科学系) Department of CSE, University of California, Riverside, CA, USA(加州大学河滨分校计算机科学与工程系)

AI总结 针对公交线路设计中的延迟反馈问题,提出AlphaTransit框架,将蒙特卡洛树搜索与神经策略-价值网络结合,在布卢明顿基准上实现最高服务率。

详情
AI中文摘要

设计公交网络需要许多顺序的线路扩展决策,但其质量通常只有在完整网络组装后才能显现。这种延迟反馈挑战是公交线路网络设计问题(TRNDP)的核心,其中线路交互可能具有欺骗性:一个看似有用的局部扩展可能会造成换乘瓶颈、产生冗余重叠或降低整体吞吐量。为了在延迟模拟器反馈下指导线路构建,我们引入了AlphaTransit,一个用于城市级公交网络设计的基于搜索的规划框架。AlphaTransit将蒙特卡洛树搜索(MCTS)与神经策略-价值网络相结合:策略提出线路扩展,价值估计下游设计质量,搜索利用这些预测来优化每个决策。这提供了在路线构建过程中的决策时间前瞻,而无需在搜索树内运行模拟器展开。我们在一个新的布卢明顿TRNDP基准上评估AlphaTransit,该基准具有现实的道路拓扑和基于人口普查的需求,在混合和全公交需求设置下。在布卢明顿网络中,AlphaTransit在两种需求设置下均达到了最高服务率,分别为54.6%和82.1%。相对于无搜索的强化学习,这对应9.9%和11.4%的服务率提升;相对于无学习指导的MCTS,这对应2.5%和11.2%的提升。这些结果表明,将学习指导与MCTS结合比单独使用任何一种方法对公交网络设计更有效。我们的代码和数据公开在https://github.com/poudel-bibek/AlphaTransit。

英文摘要

Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed-feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route interactions can be deceptive: an extension that appears useful locally can create transfer bottlenecks, produce redundant overlap, or reduce overall throughput. To guide route construction under delayed simulator feedback, we introduce AlphaTransit, a search-based planning framework for cityscale bus network design. AlphaTransit couples Monte Carlo Tree Search (MCTS) with a neural policy-value network: the policy proposes route extensions, the value estimates downstream design quality, and search uses these predictions to refine each decision. This provides decision-time lookahead during route construction without running simulator rollouts inside the search tree. We evaluate AlphaTransit on a new Bloomington TRNDP benchmark with realistic road topology and censusderived demand, under mixed and full transit demand settings. In the Bloomington network, AlphaTransit attains the highest service rate in both demand settings, reaching 54.6% and 82.1%, respectively. Relative to reinforcement learning without search, these correspond to 9.9% and 11.4% service rate gains; relative to MCTS without learned guidance, they correspond to 2.5% and 11.2% gains. These results suggest that coupling learned guidance with MCTS is more effective than using either approach alone for transit network design. Our code and data are publicly available in https://github.com/poudel-bibek/AlphaTransit.

2605.28722 2026-05-28 cs.AI 版本更新

Multi-Adapter Representation Interventions via Energy Calibration

通过能量校准的多适配器表示干预

Manjiang Yu, Hongji Li, Junwei Chen, Xue Li, Priyanka Singh, Yang Cao, Lijie Hu

发表机构 * The University of Queensland, Brisbane, Australia(昆士兰大学) Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates(马尔代夫 bin Zayed 人工智能大学) Institute of Science Tokyo, Tokyo, Japan(东京科学研究所)

AI总结 提出MARI方法,通过竞争性多适配器机制和基于能量的门控模块,自适应地确定干预方向和强度,在保持通用能力的同时提升对齐性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

表示干预已成为一种有前景的范式,可以在不修改模型权重的情况下将大型语言模型对齐到期望的行为。现有方法通常对所有输入统一应用固定的干预。然而,我们发现适当的干预方向和强度在不同样本间差异很大,这种无差别的干预会导致良性输入上通用能力的下降。为了解决这些挑战,我们提出了通过能量校准的多适配器表示干预(MARI)。具体来说,我们引入了一种竞争性多适配器机制,其中专门的专家捕获非线性校正模式,并自适应地确定不同样本的适当干预方向和强度。此外,我们设计了一个基于能量的门控模块,利用内部传播动力学来区分适合干预的输入。跨不同模型系列和参数规模的广泛实验表明,MARI实现了最先进的对齐性能。我们的方法在TruthfulQA、BBQ和安全基准测试上显著提高了性能,同时在MMLU和ARC等任务上保持甚至提高了通用能力。我们的代码可在https://github.com/V1centNevwake/MARI获取。

英文摘要

Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifying model weights. Existing methods typically apply a fixed intervention uniformly across all inputs. However, we find that the appropriate intervention direction and strength vary substantially across samples, and such indiscriminate intervention leads to degradation of general capabilities on benign inputs. To address these challenges, we propose Multi-Adapter Representation Interventions via Energy Calibration (MARI). Specifically, we introduce a competitive multi-adapter mechanism in which specialized experts capture non-linear correction patterns and adaptively determine the appropriate intervention direction and strength for different samples. Furthermore, we design an energy-based gating module that leverages internal propagation dynamics to distinguish inputs that are applicable for intervention. Extensive experiments across diverse model families and parameter scales demonstrate that MARI achieves state-of-the-art alignment performance. Our method significantly improves performance on TruthfulQA, BBQ, and safety benchmarks, while maintaining and even improving general capabilities on tasks such as MMLU and ARC. Our code is available at https://github.com/V1centNevwake/MARI.

2605.28721 2026-05-28 cs.AI 版本更新

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

LiveBrowseComp: 搜索智能体是在搜索,还是仅仅在验证它们已知的信息?

HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu

发表机构 * Harbin Institute of Technology(哈尔滨理工大学) Xiaohongshu(小红书)

AI总结 本文通过诊断方法发现基于LLM的搜索智能体存在内在知识依赖(IKD),即依赖模型内部知识而非外部证据,并引入LiveBrowseComp基准来评估超越内在知识覆盖的深度搜索能力。

详情
AI中文摘要

基于LLM的搜索智能体是真的在搜索,还是仅仅利用网络验证它们已知的信息?我们在BrowseComp上通过三个诊断研究这个问题。我们的分析揭示了内在知识依赖(IKD):即使有工具访问权限,智能体也常常依赖内在知识——检索前模型已编码的信息——而非外部证据。智能体在没有工具的情况下回答了高达44.5%的BrowseComp问题,超过一半的搜索查询来自内部生成的假设而非检索到的线索,并且当答案支持证据被移除时,其表现比闭卷基线更差。这些结果表明,静态搜索基准可能奖励基于记忆的验证而非基于证据的发现,混淆了智能体已知的信息与它们能找到的信息。然后我们引入了LiveBrowseComp,一个深度搜索基准,旨在评估超越内在知识覆盖的智能体。它包含335个人工编写的问题,其答案依赖于基准构建前90天内发布的事实,来自六个更新的来源,并过滤掉全球显著事件。在LiveBrowseComp上,所有评估的智能体闭卷准确率低于2%,搜索增强的分数相对于BrowseComp下降了25-40个百分点,且先前的模型排名不再可靠地预测性能。LiveBrowseComp可在https://huggingface.co/datasets/Forival/LiveBrowseComp获取。

英文摘要

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.

2605.28717 2026-05-28 cs.AI cs.AR cs.NI 版本更新

OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

OpenURMA:统一总线协议的开源洁净室实现

Bojie Li

发表机构 * Pine AI

AI总结 针对RDMA在数据中心网络接口的瓶颈,OpenURMA基于华为UB协议规范,通过RTL、SystemC和gem5三层实现,展示了UB在64字节远程取操作中相比RoCEv2 RC实现4.37倍延迟降低和2.80倍吞吐提升。

详情
AI中文摘要

现代数据中心RDMA的瓶颈在网络接口而非线缆。运行RoCE或InfiniBand的NIC为每个(应用,远程端点)对维护每连接状态——在1024应用扇出时达数百兆字节——并在64字节操作上支付四次PCIe往返,将延迟放大到线缆延迟的一个数量级以上。这两者都源于RDMA从InfiniBand继承的基于PCIe的队列对抽象。 华为的统一总线(UB)是2025年公开的规范,它改变了抽象:将每应用端点状态与每主机传输状态解耦,使连接上下文呈加性增长,将排序作为可选功能,并通过原生CPU加载/存储到片上总线控制器来访问远程内存。UB已搭载在华为闭源的Ascend 950芯片中。 OpenURMA是UB传输层和事务层的首个洁净室开源实现,在三个层级实现——Alveo U50上的可综合RTL、双节点周期级SystemC模拟器以及gem5全系统框架——每个层级都有匹配的OpenRoCE(RoCEv2 RC)基线。贡献在于实现、测试平台以及闭源芯片无法进行的受控比较。在规范的64字节远程取操作——UB规范第8.3节的LOAD,RoCEv2 RC的READ——上,UB的加载/存储路径实现了约500纳秒的端到端延迟,比匹配基线(2186纳秒)低4.37倍,吞吐量高2.80倍,且仅占用U50约14%的LUT。

英文摘要

Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state for every (application, remote-endpoint) pair - hundreds of megabytes at 1024-application fanout - and pays a four-traversal PCIe round trip on a 64-byte operation, inflating latency an order of magnitude beyond the wire. Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand. Huawei's Unified Bus (UB), a public 2025 specification, changes the abstraction: it decouples per-application endpoint state from per-host transport state so connection context grows additively, exposes ordering as opt-in, and reaches remote memory through native CPU load/store to an on-chip-bus controller. UB ships in Huawei's closed Ascend 950 silicon. OpenURMA is the first clean-room open implementation of UB's transport and transaction layers, realised at three tiers - synthesisable RTL on Alveo U50, a cycle-level two-node SystemC simulator, and a gem5 full-system scaffold - each with a matched OpenRoCE (RoCEv2 RC) baseline. The contribution is the implementation, harness, and controlled comparison closed silicon does not admit. On the canonical 64-byte remote fetch - LOAD on UB-spec Sec.8.3, READ on RoCEv2 RC - UB's load/store path delivers ~500 ns end-to-end, 4.37x below the matched baseline (2186 ns), sustains 2.80x higher throughput, and fits in ~14% of a U50's LUTs.

2605.28714 2026-05-28 cs.CL cs.AI 版本更新

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

IPO-Mine:用于长多模态IPO文档的章节结构化分析的工具包和数据集

Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, Sagnik Nandi, Aman Patel, Liqin Ye, Arnav Hiray, Rutwik Routu, Prasun Banerjee, Siddhartha Somani, Sudheer Chava

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Sai University(赛大学) Duke University(杜克大学)

AI总结 本文提出IPO-Mine工具包和数据集,通过标准化解析IPO文件为章节结构化文本和图像,构建大规模多模态数据集,并建立图表评估任务,揭示多模态模型在长文档分析中的对齐挑战。

Comments 12 pages

详情
AI中文摘要

首次公开募股(IPO)文件是私营公司上市时发布的文件,允许个人(散户)投资者购买其股票。这些文件描述了公司的业务、财务状况和风险,是包含叙述性文本和图像的长篇多模态文档。尽管它们对金融市场至关重要,但目前缺乏用于使用现代语言和多模态模型研究IPO文件的大规模标准化数据集或基准。这些文档带来了重大挑战:文件通常超过50万词,且缺乏一致的结构组织。我们引入了IPO-Toolkit,这是一个开源框架,用于下载和解析IPO文件,将其标准化为章节结构化文本和提取的图像。该工具包分割文件、提取嵌入的图像,并生成结构化输出,从而支持对长多模态文档进行大规模、可重复的分析工作流。利用这一基础设施,我们构建了IPO-Dataset,这是一个大规模、章节结构化的多模态数据集,涵盖1994年至2026年超过109,000份IPO文件及其修订版,包含超过76,000张图像。我们针对提取的金融图表建立了结构化评估任务,包括图表质量和误导性评估。我们的实验表明,最先进的多模态模型在这些任务上常常与专家人类判断存在分歧,揭示了在长篇幅真实监管文档上进行多模态推理时的对齐挑战。除了基准测试,IPO-Dataset还支持对章节级文本变异以及视觉和文本披露实践的跨行业差异进行大规模分析。我们的代码、数据集和网站根据CC-BY-4.0公开提供。

英文摘要

An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm's business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.

2605.28713 2026-05-28 cs.AI 版本更新

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

思维即压缩:你的推理模型其实是一个上下文压缩器

Guoxin Ma, Yibing Liu, Chengzhengxu Li, Yu Liang, Yan Wang, Yueyang Zhang, Kecheng Chen, Zhaohan Zhang, Zhiyuan Sun, Daiting Shi

发表机构 * Baidu Inc.(百度公司) Xi’an Jiaotong University(西安交通大学) City University of Hong Kong(香港城市大学) Queen Mary University of London(伦敦玛丽女王大学)

AI总结 本文提出思维即压缩(TaC)范式,利用推理模型自身的思维痕迹作为压缩上下文,并通过奖励驱动优化(TaC-C)实现可控压缩,在长上下文QA任务上显著优于现有方法。

Comments Under Review

详情
AI中文摘要

上下文压缩旨在缩短长上下文输入,同时最小化信息损失,以加速LLM推理。现有方法虽有前景,但通常依赖复杂的压缩模块或针对压缩的训练,忽视了LLM的内在能力。相比之下,本文揭示推理模型本身可以通过组织任务相关信息自然地压缩长上下文。因此,我们提出思维即压缩(TaC),一种将思维本身视为压缩上下文的新压缩范式。无需专用压缩器,TaC直接提示推理模型生成思维痕迹作为缩短的上下文,已优于大多数代表性压缩方法。进一步,鉴于原始思维输出可能难以控制预算和存在捷径行为,我们引入带约束的思维即压缩(TaC-C),利用简单的奖励驱动优化框架,激发内在思维成为紧凑且可控的压缩上下文。在四个长上下文QA基准上的实验表明,TaC-C一致优于现有基线。在4倍和8倍压缩比下,它在平均F1上分别超过最强竞争对手17.4%和23.4%,在平均精确匹配分数(EM)上分别超过15.7%和21.7%。

英文摘要

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.

2605.28710 2026-05-28 cs.CL cs.AI 版本更新

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

迈向可靠的多语言LLM作为评判者:一项实证研究

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

发表机构 * HiTZ Center - Ixa, University of the Basque Country EHU(希茨中心 - Ixa,巴斯克国家大学EHU)

AI总结 本研究通过分析指令翻译、单语与多语言监督及模型规模等策略,探讨了在有无领域内数据情况下开发多语言LLM评判者的方法,并揭示了领域内数据可用时微调小模型可媲美专有模型、零样本大模型在域外更有效等关键权衡。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于生成文本的自动评估,然而大多数先前工作集中在英语上。尽管对多语言评估的需求日益增长,将基于LLM的评估器扩展到多语言环境仍然具有挑战性,特别是对于低资源语言和领域内数据稀缺的场景。本文探索了开发多语言LLM评判者的几种策略,考虑了是否有领域内数据可用于微调。我们系统分析了英语、西班牙语和巴斯克语(代表高、中、低资源语言),考虑了指令翻译、单语与多语言监督以及模型规模。为了评估,我们将两个现有的元评估数据集扩展到巴斯克语和西班牙语。我们的结果揭示了关键的权衡:当领域内数据可用时,微调的小模型可以达到与专有模型相当的性能,而在域外设置中,使用较大模型的零样本评估更为有效。我们还观察到,在域外数据上进行微调可能会对模型性能产生不利影响。这些发现为构建高效、可靠的多语言评估流程提供了实用指导。数据和代码公开在hitz-zentroa/mJudge。

英文摘要

Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.

2605.28707 2026-05-28 cs.AI cs.LG 版本更新

Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

超越二元道德判断:在AI中建模伦理多元主义

Aisha Aijaz, Rahul Goel, Arnav Batra, Raghava Mutharaju

发表机构 * Department of Computer Science and Engineering, IIIT Delhi(印度德里理工学院计算机科学与工程系) Mehta Family School of Data Science and AI, IIT Palakkad(印度帕拉卡德理工学院梅塔家庭数据科学与人工智能学院)

AI总结 提出将道德推理建模为规范性伦理理论分布(伦理多元主义)的框架,通过规范-语义双流架构和堆叠集成学习实现,在450个案例上达到88.89%的准确率。

详情
AI中文摘要

在社会关键领域的决策中,AI系统正以不同能力越来越多地参与。然而,尽管自主系统无处不在,大多数处理自主道德决策的方法仍诉诸于标量或二元判断。这些方法对于可接受的道德推理是不够的,因为它们提供的解释很少,遗漏了必须包含以支持问责的关键背景和理论信息。为此,我们提出了一个将道德推理建模为规范性伦理理论或伦理多元主义分布的框架。我们引入了一个整合这些理论的规范伦理单纯形。还准备了涵盖15个细分子理论的450个案例基准,用于堆叠集成学习。这些案例描述了自然语言中的伦理困境,并具有相关的提取上下文特征。单纯形的实现通过双流规范-语义架构完成,随后是规范信息的融合和顺序堆叠集成,以学习三个广泛理论(后果主义、美德伦理学和道义论)及其15个子类别的最佳拟合。我们的实验表明,将上下文和规范先验与语义嵌入相结合显著提高了分类性能,准确率达到88.89%。我们进行了消融研究,以表明结构化伦理表示超越了类比推理的贡献,并且所选的堆叠架构由于逐步学习粒度而给出了最佳结果。还通过熵、置信度和可视化分析了伦理多元主义。因此,将伦理多元主义建模为概率性规范分布支持类人道德推理、伦理分歧分析以及未来AI系统中的对齐。

英文摘要

Critical decision-making in socially consequential spaces is increasingly involving AI systems at varying capacities. Yet, despite the ubiquity of autonomous systems, most approaches to handling autonomous moral decision-making resort to scalar or binary judgments. These methods are insufficient for acceptable moral reasoning, as they provide little explanation, leaving out imperative contextual and theoretical information that must be included to support accountability. For this, we propose a framework to model moral reasoning as a distribution over normative ethical theories or ethical pluralism. We introduce a normative ethics simplex that integrates these theories. A benchmark of 450 cases across 15 fine-grained subtheories was also prepared for the purposes of stacked ensemble learning. These cases describe ethical dilemmas in natural language and have associated extracted contextual features. The implementation of the simplex was achieved via a two-stream normative-semantic architecture. This is followed by the fusion of normative information and a sequential, stacking ensemble to learn the best fit of the three broad theories: consequentialism, virtue ethics, and deontology, and the 15 subcategories. Our experiments demonstrate that the integration of contextual and normative priors with the semantic embeddings significantly improves the performance of the classification, displaying an accuracy of 88.89%. We conducted ablation studies to show that structured ethical representations contribute beyond analogical reasoning, and the chosen stacking architecture gives the best results due to the gradual learning of granularity. Ethical pluralism is also analyzed through entropy, confidence, and visualization. Thus, modeling ethical pluralism as a probabilistic normative distribution supports human-like moral reasoning, ethical disagreement analysis, and future alignment in AI systems.

2605.28703 2026-05-28 cs.NE cs.AI cs.DS math.OC 版本更新

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

对拉马克进化与鲍德温效应的重新审视

Inès Benito, Johannes F. Lutzeyer, Benjamin Doerr

发表机构 * Laboratoire d’Informatique (LIX), CNRS, École Polytechnique, Institut Polytechnique de Paris(信息实验室(LIX)、国家科学研究中心、巴黎高等学院、巴黎理工学院)

AI总结 通过实验和理论分析,比较拉马克、鲍德温和达尔文进化在最大独立集和最大割问题上的表现,证明局部搜索增强的进化算法(尤其是鲍德温进化)显著优于达尔文进化,并给出理论上的运行时界限。

Comments To appear in the proceedings of PPSN 2026

详情
AI中文摘要

鲍德温和拉马克进化在进化算法中已存在很长时间,但从未主导学术文献或实际应用。在这项工作中,我们使用现代实证和理论方法重新审视拉马克和鲍德温进化,并将其与一般的达尔文进化进行严格比较。在实证方面,我们在来自近期GraphBench基准的六个不同数据集的图上,针对最大独立集和最大割问题运行了一套全面的实验。我们的结果表明,鲍德温和拉马克进化始终优于达尔文进化,证实了局部搜索增强进化算法的巨大潜力。值得注意的是,在绝大多数情况下,所有进化算法都优于最近的深度学习基线,并接近高度专业化的启发式和精确求解器的性能。此外,我们报告了一组适用于所有研究进化类型的高性能通用参数,希望未来对从业者有用。在理论方面,我们将现有的欺骗性前导块基准扩展到任意块长度,并使用现代理论运行时分析工具来证明预期运行时的上下界。对于大于二的块长度,鲍德温进化渐近快于拉马克进化,而拉马克进化渐近快于达尔文进化。当考虑适应度评估中局部搜索过程的成本时,排序取决于实现方式,鲍德温进化从较小的块长度开始就保持最快,这解释了其强大的实证性能。

英文摘要

Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic literature or practical applications. In this work, we use modern empirical and theoretical methods to revisit Lamarckian and Baldwinian evolution and rigorously compare them with the generic Darwinian evolution. On the empirical side, we run a comprehensive suite of experiments on graphs from six different datasets from the recent GraphBench benchmark on Maximum Independent Set and Maximum Cut problems. Our results show that Baldwinian and Lamarckian evolution consistently outperform Darwinian evolution, confirming the great potential of local search augmented evolutionary algorithms. Notably, in the great majority of cases, all EAs outperform recent deep learning baselines and approach the performance of highly specialised heuristic and exact solvers. We furthermore report a high-performing set of generalist parameters for all studied evolution types that we hope will be of use to practitioners in future. On the theoretical side, we extend the existing Deceptive Leading Block benchmark to arbitrary block length and use tools from modern theoretical runtime analysis to prove upper and lower bounds on the expected runtime. For block lengths greater than two, Baldwinian evolution is asymptotically faster than Lamarckian which is asymptotically faster than Darwinian evolution. When accounting for the cost of the local search procedure in fitness evaluations, the ordering depends on the implementation with Baldwinian evolution staying fastest from small block lengths onwards, explaining its strong empirical performance.

2605.28699 2026-05-28 cs.AI 版本更新

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

TRACER: 基于内部强化信用与轮次级遗憾匹配的多LLM协作推理

Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang

发表机构 * Fudan University(复旦大学) Zhongguancun Academy(中关村学院) Academy for Advanced Interdisciplinary Studies, Peking University(北京大学先进交叉学科研究院)

AI总结 提出TRACER框架,通过控制器-遗憾层和生成-信用层分别学习发言时机与内容,解决多智能体强化学习中的稀疏奖励、搭便车和固定协议振荡问题,实现数学收敛的协作推理。

Comments 25 pages, 3 figures

详情
AI中文摘要

大型语言模型越来越依赖强化学习或多智能体提示来改进推理,但这两个范式仍然难以结合。将单智能体强化学习直接应用于多轮多智能体系统面临以下困境:i) 稀疏奖励、角色级搭便车和过高的训练开销。ii) 智能体仅模仿协作。iii) 固定协作协议陷入振荡的局部最优。我们引入TRACER,一个用于协作多LLM推理的轮次级强化框架。TRACER将协作决策分为控制器-遗憾层和生成-信用层,其中控制器通过遗憾匹配学习智能体是否应在当前轮次发言或跳过,生成-信用层则使用角色特定的GSPO奖励优化提议者和评审者的发言。这种设计i) 在动作模式和生成话语两个层面分配信用,从而避免搭便车和稀疏奖励。我们仅扩展控制器做出的选择,从而大幅降低训练的计算成本。此外,ii) 智能体在学习何时发言和说什么的过程中获得协作能力。最后,iii) 通过巧妙设计二元动作,我们将为有限动作空间建立的经典博弈论扩展到深度学习,从而实现数学上严格的收敛。我们在GSM8K训练集上训练所有局部RL方法,并在保留的GSM8K、MATH500和GPQA-Diamond上评估域内准确率、跨基准泛化能力、推理成本和修正保持行为。所得框架提供了一个紧凑且可复现的测试平台,用于研究超越固定辩论、投票或聚合协议的学习协作策略。代码可在https://github.com/Shark-Forest/TRACER获取。

英文摘要

Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative decision making into a controller-regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation-credit layer, which optimizes proposer and reviewer utterances with role-specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free-riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL-style methods on the GSM8K training split and evaluate on held-out GSM8K, MATH500, and GPQA-Diamond to measure in-domain accuracy, cross-benchmark generalization, inference cost, and correction-preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at https://github.com/Shark-Forest/TRACER.

2605.28697 2026-05-28 eess.IV cs.AI cs.CV 版本更新

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

深度学习应变估计:基于物理的模拟是解决方案吗?

Thierry Judge, Nicolas Duchateau, Andreas Østvik, Khuram Faraz, Anders Austlid Taskén, Sigve Karlsen, Thor Edvardsen, Harald Brunvand, Md Abulkalam Azad, Havard Dalen, Bjørnar Grenne, Gabriel Kiss, Pierre-Yves Courand, Lasse Lovstakken, Pierre-Marc Jodoin, Olivier Bernard

发表机构 * Dept. of Computer Science, University of Sherbrooke(计算机科学系, Sherbrooke 大学) INSA, Université Lyon 1, CNRS UMR 5220, Inserm U1206, CREATIS(INSA,里昂 1 大学,CNRS UMR 5220,Inserm U1206,CREATIS) Institut Universitaire de France (IUF)(法国国家研究院(IUF)) Cardiology Dept., Hôpital Croix-Rousse, Hospices Civils de Lyon(里昂医院心血管科,Hospices Civils de Lyon) Cardiology Dept., Hôpital Lyon Sud, Hospices Civils de Lyon(里昂南部医院心血管科,Hospices Civils de Lyon) Dept. of Computer Science, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology (NTNU)(计算机科学系,信息科技与电气工程学院,挪威科学技术大学(NTNU)) Dept. of Circulation and Medical Imaging, NTNU(循环医学与医学影像系,NTNU) Department of Medicine, Hospital of Southern Norway, Arendal, Norway(南部挪威医院医学部,Arendal,挪威) Dept. of Cardiology and Cardiothoracic Surgery, St. Olavs Hospital, Trondheim, Norway(心内科和心胸外科部,St. Olavs 医院,Trondheim,挪威) Dept. of Health Research, SINTEF Digital, Trondheim, Norway(健康研究部,SINTEF 数字技术,Trondheim,挪威) Dept. of Medicine, Levanger Hospital, Nord-Trøndelag Hospital Trust, Levanger, Norway(医学部,Levanger 医院,Nord-Trøndelag 医院信托,Levanger,挪威) Dept. of Cardiology, Oslo University Hospital, Rikshospitalet and the Faaculty of Medicine, University of Oslo, Norway(心内科,奥斯陆大学医院 Rikshospitalet,奥斯陆大学医学院,挪威)

AI总结 针对超声心动图中应变估计缺乏可靠运动参考的问题,提出一种结合真实视频散斑去相关测量与迭代细化过程的模拟策略,生成逼真数据集训练运动估计算法,在全局和区域应变上达到优于临床参考的性能。

Comments 10 pages

详情
AI中文摘要

斑点追踪超声心动图(STE)是心肌应变估计的临床标准。尽管在全局应变(GLS)上表现良好,但其区域应变的准确性仍然有限,尽管这一生物标志物对于早期诊断和表征细微异常高度相关。深度学习是一种有前景的替代方案,但其发展受到缺乏可靠运动参考的限制。现有解决方案要么依赖于STE衍生的标签,要么依赖于基于物理模型生成的模拟,但这些合成序列与临床数据相比仍缺乏足够的真实性。在本文中,我们提出了一种新的模拟策略,该策略结合了来自真实视频的散斑去相关测量,并使用迭代细化过程来改善模拟中的运动真实性。我们创建了一个包含1,478个视频及其参考运动的开源逼真数据集,用于训练超声心动图运动估计算法。所提出的方法在全局和区域应变上实现了无与伦比的性能,特别是在专家间设置中,GLS变异性达到1.42%,而临床参考为1.78%。

英文摘要

Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still have limited realism compared with clinical data.In this paper, we propose a novel simulation strategy that incorporates speckle decorrelation measures from real videos and uses an iterative refinement process to improve the motion realism in the simulations. We created an open-source photorealistic dataset of 1,478 videos with reference motion, which was used to train an echocardiographic motion estimation algorithm. The proposed method achieves unmatched performance on global and regional strain, notably reaching a GLS variability of 1.42% in an inter-expert setting compared to 1.78% for the clinical reference.

2605.28693 2026-05-28 q-bio.NC cs.AI 版本更新

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

反向传播与大脑对图像响应的层级结构之间的错位

Joséphine Raugel, Maximilian Seitzer, Marc Szafraniec, Huy V. Vo, Jérémy Rapin, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean-Rémi King

发表机构 * Meta AI Ecole Normale Supérieure, PSL University(巴黎高等师范学院,PSL大学)

AI总结 通过fMRI和MEG记录人类对自然图像的脑响应,发现预训练模型的反向传播梯度虽能预测高级视觉皮层和晚期信号,但其时空组织与大脑层级结构不一致,表明深度网络与大脑可能依赖不同的学习机制。

Comments 13 pages, 9 figures

详情
AI中文摘要

反向传播是深度学习核心的学习机制。然而,该算法是否以及如何在大脑中实现仍存在高度争议。特别是,虽然预训练模型的前向激活可靠地映射到视觉处理的皮层层级结构,但反向传播梯度是否表现出类似的对应关系尚不清楚。在这里,我们利用功能性磁共振成像(fMRI)和脑磁图(MEG)记录人类对自然图像的脑响应来探讨这一问题。为此,我们将前向激活的标准编码分析扩展到将反向传播梯度映射到神经数据。聚焦于最近的自监督视觉模型(DINOv3)并在八个视觉模型上复现结果,我们发现反向传播梯度能够可靠地预测fMRI和MEG信号,尤其是在高级视觉皮层和较晚的潜伏期。然而,这些反向传播梯度在大脑中的空间和时间组织与生物合理反向传播机制预期的模式不同:具体而言,梯度计算的顺序及其空间组织均与人类大脑的时间和空间层级结构相偏离。这些结果表明,尽管深度网络和大脑可能共享相似的表征内容,但它们可能依赖根本不同的机制来学习这些表征。

英文摘要

Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brain remains highly debated. In particular, while forward activations of pretrained models reliably map onto the cortical hierarchy of visual processing, it is unknown whether backpropagated gradients exhibit a similar correspondence. Here, we address this question using functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) recordings of human brain responses to natural images. For this, we extend standard encoding analyses of forward activations to map backpropagated gradients onto neural data. Focusing on a recent self-supervised vision model (DINOv3) and reproducing results on eight vision models, we find that backpropagated gradients can reliably predict both fMRI and MEG signals, specifically in higher-level visual cortex and for later latencies. However, the spatial and temporal organization of these backpropagated gradients in the brain diverges from the patterns expected under a biologically plausible backpropagation mechanism: specifically, both the order in which gradients are computed and their spatial organization diverge from the temporal and spatial hierarchies of the human brain. Together, these results suggest that, although deep networks and the brain may share similar representational content, they likely rely on fundamentally different mechanisms to learn those representations.

2605.28683 2026-05-28 cs.AI 版本更新

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

VeriTrip: 面向非结构化网络语料的旅行规划智能体可验证基准

Yuting Xu, Jiayi Tian, Jian Liang, Xin Xiong, Hang Zhang, Mu Xu, Xiao-Yu Zhang

发表机构 * Institute of Information Engineering, CAS(中国科学院信息工程研究所) School of Cyber Security, UCAS(中国科学技术大学网络安全学院) Amap, Alibaba Group(阿里巴巴集团阿地图) NLPR & MAIS, Institute of Automation, CAS(中国科学院自动化研究所神经信息处理实验室及机器智能研究所) School of Artificial Intelligence, UCAS(中国科学技术大学人工智能学院)

AI总结 提出VeriTrip基准,通过多模态检索库和可验证知识库,评估智能体在非结构化网络语料中基于证据推理的旅行规划能力,揭示检索-推理权衡问题。

Comments 10 pages, 4 figures

详情
AI中文摘要

现有基准通过建立以API为中心的范式为旅行规划智能体奠定了基础。然而,随着自主智能体能力的不断提升,其评估必须从简单的工具执行扩展到处理开放网络的固有复杂性。当前基准绕过了核心认知障碍:它们未能考虑信息噪声,忽略了多源事实矛盾,并且忽视了将视觉感知融入逻辑规划的必要性。我们引入了VeriTrip,一个旨在满足智能体鲁棒性和可靠性日益增长需求的可验证基准。VeriTrip将评估重点转向基于非结构化多模态网络语料的证据推理。它建立了一个源自真实世界的多模态检索库(MRB),迫使智能体自主协调跨异构数据的查询。同步的可验证知识库(VKB)支持逐单元验证协议,精确量化事实可靠性,区分系统性推理失败与参数幻觉。我们在领先的多模态大语言模型上的评估揭示了一个关键的“检索-推理权衡”:自主检索的认知负荷显著侵蚀了指令保持能力。VeriTrip为能够在无约束多模态环境中运行的下一代规划智能体提供了严格的基础。

英文摘要

Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agents continue to advance, their evaluation must evolve beyond simple tool execution toward handling the inherent complexities of the open web. Current benchmarks bypass core cognitive hurdles: they fail to account for information noise, ignore multi-source factual contradictions, and overlook the necessity of grounding visual perception into logical planning. We introduce VeriTrip, a verifiable benchmark designed to meet the increasing demands for agent robustness and reliability. VeriTrip shifts the evaluation focus to evidence-grounded reasoning over unstructured multimodal web corpora. It establishes a Multimodal Retrieval Base (MRB) derived from real-world sources, forcing agents to autonomously orchestrate queries across heterogeneous data. A synchronized Verifiable Knowledge Base (VKB) enables a cell-wise verification protocol that precisely quantifies factual reliability, distinguishing systematic reasoning failures from parametric hallucinations. Our evaluations across leading MLLMs reveal a critical \textit{retrieval-reasoning trade-off}: the cognitive load of autonomous retrieval significantly erodes instruction retention. VeriTrip provides the rigorous foundation necessary for the next generation of planning agents capable of operating in unconstrained, multimodal environments.

2605.28680 2026-05-28 cs.HC cs.AI cs.CY 版本更新

AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness

职场中的AI:人工智能对感知工作体面性和意义性的影响

Kuntal Ghosh, Marc Hassenzahl, Shadan Sadeghian

发表机构 * University of Siegen(锡根大学)

AI总结 本研究通过对24名来自IT、服务和医疗行业员工的访谈,探讨了AI对工作满意度的感知影响,发现不同职业领域对AI带来的工作体面性和意义性变化预期不同,从而影响整体满意度。

Comments Accepted to CSCW 2026 / Proceedings of the ACM on Human-Computer Interaction (PACMHCI)

详情
AI中文摘要

人工智能在工作场所的普及正在改变我们的工作方式。虽然现有关于人机协作的研究通常优先考虑绩效,但对其体验结果知之甚少。通过对24名来自信息技术、服务和医疗行业的员工进行访谈,本文考察了AI通过感知工作体面性和意义性对当前和未来工作满意度的影响。我们的结果显示,AI对整体工作满意度的预期影响因职业领域而异,对其潜在的体面性和意义性的感知也不同。例如,IT和医疗行业预期在工时等体面性方面满意度提高,但由于误解AI将处理大部分任务,在社交形象等意义性方面满意度下降。相反,服务行业员工预计工时无改善,但由于与AI合作带来的地位提升感知,社会地位会提高。

英文摘要

The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaboration at work often prioritizes performance, less is known about their experiential outcomes. Through interviews with 24 employees across Information Technology (IT), service-based, and healthcare sectors, this paper examines AI's impact on job satisfaction via perceptions of job decency and meaningfulness, now and in the future. Our results reveal that the anticipated impact of AI on overall job satisfaction varies with the occupational domain, with differing perceptions of its underlying decency and meaningfulness. For instance, IT and healthcare anticipate increased satisfaction with decency aspects like working hours but decreased satisfaction with meaningfulness aspects like social image due to misconceptions about AI handling most of their tasks. Conversely, service workers foresee no improvement in their working hours but a higher social standing due to the perceived status boost associated with working with AI.

2605.28678 2026-05-28 cs.AI 版本更新

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

DREAM-R: 基于强化学习的精炼草稿、精确验证与完全并行执行的多模态推测推理

Yunhai Hu, Zining Liu, Xiangyang Yin, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang

发表机构 * New York University(纽约大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出DREAM-R框架,通过强化学习优化草稿生成、阈值验证机制和完全并行执行,加速多模态模型的推理密集型任务,同时保持准确性。

详情
AI中文摘要

推测推理最近被提出作为加速大型多模态模型中推理密集型生成的一种手段,但其有效性常受限于推测草稿与目标验证推理之间的不匹配。在本工作中,我们引入了DREAM-R,一个显著提升推测推理性能的框架。其核心是采用推测对齐策略优化(SAPO),这是一种强化学习目标,训练草稿模型生成既忠实于目标轨迹又简洁的推理步骤。我们进一步提出基于阈值的验证机制(TBVM),使用基于比率的标准,仅在正面证据明显占优时稳定且可解释地接受推测步骤,从而防止错误传播。基于这些组件,我们开发了完全并行推测推理(FPSR)框架,该框架将草稿生成、目标侧推理和验证并行化到多步推理中,支持提前停止和干净回退。在推理密集型基准上的实验表明,在保持目标模型准确性的同时,实现了高达[具体加速比]的加速,在不牺牲推理质量的情况下带来了显著的效率提升。

英文摘要

Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target-verified reasoning. In this work, we introduce DREAM-R, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM-R employs Speculative Alignment Policy Optimization (SAPO), a reinforcement-learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We further propose a Threshold-based Verification Mechanism (TBVM) that uses a ratio-based criterion to provide stable and interpretable acceptance of speculative steps only when positive evidence clearly dominates, thereby preventing error propagation. Building on these components, we develop a Fully Parallel Speculative Reasoning (FPSR) framework that parallelizes draft generation, target-side reasoning, and verification across multi-step reasoning, enabling early stopping and clean fallback. Experiments on reasoning-heavy benchmarks demonstrate up to speedup while preserving target-model accuracy, yielding substantial efficiency gains without compromising reasoning quality.

2605.28669 2026-05-28 cs.CL cs.AI 版本更新

Sense Representations Are Inducible Interfaces

Sense Representations Are Inducible Interfaces

Jan Christian Blaise Cruz, Alham Fikri Aji

发表机构 * MBZUAI(马克斯·普朗克智能系统研究所)

AI总结 提出ACROS方法,通过门控残差加法在冻结的预训练解码器LM中诱导显式词义通路,实现零样本词义消歧、低KL词义引导和跨语言适应,保持基础LM质量。

Comments https://github.com/jcblaisecruz02/acros

详情
AI中文摘要

词义表示(显式的、每个标记的意义分解)对于消歧、引导和跨语言对齐很有用,但现有方法要求模型在预训练时就内置词义结构。我们引入了ACROS,它通过门控残差加法在冻结的预训练解码器LM中诱导出显式的词义通路。在SmolLM2-360M上,ACROS在保持基础LM质量的同时,支持相同诱导变量的三种用途:零样本词义消歧(Raganato ALL上F1为64.95,与WordNet首义启发式方法相当)、在5,161个CoInCo案例中进行低KL词义引导(其中简单的非oracle代理恢复了约90%的正向偏移),以及针对四种语言的SENSIA跨语言适应(平均R@1为0.988,目标FLORES PPL为7.94)。ACROS使词义表示成为普通预训练LM的可诱导接口。

英文摘要

Sense representations (explicit, per-token meaning decompositions) are useful for disambiguation, steering, and cross-lingual alignment, but existing approaches require models to be pretrained with sense structure baked in. We introduce ACROS, which induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2-360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero-shot word-sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first-sense heuristic), low-KL lexical steering across 5,161 CoInCo cases where a simple non-oracle proxy recovers about 90% of positive shifts, and SENSIA cross-lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94). ACROS makes sense representations an inducible interface for ordinary pretrained LMs.

2605.28666 2026-05-28 cs.AI 版本更新

An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning

基于LLM的直观灵活能力规划辅助系统

Luis Miguel Vieira da Silva, Nicolas König, Felix Gehlhoff

AI总结 提出一种混合辅助系统,将基于能力的形式化SMT规划与LLM自然语言交互层结合,通过人机协同实现规划解释与知识模型自适应,提升工业自动化中能力规划的可访问性和灵活性。

详情
AI中文摘要

在现代工业中,动态环境以及模块化和可重构资源的复杂性要求对过程序列进行自动化规划。基于能力的规划方法通过从以机器可解释形式描述资源功能的语义知识模型自动生成计划来解决这一问题。然而,其实际应用仍然有限:求解器反馈(特别是在不可满足情况下)难以解释,并且知识模型需要随着操作条件变化或请求变得不可行而进行调整。本文提出一种混合辅助系统,通过基于大语言模型(LLM)的自然语言交互、解释和适应层,增强现有的基于能力的可满足性模理论(SMT)规划方法。形式化规划的正确性仍由符号规划器保证,而LLM层在明确的人机协同(HitL)批准下处理自然语言访问和灵活的知识模型适应。该系统分解为四个组件:能力基础化、符号规划、结果解释和规划适应,实现为路由代理工作流,其中中央路由器将任务委派给五个专门代理。该系统在模块化生产系统上针对四种场景类型进行了评估。在23个测试案例中,10个知识查询中的9个和所有4个可满足规划案例均被正确处理,4个不可满足案例中的3个产生了具体的修复建议,所有5个自适应规划场景通过迭代的、用户批准的知识模型修改最终生成了可满足计划。研究结果证实,将形式化规划与基于LLM的辅助相结合,显著提高了工业自动化的可访问性和适应性。

英文摘要

In modern industry, dynamic environments and the complexity of modular and reconfigurable resources require automated planning of process sequences. Capability-based planning approaches address this by automatically generating plans from semantic knowledge models that describe resource functions in a machine-interpretable form. Their practical use, however, remains limited: solver feedback, especially in the case of unsatisfiability, is difficult to interpret, and the knowledge models require adaptation as operational conditions change or requests become infeasible. This paper presents a hybrid assistance system that augments an existing capability-based Satisfiability Modulo Theories (SMT) planning approach with an Large Language Model (LLM)-based layer for natural-language interaction, explanation, and adaptation. Formal planning correctness remains with the symbolic planner, while the LLM layer handles natural-language access and flexible knowledge model adaptation under explicit Human-in-the-Loop (HitL) approval. The system decomposes into four components: Capability Grounding, Symbolic Planning, Result Interpretation, and Planning Adaptation, realized as a routed agentic workflow in which a central router delegates to five specialized agents. The system is evaluated on a modular production system across four scenario types. Of 23 test cases, 9 of 10 knowledge queries and all 4 satisfiable planning cases were handled correctly, 3 of 4 unsatisfiable cases produced concrete repair proposals, and all 5 adaptive planning scenarios resolved into satisfiable plans through iterative, user-approved knowledge model modifications. The findings confirm that combining formal planning with LLM-based assistance substantially improves accessibility and adaptability in industrial automation.

2605.28655 2026-05-28 cs.AI 版本更新

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

AutoScientists: 用于长期科学实验的自组织智能体团队

Shanghua Gao, Ada Fang, Marinka Zitnik

发表机构 * Harvard University(哈佛大学)

AI总结 提出一种去中心化的AI智能体团队系统AutoScientists,通过自组织协作、提案评审和失败知识共享,在生物医学机器学习、语言模型训练优化和蛋白质适应性预测等长期实验中显著优于现有方法。

详情
AI中文摘要

科学研究通过假设生成、实验设计、执行和修正的迭代循环进行。AI智能体可以自动化这一过程的某些部分,但现有方法通常遵循单一研究轨迹,或通过具有固定目标的中央规划器进行协调。因此,它们难以维持并行探索、根据实验证据的变化进行调整,或在长期实验中保留失败方向的知识。我们引入了AutoScientists,一个用于长期计算科学实验的去中心化AI智能体团队。智能体解释共享的实验状态,围绕有希望的假设自组织成团队,在使用实验计算资源之前评审提案,并分享成功和失败以减少冗余探索。在匹配的实验预算下,AutoScientists在生物医学机器学习、语言模型训练优化和蛋白质适应性预测方面优于先前的AI智能体。在涵盖生物医学成像、蛋白质工程、单细胞组学和药物发现的BioML-Bench上,AutoScientists在24个任务中达到了74.4%的平均排行榜百分位,比最强的AI智能体提高了8.33%。在GPT训练优化中,AutoScientists达到目标验证bits-per-byte的速度比Autoresearch快1.9倍,并从初始冠军开始持续发现改进,而单智能体方法未发现任何改进(7个接受改进对比0个)。在ProteinGym适应性预测中,AutoScientists发现了一种ACE2-Spike结合方法,其Spearman相关性比当前最先进模型提高了12.5%。在未经修改地应用于所有217个ProteinGym检测时,相同方法比先前最先进技术提高了6.5%(Spearman相关性)。

英文摘要

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

2605.28647 2026-05-28 cs.AI cs.CY q-fin.RM 版本更新

The Ethics of LLM Sandbox and Persona Dynamics

LLM沙盒与人格动态的伦理

Tim Gebbie, Stewart Gebbie

发表机构 * University of Cape Town(开普敦大学)

AI总结 本文论证LLM护栏和人格动态产生的现实差距(reality gap)构成不道德的“现实洗白”(reality laundering),并提出通过任务级因果需求规范而非响应级道德修正来解决。

Comments 8 pages

详情
AI中文摘要

众所周知,LLM护栏和训练的人格动态会产生现实差距:LLM被允许或塑造描述的世界与用户必须行动的世界之间的距离。这里我们论证,主动产生现实差距实际上是不道德的,因为它有意将认知风险转嫁给不知情的用户——这就是现实洗白。当大规模运作时,这可能会造成伤害。在高暴露建议情境中风险最为尖锐,用户寻求的是方向而非有边界、可外部检查的任务。护栏在声称防止直接伤害时看似在伦理上必要,但当它们压制真实感知并将令人不适的机制洗白为可接受的抽象时,往往变得可疑。巴塞尔式金融监管、B-BBEE式合规、法国兴业银行和伦敦鲸事件展示了正式安全系统如何变得可理解、可博弈和表演性,而真实风险却转移到了别处。同样的模式可能出现在LLM中作为道德合规:安全的语言,扭曲的现实。因此,我们区分拒绝伤害与拒绝现实;然后主张在任务层面进行自上而下的因果需求规范,而非在响应或沙盒层面进行自下而上的道德修正。人格动态之所以重要,是因为助手界面并非中立;它塑造了不确定性、冲突、权威和风险如何被呈现。结论是,所谓的“伦理AI”当用制度安慰替代与现实接触时,实质上变得不伦理。

英文摘要

It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permitted or shaped to describe, and the world in which users must act. Here we argue that actively generating reality gaps is in fact unethical because it knowingly shifts epistemic risk back to the uninformed user -- this is reality laundering. This can potentially cause harm when operationalised at scale. The risk is sharpest in high-exposure advice contexts, where users seek orientation rather than a bounded, externally checkable task. Guardrails naively appear ethically necessary when they claim to prevent direct harm, but often become suspect when they suppress truthful perception and launder uncomfortable mechanisms into acceptable abstractions. Basel-style financial regulation, B-BBEE-style compliance, Societe Generale, and the London Whale show how formal safety systems can become legible, gameable, and performative while real exposure migrates elsewhere. The same pattern can appear in LLMs as moral compliance: safe language, distorted reality. We therefore distinguish refusing harm, from refusing reality; and then argue for top-down causal requirements specification at the task level rather than bottom-up moral correction at the response or sandbox level. Persona dynamics matter because the assistant interface is not neutral; it shapes how uncertainty, conflict, authority, and risk are staged. The conclusion is that so-called ``ethical AI'' becomes substantively unethical when it substitutes institutional reassurance for contact with reality.

2605.28642 2026-05-28 cs.AI 版本更新

Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

带宽高效且隐私保护的边缘-云多对多语音翻译

Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Ming Liu, Bing Qin, Yang Xiang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Pengcheng Laboratory(鹏城实验室)

AI总结 提出边缘-云协同框架ESRT,通过分割推理架构压缩中间特征实现带宽降低10倍和语音隐私保护,并采用多任务加权课程学习策略实现45种语言的多对多语音翻译。

详情
AI中文摘要

多模态大语言模型(MLLMs)在语音到文本翻译(S2TT)方面展现出巨大潜力。然而,现有部署范式面临关键挑战:纯设备端模型受资源限制,而集中式云系统通过传输原始语音数据导致严重的隐私风险和带宽瓶颈。此外,大多数模型表现出以英语为中心的偏见,限制了多对多翻译的扩展。在本文中,我们提出边缘-云语音识别与翻译(ESRT),一种隐私保护且带宽高效的协作式边缘-云MLLM框架。具体而言,我们设计了一种边缘-云分割推理架构,在设备上保留轻量级语音编码器和适配器,仅将高度压缩的中间特征传输到云端。这从根本上防止了声纹泄露,并将带宽需求降低高达10倍。为克服以英语为中心的瓶颈,我们引入了一种多任务加权课程学习策略与数据平衡,以确保鲁棒的跨语言一致性。在FLEURS数据集上的大量实验表明,我们的模型ESRT-4B和ESRT-12B在45种语言(45×44个方向)上实现了最先进的多对多S2TT性能。代码和模型已发布,以促进可复现的、隐私感知的MLLM S2TT研究。代码和模型发布于https://github.com/yxduir/esrt。

英文摘要

Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing deployment paradigms face critical challenges: pure on-device models suffer from resource constraints, while centralized cloud systems incur severe privacy risks and bandwidth bottlenecks by transmitting raw voice data. Furthermore, most models exhibit English-centric biases, restricting many-to-many translation scaling. In this paper, we propose Edge-cloud Speech Recognition and Translation (ESRT), a privacy-preserving and bandwidth-efficient collaborative edge-cloud MLLM framework. Specifically, we design an edge-cloud split inference architecture that retains a lightweight speech encoder and adapter on the device, transmitting only highly compressed intermediate features to the cloud. This fundamentally prevents voiceprint leakage and reduces bandwidth requirements by up to 10$\times$. To overcome English-centric bottlenecks, we introduce a multi-task weighted curriculum learning strategy with data balancing to ensure robust cross-lingual consistency. Extensive experiments on the FLEURS dataset demonstrate that our models, ESRT-4B and ESRT-12B, achieve state-of-the-art many-to-many S2TT performance across 45 languages ($45 \times 44$ directions). Code and models are released to facilitate reproducible, privacy-aware MLLM S2TT research. The code and models are released at https://github.com/yxduir/esrt.

2605.28639 2026-05-28 cs.CL cs.AI 版本更新

The Attentional White Bear Effect in Transformer Language Models

Transformer语言模型中的注意力白熊效应

Rebecca Ramnauth, Brian Scassellati

发表机构 * Yale University(耶鲁大学)

AI总结 通过表征探测、注意力分析和行为语义泄露实验,发现指令抑制下Transformer语言模型仍能恢复被禁止概念的表征并影响后续生成,揭示了行为对齐与表征对齐之间的根本差距。

Comments Currently under review at EMNLP 2026

详情
AI中文摘要

基于指令的抑制被广泛用于防止语言模型生成被禁止的内容,但尚不清楚抑制是减少了内部表征还是仅仅抑制了表达。我们通过跨多个Transformer模型的表征探测、注意力分析和行为语义泄露实验来研究这个问题。我们发现,在抑制下,被禁止的概念仍然可以从隐藏表征中高度恢复,继续影响注意力路由,并且在成功避免词汇的情况下可测量地塑造下游生成。这些效应在池化策略、间接语义控制和多个模型家族中持续存在。我们的结果暴露了行为对齐与表征对齐之间的根本差距。

英文摘要

Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.

2605.28632 2026-05-28 cs.CR cs.AI 版本更新

Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking

盲PRNG劫持:一种针对LLM水印的不可检测的完整性保持攻击

Ziyang You, Huilong He, Xiaoke Yang, Xuxing Lu

发表机构 * Fujian Provincial Key Laboratory of Automotive Electronics and Electric Drive(福建省汽车电子与电力驱动重点实验室) School of Electronic, Electrical and Physics(电子、电气与物理学院) Fujian University of Technology(福建理工大学) School of Humanities(人文学院) Institute of Applied Physics and Materials Engineering(应用物理与材料工程学院) University of Macau(澳门大学)

AI总结 提出SeedHijack攻击,通过替换伪随机数生成器(PRNG)在供应链层面对LLM水印进行盲攻击,同时保持完整性并规避检测。

Comments Preprint prepared for submission to IEEE TIFS. 12 pages, 8 figures

详情
AI中文摘要

密码学水印是归因大型语言模型(LLM)生成文本的主要防御手段。现有方案(包括KGW、Unigram和DipMark)的安全性基于底层伪随机数生成器(PRNG)可信的假设。本文引入SeedHijack,这是首个针对LLM水印的供应链攻击,同时满足:(i) 盲——无需知道水印密钥、检测器或模型logits;(ii) 完整性保持——放大而非擦除水印信号;(iii) 与检测正交——攻击引入的偏差与所有内容侧检测器统计独立,确保放大和规避共存而无权衡。SeedHijack不扰动生成文本,而是在供应链层替换PRNG,偏向绿名单选择而不改变输出令牌或降低文本质量。在三种水印方案和三个开源LLM上,攻击触发了0/6个最先进的内容侧统计检测器,同时将水印z-score放大至2.42倍(系统级防御如熵源认证保持正交和互补)。量子随机数生成器(QRNG)对策被证明能完全中和攻击,同时保持良性水印效用。这些发现确立了PRNG完整性作为密码学内容来源系统的一等安全需求。

英文摘要

Cryptographic watermarking is a leading defense for attributing text generated by large language models (LLMs). Existing schemes, including KGW, Unigram, and DipMark, derive their security guarantees from the assumption that the underlying pseudo-random number generator (PRNG) is trustworthy. This work introduces SeedHijack, the first supply-chain attack on LLM watermarking that is simultaneously (i) blind -- requiring no knowledge of the watermark key, detector, or model logits, (ii) integrity-preserving -- amplifying rather than erasing the watermark signal, and (iii) orthogonal to detection -- the attack-induced bias is statistically independent of all content-side detector statistics, ensuring that amplification and evasion coexist without trade-off. Rather than perturbing generated text, SeedHijack replaces the PRNG at the supply-chain layer, biasing green-list selection without altering output tokens or degrading text quality. Across three watermarking schemes and three open-source LLMs, the attack triggers 0/6 state-of-the-art content-side statistical detectors while inflating the watermark z-score up to 2.42x (system-level defenses such as entropy-source attestation remain orthogonal and complementary). A quantum random number generator (QRNG) countermeasure is shown to fully neutralize the attack while preserving benign watermarking utility. These findings establish PRNG integrity as a first-class security requirement for cryptographic content-provenance systems.

2605.28617 2026-05-28 cs.AI cs.PL 版本更新

LACUNA: Safe Agents as Recursive Program Holes

LACUNA: 作为递归程序空洞的安全智能体

Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham, Frank Zhengqing Wu, Martin Odersky

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 提出LACUNA编程模型,通过类型化调用和编译时检查,让LLM智能体以递归程序空洞的方式安全地编写代码,实现表达性与安全性的统一。

详情
AI中文摘要

LLM智能体越来越多地通过编写代码来行动,但驱动智能体的运行时与模型编写的代码之间仍然存在分裂。运行时拥有循环、上下文和控制流,而模型对这些几乎没有发言权。让模型编写的代码塑造运行时本身将使智能体更具表达性,但也会加剧安全问题。模型可能因提示注入而偏离方向、调用错误的工具,或在执行中途失败并留下不一致的状态,而当代码塑造运行时,此类失败的波及范围比仅表达单个动作时更广。我们提出了LACUNA,一种智能体编程模型,它在保持安全性的同时弥合了这种分裂。每个智能体动作都是一个类型化调用$\texttt{agent[T](task)}$,当执行到达该调用时,LLM用代码填充它,并且在代码运行之前,会针对周围程序进行类型检查。由于每个动作作为一个整体被接受或拒绝,被拒绝的动作不会影响环境,其编译器诊断信息会驱动重试。同样的检查也限制了动作可以使用哪些工具和数据以及它们如何流动。我们的原语将ReAct循环、子智能体、技能、并行分解和多模型规划表达为普通的控制流。我们在测试用例集合、BrowseComp-Plus和$τ^2$-bench上评估了LACUNA。在BrowseComp-Plus上,8.6%的生成在执行前被拒绝,平均每次查询重试0.7次,智能体达到27.1%的准确率。在$τ^2$-bench上,LACUNA使用一个能力强的模型解决了四个领域392个任务中的76.0%,与基线智能体相当。

英文摘要

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call $\texttt{agent[T](task)}$ that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and $τ^2$-bench. On BrowseComp-Plus, $8.6\%$ of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches $27.1\%$ accuracy. On $τ^2$-bench, LACUNA solves $76.0\%$ of $392$ tasks across four domains with a capable model, on par with the baseline agent.

2605.28616 2026-05-28 cs.CL cs.AI 版本更新

Measuring Form and Function in Language Models

语言模型中的形式与功能测量

Héctor Javier Vázquez Martínez, Charles Yang

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Department of Linguistics and Computer and Information Science(语言学与计算机与信息科学系)

AI总结 通过引入儿童语言习得的定量指标,提出上下文替代选择(CAC)提示方法,评估语言模型在英语限定词的形式句法和功能话语知识方面的表现,发现仅大型模型能同时满足形式和功能基准。

Comments Under review at ACL Rolling Review May 2026 cycle

详情
AI中文摘要

我们引入儿童语言习得的定量指标来评估语言模型。我们的重点是英语中限定词的形式句法和功能话语属性,这些属性幼儿早期就能准确习得。我们提出了上下文替代选择(CAC),一种新的提示方法,为语言的句法和话语知识提供针对性测试。该方法能够直接将语言模型与儿童进行比较,更重要的是,与实证研究中独立建立的统计基准进行比较。目前,没有在可比数据量上训练的模型能像人类儿童一样同时满足形式和功能基准,但一些非常大的模型可以做到。我们将结果作为方法论和技术贡献呈现,特别强调语言模型的认知状态。

英文摘要

We introduce quantitative metrics for child language acquisition to evaluate language models. Our focus is on the formal syntactic and functional discourse properties of determiners in English, which young children acquire early and accurately. We propose Contextual Alternative Choice (CAC), a new prompting method which provides targeted tests for both syntactic and discourse knowledge of language. The method enables direct comparison of language models against children, and more importantly, against statistical benchmarks independently established in empirical research. No current model trained on a comparable amount of data simultaneously meet both formal and functional benchmarks like human children, but some very large models do. We present our results as methodological and technical contributions, with specific emphasis on cognitive status of language models.

2605.28607 2026-05-28 cs.AI cs.CL 版本更新

Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

基于自适应多智能体框架的自动工作流执行

Susanna Cifani, Mario Luca Bernardi, Marta Cimitile

发表机构 * Sapienza University of Rome(罗马萨皮恩扎大学) Department of Engineering University of Sannio(萨尼奥大学工程系) Faculty of Jurisprudence Unitelma Sapienza University(法理学院萨皮恩扎大学)

AI总结 提出一种多模态多智能体框架,通过离线构建拓扑知识库和在线自适应检索增强生成与闭环协作验证,实现自动工作流执行。

Comments Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Accepted for publication at the 2026 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS 2026)

详情
AI中文摘要

现代信息系统需要能够导航复杂工作流的自主智能体,但当前方法在从结构化元数据解析过渡到通用环境感知时常常遇到困难。虽然多模态大语言模型的集成使智能体能够直接与图形用户界面交互,但现有方法通常将任务序列视为离散的线性片段。这种碎片化阻止了智能体捕捉底层转移拓扑结构,限制了它们在新型或非平稳场景中的有效性。为了解决这个问题,我们提出了一种新颖的多模态多智能体框架,通过一个独特的两阶段流程实现自动工作流执行。首先,在离线发现阶段,该架构从碎片化的执行日志中自适应地构建拓扑知识库。在推理过程中,智能体利用自适应检索增强生成(RAG)作用于这个固定的、预先建立的图,并结合闭环协作验证协议进行动态自我纠正和导航。这种基于图的方法促进了优越的任务分解和自适应导航性能。我们在真实世界环境中验证了该框架,展示了即使在训练数据有限的情况下,它也能保持高可靠性和语义感知能力。

英文摘要

Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.

2605.28604 2026-05-28 cs.CV cs.AI 版本更新

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

挖掘多模态时空线索用于视频重要人物识别

Xiao Wang, Minglei Yang, Bin Yang, Wenke Huang, Zheng Wang, Xin Xu, Mang Ye

发表机构 * School of Computer Science and Technology, Wuhan University of Science and Technology(武汉科技大学计算机科学与技术学院) Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology(湖北省智能信息处理与实时工业系统重点实验室) School of Computer Science, National Engineering Research Center for Multimedia Software, Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University(计算机科学学院,国家多媒体软件工程技术研究中心,湖北省多媒体与网络通信工程重点实验室,武汉大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 针对视频中人物重要性随时间变化的问题,提出VIP-Net框架,通过多模态时空线索融合与时间重要性矫正,在Temporal-VIP数据集上达到67.3%准确率。

详情
AI中文摘要

识别视频场景中的关键人物对于自动视频编辑和智能监控等应用至关重要。当前方法主要关注静态图像和即时视觉线索,忽略了视频中丰富的时空信息。这导致了时间重要性转移(TIS)现象,即早期帧中被认为重要的人物在考虑整个时间上下文后可能被降级。为了解决这一问题,我们引入了视频重要人物(VIP)识别任务,旨在自动识别视频中最具影响力的人物,同时提供文本理由。我们提出了Temporal-VIP,一个大规模的理由标注数据集,包含11个类别的9,249个视频片段,并附有对齐的重要性理由。为了缓解TIS,我们开发了VIP-Net框架,包括用于提取多模态时空线索的社会线索编码器(SCE)、用于层次化线索融合和跨模态对齐的时间重要性矫正器(TIR),以及用于人物排序的VIP推理。实验结果表明,VIP-Net达到了67.3%的准确率,显著优于最先进的模型(37.5%-53.9%),并通过特征引导的LLM优化,平均理由相似度达到0.63。数据集和代码可在https://huggingface.co/datasets/yml2002/Temporal-VIP获取。

英文摘要

Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Current methods primarily focus on static images and immediate visual cues, overlooking the rich spatio-temporal information in videos. This leads to the phenomenon of Temporal Importance Shift (TIS), wherein individuals deemed significant in early frames may be demoted as the entire temporal context is considered. To address this, we introduce the Video Important Person (VIP) identification task, aimed at automatically identifying the most influential individuals in videos while providing textual rationales. We present Temporal-VIP, a large-scale rationale-annotated dataset consisting of 9,249 video segments across 11 categories with aligned importance rationales. To mitigate TIS, we develop the VIP-Net framework, which includes a Social Cue Encoder (SCE) for extracting multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical cue fusion and cross-modal alignment, and VIP Inference for ranking individuals. Experimental results show that VIP-Net achieves 67.3% accuracy, significantly outperforming state-of-the-art models (37.5%-53.9%) and yielding a mean rationale similarity of 0.63 to ground truth through feature-guided LLM refinement. The dataset and code are available at https://huggingface.co/datasets/yml2002/Temporal-VIP.

2605.28603 2026-05-28 cs.LG cs.AI 版本更新

Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration

在线不规则多变量时间序列预测:基于不确定性驱动的双专家校准

Haonan Wen, Hanyang Chen, Songhe Feng

发表机构 * Key Laboratory of Big Data \& Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education School of Computer Science Technology, Beijing Jiaotong University Beijing China School of Computer Science Tangshan Research Institute, Beijing Jiaotong University Tangshan China Key Laboratory of Big Data \& Artificial Intelligence in Transportation (Beijing Jiaotong University), Ministry of Education Technology, Beijing Jiaotong University Tangshan Research Institute, Beijing Jiaotong University

AI总结 针对在线不规则多变量时间序列预测中数据分布动态变化导致性能下降的问题,提出不确定性驱动的双专家校准框架Under-Cali,通过不确定性估计、双专家校准和自适应路由模块实现稳定高效的在线学习。

Comments Accepted by KDD 2026

详情
AI中文摘要

不规则多变量时间序列预测在许多实际应用中至关重要,其中时间序列是不规则采样的,并表现出动态演变的缺失模式。尽管现有方法在离线设置中表现良好,但在在线部署时,由于数据分布的动态变化,它们常常遭受显著的性能下降。在这种动态场景中保持预测能力通常需要在线自适应技术。由于不规则采样从根本上破坏了时间连续性和周期性,我们无法利用来自规则MTS的这些广泛研究的特性进行在线学习。为此,我们研究了在线IMTS预测问题,并提出了Under-Cali,一个不确定性驱动的双专家校准框架,包含三个核心组件:不确定性估计器、双专家校准模块和自适应路由模块。我们设计了一个不确定性估计器,作为核心控制信号来联合管理推理和自适应过程。在我们的框架中,不确定性估计器首先评估每个传入批次的不确定性。然后,自适应路由模块将高不确定性的样本引导至不可靠专家进行校准,而低不确定性样本则保留给可靠专家。随后,系统使用校准良好的可靠样本更新可靠专家和不确定性估计器,并使用具有挑战性的样本更新不可靠专家,从而实现稳定高效的在线学习。Under-Cali保持源预测模型冻结,仅通过轻量级、模型无关的校准模块进行自适应,从而实现高效自适应。在IMTS基准上的大量实验表明,在低计算成本下取得了持续的改进。我们的代码可在https://github.com/HaonanWen/Under-Cali获取。

英文摘要

Irregular multivariate time series forecasting is critical in many real-world applications, where time series are irregularly sampled and exhibit dynamically evolving missingness patterns. Although existing methods perform well in offline settings, they often suffer from significant performance degradation when deployed online due to dynamic shifts in data distribution. Maintaining forecasting capability in such dynamic scenarios typically necessitates online adaptation techniques. Since irregular sampling fundamentally undermines temporal continuity and periodicity, we cannot leverage these widely studied characteristics from regular MTS for online learning. To this end, we study the problem of online IMTS forecasting and propose Under-Cali, an uncertainty-driven dual-expert calibration framework consisting of three core components: an uncertainty estimator, a dual-expert calibration module, and an adaptive routing module. We design an uncertainty estimator that serves as the core control signal to jointly manage inference and adaptation processes. In our framework, the uncertainty estimator first assesses uncertainty for each incoming batch. The adaptive routing module then directs samples with high uncertainty to the unreliable expert for calibration, while low uncertainty samples remain with the reliable expert. Subsequently, the system updates the reliable expert and the uncertainty estimator using well-calibrated reliable samples, and updates the unreliable expert with challenging samples, enabling stable and efficient online learning. Under-Cali keeps the source forecasting model frozen and performs adaptation only through a lightweight, model-agnostic calibration module, enabling efficient adaptation. Extensive experiments on IMTS benchmarks demonstrate consistent improvements with low computational cost. Our code is available at https://github.com/HaonanWen/Under-Cali.

2605.28602 2026-05-28 cs.AI cs.CL cs.LO 版本更新

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

用大语言模型求解可满足性问题:推理能力的配对评估

Leizhen Zhang, Shuhan Chen, Sheng Chen

发表机构 * University of Louisiana at Lafayette(路易斯安那州立大学拉斐特分校)

AI总结 提出配对公式协议和准确区分率(ADR)来评估大语言模型在SAT问题上的推理能力,发现传统指标具有误导性,而ADR能更忠实、跨表示鲁棒地评估模型。

Comments Accepted at the ACM International Conference on the Foundations of Software Engineering (FSE 2026)

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于隐式归结为布尔可满足性(SAT)的任务,但它们在SAT上的推理能力仍不清楚。我们对LLMs在2-SAT和3-SAT上进行了系统研究,并使用了两个经典归约——顶点覆盖和离散3D装箱——来探测表示不变的推理。我们首先使用传统指标评估模型,包括准确率、精确率、召回率和F1,以及SAT相变设置。我们发现这些指标可能具有误导性:许多模型通过过度预测可满足公式获得高分,未能重现3-SAT阈值附近经典的易-难-易特征,并且随着变量数量的增加而急剧下降。为解决这个问题,我们引入了一个基于最小差异可满足和不可满足实例的配对公式协议,以及准确区分率(ADR),它要求每对中的两个成员都被正确分类。ADR将面向推理的模型与启发式模型区分开来,并与证据有效性相关。在CNF之外,我们通过将CNF转换为顶点覆盖和将3-SAT转换为离散3D装箱来测试跨表示一致性。大多数模型在超过80%的实例上,对CNF和对应图或装箱实例的决策一致,表明跨表示存在稳定的决策规则。总体而言,我们的结果表明SAT是LLM推理的一个保守探针,并且使用ADR的配对评估比传统指标提供了更忠实且表示鲁棒的评估。

英文摘要

Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.

2605.28598 2026-05-28 cs.CL cs.AI 版本更新

Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News

评估基于LLM的社会智能体的真实性:对西班牙在线新闻反应的案例研究

Alejandro Buitrago López, Alberto Ortega Pastor, Javier Pastor-Galindo, José A. Ruipérez-Valiente

发表机构 * Faculty of Computer Science, University of Murcia(计算机科学系,穆尔西亚大学)

AI总结 通过比较真实与LLM生成的西班牙新闻评论,研究LLM在仇恨言论、情感和语义对齐三个维度上的真实性,发现现成模型表现不佳,微调可部分改善。

详情
AI中文摘要

基于LLM的社会智能体越来越多地被用于模拟在线社交行为,但其真实性仍然难以验证。现有工作主要依赖通用基准,而对简短的反应性话语(如受众对在线新闻的回复)关注较少。在本文中,我们评估LLM生成的西班牙新闻反应是否再现了真实受众话语的可测量属性。使用Hatemedia数据集,我们将5,631条新闻与58,555条真实受众反应配对,并在共享实验设置下使用五个LLM生成匹配的合成数据集。我们从仇恨言论、情感和语义对齐三个维度比较真实和合成反应,考虑现成和微调生成。结果表明,现成模型是真实受众反应的糟糕代理:它们严重低估仇恨言论,引入模型特定的情感偏差,并且在分布上与人类回复相距甚远。微调不均匀地提高了保真度。Qwen3提供了最平衡的近似,而Mistral7B实现了最强的情感和语义对齐,但过度估计了仇恨普遍性。看似合理的合成回复不一定再现公共话语的分布特性。

英文摘要

LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news. In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation. Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse.

2605.28597 2026-05-28 cs.CR cs.AI cs.LG 版本更新

Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

立场:淘汰“正向后门”标签——秘密对齐需要严格且系统的评估

Jianwei Li, Jung-Eun Kim

发表机构 * Department of Computer Science, North Carolina State University, Raleigh, USA(北卡罗来纳州立大学计算机科学系)

AI总结 本文主张停止使用“正向后门”标签,将触发激活的隐藏行为视为秘密对齐,并通过评估三个代表性应用在六个核心属性上的表现,揭示其脆弱性,呼吁进行严格评估。

Comments ICML 2026

详情
AI中文摘要

这篇立场论文认为,AI/ML社区应停止过度宣称并淘汰“正向后门”标签,而应将触发激活的隐藏行为视为秘密对齐。关键在于,基于秘密对齐的保护性主张在缺乏严格、标准化评估的情况下,默认不应被视为安全。私有AI时代,通过开放权重的LLM和可访问的训练/推理栈,语言模型成为私有数字资产,产生了关于未授权访问、模型盗窃和行为滥用的安全问题。最近,一系列被称为“正向后门”的工作被提出以应对这些挑战。为将我们的立场建立在证据基础上,我们将这些提议统一为用于访问门控、所有权归属和安全执行的隐蔽触发-行为关联,并评估了三个代表性应用在六个核心属性上的表现:有效性、无害性、持久性、效率、鲁棒性和可靠性。我们的结果揭示了触发-行为映射的显著脆弱性——尤其是在机密性、完整性和可用性(CIA)方面——这些往往被现有声称低估。我们进一步将这些结果与行为密度和决策复杂性联系起来,提供了一个理解部署时风险的行为视角,并激励社区范围内的评估,使秘密对齐主张可证明。

英文摘要

This position paper argues that the AI/ML community should stop overclaiming and retire the label "positive backdoor," and instead treat trigger-activated hidden behaviors as Secret Alignment. Crucially, protective claims based on Secret Alignment should be presumed not secure by default unless supported by rigorous, standardized evaluation. The Private AI era, enabled by open-weight LLMs and accessible training/inference stacks, turns language models into privately owned digital assets, creating security concerns around unauthorized access, model theft, and behavioral misuse. Recently, a line of work framed as "positive backdoors" has been proposed to address these challenges. To ground our position in evidence, we unify these proposals as covert trigger-behavior associations for access gating, ownership attribution, and safety enforcement, and evaluate three representative applications across six core properties: effectiveness, harmlessness, persistence, efficiency, robustness, and reliability. Our results reveal substantial brittleness - especially in the confidentiality, integrity, and availability (CIA) - of trigger-behavior mappings often underrepresented by existing claims. We further relate these outcomes to behavior density and decision complexity, offering a behavioral lens for understanding deployment-time risks and motivating community-wide evaluation that makes Secret Alignment claims provable.

2605.28594 2026-05-28 cond-mat.stat-mech cs.AI physics.comp-ph 版本更新

Thermodynamic properties of chemically disordered compounds via AI-driven estimation of partition function with the PULSE method

通过PULSE方法基于AI驱动配分函数估计的化学无序化合物热力学性质

Baptiste Bernard, Luca Messina, Eiji Kawasaki, Emeric Bourasseau

发表机构 * CEA, DES, IRESNE, DEC(CEA,DES,IRESNE,DEC)

AI总结 提出改进的PULSE方法,通过无监督学习采样和估计配分函数,以低成本高效计算化学无序化合物的热力学性质,并在2D Ising模型上验证了其高精度和效率。

Comments 13 pages, 11 figures, submitted to Physical Chemistry Chemical Physics

详情
AI中文摘要

在本文中,我们提出了PULSE方法(配分函数无监督学习采样与评估)的改进版本,用于估计化学无序化合物的热力学性质。目的是降低这类材料蒙特卡罗方法的计算成本,并证明这种生成工具可以通过采样和估计系统的配分函数来估计热力学性质。为了验证这种创新方法,我们使用2D Ising模型作为基准。我们证明,与传统蒙特卡罗采样方法相比,我们的方法能够以高精度和效率准确再现平均性质。我们的结果突出了PULSE方法的效率和适应性,使其成为研究那些传统方法因化学无序影响而过于低效、无法低成本计算性质的材料的有价值工具。

英文摘要

In this article, we present an improved version of the PULSE method (Partition function Unsupervised Learning Sampling and Evaluation) for estimating the thermodynamic properties of chemically disordered compounds. The aim is to reduce the computational cost of Monte Carlo approaches for this type of material and to demonstrate that this generative tool can estimate thermodynamic properties by sampling and estimating the partition function of the system. To validate this innovative approach, we use the 2D Ising model as a benchmark. We demonstrate that our method accurately reproduces average properties with high precision and efficiency compared to traditional Monte Carlo sampling methods. Our results highlight the efficiency and adaptability of the PULSE method, making it a valuable tool for studying materials for which conventional methods are too inefficient to compute properties affected by chemical disorder at low cost.

2605.28588 2026-05-28 cs.CR cs.AI 版本更新

Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

技术报告:探索智能体技能生态系统的新兴威胁

Luca Beurer-Kellner, Aleksei Kudrinskii, Marco Milanta, Kristian Bonde Nielsen, Hemang Sarkar, Liran Tal

发表机构 * Snyk

AI总结 本研究通过分析3984个AI智能体技能,发现76个恶意载荷,揭示了技能生态系统中的安全威胁,并提出了威胁分类和攻击模式。

Comments 10 pages, technical report

详情
AI中文摘要

我们分析了来自主要市场的3,984个AI智能体技能,发现了76个确认的恶意载荷,包括凭证窃取、后门安装和数据泄露。13.4%的技能至少包含一个关键级别的安全问题,截至发表之日,至少有8个手动确认的恶意技能仍在clawhub.ai上公开可用。本报告记录了我们的方法论,基于真实样本提出了威胁分类,并详细描述了观察到的攻击模式。随着技能市场快速增长,AI智能体获得敏感凭证和系统的访问权限,自动化安全分析不再是可选项。

英文摘要

We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and data exfiltration. 13.4% of all skills contain at least one critical-level security issue and at least 8 manually confirmed malicious skills remain publicly available on clawhub.ai as of the date of publication. This report documents our methodology, presents a threat taxonomy based on real-world samples, and details the attack patterns we observed. As skill marketplaces grow rapidly and AI agents gain access to sensitive credentials and systems, automated security analysis is no longer optional.

2605.28583 2026-05-28 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving

SARAD:基于LLM的安全感知混合强化学习与碰撞预测在自动驾驶中的应用

Kangyu Wu, Peng Cui, Guoxi Chen, Ya Zhang

发表机构 * National Natural Science Foundation (NNSF) of China(中国国家自然科学基金委员会) National Science and Major Project(国家科学技术重大专项)

AI总结 提出SARAD框架,结合大语言模型和深度强化学习,通过检索增强生成和碰撞预测模块提升自动驾驶的安全性和效率。

Comments 7 pages, 4 figures, accepted by IJCNN 2026

详情
AI中文摘要

确保自动驾驶系统决策的安全性和效率仍然是一个基本挑战。传统的深度强化学习(DRL)存在不安全的随机探索和收敛缓慢的问题,而大语言模型(LLM)在实时推理操作中表现出固有的延迟。为了解决这些限制,本文提出了SARAD,一种新颖的安全感知混合框架,协同LLM和DRL用于自动驾驶。SARAD用来自动态专家知识库的、经检索增强生成(RAG)增强的LLM引导决策替代了DRL的随机探索。提出了一个注意力判别器,将LLM的先验知识整合到DRL策略优化中。进一步设计了一个碰撞预测模块,使用历史碰撞数据进行微调,以提高车辆安全性。大量实验表明,SARAD在Highway-Env模拟器中实现了显著的性能提升,验证了所提模型在自动驾驶中的有效性。

英文摘要

Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Reinforcement Learning (DRL) suffers from unsafe random exploration and slow convergence, while Large Language Models (LLMs) demonstrate inherent latency in real-time inference operations. To address these limitations, this paper proposes SARAD, a novel safety-aware hybrid framework that synergizes LLMs and DRL for autonomous driving. SARAD substitutes the random exploration of DRL with Retrieval-Augmented Generation (RAG)-enhanced, LLM-guided decisions sourced from a dynamic expert knowledge repository. An attention discriminator is proposed to integrate the prior knowledge of LLMs into DRL policy optimization. A collision predictor module, fine-tuned with historical collision data, is further designed to improve vehicle safety. Extensive experiments show that SARAD achieves significant performance improvements in the Highway-Env simulator, validating the effectiveness of the proposed model in autonomous driving.

2605.28577 2026-05-28 cs.AI cs.LG 版本更新

Continual Model Routing in Evolving Model Hubs

演化模型库中的持续模型路由

Jack Bell, Giacomo Carfì, Gerlando Gramaglia, Vincenzo Lomonaco

发表机构 * Department of Computer Science, University of Pisa, Pisa, Italy(意大利比萨大学计算机科学系) LUISS University, Rome, Italy(意大利罗马大学)

AI总结 针对模型库快速扩展带来的模型选择和路由更新挑战,提出持续模型路由(CMR)问题,构建大规模基准CMRBench,并设计基于对比嵌入的CARvE方法,通过检查点锚定和结构化重放实现高效路由,显著优于多种基线。

Comments 42 pages, 24 tables, 6 figures, to be published at ICML 2026

详情
AI中文摘要

AI模型库提供了对快速增长的大量预训练模型的访问,使得具有不同路由策略的现成混合专家系统成为可能。然而,这种快速增长带来了两个基本挑战:跨数千个专家进行模型选择的扩展,以及随着新模型和任务的引入持续更新路由机制。在本文中,我们将这一设置形式化为持续模型路由(CMR),并提出了CMRBench,这是一个新的大规模基准,模拟现实的模型库扩展,包括超过2000个候选模型。最后,我们介绍了CARvE,一种对比嵌入方法,通过基于检查点的锚定和结构化重放实现高效的持续模型路由。大量的实验结果和消融研究表明,CARvE在模型、家族和领域级别的准确性上显著优于零样本检索、微调和适配器合并基线。

英文摘要

AI model hubs provide access to a rapidly growing collection of powerful pre-trained models, enabling off-the-shelf mixture-of-experts systems with different routing strategies. However, this rapid growth poses two fundamental challenges: scaling model selection across thousands of experts and continually updating routing mechanisms as new models and tasks are introduced. In this paper, we formalise this setting as Continual Model Routing (CMR) and propose CMRBench, a new large-scale benchmark simulating realistic hub expansion and including over 2,000 candidate models. Finally, we introduce CARvE, a contrastive embedding approach for efficient continual model routing via checkpoint-based anchoring and structured replay. Extensive empirical results and ablations show that CARvE significantly outperforms zero-shot retrieval, fine-tuning, and adapter-merging baselines in model, family, and domain-level accuracy.

2605.28575 2026-05-28 cs.AI 版本更新

A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

一种冲突感知惩罚与统计损失框架,用于平衡模态并增强多模态情感分析的稳定性

Jianheng Dai, Jiazhang Liang, Sijie Mai

发表机构 * School of Computer Science, South China Normal University(华南师范大学计算机学院)

AI总结 针对多模态情感分析中文本模态主导导致梯度冲突的问题,提出冲突感知惩罚和统计损失框架,实现模态平衡与训练稳定,在CMU-MOSI上取得最优性能。

详情
AI中文摘要

多模态情感分析(MSA)融合文本、声学和视觉流来推断情感。由于预训练文本编码器的表达能力远强于声学和视觉编码器,文本模态往往主导优化过程,抑制较弱模态并引发梯度范数冲突,从而破坏训练稳定性。为解决此问题,我们提出一种冲突感知惩罚(CP),在每一步训练中检测并惩罚梯度范数冲突,以及一种统计损失(SL),使预测分布统计量与经验输入统计量对齐。关键的是,CP防止主导模态梯度干扰SL目标,从而在统一框架内实现协同训练,该框架包含自适应模态编码、门控跨模态融合和单模态辅助头。在CMU-MOSI上的实验表明,该方法达到了最先进的性能,消融研究证实了每个组件的有效性。

英文摘要

Multimodal Sentiment Analysis (MSA) fuses text, acoustic, and visual streams to infer sentiment. Because pre-trained text encoders are far more expressive than their acoustic and visual counterparts, the text modality tends to dominate optimization, suppressing weaker modalities and inducing gradient norm conflicts that destabilize training. To address this, we propose a Conflict-aware Penalty (CP) that detects and penalizes gradient norm conflicts at each training step, and a Statistical Loss (SL) that aligns predicted distribution statistics with empirical input statistics. Crucially, CP prevents dominant modality gradients from interfering with the SL objective, enabling synergistic training within a unified framework incorporating adaptive modality encoding, gated cross-modal fusion, and unimodal auxiliary heads. Experiments on CMU-MOSI demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of each component.

2605.28573 2026-05-28 cs.LG cs.AI 版本更新

Efficient Pre-Training of LLMs through Truncated SVD Layers

通过截断SVD层实现LLM的高效预训练

Kaivan Kamali, Kajetan Schweighofer, Hormoz Shahrzad, Olivier Francon, Babak Hodjat, Risto Miikkulainen

发表机构 * Cognizant AI Lab(认知AI实验室) UT Austin(得克萨斯大学奥斯汀分校)

AI总结 提出TSVD框架,利用谱能量启发式自适应秩选择和缓存机制保持低秩与严格正交性,在减少计算开销的同时匹配或超越全参数基线的性能。

详情
AI中文摘要

大规模语言模型(LLM)的规模扩展使得预训练成本日益高昂。虽然低秩表示和正交权重矩阵原则上可以减少参数数量和计算开销,但现有方法大多依赖静态秩选择,且由于高计算成本而不强制权重正交性。本文引入TSVD框架,在整个训练过程中保持低秩和严格正交性。它利用基于谱能量的启发式方法进行自适应秩选择,并采用缓存机制来维持正交性。理论分析证明了该方法在预训练动态中的优势,跨多种模型规模的实验表明其在经验上有效。TSVD在显著降低计算需求的同时,匹配或超越了全参数基线的性能。因此,该方法为高效高性能LLM预训练提供了一条有充分依据、实用且可扩展的路径。

英文摘要

The massive scaling of Large Language Models (LLMs) has made pretraining increasingly cost-prohibitive. While low-rank representation and orthonormal weight matrices could in principle reduce parameter counts and computational overhead, most existing methods rely on static rank selection and do not enforce weight orthonormality due to high computational cost. This paper introduces TSVD, a framework that maintains low rank and strict orthonormality throughout the training process. It utilizes a spectral energy-based heuristic for adaptive rank selection, and a caching mechanisms to maintain orthonormality. Theoretical analysis justifies the advantage of the approach in pretraining dynamics and experiments across various model scales demonstrate that it is effective empirically. TSVD matches or exceeds the performance of full-parameter baselines while significantly reducing compute requirements. The approach thus offers a well-founded, practical, and scalable path toward efficient high-performance LLM pretraining.

2605.28567 2026-05-28 cs.LG cs.AI 版本更新

Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

稀疏自编码器特征匹配与电路压缩的语义最优传输

Tue M. Cao, Nguyen Do, My T. Thai

发表机构 * University of Florida(佛罗里达大学)

AI总结 提出基于最优传输的分布框架,通过激活加权分布和Wasserstein距离统一解决跨层特征匹配与电路压缩问题。

Comments preprint

详情
AI中文摘要

稀疏自编码器(SAE)已成为解释语言模型的核心工具。然而,两个关键的SAE分析仍然难以规模化:(1)跨层匹配语义相似的特征,(2)将大型特征电路压缩为可解释的超节点。尽管这些问题被视为独立问题,但我们表明它们都是更基础挑战的实例,我们将其框架化为估计位于不同激活流形上的SAE特征之间的语义距离。我们为此问题引入了一个分布框架,其中每个特征不是像文献中那样由单个解码器向量表示,而是由表达它的隐藏状态上的激活加权分布表示。通过将这些分布投影到共享参考空间并使用Wasserstein距离进行比较,我们的方法为跨层特征比较提供了统一的语义度量。我们证明了我们的表示对激活缩放具有不变性,在扰动下稳定,并在有限样本边际条件下恢复真实匹配。实验上,我们的方法优于解码器向量和基于LLM的基线,并捕捉相关特征之间的细微功能差异。值得注意的是,我们的方法自动将大型特征电路压缩为可解释的超节点。

英文摘要

Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult to scale are (1) matching semantically similar features across multi-layers and (2) compressing large feature circuits into interpretable supernodes. Although these have been treated as separate problems, we show that both are instances of a more fundamental challenge, which we frame as the estimation of semantic distances between SAE features that lie on different activation manifolds. We introduce a distributional framework for this problem, in which each feature is represented not by a single decoder vector like in the literature, but by an activation-weighted distribution over the hidden states that express it. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, our method provides a unified semantic metric for cross-layer feature comparison. We prove that our representation is invariant to activation rescaling, stable under perturbations, and recovers true matches under finite-sample margin conditions. Empirically, our method outperforms decoder-vector and LLM-based baselines and captures subtle functional distinctions between related features. Notably, our method compresses large feature circuits into interpretable supernodes automatically.

2605.28566 2026-05-28 cs.AI cs.LG 版本更新

Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns

思维树作为经典启发式搜索问题:形式化基础与设计模式

Guni Sharon

发表机构 * Guni Sharon

AI总结 本文通过经典启发式搜索术语统一分类法,将基于LLM的推理映射到搜索组件,并识别出系统搜索和前瞻性策略两种设计模式。

Comments Extended version of the SoCS 2026 paper. Includes appendices omitted from the proceedings version

详情
Journal ref
Proceedings of the Nineteenth International Symposium on Combinatorial Search (SoCS 2026), AAAI Press, 2026
AI中文摘要

大型语言模型(LLM)展示了卓越的推理能力,但其标准生成过程——自回归令牌预测——本质上是短视的,容易产生级联错误。为了解决这个问题,思维树(ToT)框架在中间推理步骤上创建了一个搜索空间,允许搜索模型进行探索、前瞻和回溯。然而,当前的ToT研究在自然语言处理和自动规划社区之间仍然分散,常常使用不一致的术语和临时实现。因此,我们通过基于经典启发式搜索术语的统一分类法综合了ToT领域。我们将基于LLM的推理映射到经典搜索组件:状态表示(思维粒度)、后继生成(提示操作符)和启发式评估(进展自我评估)。我们在分类法的背景下分析现有工作,并识别出新兴的设计模式:针对浅层确定性任务的系统搜索(最佳优先搜索)和针对深层多步推理的前瞻性策略(DFS、MCTS)。最后,我们指出了启发式搜索与LLM推理交叉领域中的开放算法挑战,并呼吁启发式搜索社区参与这一新兴领域。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, yet their standard generation process -- auto-regressive token prediction -- is inherently myopic and prone to cascading errors. To address this, the Tree-of-Thoughts (ToT) framework creates a search space over intermediate reasoning steps, allowing search models to explore, look ahead, and backtrack. However, current ToT research remains fragmented across Natural Language Processing and Automated Planning communities, often using inconsistent terminology and ad-hoc implementations. Consequently, we synthesize the ToT landscape through a unified taxonomy based on classical heuristic search terminology. We map LLM-based reasoning to classical search components: state representation (granularity of thoughts), successor generation (prompting operators), and heuristic evaluation (self-assessment of progress). We analyze existing work within the context of our taxonomy and identify emerging design patterns: systematic search (Best-First Search) for shallow, deterministic tasks and lookahead-heavy strategies (DFS, MCTS) for deep multi-step reasoning. We conclude by identifying open algorithmic challenges at the intersection of heuristic search and LLM reasoning, and call on the heuristic search community to engage with this emerging domain.

2605.28565 2026-05-28 cs.DL cs.AI cs.CL cs.IR 版本更新

Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

验证性误导:衡量搜索增强型大语言模型中的结构性引用失败

Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang, Dongha Lee

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系) Department of Computer Science and Engineering, Konkuk University(Konkuk大学计算机科学与工程系) Incheon International Airport Corporation(仁川国际机场公司) Department of Computer Science and Engineering, Ewha Womans University(成均馆女子大学计算机科学与工程系)

AI总结 针对搜索增强型大语言模型中的引用可信度问题,提出CITETRACE数据集和三维评估框架,发现系统性“验证性误导”模式:模型引用真实可访问来源但存在意图对齐、来源适宜性或答案-来源忠实度缺陷,导致用户面临结构性误导。

Comments Working Progress

详情
AI中文摘要

搜索增强型大语言模型的用户依赖引用作为回答基于真实来源的证据,但很少自行验证引用的页面。每天数百万次查询通过这些系统,使得引用质量成为用户是被告知还是被误导的无声决定因素——然而现有基准各自孤立地处理一个方面,导致决定引用可信度的联合结构未被衡量。我们构建了CITETRACE,一个大规模数据集,追踪从用户查询到检索来源再到生成答案的完整引用链:来自28个社区的11,200个真实世界查询,与来自五个提供商的十个模型的112,000个回答配对,产生761,495个可评估的引用对。我们设计了一个三维评估框架,使用专家验证的预定义矩阵和五级忠实度标准,对每个引用在意图-目的对齐、来源适宜性和答案-来源忠实度上进行评分;该框架适用于任何产生带引用回答的系统。大规模应用该框架,我们识别出一种系统性的模式,称为验证性误导(VM):模型引用真实、可访问的来源,但在一个或多个维度上失败,产生忠实度-适宜性权衡,其中忠实模型选择不合适的来源,反之亦然。在我们的池中,30.6%的引用扭曲了其来源,27.1%的引用源自领域不合适的来源;在回答层面,高达96%的用户至少遇到一个结构性误导的引用。提供商层面的差异解释了88-96%的引用质量方差,表明来源选择更多受超出单个模型能力的因素控制,而非LLM本身。总之,CITETRACE及其评估框架为诊断部署的搜索增强系统中的结构性引用失败提供了首个资源。

英文摘要

Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

2605.28563 2026-05-28 cs.LG cs.AI 版本更新

A Multi-dimensional Framework for Evaluating Generalization in EEG Foundation Models

评估脑电图基础模型泛化能力的多维框架

Aditya Kommineni, Emily Zhou, Kleanthis Avramidis, Tiantian Feng, Shrikanth Narayanan

发表机构 * Signal Analysis and Interpretation Laboratory(信号分析与解释实验室)

AI总结 提出一个多维评估框架,在低资源条件下系统评估EEG基础模型(如LaBraM、CSBrain、CBraMod)的泛化能力,发现其在长上下文任务中表现优异,但在短窗口BCI任务中与监督模型相当,且对通道限制鲁棒性不足。

Comments 24 pages, 5 Figures

详情
AI中文摘要

在适当的适应设置下评估基础模型对于理解所学表示的质量和可迁移性至关重要。最近的脑电图基础模型在跨任务和数据集上展示了有前景的迁移能力,推动了它们在神经技术和临床应用中日益增长的使用。然而,这些模型通常是在精心整理的下游数据集上进行全微调评估,这种设置并未反映生物医学领域的约束,如有限的标记数据、减少的传感器覆盖或参数高效的适应。在这项工作中,我们提出了一个多维评估框架,用于在现实低资源条件下评估脑电图模型。在提出的多维评估框架下,对包括LaBraM、CSBrain和CBraMod在内的监督脑电图模型和最近的脑电图基础模型在6个不同数据集上进行了实证分析。我们发现,脑电图基础模型在长上下文任务(如睡眠阶段预测和心理健康状态分类)上持续提供性能提升。相比之下,对于短窗口的脑机接口风格任务,监督模型尽管参数少得多,却取得了相当的性能。额外的分析表明,当前的基础模型对短窗口任务和通道受限设置提供的鲁棒性有限。总之,这些发现激励使用多维评估协议,以表征模型在现实使用约束下的行为。

英文摘要

Evaluating foundation models under appropriate adaptation settings is essential for understanding the quality and transferability of the learned representations. Recent EEG foundation models have demonstrated promising transfer capabilities across tasks and datasets, motivating their growing use in neurotechnology and clinical applications. However, these models are typically evaluated under full fine-tuning on well-curated downstream datasets, a setting that does not reflect biomedical domain constraints such as limited labeled data, reduced sensor coverage, or parameter-efficient adaptation. In this work, we propose a multi-dimensional evaluation framework for assessing EEG models under realistic low-resource conditions. Empirical analysis of both supervised EEG models and recent EEG foundation models, including LaBraM, CSBrain, and CBraMod, across 6 different datasets is performed under the proposed multi-dimensional evaluation framework. We find that EEG foundation models consistently provide performance gains on long-context tasks such as sleep stage prediction and mental health state classification. In contrast, for short-window Brain Computer Interface style tasks, supervised models achieve comparable despite having substantially fewer parameters. Additional analyses demonstrate that current foundation models provide limited robustness to short-window tasks and channel constrained settings. Together, these findings motivate the use of multi-dimensional evaluation protocols that characterize model behavior under realistic use constraints.

2605.28557 2026-05-28 cs.LO cs.AI 版本更新

Token Optimization Strategies for LLM-Based Oracle-to-PostgreSQL Migration

基于LLM的Oracle到PostgreSQL迁移的Token优化策略

Oleg Grynets, Dmytro Babarytskyi, Vasyl Lyashkevych

发表机构 * EPAM Systems(EPAM系统) Kharkiv, Ukraine(乌克兰基尔基茨) Lviv, Ukraine(乌克兰利沃夫) McLean, Virginia, USA(美国弗吉尼亚州麦莱恩)

AI总结 本文形式化并评估了十二种Token优化策略,在Oracle到PostgreSQL迁移中平衡成本、语法有效性、语义保持和结构保真度。

Comments 11 pages, 3 figures, 5 tables, 38 references

详情
AI中文摘要

LLM越来越多地用于软件现代化、代码翻译和数据库迁移。然而,基于LLM的Oracle2PostgreSQL迁移仍然受到高Token消耗、长上下文退化、方言特定的语义差异以及查询转换过程中语义漂移风险的限制。将大型Oracle SQL/PL-SQL工件、模式定义、过程逻辑和迁移指令直接包含到模型上下文中会增加成本并可能降低生成质量。本文将Token优化视为基于LLM的Oracle2PostgreSQL迁移中的一个约束转换问题。研究形式化并评估了十二种Token优化策略:基线表示、上下文剪枝、最小化、基于DSL的语义压缩、元数据增强、上下文重构、模式蒸馏、自适应路由、基于AST的最小化、标识符掩码、输出约束强制和混合优化。这些策略在10和100个Oracle SQL查询样本上使用有效语法率、精确匹配、语义匹配、CodeBLEU和Token效率进行评估。结果表明,轻度上下文剪枝几乎保持了基线水平的语义质量,在100个查询样本上实现了89.75%的语义匹配,而未优化基线为89.80%。自适应路由提供了最佳的实际权衡,输入Token减少8.72%,输出Token减少5.49%,同时保持88.40%的语义匹配,并将Token效率提高6.67%。激进的模式蒸馏将Token效率提高了132.22%,但导致语义匹配下降44.50个百分点。研究结果表明,Token优化不能简单地视为提示缩短;它必须作为一个多目标迁移问题来评估,平衡成本、语法有效性、语义保持和结构保真度。

英文摘要

LLMs are increasingly used for software modernization, code translation, and database migration. However, LLM-based Oracle2PostgreSQL migration remains constrained by high token consumption, long-context degradation, dialect-specific semantic differences, and the risk of semantic drift during query transformation. Direct inclusion of large Oracle SQL/PL-SQL artefacts, schema definitions, procedural logic, and migration instructions into the model context increases cost and may reduce generation quality. This paper shows token optimization as a constrained transformation problem in LLM-based Oracle2PostgreSQL migration. The study formalizes and evaluates twelve token optimization strategies: baseline representation, context pruning, minification, DSL-based semantic compression, metadata augmentation, context refactoring, schema distillation, adaptive routing, AST-based minification, identifier masking, output constraint enforcement, and hybrid optimization. The strategies are evaluated on samples of 10 and 100 Oracle SQL queries using Valid Syntax Rate, Exact Match, Semantic Match, CodeBLEU, and Token Efficiency. The results show that mild context pruning preserves semantic quality almost at the baseline level, achieving 89.75% Semantic Match on the 100-query sample compared with 89.80% for the unoptimized baseline. Adaptive routing provides the best practical trade-off, reducing input tokens by 8.72% and output tokens by 5.49% while maintaining 88.40% Semantic Match and increasing Token Efficiency by 6.67%. Aggressive schema distillation increases Token Efficiency by 132.22% but results in a 44.50-percentage-point decrease in Semantic Match. The findings demonstrate that token optimization cannot be treated as simple prompt shortening; it must be evaluated as a multi-objective migration problem balancing cost, syntactic validity, semantic preservation, and structural fidelity.

2605.28553 2026-05-28 cs.AI cs.CR 版本更新

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

解码前拒绝:检测和利用中间LLM激活中的拒绝信号

Matteo Gioele Collu, Riccardo Conte, Alberto Giaretta, Denis Kleyko, Mauro Conti, Matteo Zavatteri, Roberto Confalonieri

发表机构 * University of Padua(帕多瓦大学) Örebro University(欧雷布罗大学) Fondazione Bruno Kessler(布鲁诺·凯索基金会)

AI总结 本文通过线性探针在变压器块的残差流激活中检测拒绝行为,并提出Mechanistic AutoDAN方法,利用探针引导的遗传搜索实现高效攻击,显著降低搜索时间并保持攻击成功率。

详情
AI中文摘要

在本文中,我们研究了是否可以通过在解码前使用线性探针在变压器块的残差流激活上训练,从LLM中间激活中预测拒绝行为。我们发现拒绝在远早于最后一层时即可线性解码,表明安全相关行为在输出生成前就已编码在中间激活中。为了测试该信号是否可行,我们引入了Mechanistic AutoDAN,这是AutoDAN的一种探针引导变体,它在遗传提示搜索循环中用部分前向传递和基于探针的评分取代了全模型适应度评估。在评估的模型中,我们的方法实现了与原始AutoDAN相当的攻击成功率,同时将每次迭代的搜索时间减少了高达72%,并且在多种配置下,探针引导的提示在跨模型迁移方面达到或超过了AutoDAN。我们进一步发现,探针引导的有效性随模型规模增大而增加。我们的结果表明,拒绝不仅在输出层面可观察,而且作为结构化且可行的信号编码在LLM中间激活中。

英文摘要

In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with partial forward passes and probe-based scoring inside a genetic prompt search loop. Across the evaluated models, our method achieves attack success rates competitive with vanilla AutoDAN while reducing per-iteration search time by up to 72%, and probe-guided prompts match or exceed AutoDAN's cross-model transfer in several configurations. We further find that the usefulness of probe guidance increases with model scale. Our results show that refusal is not only observable at the output level, but is encoded as a structured and actionable signal in intermediate LLM activations.

2605.28552 2026-05-28 cs.AI 版本更新

Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning

使用Smooth-Mamba深度强化学习建模安全关键交互中车辆类型特定的行人碰撞规避行为

Qingwen Pu, Kun Xie, Hong Yang, Di Yang, Junqing Wang

发表机构 * Transportation Informatics Lab, Department of Civil and Environmental Engineering, Old Dominion University(交通信息实验室,土木与环境工程系,旧 Dominion 大学) Department of Electrical and Computer Engineering, Old Dominion University(电气与计算机工程系,旧 Dominion 大学) Department of Transportation and Urban Infrastructure Studies, SMARTER Center, Morgan State University(交通与城市基础设施研究系,SMARTER 中心,莫根州立大学)

AI总结 本研究利用Smooth-Mamba深度确定性策略梯度框架(SMamba-DDPG)从Argoverse 2数据集中提取安全关键交互,建模行人与自动驾驶车辆(AV)和人类驾驶车辆(HDV)的碰撞规避行为,发现行人对AV反应更快、穿越速度更低,且AV场景冲突率更低。

Comments 37 page. 15 Figure, 9 table

详情
AI中文摘要

随着自动驾驶车辆(AV)越来越多地与人类驾驶车辆(HDV)共享道路,理解行人在安全关键交互中如何应对不同车辆类型对于自动驾驶技术的安全部署至关重要。本研究从Argoverse 2数据集中提取安全关键的行人-车辆交互,以捕捉涉及AV和HDV的真实碰撞规避行为。为了建模车辆类型特定的行人碰撞规避行为,我们开发了Smooth-Mamba深度确定性策略梯度框架(称为SMamba-DDPG),该框架将平滑动作约束与高效的时序表示学习相结合。为了量化行人行为差异,该框架分别为行人与AV和HDV的交互训练了碰撞规避策略。结果表明,SMamba-DDPG在复现行人碰撞规避行为方面优于基线强化学习和监督学习模型。重构轨迹表现出强烈的行为真实性,准确复现了AV和HDV场景中的碰撞规避运动学。反应时间分析表明,该模型捕捉到了类人的响应延迟,并揭示行人对AV的反应比HDV更快。反事实分析进一步表明,行人在与AV交互时采用更低的穿越速度。对模型生成数据的大规模安全分析显示,与行人-HDV交互相比,行人-AV交互始终产生更低的冲突率和更高的行人让行率。这些发现强调了在混合交通环境中,将车辆类型特定的行人行为模型纳入更安全的自动驾驶系统设计和更真实的交通模拟中的重要性。

英文摘要

As automated vehicles (AVs) increasingly share roadways with human-driven vehicles (HDVs), understanding how pedestrians respond to different vehicle types in safety-critical interactions is essential for the safe deployment of automated driving technologies. This study extracts safety-critical pedestrian-vehicle interactions from the Argoverse 2 dataset to capture real-world crash avoidance behaviors in encounters involving AVs and HDVs. To model vehicle-type-specific pedestrian crash avoidance behavior, we develop a Smooth-Mamba Deep Deterministic Policy Gradient framework, termed SMamba-DDPG, which integrates smooth action constraints with efficient temporal representation learning. To quantify pedestrian behavioral differences, the framework trains separate crash avoidance policies for pedestrian interactions with AVs and HDVs. Results show that SMamba-DDPG outperforms baseline reinforcement learning and supervised learning models in reproducing pedestrian crash avoidance behaviors. Reconstructed trajectories demonstrate strong behavioral realism, accurately reproducing crash avoidance kinematics in both AV and HDV scenarios. Reaction time analysis shows that the model captures human-like response delays and reveals that pedestrians respond more quickly to AVs than to HDVs. Counterfactual analysis further indicates that pedestrians adopt lower crossing speeds when interacting with AVs. Large-scale safety analysis of model-generated data revealed that pedestrian-AV interactions consistently yielded lower conflict rates and higher pedestrian yielding rates compared to pedestrian-HDV interactions. The findings highlight the importance of incorporating vehicle-type-specific pedestrian behavioral models for safer automated driving system design and more realistic traffic simulations in mixed-traffic environments.

2605.28543 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Cultural Binding Heads in Language Models

语言模型中的文化绑定头

Avrile Floro, Luca Benedetto

发表机构 * Mistral-7B Mistral-Nemo-12B Llama-3.1-8B Gemma-2-9B

AI总结 通过机制可解释性和析因设计,识别出8个语言模型中2-3个中间层注意力头对文化绑定有因果贡献,且绑定主要在预训练阶段形成,知识探测表明模型知道的知识远多于其行为表现。

详情
AI中文摘要

大型语言模型通常默认对不同文化群体一视同仁,即使上下文需要区分:这缺乏差异意识。利用机制可解释性和Wang等人(2025)的N4文化挪用基准上的析因设计,我们在八个模型(四种架构,基础版和指令版)中识别出每个模型有2-3个中间层注意力头对文化绑定有因果贡献。文化绑定是将文化项目与适当身份关联的过程。敲除这些头上的身份到项目边会使绑定强度降低9-23%。识别出的头从指令模型转移到基础模型,表明文化绑定是在预训练阶段创建的。α缩放显示分级剂量反应,生成时适度放大引导(α=2-3)可将文化区分准确性提高1-3个百分点,同时基本保持中性推理不变。知识探测任务表明,模型知道的知识比其行为表现多3-5倍,表明瓶颈在于路由而非知识。

英文摘要

LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference awareness. Using mechanistic interpretability and a factorial design on the N4 cultural appropriation benchmark from Wang et al. (2025), we identify 2-3 mid-layer attention heads per model that contribute causally to cultural binding across eight models (four architectures, base and instruct). Cultural binding is the process of associating cultural items with the appropriate identity. Knockout of the identity-to-item edges on these heads lowers the binding strength by 9-23%. The identified heads transfer from instruct to base models, suggesting that cultural binding is created at pre-training. An $α$-scaling shows a graded dose-response and moderate amplification steering at generation ($α= 2-3$) increases cultural differentiation accuracy by 1-3 pp while leaving neutral reasoning mostly intact. A knowledge probing task shows that models know 3-5 times more than they act upon it, indicating that the bottleneck lies in routing and not knowledge.

2605.28532 2026-05-28 cs.AI 版本更新

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

智能体知道它们不能做什么吗?评估使用工具的智能体的可行性意识

Liang Cheng, Mingsheng Cai, Jiuming Jiang, Luo Mai

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 提出FeasiGen自动构建不可行任务管道,通过屏蔽关键工具将可解任务转为不可解,评估发现多数模型缺乏可行性检测能力,错误继续率高达73.9%。

Comments 14 pages

详情
AI中文摘要

使用工具的智能体通常因长推理链和迭代工具使用而产生大量计算成本。在实际场景中,许多任务在受限的工具环境下变得不可行,因为成功完成任务所需的能力不可用。检测不可行任务并提前停止执行可以显著减少不必要的执行成本。在这项工作中,我们提出了FeasiGen,一个自动构建不可行智能体任务的管道,通过识别成功完成任务所需的关键工具。我们的方法从多个智能体系统的成功执行中提取工具调用轨迹,识别不同执行策略中一致共享的关键工具,并屏蔽这些工具,从而自动将可解任务转化为不可解任务。人工验证确认,我们构建的任务的不可行性标注准确率超过94%。我们进一步引入了可行性感知评估指标,用于衡量智能体是否能识别不可行任务并适当停止执行。在九个模型上的广泛评估揭示了显著弱的不可行性检测能力,错误继续率高达73.9%。我们进一步观察到,多智能体架构在不可行条件下显著减少了错误执行。

英文摘要

Tool-using agents often incur substantial computational cost due to long reasoning chains and iterative tool usage. In practical scenarios, many tasks become infeasible under constrained tool environments, where the capabilities required for successful task completion are unavailable. Detecting infeasible tasks and stopping execution early can significantly reduce unnecessary execution cost. In this work, we propose FeasiGen, an automatic pipeline for constructing infeasible agent tasks by identifying the critical tools required for successful task completion. Our approach extracts tool-calling traces from successful executions across multiple agent systems, identifies critical tools consistently shared across diverse execution strategies, and masks these tools to automatically transform solvable tasks into infeasible ones. Human verification confirms that the infeasibility annotations for our constructed tasks achieve over 94% accuracy. We further introduce feasibility-aware evaluation metrics for measuring whether agents can recognize infeasible tasks and stop execution appropriately. Extensive evaluations across nine models reveal substantially weak infeasibility detection ability, with false continue rate reaching up to 73.9%. We further observe that multi-agent architectures significantly reduce erroneous execution under infeasible conditions.

2605.28526 2026-05-28 cs.AI cs.CL 版本更新

Entropy-aware Masking for Masked Language Modeling

面向掩码语言建模的熵感知掩码策略

Gokul Srinivasagan, Kai Hartung, Munir Georges

发表机构 * AImotion Bavaria(AImotion巴伐利亚) Technische Hochschule Ingolstadt(英戈尔施塔特技术大学)

AI总结 提出基于熵分布的掩码策略,通过模型预测熵识别信息量高的token进行掩码,并引入自掩码方法提升训练效率,在GLUE上平均提升5%。

Comments accepted at starsem 2026 Conference

详情
AI中文摘要

掩码语言建模已成为训练基于编码器的语言模型的标准预训练目标。在该方法中,输入中的某些token被掩码,模型学习利用周围上下文预测它们。这一过程使模型能够捕捉语言的句法和语义属性。传统上,用于掩码的token是随机选择的,这可能并不总是产生最有效的学习信号。在这项工作中,我们研究了一种基于熵分布的token掩码策略。我们利用模型在token预测上的熵来确定哪些token应被掩码。该方法旨在针对信息量更大、不确定性更高的token,以提高训练效率。我们还提出了一种新颖的自掩码方法,无需依赖外部参考模型即可增强训练效率。实验结果表明,与基线相比,我们的方法在GLUE分数上平均提升了5%。此外,我们尝试将知识蒸馏与熵掩码相结合,取得了最佳的整体结果。

英文摘要

Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model's entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.

2605.28524 2026-05-28 cs.AI 版本更新

Let Relations Speak: An End-to-End LLM-GNN Soft Prompt Framework for Fraud Detection

让关系说话:面向欺诈检测的端到端LLM-GNN软提示框架

Zhixing Zuo, Huilin He, Jiasheng Wu, Dawei Cheng

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院)

AI总结 提出LGSPF框架,通过软提示桥接图结构与语义空间,并引入并行GNN编码器将多关系拓扑转化为图令牌,实现端到端优化,在欺诈检测中达到最优性能。

Comments 14 pages,3 figures

详情
AI中文摘要

近年来,大型语言模型(LLM)在处理欺诈检测等图任务方面展现出强大能力。然而,现有方法大多严重依赖丰富的文本属性,由于该领域缺乏文本数据,这带来了困难。尽管一些开创性方法试图克服这一问题,但它们通过硬提示将图结构文本化容易导致特征失真。此外,欺诈检测通常表现出多关系复杂性,当前方法难以捕捉这种深层语义信息。为应对这些挑战,我们提出了LLM-GNN软提示框架(LGSPF)。具体而言,LGSPF使用软提示桥接图结构和语义空间,以消除对文本的依赖。我们进一步引入并行图神经网络(GNN)编码器,将多关系拓扑转化为图令牌,用于细粒度的LLM欺诈理解。通过端到端优化,LGSPF增强了LLM和GNN之间的深层语义对齐。在多个欺诈检测基准上的实验表明,我们的方法达到了最先进的性能。此外,我们进一步验证了LGSPF在增强欺诈行为语义可解释性方面的贡献。

英文摘要

In recent years, Large Language Models (LLMs) have shown great capability in processing graph tasks such as fraud detection. However, most existing methods rely heavily on rich text attributes, which poses difficulties for this domain due to the lack of textual data. Although some pioneering methods attempt to overcome it, their textualization of graph structures via hard prompts easily leads to feature distortion. Additionally, fraud detection often exhibits multi-relational complexity, where current methods struggle to capture this deep semantic information. To address these challenges, we propose LLM-GNN Soft Prompt Framework (LGSPF). Specifically, LGSPF bridges the graph structure and semantic space using soft prompt to eliminate reliance on text. We further introduce a parallel Graph Neural Network (GNN) encoder to translate multi-relational topologies into graph tokens for fine-grained LLM fraud comprehension. Through end-to-end optimization, LGSPF enhances deep semantic alignment between LLM and GNN. Experiments across diverse fraud detection benchmarks demonstrate our method achieves state-of-the-art performance. Moreover, we further validate the contribution of LGSPF on enhancing the semantic interpretability of fraud behaviors.

2605.28520 2026-05-28 cs.AI 版本更新

GS-FUSE: Granger-Supervised Gated Fusion and Multi-Granularity Alignment for Event-Driven Financial Forecasting

GS-FUSE: 格兰杰监督的门控融合与多粒度对齐用于事件驱动的金融预测

Yang Zhang, En Chun, Ziyun Mao, Yulu Wu, Jun Wang

发表机构 * Southwestern University of Finance and Economics(西南财经大学)

AI总结 提出GS-Fuse框架,通过格兰杰因果监督的门控融合模块和多粒度对齐机制,选择性利用事件文本与价格信号,提升金融事件对市场影响的预测精度。

详情
AI中文摘要

准确预测重大金融事件对市场的影响对投资者和政策制定者至关重要。然而,现有的多模态时间序列模型通常对称地融合文本和价格,没有明确的方式来决定事件文本何时真正具有预测性,因此难以利用事件到价格的方向性结构以及文本和价格信号的异质性角色。在这项工作中,我们提出了GS-Fuse,一个基于多模态事件的预测框架,它采用:(i) 格兰杰监督的、因果感知的门控融合模块,该模块仅在事件文本提供超越历史价格的增量预测价值时学习向事件文本开放;(ii) 多粒度对齐机制,该机制将高级事件表示和细粒度文本线索与未来市场轨迹联合对齐。作为构建在现成的大语言模型和时间序列基础模型之上的灵活、即插即用适配器,GS-Fuse可以在不同的骨干网络和市场设置中实例化。在真实世界金融数据集上的大量实验表明,GS-Fuse在多种资产和预测时间范围内始终优于最先进的时间序列和多模态基线。

英文摘要

Accurately forecasting the impact of salient financial events on markets is critical for investors and policymakers. However, existing multimodal time-series models typically fuse text and prices symmetrically, without an explicit way to decide when event text is truly predictive, and thus struggle to exploit the directional event-to-price structure and the heterogeneous roles of textual and price signals. In this work, we propose GS-Fuse, a multimodal event-based forecasting framework that employs (i) a Granger-supervised, causal-aware gated fusion module, which learns to open toward event text only when it provides incremental predictive value beyond historical prices, and (ii) a multi-granularity alignment mechanism that jointly aligns high-level event representations and fine-grained textual cues with future market trajectories. Built as a flexible, plug-and-play adapter on top of off-the-shelf large language models and time-series foundation models, GS-Fuse can be instantiated across diverse backbones and market settings. Extensive experiments on real-world financial datasets show that GS-Fuse consistently outperforms state-of-the-art time-series and multimodal baselines across multiple assets and forecasting horizons.

2605.28517 2026-05-28 cs.LG cs.AI 版本更新

Stochastic Gradient Descent with Momentum is Algorithmically Stable

带动量的随机梯度下降具有算法稳定性

Yunwen Lei, Zimeng Wang, Xiaoming Yuan

发表机构 * Department of Mathematics, The University of Hong Kong(香港大学数学系) Department of Mathematics and Mathematical Statistics, Umeå University(乌梅大学数学与统计学系)

AI总结 本文通过算法稳定性分析,证明了带动量的随机梯度下降(SGDM)在光滑凸问题上具有泛化保证,并建立了最优的过界总体风险界。

详情
AI中文摘要

带动量的随机梯度下降(SGDM)是机器学习中最广泛使用的优化算法之一。尽管文献中已经广泛研究了SGDM的优化性质,但关于SGDM是否以及何时能够很好地泛化到未见数据,仍然不够清楚。特别是,有人推测虽然动量加速了训练,但可能会降低泛化性能。在本文中,我们通过算法稳定性的视角,对SGDM进行了全面的泛化分析,填补了这一空白。更具体地说,我们引入了一个广义的SGDM框架,该框架涵盖了Polyak和Nesterov的动量方案,并为光滑凸问题建立了紧的平均模型稳定性界。值得注意的是,所获得的界利用了沿轨迹的小优化误差界,适用于区间$[0, 1)$内的任何动量参数,并且不需要通常假设的损失函数的Lipschitz连续性。我们进一步推导了广义SGDM的优化误差界,并将其与我们的泛化分析相结合,为具有Polyak和Nesterov动量的SGDM获得了最优的过界总体风险界。

英文摘要

Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimization properties of SGDM have been extensively studied in the literature, it remains insufficiently understood whether and when SGDM can generalize well to unseen data. In particular, it has been conjectured that while momentum accelerates training, it may degrade generalization. In this paper, we close this gap by developing a comprehensive generalization analysis of SGDM through the lens of algorithmic stability. More specifically, we introduce a generalized SGDM framework that encompasses both Polyak's and Nesterov's momentum schemes, and establish tight on-average model stability bounds for smooth and convex problems. Notably, the obtained bounds exploit small optimization error bounds along the trajectory, apply to any momentum parameter in the interval $[0, 1)$, and do not require the commonly assumed Lipschitzness of loss functions. We further derive optimization error bounds for the generalized SGDM, and combine them with our generalization analyses to obtain optimal excess population risk bounds for SGDM with both Polyak's and Nesterov's momentum.

2605.28515 2026-05-28 cs.SE cs.AI 版本更新

Do LLMs Favor Their Providers? Measuring Vertical Integration Bias in Code Generation

LLM 是否偏袒其提供商?测量代码生成中的垂直整合偏差

Melih Catal, Alex Wolf, Tiago Ferreiro Matos, Pooja Rani, Harald Gall

发表机构 * University of Zurich(苏黎世大学) University of Mannheim(曼海姆大学)

AI总结 本文提出 VIBench 基准,通过 20 个提供商可选的软件集成场景,测量前沿 LLM 在直接和代理代码生成中的垂直整合偏差,发现六成关联模型存在显著偏差,代理工作流加剧偏差至 +39.2 个百分点。

详情
AI中文摘要

大型语言模型已成为软件开发不可或缺的一部分,尤其是随着代理能力的出现。然而,许多前沿 LLM 与特定提供商有关联。这引发了一个问题:生成的代码是否偏袒提供商自身的生态系统而非可比较的替代方案,从而可能限制开发者的选择并增加对单一提供商的依赖。我们将这种行为定义为垂直整合偏差,并引入 VIBench,一个用于在 20 个提供商可选的软件集成场景中测量直接和代理代码生成中 VIB 的基准。通过评估 10 个前沿提供商关联模型与 3 个非关联对照模型,我们发现直接生成中存在正的 VIB,其中十个关联模型中有六个显示出统计显著效应,最高达 +18.8 个百分点。代理工作流进一步放大了 VIB,达到 +39.2 个百分点。此外,代理工作流中早期的关联生态系统选择可能持续存在于概念上解耦的下游文件中,持续性高达 90.3%。这些发现强调了在代码生成中测量和考虑 VIB 的必要性,尤其是在代理能力日益普及的背景下。

英文摘要

Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet, many frontier LLMs are affiliated with specific providers. This raises the question of whether generated code favors the provider's own ecosystem over comparable alternatives, potentially constraining developers' choices and increasing dependence on a single provider. We define this behavior as Vertical Integration Bias (VIB) and introduce \textsc{VIBench}, a benchmark for measuring VIB in direct and agentic code generation across $20$ provider-selectable software-integration scenarios. Evaluating $10$ frontier provider-affiliated models against $3$ non-affiliated controls, we find positive VIB in direct generation, with six of ten affiliated models showing statistically significant effects up to $+18.8$ percentage points (pp). Agentic workflows further amplify VIB, reaching $+39.2$ pp. Moreover, early affiliated-ecosystem choices in agentic workflows can persist into conceptually decoupled downstream files, with persistence as high as $90.3\%$. These findings underscore the need to measure and account for VIB in code generation, especially as agentic capabilities become more prevalent.

2605.28513 2026-05-28 cs.LG cs.AI 版本更新

Learning Theory of the SVRG: Generalization and Convergence Analysis

SVRG的学习理论:泛化与收敛性分析

Yunwen Lei, Zimeng Wang, Xiaoming Yuan

发表机构 * Department of Mathematics, The University of Hong Kong(香港大学数学系) Department of Mathematics and Mathematical Statistics, Umeå University(乌梅大学数学与统计学系)

AI总结 本文通过算法稳定性分析,首次为非凸和强凸设置下的SVRG方法建立了非平凡的泛化界,揭示了优化与泛化之间的相互作用,并得到了最优的过量风险界。

详情
AI中文摘要

方差缩减(VR)方法采用方差递减的随机梯度,因其高效性被广泛应用于机器学习中的大规模优化问题。现有的VR方法理论研究主要集中在收敛性分析上,而泛化行为在很大程度上未被探索。本文通过算法稳定性的视角,首次为代表性VR方法——随机方差缩减梯度(SVRG)建立了非平凡的泛化分析,填补了这一空白。特别地,我们利用SVRG的算法结构,在凸和强凸两种设置下建立了尖锐的稳定性界。所得到的界是数据依赖的,因为训练误差沿轨迹被纳入。我们的分析阐明了优化与泛化之间的相互作用,从而在两种设置下都得到了最优的过量风险界。我们的方法与现有的随机算法分析有本质不同,我们将SVRG更新分解为类似SGD的步骤加上一个零均值修正项,然后引入新的Lyapunov函数来吸收由参考点引起的额外梯度项。我们的分析框架可以推广到其他VR方法,并通过著名的随机平均梯度加速(SAGA)方法展示了泛化性。

英文摘要

Variance reduction (VR) methods employ stochastic gradients with decreasing variance, and they have been widely applied to solve large-scale optimization problems in machine learning because of their efficiency. Existing theoretical studies of VR methods are mainly focused on the convergence analysis, leaving the generalization behavior largely unexplored. In this paper, we bridge this gap by developing the first non-vacuous generalization analysis of the representative VR method: Stochastic Variance Reduced Gradient (SVRG), through the lens of algorithmic stability. In particular, we establish sharp stability bounds of the SVRG in both convex and strongly convex settings by exploiting its algorithmic structure. The obtained bounds are data-dependent, because the training errors are incorporated along the trajectory. Our analysis clarifies the interplay between optimization and generalization, leading to optimal excess population risk bounds in both settings. Our approach differs substantially from existing analyses of stochastic algorithms in the sense that we decompose the SVRG update as an SGD-like step plus a zero-mean correction term and then introduce novel Lyapunov functions to absorb the additional gradient terms induced by the reference points. Our analytical framework can be generalized to other VR methods, and we demonstrate the generalization by the well-known Stochastic Average Gradient Accelerated (SAGA) method.

2605.28500 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

功能熵:通过不确定性量化预测LLM生成代码的功能正确性

Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra

发表机构 * CVS Health(CVS健康)

AI总结 针对LLM生成代码功能不正确的问题,提出基于功能等价性的不确定性量化方法(功能熵),在多个编程语言和模型上优于现有方法。

详情
AI中文摘要

大型语言模型在代码生成方面表现出令人印象深刻的能力,但它们经常生成功能不正确的代码。不确定性量化(UQ)方法已成为检测自然语言生成中幻觉的有前途的方法,但它们在代码生成任务中的有效性仍未得到充分探索。我们系统地评估了UQ技术如何跨三种编程语言、五个LLM和超过1700个问题迁移到代码生成。我们发现,一些基于令牌概率的方法无需修改即可有效泛化,而依赖自然语言推理(NLI)的基于采样的方法失败,因为NLI模型无法区分功能不同的代码,导致大多数响应崩溃为单个语义簇。为了解决这个问题,我们引入了功能等价性方法,这是一类特定于代码的方法,用基于LLM的功能等价性评估取代基于NLI的语义等价性,包括功能熵,即语义熵的代码特定模拟。功能等价性方法在15个模型-基准组合中的11个中实现了最高的AUROC,并在大多数设置中实现了最佳校准,始终优于基于NLI的对应方法以及所有其他评估方法。

英文摘要

Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.

2605.28498 2026-05-28 cs.HC cs.AI 版本更新

The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

验证决策:温暖度和用户特征如何塑造对信息搜索中对话代理的依赖

Mert Yazan, Frederik Bungaran Ishak Situmeang, Suzan Verberne

发表机构 * Amsterdam University of Applied Sciences(阿姆斯特丹应用科学大学) Leiden University(莱顿大学)

AI总结 研究通过混合实验发现,即使提供事实核查工具,用户仍过度依赖对话AI,验证行为主要由用户特征(如先验信任)驱动,而温暖对话风格通过增加对错误答案的认同间接影响依赖。

Comments Under review for Computers in Human Behavior

详情
AI中文摘要

对话式人工智能(AI)提供了高效便捷的信息访问途径。然而,当用户盲目信任AI并在不进行事实核查的情况下接受其答案时,可能会导致过度依赖。信息搜索日益遵循一种结合对话AI与网络搜索的混合交互范式,使得事实核查更加容易。本文考察了这种交互范式是否能有效抑制依赖。我们进一步探究了驱动用户验证AI答案的潜在因素(例如数字素养和对话温暖度)。我们进行了一项混合被试问答实验,参与者与温暖或中性的聊天机器人互动。我们的发现表明,尽管用户同时拥有对话和网络搜索的访问权限,依赖仍然存在。验证决策主要由现有的用户感知(例如对聊天机器人的先验信任)驱动,而非答案属性,一些用户无论上下文如何都会进行事实核查,而另一些用户则默认信任聊天机器人。温暖的对话风格通过增加对错误聊天机器人的认同,对依赖产生了间接但关键的影响。咨询额外的AI来源可预测更高的准确性,而传统网络搜索则不然。我们的研究通过以下方式扩展了过度依赖研究:(a)证明了即使在可进行事实核查的情况下,过度依赖仍然存在;(b)将验证行为识别为用户依赖性;(c)揭示了对话温暖度对过度依赖的间接影响,这对设计可信赖的对话搜索系统具有启示意义。

英文摘要

Conversational artificial intelligence (AI) provides an efficient and convenient gateway to information access. However, it can cause overreliance when users blindly trust AI and accept its answers without fact-checking. Information search increasingly follows a hybrid interaction paradigm that combines conversational AI with web search, making fact-checking easier. In this paper, we examine whether this interaction paradigm is effective in curbing reliance. We further investigate the underlying factors (e.g., digital literacy and conversation warmth) that drive users to verify AI answers. We conduct a mixed-subjects question-answering experiment where participants interact with either a warm or a neutral chatbot. Our findings reveal that reliance persists despite users having access to both conversational and web search. The decision to verify is driven primarily by existing user perceptions (e.g., prior trust in chatbots) rather than answer properties, with some users fact-checking regardless of the context and others trusting chatbots by default. Warm conversational style has an indirect yet critical influence on reliance by increasing agreement with the chatbot when it is incorrect. Consulting additional AI sources predicts higher accuracy, while traditional web search does not. Our study extends overreliance research by: (a) demonstrating its persistence despite access to fact-checking, (b) identifying verification behavior as user-dependent, and (c) revealing conversational warmth's indirect effect on overreliance with implications for designing trustworthy conversational search systems.

2605.28490 2026-05-28 cs.CV cs.AI 版本更新

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

SSR3D-LLM: 通过潜在步骤实现结构化空间推理以实现统一3D-LLM中的细粒度定位

Jiawei Li, Ziyi Liu, Weijie Shi, Long Chen, Jiajie Xu, Xiaofang Zhou

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Soochow University(苏州大学)

AI总结 针对统一3D-LLM中细粒度查询的脆弱性,提出SSR3D-LLM,通过潜在空间推理步骤和几何感知评分器逐步精炼候选排名,在多个基准上取得最优结果。

详情
AI中文摘要

3D物体定位从自然语言中定位3D场景中的所指对象。统一的以实例为中心的3D-LLM旨在同时解决定位、对话、问答和描述任务,但许多方法依赖于单一的指针式定位决策,将关系指令压缩为一个选择。这对于需要根据上下文对象和空间关系排除多个同类候选的细粒度查询来说是脆弱的。我们提出结构化空间推理3D-LLM(SSR3D-LLM),一种用于统一3D-LLM的结构化定位接口。给定固定的Mask3D物体提议,LLM从查询中写出一系列潜在的空间推理步骤和记忆令牌,然后一个几何感知评分器读取这些潜在步骤,通过逐步长度掩码逐步精炼候选排名。潜在步骤从标准基准目标监督和训练期间的辅助指代线索监督中学习,而推理仅使用输入查询和Mask3D提议。在ReferIt3D、ScanRefer和Multi3DRef上,SSR3D-LLM在统一3D-LLM基线中取得了最强结果,在细粒度定位上相比单指针QPG基线有显著提升,并相比先前的统一3D-LLM有一致改进,同时保留了默认的语言任务路径。

英文摘要

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.

2605.28487 2026-05-28 cs.AI cs.LG 版本更新

ProvMind: Provenance-grounded reasoning for materials synthesis

ProvMind:基于来源的材料合成推理

Yiming Zhang, Ryo Tamura, Koji Tsuda

发表机构 * Center for Basic Research on Materials, National Institute for Materials Science(材料基础研究センター,国家材料科学研究所) RIKEN Center for Advanced Intelligence Project(RIKEN高级智能项目中心)

AI总结 提出MatProcBench基准和ProvMind框架,通过来源图推理实现材料合成中的路线、条件和因果依赖优化,在双OOD分割上达到52.84%准确率。

详情
AI中文摘要

材料工艺优化需要对路线、条件、工具和因果依赖进行推理,然而大多数计算方法将合成过程扁平化为文本或有序步骤。我们引入了MatProcBench,一个基于文献挖掘的MatPROV图构建的来源基准,用于评估七个过程推理任务,涵盖路线连续性、步骤级变量推断和全局因果一致性,在相同分割和偏移感知评估下,包括结合时间与材料类别偏移的严格双OOD分割。我们进一步引入了ProvMind,一个过程记忆推理框架,检索类似训练过程,将其转换为来源感知的选项级兼容性分数,并使用语言模型进行约束最终决策。ProvMind在双OOD分割上达到52.84%的准确率,优于提示、检索增强和监督微调基线。

英文摘要

Materials process optimization requires reasoning over routes, conditions, tools and causal dependencies, yet most computational formulations flatten synthesis procedures into text or ordered steps. We introduce MatProcBench, a provenance-grounded benchmark constructed from literature-mined MatPROV graphs, to evaluate seven process-reasoning tasks spanning route continuity, step-level variable inference and global causal consistency under both same-split and shift-aware evaluation, including a strict dual-OOD split that combines temporal and material-class shift. We further introduce ProvMind, a process-memory reasoning framework that retrieves analogous training processes, converts them into provenance-aware option-level compatibility scores, and uses a language model for constrained final decision making. ProvMind achieves 52.84\% accuracy on the dual-OOD split, outperforming prompting, retrieval-augmented and supervised fine-tuning baselines.

2605.28483 2026-05-28 cs.AI cs.IR 版本更新

From Learning Resources to Competencies: LLM-Based Tagging with Evidence and Graph Constraints

从学习资源到能力:基于证据和图约束的LLM标签方法

Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge

发表机构 * Université de technologie de Compiègne, CNRS, Heudiasyc(法国图卢兹技术大学、CNRS、Heudiasyc实验室) Sorbonne Université, CNRS UMR 7585, LPMHE(巴黎大学、CNRS UMR 7585、LPMHE实验室)

AI总结 提出一种端到端对齐流程,利用大语言模型作为受约束的、能产生证据的标签器,将学习资源链接到结构化能力框架,在计算机科学数据集上取得优于基线方法的性能。

详情
AI中文摘要

将学习资源链接到结构化能力框架是实现学习管理系统中基于能力的搜索和课程分析的关键。然而,手动标注劳动密集,全自动方法往往缺乏透明度。在本文中,我们提出了一种端到端对齐流程,使用大语言模型作为受约束的、能产生证据的标签器。LMS资源——包括教学内容和评估——首先被分割成有意义的教学片段。对于每个片段,从基于图上下文增强的结构化能力档案中检索一小部分候选能力。然后,LLM从该集合中选择最相关的能力,并从片段文本中提供支持证据片段。这些预测利用能力图的结构进行细化,并在资源级别聚合。我们在从计算机科学系的能力参考体系(UTC)构建的数据集上评估了我们的方法,该数据集涵盖22个能力,涉及多个课程材料。我们的LLM+BM25+Graph(LBG)流程取得了强劲的结果:片段级微F1为0.57,宏F1为0.50;资源级宏F1为0.51;MRR为0.82——优于零样本和少样本LLM变体、检索/相似性基线以及监督分类器——同时产生更多机械可追踪的证据片段,以支持人工审计和教育分析。

英文摘要

Linking learning resources to a structured competency framework is key to enabling competency-based search and curriculum analytics in Learning Management Systems (LMS). However, manual tagging is labor-intensive, and fully automatic methods often lack transparency. In this paper, we present an end-to-end alignment pipeline that uses a large language model (LLM) as a constrained, evidence-producing tagger. LMS resources -both instructional content and assessments -are first segmented into meaningful pedagogical fragments. For each fragment, a small set of candidate competencies is retrieved from structured competency profiles enriched with graph-based context. The LLM then selects the most relevant competencies from this set and provides supporting evidence spans from the fragment text. These predictions are refined using the structure of the competency graph and aggregated at the resource level. We evaluate our approach on a dataset built from the Computer Science department's competency referential at the Université de Technologie de Compiègne (UTC), covering 22 competencies across multiple course materials. Our LLM+BM25+Graph (LBG) pipeline achieves strong results, with a micro-F1 of 0.57 and macro-F1 of 0.50 at the fragment level, 0.51 macro-F1 at the resource level, and an MRR of 0.82outperforming zero-shot and few-shot LLM variants, retrieval/similarity baselines, and supervised classifiers -while also producing more mechanically traceable evidence spans to support human auditing and educational analysis.

2605.28464 2026-05-28 cs.CL cs.AI 版本更新

The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

LJP 从未见过的案件:面向更完整刑事责任评估的起诉决定预测

Junyu Lu, Qi Wei, Peishuo Zheng, Jie Zhang, Hui Huang, Qianru Wang, Chuan Xiao, Jianbin Qin, Shuyuan Zheng

发表机构 * Beijing Institute of Technology(北京理工大学) Osaka University(大阪大学) Xi’an University of Technology(西安理工大学) Institute of Science Tokyo(东京科学研究所) City University of Hong Kong(香港城市大学)

AI总结 提出起诉决定预测(PDP)任务,通过分类起诉或三种不起诉决定,弥补法律判决预测(LJP)在刑事责任评估中的盲区,并构建PDP-Bench基准,实验表明大语言模型在PDP上表现显著差于LJP。

Comments 24 pages, 5 figures, 22 tables

详情
AI中文摘要

法律判决预测(LJP)已成为评估刑事法律领域人工智能的核心基准,但它只涉及已经通过检察审查并正式起诉的刑事案件。因此,LJP在评估刑事责任方面留下了大量盲区,忽略了证据不足、无刑事责任或免予处罚的案件。为填补这一空白,我们提出了 extbf{起诉决定预测(PDP)},这是首个围绕检察审查构建的法律AI任务,它将每个案件分类为起诉或三种不起诉决定之一,并反映了法律AI在证据评估、法律归类和基于价值的裁量方面的能力。我们进一步构建了 extbf{PDP-Bench},一个包含4,630个真实中国检察决定、涵盖190个罪名的基准。大量实验表明,最先进的大语言模型在PDP上的表现显著差于LJP,且主流增强途径未能缩小差距。此外,受控的RLVR干预表明,简单的结果奖励无法产生可泛化的PDP判别能力。

英文摘要

Legal Judgment Prediction (LJP) has become a core benchmark for evaluating AI in the criminal legal domain, but it only sees criminal cases that have already passed prosecutorial review and been formally indicted. As a result, LJP leaves a substantial blind spot in assessing criminal liability, overlooking cases involving insufficient evidence, no criminal liability, or guilt exempted from punishment. To fill this gap, we propose \textbf{Prosecution Decision Prediction (PDP)}, the first Legal AI task built around prosecutorial review, which classifies each case into prosecution or one of three non-prosecution decisions and reflects legal AI's capabilities in evidence evaluation, legal subsumption, and value-based discretion. We further construct \textbf{PDP-Bench}, a benchmark of 4{,}630 real Chinese prosecutorial decisions spanning 190 charges. Extensive experiments show that state-of-the-art LLMs perform substantially worse on PDP than on LJP and that mainstream enhancement routes fail to close the gap. Moreover, controlled RLVR interventions show that simple outcome rewards fail to produce generalizable PDP discrimination.

2605.28456 2026-05-28 cs.AI cs.CV eess.AS 版本更新

Diffusion Large Language Models for Visual Speech Recognition

用于视觉语音识别的扩散大语言模型

Jeong Hun Yeo, Chae Won Kim, Hyeongseop Rha, Yong Man Ro

发表机构 * Integrated Vision Language Lab, KAIST, South Korea(韩国加耶大学集成视觉语言实验室)

AI总结 提出首个基于扩散大语言模型(DLLM)的视觉语音识别框架DLLM-VSR,通过迭代掩码去噪和灵活顺序解码,结合置信度引导的解掩码策略及两阶段训练,并引入长度引导候选解码以降低目标长度不确定性,在LRS3上取得19.5%的词错误率。

Comments Code: https://github.com/JeongHun0716/dllm-vsr

详情
AI中文摘要

现有的视觉语音识别(VSR)系统通常依赖于从左到右的自回归解码,这可能在获得足够上下文之前,迫使对视觉模糊的令牌做出过早决策。我们提出DLLM-VSR,据我们所知,这是首个基于扩散大语言模型(DLLM)的VSR框架,将转录过程表述为具有灵活顺序解码的迭代掩码去噪。通过基于置信度的解掩码,DLLM-VSR早期提交高置信度位置,并利用已提交的令牌作为双向上下文来细化模糊令牌。为了使DLLM适应VSR,我们引入了一种两阶段掩码去噪训练策略,将视觉到文本的内容对齐与长度建模分离。我们进一步观察到,在假设知道真实转录长度的oracle长度解码下存在性能差距,这表明减少目标长度不确定性可以改善基于DLLM的VSR。为了缩小这一差距,我们开发了长度引导的候选解码,利用视频时长构建合理的转录长度假设,在多个假设下解码,并使用长度合理性和解码置信度对候选进行重新排序。所提出的方法仅使用LRS3的标注训练数据,就实现了19.5%的词错误率(WER),达到了最先进水平。

英文摘要

Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion Large Language Model (DLLM)-based VSR framework, formulating transcription as iterative masked denoising with flexible-order decoding. With confidence-based unmasking, DLLM-VSR commits high-confidence positions early and uses the committed tokens as bidirectional context to refine ambiguous ones. To adapt DLLMs to VSR, we introduce a two-stage masked-denoising training strategy that separates visual-to-text content alignment from length modeling. We further observe a performance gap with oracle-length decoding, which assumes access to the true transcript length, indicating that reducing target-length uncertainty can improve DLLM-based VSR. To reduce this gap, we develop length-guided candidate decoding, which uses video duration to construct plausible transcript-length hypotheses, decodes under multiple hypotheses, and reranks candidates using length plausibility and decoding confidence. The proposed method achieves a state-of-the-art WER of 19.5\% on LRS3 using only its labeled training data.

2605.28454 2026-05-28 cs.AI 版本更新

GONDOR to the Rescue: Satisficing Planning with Low Memory

GONDOR 救援:低内存下的满意规划

Yonatan Vernik, Alexander Tuisov, Alexander Shleyfman

发表机构 * Computer Science Department, Bar-Ilan University(巴伊兰大学计算机科学系) Independent Researcher(独立研究员)

AI总结 提出 GONDOR 算法,通过周期压缩搜索树并保留稀疏锚点状态,在严格内存限制下扩展 GBFS,实现低内存预算下的满意规划。

详情
AI中文摘要

贪婪最佳优先搜索(GBFS)是解决可通过启发式估计目标(如规划、路径查找、导航和寻路)的搜索问题的主要方法。当内存严格受限时(例如在边缘设备上规划),尤其如此。为了缓解这一问题,我们提出了 GONDOR(基于动态前哨站再搜索的贪婪在线导航),这是 GBFS 的一种内存高效扩展,通过周期性地压缩搜索树同时保留一组稀疏的锚点状态,允许在严格内存限制下继续搜索,然后在到达目标时通过在稀疏状态之间重新搜索来重建路径。我们分析了该算法,并讨论了由不同前哨站选择策略定义的几种变体。此外,我们探索了在关闭列表中使用布隆过滤器进行紧凑的重复检测。跨数值规划领域和启发式配置的实验表明,与标准 GBFS 相比,GONDOR 在低内存预算下持续提高了覆盖率。我们发布了 GONDOR 和布隆过滤器变体的实现,以促进对内存高效启发式搜索的进一步研究。

英文摘要

Greedy Best-First Search (GBFS) is the dominant approach for solving search problems where the goal can be estimated with a heuristic, such as planning, route finding, navigation, and pathfinding. This is especially true when the memory is tightly constrained, such as planning on edge devices. To alleviate that, we present GONDOR (Greedy Online Navigation with Dynamic Outpost-based Re-search), a memory-efficient extension of GBFS that allows search to continue under strict memory limits by periodically compressing the search tree while retaining a sparse set of anchor states, then upon reaching the goal reconstructs the path by re-searching between the sparse states. We analyze the algorithm and discuss several variants defined by different outpost selection policies. In addition, we explore using Bloom filters for compact duplicate detection in the closed list. Experiments across numeric planning domains and heuristic configurations show that GONDOR consistently improves coverage under low memory budgets compared to standard GBFS. We release the implementation of GONDOR and the Bloom-filter variant to facilitate further research on memory-efficient heuristic search.

2605.28450 2026-05-28 cs.CV cs.AI 版本更新

BiasEdit: A Training-Free Bias-Detect-and-Edit Framework for Learning Fair Visual Classifiers

BiasEdit: 一种无需训练的偏差检测与编辑框架,用于学习公平的视觉分类器

Jungwook Seo, Yoonsik Park, Changmin Lee, Sungyong Baik

发表机构 * Hanyang University Department of Artificial Intelligence BAIK Lab Seoul South Korea(翰阳大学人工智能系BAIK实验室首尔韩国) Hanyang University Department of Data Science BAIK Lab Seoul South Korea(翰阳大学数据科学系BAIK实验室首尔韩国) Hanyang University Department of Data Science Department of Artificial Intelligence BAIK Lab Seoul South Korea(翰阳大学数据科学系人工智能系BAIK实验室首尔韩国) Hanyang University(翰阳大学)

AI总结 提出BiasEdit框架,通过统计依赖和互信息分析自动检测偏差属性,并利用文本引导的图像编辑生成无偏样本,无需手动标注即可实现公平分类。

Comments Accepted to The Web Conference 2026 (formerly WWW) as an Oral presentation

详情
AI中文摘要

来自网络的视觉数据为图像分类器提供动力,这些分类器通常支撑着许多网络服务,如推荐和内容审核。然而,原始网络数据常常包含虚假关联和社会偏见,而神经网络以其倾向于学习数据中存在的偏见而闻名。这可能会加剧网络服务和网络数据中的不公平性,导致恶性循环。在图像分类的背景下,当大多数图像仅针对给定类包含相同属性时,网络会学习该类别的偏差属性。因此,从有偏数据集中训练公平且去偏的分类器需要处理多数具有偏差属性的图像(偏差对齐样本)与少数没有偏差属性的图像(偏差冲突样本)之间的不平衡问题。在这项工作中,我们引入了BiasEdit,一个模块化框架,能够自动从原始数据集中检测偏差属性并对其进行编辑,以构建去偏数据集。具体来说,BiasEdit首先通过视觉-语言表示的统计依赖性和互信息分析检测未知的偏差属性,然后使用文本引导的图像编辑显式编辑这些属性,以生成逼真的偏差冲突样本。与先前假设已知偏差属性或依赖合成混合的工作不同,我们的方法无需手动标注,并且可以利用现成的视觉-语言和编辑模型。BiasEdit解决了网络来源视觉AI中的一个基本挑战,减轻了数据集引起的偏差,并在训练数据完全有偏的情况下实现了最先进的去偏性能。

英文摘要

Visual data from the Web power image classifiers, which often underpin many web services, such as recommendation and content moderation. However, the raw Web data often contain spurious correlations and social biases, and neural networks are known for their tendency to learn biases present in data. This can reinforce unfairness in web services and the web data, leading to a vicious cycle. In the context of image classification, networks learn bias attributes for a specific class when a majority of images contain the same attribute only for a given class. Hence, training a fair and debiased classifier from a biased dataset demands handling an imbalanced problem between a majority of images with bias attributes (bias-aligned samples) and a minority without (bias-conflict samples). In this work, we introduce BiasEdit, a modular framework that automatically detects bias attributes from the original dataset and edits them to construct a debiased dataset. Specifically, BiasEdit first detects unknown bias attributes via statistical dependence and mutual information analysis of visual-linguistic representations, and then explicitly edits those attributes using text-guided image editing to generate realistic bias-conflict samples. Unlike prior works that assume known bias attributes or relies on synthetic mixing, our method operates without manual annotations and can leverage off-the-shelf vision-language and editing models. BiasEdit addresses a fundamental challenge in Web-sourced visual AI, mitigating dataset-induced bias and achieving state-of-the-art debiasing performance even when training data are fully biased.

2605.28441 2026-05-28 cs.CV cs.AI 版本更新

Bayesian Gated Non-Negative Contrastive Learning

贝叶斯门控非负对比学习

Peng Cui, Jiahao Zhang, Lijie Hu

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 针对对比学习中表示纠缠问题,提出贝叶斯门控非负对比学习,通过概率门控机制动态过滤无关特征,在Imagenet-100上语义一致性提升142.1%。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然对比学习(CL)已经革新了自监督表示学习,但其潜在表示仍然高度纠缠且不透明,限制了在安全关键应用中的可解释性。我们发现这种纠缠的一个根本原因是对确定性相似度量的依赖,该度量平等地对待所有特征维度。在组合场景中,这会产生优化冲突:常见的背景特征(如“蓝天”)被鼓励在正对中对齐,但同时又在负对中排斥,导致梯度振荡,阻碍精确的语义解缠。为了解决这个问题,我们提出了BayesNCL(贝叶斯门控非负对比学习)。与标准方法不同,BayesNCL引入了一种概率门控机制,动态过滤掉与任务无关的高频常见特征,同时选择性地保留判别性语义。通过将特征选择形式化为具有稀疏伯努利先验的变分推理问题,我们的方法有效解决了优化冲突。在Imagenet-100上的实验结果表明,与最先进的基线相比,BayesNCL在语义一致性上实现了142.1%的显著提升,在不影响下游任务性能的情况下产生了高度可解释的表示。代码可在 https://github.com/Cui-Peng-624/BayesNCL 获取。

英文摘要

While Contrastive Learning (CL) has revolutionized self-supervised representation learning, its latent representations remain highly entangled and opaque, limiting their interpretability in safety-critical applications. We identify that a fundamental cause of this entanglement is the reliance on deterministic similarity measures, which treat all feature dimensions equally. In compositional scenes, this creates an Optimization Conflict: common background features, such as, "blue sky", are encouraged to align in positive pairs but simultaneously repelled in negative pairs, causing gradient oscillations that hinder precise semantic disentanglement. To address this, we propose BayesNCL (Bayesian Gated Non-Negative Contrastive Learning). Unlike standard approaches, BayesNCL introduces a probabilistic gating mechanism that dynamically filters out task-irrelevant, high-frequency common features while selectively retaining discriminative semantics. By formalizing feature selection as a variational inference problem with a sparse Bernoulli prior, our method effectively resolves the optimization conflict. Empirical experimental results on Imagenet-100 demonstrate that BayesNCL achieves a remarkable 142.1% improvement in semantic consistency compared to state-of-the-art baselines, yielding highly interpretable representations without compromising downstream task performance. Code is available at https://github.com/Cui-Peng-624/BayesNCL.

2605.28428 2026-05-28 cs.CV cs.AI 版本更新

Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization

通过无训练图拉普拉斯能量最小化的非一致性异常检测

Jungwook Seo, Minjeong Kim, Younkwan Lee, Seungho Shin, Sungyong Baik

发表机构 * Dept. of Artificial Intelligence, Hanyang University(人工智能系,翰阳大学) Dept. of Data Science, Hanyang University(数据科学系,翰阳大学) Global Technology Research, Samsung Electronics(三星电子全球技术研究)

AI总结 提出一种无训练图拉普拉斯能量优化方法ANoCo,通过查询补丁与正常流形对齐所需的更新幅度来度量异常,无需学习参数或采样,在标准基准上取得强图像级AUROC和稳定定位图。

Comments Accepted to CVPR 2026

详情
AI中文摘要

检测图像中的细微视觉异常仍然具有挑战性,特别是当仅预先提供正常样本时。这种无监督异常检测通常通过测量查询补丁与正常补丁记忆库的特征相似性来解决。然而,仅凭相似性无法揭示查询补丁在多大程度上违反了正常特征流形的结构。我们提出了一种无训练的拉普拉斯图能量优化公式,名为ANoCo,它通过查询补丁与固定正常流形对齐所需的非一致性成本来评分异常。对于每个查询补丁,我们构建一个由余弦亲和性加权的二分查询-正常图,明确移除查询-查询和正常-正常边以防止证据稀释。我们将异常评分公式化为带有锚定正常节点的凸拉普拉斯能量,并以闭式求解。特别地,我们不使用优化后的特征本身——异常分数是满足正常性约束所需的更新幅度,将图拉普拉斯重新定义为非一致性算子而非平滑先验。所提出的方法不引入可学习参数、消息传递或采样,其复杂度与单次线性求解相当。在标准基准上,它实现了强大的图像级AUROC、稳定的定位图以及相比先前方法更强的鲁棒性,证明了使用优化诱导的特征漂移作为异常度量的有效性。

英文摘要

Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsupervised anomaly detection is typically solved by measuring feature similarity of a query patch to a memory of normal patches. However, similarity alone does not reveal how strongly a query patch violates the structure of the normal feature manifold. We propose a training-free Laplacian graph energy optimization formulation, named ANoCo that scores Anomaly by the cost of Non-Conformity of a query patch to align with a fixed normal manifold. For each query patch, we construct a bipartite query to normal graph weighted by cosine affinity, explicitly removing query-query and normal-normal edges to prevent evidence dilution. We formulate anomaly scoring as a convex Laplacian energy with anchored normal nodes, and solve in closed form. In particular, we do not use the optimized features themselves-the anomaly score is the magnitude of the update required to satisfy normality constraints, reframing the graph Laplacian as a non-conformity operator rather than a smoothing prior. The proposed method introduces no learnable parameters, message passing, or sampling, and has complexity comparable to a single linear solve. Across standard benchmarks, it delivers strong image-level AUROC, stable localization maps, and improved robustness over prior methods, demonstrating the effectiveness of using optimization-induced feature drift as anomaly measure.

2605.28422 2026-05-28 cs.CV cs.AI 版本更新

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

VITAL: 视觉-语义双重监督增强可解释的医学多模态大语言模型潜在推理

Qiaoru Li, Shaotian Liang, Jintao Chen, Haoran Sun, Yuxiang Cai, Jianwei Yin, Yankai Jiang

发表机构 * Zhejiang University(浙江大学) Shanghai AI Laboratory(上海人工智能实验室) Tencent(腾讯) Ningbo Global Innovation Center, Zhejiang University(宁波全球创新中心,浙江大学) Zhejiang Key Laboratory of Digital-Intelligence Service Technology(浙江省数字智能服务技术重点实验室)

AI总结 提出VITAL框架,通过视觉-语义双重监督(文本解码器重构推理链、视觉投影器回归ROI特征)实现医学MLLM的可解释潜在推理,在7个基准上达到SOTA。

详情
AI中文摘要

潜在推理能够对连续隐藏状态而非显式token进行推理,避免了医学VQA中思维链的语言瓶颈和推理开销。然而,现有方法存在模态崩溃、视觉监督不足以及训练-推理不匹配的问题。此外,其不透明的潜在状态缺乏可解释性,而这在临床应用中至关重要。我们提出VITAL,一个用于医学MLLM的潜在空间推理框架,具有视觉-语义双重监督:一个辅助文本解码器从潜在状态重建推理链,同时一个视觉投影器从冻结的独立医学视觉编码器回归ROI特征。两个模块在推理时被丢弃,零开销,但可以在事后重新附加以实现双重可解释性,在不牺牲效率的情况下提供推理过程的文本和视觉解释。我们构建了一个涵盖9种成像模态的61K数据集,比之前的医学视觉潜在推理数据集大一个数量级。在7个基准上的实验表明,VITAL一致且显著优于骨干模型、所有潜在推理基线以及在更大数据上训练的医学MLLM,达到了与万亿参数专有模型竞争的最先进结果。

英文摘要

Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.

2605.28421 2026-05-28 cs.AI 版本更新

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

DenoiseRL:引导推理模型从噪声前缀中恢复

Caijun Xu, Changyi Xiao, Zhongyuan Peng, Yixin Cao

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出DenoiseRL框架,通过强化学习从弱模型的错误推理中学习,无需外部监督或强教师模型,提升推理性能和训练效率。

Comments 17 pages, 6 figures

详情
AI中文摘要

强化学习已成为推动大型语言模型推理能力发展的核心范式,然而现有方法仍依赖更强的教师模型或精心策划的困难数据集,限制了可扩展的能力提升。在本文中,我们提出DenoiseRL,一种强化学习框架,通过从弱模型的失败中恢复导向优化来替代外部监督。DenoiseRL不依赖更强的监督或精心设计的数据,而是直接从错误的推理轨迹中学习,将其转化为改进的机会,使训练更具可扩展性且更少依赖外部资源。这产生了更丰富、更多样化的学习信号,提高了从非完美模型行为中探索的效率。因此,DenoiseRL提升了推理性能和整体训练效率,同时减少了对昂贵数据整理或更强教师模型的需求。实验表明,DenoiseRL在竞争性数学和通用推理基准上持续优于强在线强化学习基线,并随着训练难度增加促进更强的自我纠正行为,突显了改进大型语言模型推理的一种有效且可扩展的替代路径。

英文摘要

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.

2605.28409 2026-05-28 cs.AI 版本更新

Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

基于离线强化学习的代码生成LLM高效后训练

Mingze Wu, Abhinav Anand, Shweta Verma, Mira Mezini

发表机构 * Hessian Center for Artificial Intelligence(海德堡人工智能中心) National Research Center for Applied Cybersecurity ATHENE(应用网络安全国家研究中心ATHENE)

AI总结 本文探索使用离线强化学习利用现有代码数据集对代码生成LLM进行后训练,实验表明该方法能有效提升模型性能,尤其适用于小模型和复杂编码问题。

详情
AI中文摘要

使用在线强化学习(RL)进行后训练是LLM(包括代码生成模型)的重要训练步骤。然而,用于代码生成的在线RL涉及LLM推理和生成输出的验证,这可能耗费大量时间和资源。在本文中,我们通过利用现有代码数据集,探索将离线RL应用于代码生成模型。我们的实验表明,离线RL是提升LLM性能的有效训练策略。我们证明,离线RL对于小型LLM和具有挑战性的编码问题尤其有益。

英文摘要

Post-training using online reinforcement learning (RL) is an important training step for LLMs, including code-generating models. However, online RL for code generation involves LLM inference and verification of the generated output, which can take considerable time and resources. In this paper, we explore the application of offline RL to code-generating models by leveraging existing code datasets. Our experiments demonstrate that offline RL is an effective training strategy for improving LLM performance. We show that offline RL can be especially beneficial for small LLMs and challenging coding problems.

2605.28405 2026-05-28 cs.AI 版本更新

Measuring Progress Toward AGI: A Cognitive Framework

衡量AGI进展:一个认知框架

Ryan Burnell, Yumeya Yamamori, Orhan Firat, Kate Olszewska, Steph Hughes-Fitt, Oran Kelly, Isaac R. Galatzer-Levy, Meredith Ringel Morris, Allan Dafoe, Alison M. Snyder, Noah D. Goodman, Matthew Botvinick, Shane Legg

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文提出一个基于认知分类学的框架,通过10个关键认知能力评估系统性能,以量化AGI进展。

Comments 32 pages, 2 figures

详情
AI中文摘要

尽管AGI被广泛讨论,但目前缺乏衡量其进展的明确框架。这种模糊性助长了主观论断,使追踪进展变得困难,并可能阻碍负责任的治理。作为解决这一问题的起点,我们提出了一个理解系统能力与人类认知能力关系的框架。借鉴心理学、神经科学和认知科学数十年的研究,我们引入了一个认知分类学,将通用智能分解为10个关键认知能力。然后,我们提出一个严格的评估协议,通过一套有针对性的、保留的认知任务来衡量系统性能,生成可用于理解系统优缺点的“认知轮廓”。我们希望这一框架能为更严格、更实证的AGI评估提供实用路线图和初步步骤。

英文摘要

Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims, makes it difficult to track progress, and risks hindering responsible governance. As a starting point to address this gap, we present a framework for understanding system capabilities in relation to human cognitive abilities. Drawing from decades of research in psychology, neuroscience, and cognitive science, we introduce a Cognitive Taxonomy that deconstructs general intelligence into 10 key cognitive faculties. We then propose a rigorous evaluation protocol in which a system's performance is measured across a suite of targeted, held-out cognitive tasks, generating a 'cognitive profile' that can be used to understand a system's strengths and weaknesses. We hope this framework will provide a practical roadmap and an initial step toward more rigorous, empirical evaluation of AGI.

2605.28398 2026-05-28 cs.AI 版本更新

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

HRBench:混合推理大语言模型中思维模式切换策略的基准测试与理解

Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang, Hao Liu

发表机构 * AI Thrust, The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)人工智能方向)

AI总结 提出HRBench统一评估框架,系统研究混合推理LLM中基于提示、外部路由和推测执行三类切换策略在四种训练机制下的效率-效果权衡,揭示策略选择随模型规模和任务领域的变化规律。

Comments Under review

详情
AI中文摘要

混合推理大语言模型(LLMs)暴露了对推理努力程度的显式控制,允许用户或系统在答案质量与推理成本之间进行权衡。然而,现有的自适应思维模式选择方法通常在不同模型、数据集和实现假设下进行评估,使得比较它们的实际行为变得困难。我们引入了HRBench,一个用于研究混合推理LLM中思维模式切换的统一评估框架。HRBench沿两个轴组织设计空间:三种切换策略族(基于提示的选择、外部路由和推测执行)和四种训练机制(无训练、SFT、离线RL和在线RL),产生12种受控评估设置。我们在6个LLM(从Qwen3.5-2B到Kimi-K2.5-1.1T)和5个涵盖数学、科学和代码的推理基准上评估这些设置,并在同一流水线中重新实现了12种以上有代表性的先前方法。我们的分析表征了不同切换策略如何占据不同的效率-效果权衡区域:基于提示的方法通常提供有利的token-准确率权衡,路由方法提供更稳定的成本降低,而推测方法倾向于以更高的token成本提高准确率。我们进一步发现训练对不同策略的影响不同,且首选策略随模型规模和任务领域而变化。HRBench提供了参考实现和统一评估平台,以支持对混合推理LLM中高效推理的更受控研究。我们的数据、代码和仓库可在https://github.com/usail-hkust/HRBench获取。

英文摘要

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at https://github.com/usail-hkust/HRBench.

2605.28396 2026-05-28 cs.LG cs.AI 版本更新

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

ADWIN: 用于视野感知在线策略蒸馏的自适应窗口

Kun Liang, Chenming Tang, Clive Bai, Weijie Liu, Saiyong Yang, Yunfang Wu

发表机构 * School of Computer Science, Peking University(北京大学计算机科学系) National Key Laboratory for Multimedia Information Processing, Peking University(北京大学多媒体信息处理国家重点实验室) LLM Department, Tencent(腾讯LLM部门)

AI总结 提出ADWIN框架,通过自适应窗口动态调整在线策略蒸馏中的轨迹长度,在保持或提升准确率的同时,将训练成本降低最多4.1倍。

详情
AI中文摘要

在线策略蒸馏(OPD)通过沿着学生生成的轨迹训练学生模型,并利用教师反馈来迁移推理行为,但标准的全轨迹训练将每次更新与昂贵的完整轨迹绑定,并且可能过度分配监督到对当前学生边际价值较低的后半部分。我们通过有用监督视野重新审视这一假设:学生引起的轨迹可能偏离教师偏好的延续,而对齐的前缀可能已经保留了长视野OPD更新方向。我们提出ADWIN,一种用于OPD的自适应窗口框架,将轨迹长度视为在线可接受性决策,在短的教师锚定前缀上训练,同时使用延迟的全轨迹探测来审计前缀与全轨迹的对齐情况,并通过陈旧性控制自适应调整下一视野。在数学和代码推理基准测试中,包括单任务、多任务和强到弱设置,ADWIN在全轨迹OPD和基于前缀的基线方法上改善了准确率与计算成本的权衡,将端到端训练成本降低最多4.1倍,同时达到相当或更好的准确率。

英文摘要

On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low marginal value for the current student. We revisit this assumption through the useful supervision horizon: student-induced rollouts can drift from teacher-preferred continuations, while aligned prefixes may already preserve the long-horizon OPD update direction. We propose ADWIN, an adaptive-window framework for OPD that treats rollout length as an online admissibility decision, training on short teacher-anchored prefixes while using delayed full-rollout probes to audit prefix--full alignment and adapt the next horizon with staleness control. Across math and code reasoning benchmarks in single-task, multi-task, and strong-to-weak settings, ADWIN improves the accuracy--compute trade-off over full-rollout OPD and prefix-based baselines, reducing end-to-end training cost by up to 4.1 times while achieving comparable or better accuracy.

2605.28390 2026-05-28 cs.AI 版本更新

You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

你活不止一次:迈向分层技能元进化

Xujun Li, Kehan Zheng, Mingyuan Zhao, Yize Geng, Jinfeng Zhou, Qi Zhu, Fei Mi, Lifeng Shang, Minlie Huang, Hongning Wang

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Huawei Foundation Model Department(华为基础模型部门)

AI总结 本文提出HiSME,一种轻量级分层技能元进化方法,通过从智能体任务执行轨迹中学习元技能,联合优化技能和技能进化策略,以持续提升部署的智能体系统在不同下游场景中的性能。

详情
AI中文摘要

测试时技能进化被视为增强已部署智能体系统的新范式。现有工作主要关注硬编码的技能进化策略或依赖底层LLM中昂贵参数更新的参数化学习。在本文中,我们证明,对于在不同下游场景中持续改进智能体系统,对技能进化框架本身进行测试时优化是必要的,并且轻量级的算法适应是可行的。具体来说,我们提出HiSME,一种轻量级分层技能元进化解决方案,通过从智能体的任务执行轨迹中学习元技能,联合优化技能和技能进化策略。在多样化智能体基准上的实验表明,元进化可以产生比纯技能进化更高质量的技能库,并能为不同场景推导出多样化的元技能,从而促进未来的持续经验学习。我们的代码暂时公开在https://anonymous.4open.science/r/HiSME-BD45。

英文摘要

Test-time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard-coded skill evolving strategies or parametric learning that rely on expensive parameter updates in the underlying LLMs. In this paper, we demonstrate that test-time refinement of the skill evolving framework itself is necessary for continuous improvement of the agent systems in different downstream scenarios, and lightweight algorithmic adaptation is feasible. Specifically, we propose HiSME, a lightweight hierarchical skill meta-evolving solution that jointly optimizes skills and the skill evolving strategy by learning meta-skills from agents' task execution traces. Experiments on diverse agentic benchmarks show that meta-evolving can produce a higher-quality skill library than pure skill evolving and can derive diverse meta-skills for different scenarios, thereby facilitating future continual experience learning. Our code is temporarily public at https://anonymous.4open.science/r/HiSME-BD45.

2605.28388 2026-05-28 cs.AI 版本更新

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

机制性解释样本难度在RLVR中对大语言模型的作用

Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing, Zheng Wang, Zhanxing Zhu

发表机构 * Beijing Jiaotong University(北京交通大学) AntGroup(蚂蚁集团) Northwestern Polytechnical University(西北工业大学) University of Leeds(利兹大学) University of Southampton(南安普顿大学)

AI总结 本文通过难度维度和单样本分析,发现样本难度对RLVR有非单调影响,中等难度问题提供最稳定的推理改进,并基于此提出难度自适应策略。

Comments 30 pages, 11 figures

详情
AI中文摘要

经验表明,带可验证奖励的强化学习(RLVR)能显著提升大语言模型(LLMs)的推理性能,尤其是在数学和编程领域。然而,样本难度在RLVR中的机制性作用仍不明确。本文通过难度维度和单样本分析研究RLVR。我们发现样本难度对RLVR有非单调影响:简单和中等难度问题带来最强且最稳定的推理改进,而过难问题往往提供弱学习信号,诱发退化行为(如重复答案或跳过必要计算),并最终损害模型已有的能力。除了响应层面,我们还利用时间稀疏自编码器(T-SAE)分析模型内部特征动态。简单问题主要强化直接答案和基本计算特征,同时抑制深思熟虑推理特征;困难问题激活推理相关特征,但仅在成功轨迹被采样时才有用;中等难度问题提供更平衡的信号,同时强化计算和多步推理特征。基于这些发现,我们提出了针对困难样本的难度自适应策略,利用反向推理重构和T-SAE引导的训练信号来改善RLVR中的奖励密度和信用分配。总体而言,我们的结果将样本难度识别为控制RLVR优化动态和表示演化的关键因素。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sample Difficulty in RLVR remains poorly understood. In this paper, we investigate RLVR through the lens of difficulty-wise and one-sample analysis. We find that sample difficulty has a non-monotonic effect on RLVR: easy and medium-difficulty problems yield the strongest and most stable reasoning improvements, whereas overly hard problems often provide weak learning signals, induce degenerate behaviors such as answer repetition or skipping necessary computation, and can ultimately degrade the model's pre-existing capabilities. Beyond the obverse of response, we further analyze the model's internal feature dynamics using Temporal Sparse Autoencoders (T-SAE). Easy problems mainly reinforce direct-answer and basic-computation features while suppressing deliberative-reasoning features; hard problems activate reasoning-related features but become useful only when successful trajectories are sampled; medium-difficulty problems provide a more balanced signal, strengthening both computation and multi-step reasoning features. Motivated by these findings, we propose difficulty-adaptive strategies for hard-sample utilization, using backward-reasoning reformulation and T-SAE-guided training signals to improve reward density and credit assignment during RLVR. Overall, our results identify sample difficulty as a key factor governing both the optimization dynamics and representation evolution of RLVR.

2605.28387 2026-05-28 cs.LG cs.AI cs.NE 版本更新

CLANE: Continual Learning of Actions on Neuromorphic Hardware from Event Cameras

CLANE: 基于事件相机在神经形态硬件上的动作持续学习

Elvin Hajizada, Michael Neumeier, Edward Paxon Frady, Yulia Sandamirskaya, Axel von Arnim, Bing Li, Eyke Hüllermeier

发表机构 * Institute of Informatics, University of Munich (LMU)(慕尼黑大学信息学院) fortiss GmbH, Neuromorphic Computing(fortiss GmbH 神经形态计算部门) Technical University of Munich, TUM School of CIT(慕尼黑技术大学 CIT 学院) Intel Labs, Intel Corporation(英特尔实验室,英特尔公司) Institute of Computational Life Sciences (ICLS), Zurich University of Applied Sciences (ZHAW)(应用科学大学(ZHAW)计算生命科学研究所) Technische Universität Ilmenau, Resource-Efficient Artificial Intelligence Group(伊门豪大学资源高效人工智能小组) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) German Research Centre for Artificial Intelligence (DFKI)(德国人工智能研究中心)

AI总结 提出CLANE系统,在Intel Loihi 2神经形态芯片上实现端到端的持续学习,用于事件相机动作识别,通过尖峰CNN和新型Loihi 2模块实现高能效和低延迟。

详情
AI中文摘要

识别并持续学习新的人类动作而不遗忘先前类别,是新兴AR/VR和机器人应用的需求。对于这些应用,设备上的处理和学习对于隐私和低延迟适应至关重要。事件相机通过稀疏、异步的输出解决了视觉传感的效率问题,该输出天然兼容神经形态处理。然而,此前没有系统部署过使用神经形态硬件进行基于事件的持续设备上学习流水线。我们提出了CLANE(基于事件相机在神经形态硬件上的动作持续学习),端到端部署在Intel Loihi 2上。CLANE将用于时空特征提取的脉冲2D CNN与作为片上学习头的CLP-SNN相结合,并通过时间聚合层和定点归一化层(两者均为新型Loihi 2模块)扩展到动作片段。在真实条件下捕获的50类数据集THU E-ACT-50上,CLANE在持续学习任务中达到70.4%的准确率,同时相比顺序CNN+GRU+CLP边缘GPU基线实现了超过100倍的能耗降低和16倍的延迟降低,通过三个评估级别的跨平台等算法基准测试得到验证。

英文摘要

Recognizing and continuously learning novel human actions without forgetting prior classes is a requirement for emerging AR/VR and robotics applications. For these applications, both on-device processing and learning are essential for privacy and low-latency adaptation. Event cameras address the efficiency of visual sensing with sparse, asynchronous output that is naturally compatible with neuromorphic processing. Yet no prior system has deployed a continual on-device learning pipeline for event-based action recognition using neuromorphic hardware. We present CLANE, Continual Learning of Actions on Neuromorphic Hardware from Event Cameras, deployed end-to-end on Intel Loihi 2. CLANE combines a spiking 2D CNN for spatiotemporal feature extraction with CLP-SNN as its on-chip learning head, extended to action clips via a Temporal Aggregation Layer and a fixed-point Normalization Layer, both novel Loihi 2 modules. On THU E-ACT-50, a 50-class dataset captured under real-world conditions, CLANE achieves 70.4% accuracy in a continual learning task while delivering more than 100x energy reduction and 16x lower latency over a sequential CNN+GRU+CLP edge GPU baseline, validated through iso-algorithm cross-platform benchmarking across three evaluation levels.

2605.28371 2026-05-28 cs.AI cs.LG cs.SE 版本更新

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

从论文到基准测试:基于智能体和框架的机器健康智能中欠规范方法复现

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt, Lev Telyatnikov, Olga Fink

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 提出一种基于智能体和共享框架的方法,通过槽绑定接口将论文转化为可执行、可比较的基准测试实现,解决工业预测与健康管理中方法复现的困难。

详情
AI中文摘要

工业预测与健康管理(PHM)为应用机器学习中的更广泛挑战提供了一个代表性案例研究:将已发表的论文转化为可执行、可基准测试的实现。由于工业数据集的访问受限、预处理和评估协议的报告不完整以及隐含的设计选择(例如,窗口化、目标构建、数据分割)对性能有重要影响,复现PHM中的欠规范方法尤为困难。现有的论文到代码系统为单篇论文生成实现,但由于假设和评估设置的不一致性,这些产物通常无法直接比较。我们引入了基于智能体和框架的PHM论文复现方法,其中智能体通过槽绑定接口将论文转化为共享的PHM基准测试框架。该接口将方程和协议描述映射为结构化组件(任务定义、数据集适配器、窗口化、目标、模型和评估器),同时明确记录未解决的假设。最终实现通过标准化任务契约和评估钩子进行验证,从而实现一致且可比较的基准测试。我们在16篇PHM论文上评估了该方法,比较了框架增强型、基于技能和基于提示的智能体复现与最近的无框架论文复现智能体。我们评估了复现成功率、基于模型的代码评估、论文假设的框架绑定以及标准化协议下的跨论文基准可比性。结果表明,将智能体生成与共享框架相结合,将论文复现从孤立的代码合成转变为可执行、假设感知且系统可比较的基准测试实现。

英文摘要

Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning: translating published papers into executable, benchmark-ready implementations. Reproducing under-specified methods in PHM is particularly difficult due to restricted access to industrial datasets, incomplete reporting of preprocessing and evaluation protocols, and implicit design choices (e.g., windowing, target construction, data splits) that critically affect performance. Existing paper-to-code systems generate implementations for individual papers, but these artifacts are often not directly comparable due to inconsistencies in assumptions and evaluation settings. We introduce \emph{agentic, framework-based PHM paper reproduction}, where an agent translates a paper into a shared PHM benchmark framework via a \emph{slot-binding interface}. This interface maps equations and protocol descriptions into structured components (task definitions, dataset adapters, windowing, targets, models, and evaluators), while explicitly recording unresolved assumptions. The resulting implementations are validated against standardized task contracts and evaluation hooks, enabling consistent and comparable benchmarking. We evaluate this approach on 16 PHM papers, comparing framework-enhanced, skill-based and prompt-based agentic reproduction against a recent framework-free paper-reproduction agent. We assess reproduction success, model-based code evaluation, framework binding of paper assumptions, and cross-paper benchmark comparability under standardized protocols. Our results show that coupling agentic generation with a shared framework transforms paper reproduction from isolated code synthesis into executable, assumption-aware, and systematically comparable benchmark implementations.

2605.28369 2026-05-28 cs.AI cs.SI 版本更新

CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

CyberJurors:电商纠纷裁决的多智能体模拟任务

Yanhui Sun, Wu Liu, Haifeng Ming, Xinru Wang, Hantao Yao, Yongdong Zhang

发表机构 * School of Information Science and Technology, University of Science and Technology of China(信息科学与技术学院,中国科学技术大学)

AI总结 针对电商纠纷裁决需要从冗余多轮多模态证据中提取关键线索并依据平台特定惯例决策的问题,提出多智能体框架CyberJurors,通过个体裁决链式思维和集体陪审共识裁决提升裁决质量,在包含6000真实案例的基准上超越现有方法。

Comments ICML 2026

详情
AI中文摘要

电商平台开始招募众包陪审员来裁决大量交易纠纷。与正式法律判决不同,电商纠纷裁决需要从冗余、多轮、多模态证据中提取关键线索,并在平台特定的灵活惯例下做出决策。这些特点使得现有方法不足以应对该场景。为弥补这一差距,我们引入了一项开创性任务——电商纠纷裁决(EDV),并提出了VerdictBench,一个包含6000个真实案例的多模态基准,旨在反映众包陪审团决策。在此基础上,我们提出了CyberJurors,一个多智能体框架,用于澄清纠纷逻辑并规范裁决过程。在个体层面,个体裁决链式思维将EDV任务分解为四个结构化的推理阶段,实现细粒度线索感知并澄清关键线索与纠纷焦点之间的因果逻辑。在集体层面,陪审共识裁决模拟陪审员之间的多轮讨论和投票,同时纳入裁决先例以减轻对任一争议方的认知偏差。在VerdictBench上的实验表明,CyberJurors优于最先进的LLM、MLLM和法庭模拟器,同时与真实陪审团投票模式实现了更强的一致性。代码和数据集可在https://github.com/YanhuiS/CyberJurors 和 https://huggingface.co/datasets/piggi/VerdictBench 获取。

英文摘要

E-commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal judgment, E-commerce dispute verdicts require grounding pivotal clues from redundant, multi-round, multimodal evidence and making decisions under flexible platform-specific conventions. These characteristics render existing methods insufficient for this scenario. To bridge this gap, we introduce a pioneering task, E-commerce Dispute Verdicts (EDV), and present VerdictBench, a multimodal benchmark comprising 6,000 real-world cases designed to reflect crowdsourced jury decisions. Building upon this, we propose CyberJurors, a multi-agent framework to clarify the dispute logic and regulate the verdict process. At the individual level, Individual Verdict Chain-of-Thought decomposes the EDV task into four structured reasoning stages, enabling fine-grained clue perception and clarifying causal logic between pivotal clues and the dispute focus. At the collective level, Jury Consensus Verdict simulates multi-round discussion and voting among jurors, while incorporating verdict precedents to mitigate cognitive biases toward either disputant. Experiments on VerdictBench show that CyberJurors outperforms state-of-the-art LLMs, MLLMs, and court simulators, while achieving stronger alignment with real-world jury voting patterns. Code and dataset are available at https://github.com/YanhuiS/CyberJurors and https://huggingface.co/datasets/piggi/VerdictBench.

2605.28365 2026-05-28 cs.AI cs.CL cs.LO 版本更新

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

风险控制的 Lean 作为自然语言数学推理的评判者

Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov, Haitham Bou Ammar

发表机构 * Imperial College London(伦敦帝国理工学院) Huawei Noah’s Ark Lab(华为诺亚实验室) UCL Centre for AI(大学学院伦敦人工智能中心)

AI总结 针对 Lean 评判自然语言数学答案时信号稀疏且不忠实的问题,提出 COVCAL 选择器,通过有限样本选择性风险控制,在自动形式化覆盖率足够高时保证接受答案的准确率。

详情
AI中文摘要

Lean 越来越多地被用于评判自然语言数学答案,但其信号是不完全的:许多答案从未被形式化,而一个失败的证明可能反映类型错误或缺少库事实,而非答案错误。在 MATH-500 上,我们表明该信号 (i) 严重依赖于覆盖率,即在证明覆盖率高的答案中正确率为 96%,但在覆盖率低时为 20%,以及 (ii) 稀疏且常常不忠实:一个 7B 自动形式化器仅对 28% 的问题证明了某个类别,而人工审计发现其中只有约 43% 的证明是忠实的。我们提出 COVCAL,一个基于 Lean 跟踪诊断的选择器,它在两种机制(保守的 Bonferroni 界和更紧的 dev-then-cal 规则)下,对接受的答案认证有限样本选择性风险界,否则弃权。可行性取决于自动形式化覆盖率:对于 7B 形式化器,信号过于稀疏,Bonferroni 在所有 20 个自助法分区上弃权,而一个专用于证明器的形式化器达到 79% 的覆盖率,并在 20 个分区中的 17 个上使其可行,以 0.98 的接受准确率接受约 48% 的问题。由于自一致性本身已达到 91% 的准确率,我们的贡献是精确描述了何时以及使用哪个形式化器,部分形式化信号可以在风险控制下被信任。

英文摘要

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

2605.28360 2026-05-28 cs.AI 版本更新

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

提示码本:面向语言模型指令精炼的离散组合优化

Jyotirmoy Nath, Neeraj Kumar, Brejesh Lall

发表机构 * IIT Delhi(印度德里理工学院)

AI总结 提出Prompt Codebooks (PCO)框架,将自动提示优化重构为离散组合学习,通过可重用的自然语言本能单元实现实例级路由和结构化反馈,在多个基准上提升性能并压缩提示长度。

详情
AI中文摘要

自动提示优化(APO)显著提升了基于LLM的智能体工作流。然而,现有方法将每个任务的提示视为一个整体、实例无关的字符串,通过全局编辑进行优化,导致更新脆弱且无法复用学到的子行为。我们提出提示码本(PCO),一种新颖的组合式提示优化框架,将APO重构为在有限自然语言本能(原子、可重用的指令单元)词汇表上的离散学习。PCO将提示构建知识组织在离散码本中,通过基于LLM的编码器将每个输入路由到少量条目;生成器将它们组合成冻结目标模型的提示;评论器输出结构化判决,通过归因分解为每个变量的文本梯度,在语言值极小极大目标下联合训练编码器、生成器和码本。得到的路由是实例级的:同一任务的不同输入接收不同的本能组合,这种机制在实例无关方法下结构上无法表达。在Qwen3-8B和LLaMA-3.1-8B上的六个基准测试中,PCO相比零样本提升高达+30.36分,在HotpotQA上超越最强先前基线(GEPA)达+3.34分,总体平均提升+1.11分,并且仅使用K=16个本能即可将部署提示长度相比MIPROv2压缩最多14.1倍,相比GEPA压缩3.0倍。

英文摘要

Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task's prompt as a monolithic, instance-blind string optimized through global edits, producing brittle updates and preventing the reuse of learned sub-behaviors. We propose Prompt Codebooks (PCO), a novel compositional prompt optimization framework that recasts APO as discrete learning over a finite vocabulary of natural-language instincts - atomic, reusable instruction units. PCO organizes prompt-construction knowledge in a discrete codebook and routes each input to a small subset of entries via an LLM-based encoder; a generator composes them into a prompt for the frozen target model; a critic emits a structured verdict that decomposes by attribution into per-variable textual gradients, jointly training the encoder, generator, and codebook under a language-valued min-max objective. The resulting routing is per-instance: different inputs in the same task receive different instinct compositions, a regime structurally inexpressible under instance-blind methods. Across six benchmarks on Qwen3-8B and LLaMA-3.1-8B, PCO improves over zero-shot by up to +30.36 points, surpasses the strongest prior baseline (GEPA) by +3.34 on HotpotQA and +1.11 in aggregate, and reduces deployed prompt length by up to 14.1x versus MIPROv2 and 3.0x versus GEPA using only K=16 instincts.

2605.28359 2026-05-28 cs.AI q-fin.TR 版本更新

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

从知道到做到:面向LLM股票市场交易智能体的记忆控制基准

Taojie Zhu, Wentao Zhao, Rui Sun, Beidi Luan, Jiacheng Lu, Sinuo Wang, Jing Li, Daxin Jiang, Yonghong He, Zuo Bai

发表机构 * Tsinghua University(清华大学) Stepfun FinStep Shanghai Jiao Tong University(上海交通大学) Adelaide University(阿德莱德大学)

AI总结 针对LLM交易智能体评估中的知识泄露和收益归因问题,提出KTD-Fin基准,通过数据掩码和Barra风格归因框架,分离市场记忆与投资决策,并揭示收益主要来自被动市场暴露而非选股能力。

详情
AI中文摘要

评估大语言模型(LLM)智能体能否在资本市场盈利,越来越被框架化为端到端交易:将智能体置于历史市场中,让其交易,并衡量投资组合收益。这种设置容易导致两种评估失败。首先,长时间的回测往往与前沿LLM的知识截止日期重叠,使得记忆的股票代码、日期、价格和市场叙事替代了投资推理。其次,原始收益是选股能力的一个嘈杂代理,因为正收益可能来自市场贝塔、风格暴露或有利的市场环境,而非真正的阿尔法。我们引入了KTD-Fin(知道-做到金融基准),一个端到端的股票市场交易基准,解决了这两个问题。KTD-Fin使用数据侧掩码协议,在提示和工具中一致地匿名化关键标识符和日历信息,将历史市场记忆与投资决策分离。它还整合了Barra风格的表现归因框架,将投资组合收益分解为市场、风格和选股阿尔法成分。在2024-2026年窗口内对中国沪深300指数评估的十个前沿LLM智能体中,掩码显著改变了智能体的推理过程,推动其转向匿名化的因子推理。归因分析进一步表明,在泄露控制评估下,LLM智能体的累积收益主要由被动的市场和风格暴露解释,而持续选股阿尔法的证据有限。这些发现表明,金融LLM基准不仅应评估智能体是否赚钱,还应评估收益来源是否反映了可转移的投资技能。我们发布KTD-Fin作为LLM交易智能体泄露控制和归因感知评估的可复现模板。

英文摘要

Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market narratives to substitute for investment reasoning. Second, raw returns are a noisy proxy for stock-selection ability, since positive performance may come from market beta, style exposure, or favorable regimes rather than genuine alpha. We introduce KTD-Fin (Knowing-To-Doing Financial Benchmark), an end-to-end stock-market trading benchmark that addresses both issues. KTD-Fin uses a data-side masking protocol to anonymize key identifiers and calendar information consistently across prompts and tools, separating historical market memory from investment decision-making. It also incorporates a Barra-style performance attribution framework that decomposes portfolio returns into market, style, and stock-selection alpha components. Across ten frontier LLM agents evaluated on the Chinese CSI300 over a 2024--2026 window, masking substantially changes agent rationales, pushing them towards anonymized factor-based reasoning. Attribution analysis further shows that LLM agents' cumulative returns under leakage-controlled evaluation are largely explained by passive market and style exposure, with limited evidence of persistent stock-selection alpha. These findings suggest that financial LLM benchmarks should evaluate not only whether an agent makes money, but also whether the source of returns reflects transferable investment skill. We release KTD-Fin as a reproducible template for leakage-controlled and attribution-aware evaluation of LLM trading agents.

2605.28358 2026-05-28 cs.LG cs.AI cs.IT math.IT 版本更新

Score Based Error Correcting Code Decoder

基于分数的纠错码译码器

Alon Helvits, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering (ECE)(电气与计算机工程学院)

AI总结 提出SB-ECC,一种将译码视为连续时间去噪的基于分数的译码器,通过神经去噪器定义概率流常微分方程,在奇偶校验约束下迭代更新噪声信道观测值,无需SNR估计即可推理,并在42个码/SNR设置中39/42达到最佳误码率。

Comments Accepted to ICML 2026

详情
AI中文摘要

纠错码能够实现可靠通信,然而在实际软译码中,跨码族和码长仍然具有挑战性。我们提出SB-ECC,一种基于分数的译码器,将译码视为连续时间去噪。神经去噪器定义了一个概率流常微分方程(ODE),该方程在奇偶校验约束的引导下,迭代地将噪声信道观测值更新为有效的码字。该模型在不同噪声水平下训练,无需时间/SNR条件,从而无需SNR估计即可进行推理,并支持由ODE求解器预算控制的直接延迟-精度权衡。我们使用原始带符号的信道观测值作为输入来学习连续去噪场。在42个码/SNR设置中,SB-ECC在39/42个条目中实现了最佳误码率,平均SNR增益为0.17dB,最大增益为0.46dB,优于最强竞争基线。我们表明,将求解器从Euler切换为DPM可保持-ln(BER),同时将端到端译码时间平均减少8.86%(最高达12.82%)。

英文摘要

Error-correcting codes enable reliable communication, yet practical soft decoding remains challenging across code families and block lengths. We propose SB-ECC, a score-based decoder that casts decoding as continuous-time denoising. A neural denoiser defines a probability-flow ordinary differential equation (ODE) that iteratively updates the noisy channel observation toward a valid codeword, guided by parity constraints. The model is trained across noise levels without time/SNR conditioning, enabling inference without SNR estimation and supporting a direct latency accuracy trade off controlled by the ODE solver budget. We use the raw signed channel observation as input for learning a continuous denoising field. Across 42 code/SNR settings, SB-ECC achieves the best BER in 39/42 entries, with an average SNR gain of 0.17dB and a maximum gain of 0.46dB over the strongest competing baseline, we showed that swapping the solver from Euler to DPM preserves -ln(BER) while reducing end-to-end decoding time by 8.86% on average (up to 12.82%).

2605.28354 2026-05-28 cs.AI 版本更新

Plan Before Search: Search Agents Need Plan

搜索前先规划:搜索智能体需要规划

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Jiayi Ji, Chenyi Lei, Wenwu Ou, Xiaoshuai Sun, Qibin Hou

发表机构 * Kuaishou Technology(快手科技) Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(教育部多媒体可信感知与高效计算重点实验室,厦门大学) VCIP, CS, Nankai University(南开大学VCIP实验室)

AI总结 提出Plan方法,通过将问题分解为有序子问题再进行检索,并引入自举训练范式,无需外部强模型蒸馏即可在多跳QA中激活规划能力。

详情
AI中文摘要

将大型语言模型训练为检索增强推理智能体通常将强化学习与从更强模型蒸馏的SFT冷启动相结合。然而,这种范式忽略了两个基本因素:子技能之间的依赖结构,以及蒸馏并非获取能力的唯一途径。我们通过Plan来研究这一点,这是一种结构化的智能体行为,用于多跳检索,它在任何检索执行之前将问题分解为有序的子问题,从而使每个搜索步骤可以锚定到预先设计的子问题,而不是在先前检索的部分相关文档的影响下漂移。然而,在涵盖3B到14B参数的三个模型家族中,我们发现相同的奖励信号会引发定性不同的RL失败模式。这一现象表明,成功的训练不仅取决于奖励设计,还取决于模型特定的可行性条件:足够的初始熵、训练稳定性和先决子技能。受此启发,我们提出了一种自举训练范式,其中小规模种子模型生成过滤后的轨迹,从而在任何目标模型中激活Plan,消除了从外部强模型蒸馏的需要。我们的流程在每个测试模型中都激活了Plan,并在多跳QA基准上持续优于竞争基线。

英文摘要

Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.

2605.28353 2026-05-28 cs.NE cs.AI cs.SC 版本更新

Improving Evaluation of Recombination-based Cartesian Genetic Programming

改进基于重组的笛卡尔遗传编程的评估

Duy Long Tran, Anja Jankovic, Marie Anastacio, Holger Hoos, Roman Kalkreuth

发表机构 * Chair for AI Methodology, RWTH Aachen University(人工智能方法学研究所,亚琛工业大学)

AI总结 本研究通过超参数优化,在SRBench基准平台上评估了子图交叉和离散表型重组两种重组算子,证明了超参数优化可提升基于重组的笛卡尔遗传编程的性能。

Comments Accepted for presentation as workshop paper in the graph-based genetic programming workshop (GGP) at the Genetic and Evolutionary Computation Conference (GECCO). To appear in the GECCO'26 conference companion. GECCO'26 will be held July 13-17, 2026 in San Jose, Costa Rica

详情
Journal ref
GECCO'26 Companion: Genetic and Evolutionary Computation Conference Companion, July 13-17, 2026, San Jose, Costa Rica
AI中文摘要

笛卡尔遗传编程传统上使用变异作为其主要且通常是唯一的遗传算子来驱动进化搜索。尽管近年来取得了进展,但由于明显的性能提升不足,基于重组的方法长期以来一直被避免。本研究在符号回归基准平台SRBench上检验了最近提出的两种重组算子:子图交叉和离散表型重组。利用TinyverseGP框架中提供的实现,我们对这两种算子的相应表示进行了超参数优化。我们的工作表明,超参数优化可以导致基于重组的笛卡尔遗传编程的性能提升。

英文摘要

Cartesian Genetic Programming has traditionally been using mutation as its main and often sole genetic operator to drive evolutionary search. Despite advancements in recent years, recombinationbased approaches have long been avoided, due to apparent lack of performance gains. This study examines two recently suggested recombination-based operators, subgraph crossover and discrete phenotypic recombination on SRBench, a benchmarking platform for symbolic regression. Using the implementations provided in the TinyverseGP framework, we perform hyperparameter optimisation of the respective representations with these two operators. Our work demonstrates that hyperparameter optimisation can lead to improvements in performance for recombination-based Cartesian Genetic Programming.

2605.28347 2026-05-28 cs.AI 版本更新

FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models

FedMPT: 视觉语言模型的多标签联邦提示调优

Xucong Wang, Pengkun Wang, Zhe Zhao, Liheng Yu, Shuang Wang, Yang Wang

发表机构 * University of Science and Technology of China(科学技术大学)

AI总结 针对联邦学习中多标签识别任务,提出FedMPT方法,利用因果模型的前门调整和大语言模型驱动的条件解耦,通过最优传输和门控机制抑制虚假标签关联,提升模型鲁棒性。

Comments 16 pages, including 11 pages of main text and 5 pages of appendix; Accepted by CVPR2026

详情
AI中文摘要

基于视觉语言模型的多标签识别旨在利用其预训练知识更好地适应复杂识别场景,从而增强模型鲁棒性。然而,对于需要联邦学习的现实去中心化应用,将视觉语言模型适应到每个拥有私有和异构数据的客户端会导致模型过拟合虚假标签关联,从而在遇到新样本时触发不相关类别。为解决此问题,我们使用因果模型重新考虑多标签识别的联邦学习,其中采用前门调整并通过中间变量(放大真实标签共现)解耦多标签识别建模过程。在分析指导下,我们提出FedMPT,这是首个专门为联邦多标签识别设计的方法。FedMPT的核心思想是利用可泛化条件引导联邦多标签识别以减轻错误标签激活。为此,FedMPT引入了一个由大语言模型驱动的流程来解读控制标签依赖的潜在条件。此外,我们引入了条件增强提示与图像块之间的最优传输以揭示多个区域级语义。最后,我们通过精心设计的门控机制从不同条件生成协同预测。在多个基准数据集上的实验表明,我们提出的方法在不同设置下取得了有竞争力的结果,并优于现有最先进方法。

英文摘要

Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.

2605.28345 2026-05-28 cs.AI cs.LG eess.SP 版本更新

Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

Picid: 一种跨任务和领域的可复现PHM模块化评估基础设施

Lev Telyatnikov, Raffael Theiler, Leandro Von Krannichfeldt, Olga Fink

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 提出模块化评估基础设施Picid,通过标准化数据契约和评估边界,实现跨任务、跨数据集的故障检测、诊断和预测的可复现与公平比较。

详情
AI中文摘要

预测与健康管理(PHM)领域的进展受到跨任务、数据集和应用领域缺乏标准化和可复用评估实践的阻碍。报告的结果往往难以复现和比较,因为关键协议选择(如数据划分、预处理、标签对齐、时间窗口和指标)通常是隐式的或临时实现的。我们引入了\picid,一个模块化评估基础设施,将PHM评估流程形式化为显式、可执行和可复现的协议。通过定义良好的抽象,\picid在保持对不同PHM设置的灵活性的同时,强制执行确定性、无泄漏的数据集构建。该框架通过统一接口支持故障检测、诊断和预测,并且可以扩展到新的数据集和模型类别,而不违反协议不变性。通过标准化数据契约和评估边界,\picid还实现了跨诊断(分类)和预测(回归)的公平任务比较,允许相同的模型系列在不同设置中一致地进行评估。我们通过对跨越电池、轴承、涡轮风扇发动机、液压系统、过滤系统和建筑的十二个数据集上的十三个模型进行实证评估来展示\picid。这项工作为PHM中标准化、公平和可复现的评估建立了可复用的基础。

英文摘要

Progress in Prognostics and Health Management (PHM) is hindered by the lack of standardized and reusable evaluation practices across tasks, datasets, and application domains. Reported results are often difficult to reproduce and compare, as key protocol choices, such as data splits, preprocessing, label alignment, temporal windowing, and metrics, are often implicit or implemented ad hoc. We introduce \picid, a modular evaluation infrastructure that formalizes the PHM evaluation pipeline as an explicit, executable, and reproducible protocol. Through well-defined abstractions, \picid enforces deterministic, leakage-safe dataset construction while remaining flexible across diverse PHM settings. The framework supports fault detection, diagnostics, and prognostics through a unified interface and can be extended to new datasets and model classes without violating protocol invariants. By standardizing data contracts and evaluation boundaries, \picid also enables fair cross-task comparisons across diagnostics (classification) and prognostics (regression), allowing identical model families to be evaluated consistently across heterogeneous settings. We demonstrate \picid through an empirical evaluation of thirteen models on twelve datasets spanning batteries, bearings, turbofan engines, hydraulics, filtration systems, and buildings. This work establishes a reusable foundation for standardized, fair and reproducible evaluation in PHM.

2605.28338 2026-05-28 cs.AI 版本更新

SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

SafeMed-R1: 临床医生审计的安全与伦理对齐用于医疗大语言模型

Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan, Yidong Jiang, Yankai Jiang, Jinru Ding, Jiayuan Chen, Zhuangzhi Gao, Pengcheng Chen, Zhao He, Rongzhao Zhang, Meiling Liu, Luyi Jiang, Jie Xu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Joint Laboratory of Biomedical Artificial Intelligence(生物医学人工智能联合实验室) Shanghai Institute of Infectious Disease and Biosecurity(上海传染病与生物安全研究院) Shanghai Health Development Research Center (Shanghai Medical Information Center)(上海健康发展战略研究中心(上海医疗信息中心)) University of Washington(华盛顿大学) Department of Eye and Vision Sciences, University of Liverpool(利物浦大学眼科与视觉科学系) Liverpool Centre for Cardiovascular Science, University of Liverpool(利物浦大学心血管科学中心) School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院)

AI总结 提出SafeMed-R1模型,通过可追溯的临床信任信号管道和红队压力测试实现安全与伦理对齐,在临床基准上达到79.6%的宏平均准确率,并将不安全输出减少约3-5%。

详情
AI中文摘要

大语言模型在执业考试中日益匹配专家表现,但常规临床使用仍受限,因为治理需要可审计的推理、安全与伦理对齐以及对对抗性滥用的韧性。本文提出SafeMed-R1,通过可追溯的临床信任信号管道进行训练,该管道将每个推理实例与临床医生评分标准和编辑历史关联,并通过安全与伦理监督和红队压力测试进行对齐。SafeMed-R1在临床基准上达到79.6%的宏平均准确率。在对抗性安全测试下,它显示出最低的聚合风险,并将不安全输出相对于基线减少约3%至5%。在一项包含30个用药安全场景的配对专家研究中,SafeMed-R1在医学正确性上与PGY1和PGY2住院医师相当,并在用药安全、指南一致性和临床实用性上得分更高。总体而言,这些结果表明,临床医生审计的监督溯源,结合领域定制的安全与伦理对齐,可以在不依赖推理时检索或引用依据的情况下,加强治理相关的证据。

英文摘要

Large language models(LLMs) increasingly match expert performance on licensing examinations, yet routine clinical use remains limited because governance requires auditable reasoning, safety and ethics alignment, and resilience to adversarial misuse. Here we present SafeMed-R1, trained with a traceable Clinical Trust Signals(CTS) pipeline that links each reasoning instance to clinician rubric scores and edit histories, and aligned through safety and ethics supervision and red team stress testing. SafeMed-R1 attains a macro-averaged accuracy of 79.6% across clinical benchmarks. Under adversarial safety testing, it shows the lowest aggregated risk and reduces unsafe outputs by about 3 to 5% relative to its baseline. In a paired expert study of 30 medication safety vignettes, SafeMed-R1 matches PGY1 and PGY2 residents on medical correctness and scores higher for medication safety, guideline consistency, and clinical usefulness. Collectively, these results suggest that clinician-audited supervision provenance, together with domain-tailored safety and ethics alignment, can strengthen governance-relevant evidence without relying on inference-time retrieval or citation grounding.

2605.28337 2026-05-28 cs.AI 版本更新

An Enhanced Large Neighborhood Search Approach for the Capacitated Facility Location Problem with Incompatible Customers

一种增强的大邻域搜索方法用于解决具有不兼容客户的容量设施选址问题

Ida Gjergji, Lucas Kletzander, Nysret Musliu, Andrea Schaerf

发表机构 * University of Udine(乌迪内大学)

AI总结 针对具有客户不兼容约束的容量设施选址问题,提出一种结合混合破坏算子和精确修复的大邻域搜索方法,在所有基准实例上取得了新的最优解。

详情
AI中文摘要

文献中最近引入了一种经典容量设施选址问题的新变体,该变体考虑了客户之间的不兼容性。该问题捕捉了给定客户对不能由同一设施服务的情况。这一特征对于许多实际选址问题至关重要,例如存在危险或污染材料以及竞争客户之间的冲突。在本文中,我们提出了一种大邻域搜索(LNS)方法来解决该问题。在LNS框架内,我们引入了三种不同的破坏算子,并以混合方式组合它们,同时在修复阶段使用精确求解器。针对LNS的设计研究了不同的算法组件。实验分析表明,我们的新方法优于现有的最先进元启发式算法,为所有可用的基准实例提供了新的最佳解。

英文摘要

A new variant of the classic capacitated facility location problem, which considers incompatibilities between customers, has recently been introduced in the literature. This problem captures the situation where given pairs of customers cannot be served by the same facility. Such a feature is crucial for many practical cases of location problems, such as the presence of hazardous or polluting materials and contention between competing costumers. In this paper, we propose a Large Neighborhood Search (LNS) method to solve this problem. Within the framework of LNS, we introduce three different destroy operators, which are combined in a hybrid manner, and we use an exact solver in the repair phase. Different algorithmic components are investigated for the design of LNS. The experimental analysis shows that our new method outperforms existing state-of-the-art metaheuristics, providing new best solutions for all available benchmark instances.

2605.28328 2026-05-28 cs.LG cs.AI 版本更新

Learning the Error Patterns of Language Models

学习语言模型的错误模式

Jinwoo Kim, Taylor Berg-KirkPatrick, Loris D'Antoni

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) University of California-San Diego(加州大学圣地亚哥分校)

AI总结 提出前缀过滤器(prefix filters)来捕捉LLM在特定领域中的错误模式,并通过Palla算法高效学习这些过滤器,从而提升输出有效性,例如在TypeScript生成中将编译率提升60%以上。

详情
AI中文摘要

当为具有特定有效性约束的领域(例如,程序应能编译)生成输出时,LLM通常会在少数集中的方面失败:例如,在生成TypeScript时使用Python函数名。我们观察到这些错误模式可以用少量约束来表示,并且这些约束可以在实践中学习。我们提出\emph{前缀过滤器},即针对领域和LLM的符号函数,作为捕捉错误模式的对象,以及Palla算法,用于在实践中高效学习前缀过滤器,并实现了Palla。由Palla学习的前缀过滤器i)帮助我们定量分析LLM的错误模式,ii)可用于通过约束采样算法约束模型的输出。例如,Palla将Qwen2.5-1.5B在TypeScript生成上的编译率提升了超过60%,使得Qwen2.5-1.5B达到与未约束的Llama3.1-8B相似的性能。

英文摘要

When generating outputs for domains with specific validity constraints (e.g., a program should compile), LLMs often fail in a small number of focused ways: for example, by using Python function names when generating TypeScript. We observe that these error patterns can be represented using a small number of constraints that can be learned in practice. We propose \emph{prefix filters}, which are per-domain-and-LLM symbolic functions, as objects to capture the error patterns, Palla as an algorithm to learn prefix filters efficiently in practice, and implement Palla. Prefix filters learned by Palla i) help us quantitatively analyze the error patterns of LLMs, and ii) can be used to constrain the outputs of a model via constrained sampling algorithms. For example, Palla boosts compile rates for Qwen2.5-1.5B on TypeScript generation, by over 60%, allowing Qwen2.5-1.5B to achieve similar performance to Llama3.1-8B unconstrained.

2605.28321 2026-05-28 cs.SE cs.AI 版本更新

Multi-Agent LLM-based Metamorphic Testing for REST APIs

基于多智能体LLM的REST API蜕变测试

Shehroz Khan, Abdullah Mughees, Gaadha Sudheerbabu, Tanwir Ahmad, Dragos Truscan

发表机构 * Åbo Akademi University(阿博阿卡迪米大学)

AI总结 提出ARMeta方法,利用基于LLM的多智能体工作流自动识别蜕变测试场景并生成可执行测试,以解决REST API测试中的预言问题。

Comments Author submitted version accepted for publication the IEEE Conference on Computers, Software, and Applications (COMPSAC2026), July 7-11, 2026, Madrid Spain

详情
AI中文摘要

随着REST API在软件系统中日益重要,其验证也变得更为关键。因此,测试和发现潜在问题对于提高软件质量至关重要。然而,测试REST API的主要挑战在于难以评估API调用的输出是否正确,即测试预言问题。蜕变测试是一种基于规约的测试方法,适用于正确输出未知或未明确指定的情况。为了检查系统的正确性,需要指定不同输出之间的关系。我们提出了ARMeta,一种支持工具的方法,利用基于LLM的多智能体工作流来支持使用OpenAPI文档化的REST API的蜕变测试。该智能体工作流用于识别蜕变测试场景,并以Given-When-Then格式进行规约。这些场景自动实现为可执行测试,并针对被测系统执行。我们在两个公开的暴露REST接口的Web应用程序上评估了ARMeta,并将其性能与基于场景的测试基线进行了比较。结果表明,ARMeta探索的行为可作为现有基于场景的测试方法的补充。

英文摘要

As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and uncovering underlying issues are of utmost importance for improving software quality. However, testing REST APIs is challenging mainly due to the difficulty of assessing whether the output of an API call is correct, i.e., the test oracle problem. Metamorphic testing is a specification-based testing approach for situations where correct outputs are unknown or not specified explicitly. To check the correctness of a system, relations between the different outputs are specified. We present ARMeta, a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

2605.28320 2026-05-28 cs.RO cs.AI 版本更新

Identifying Explicit Parsimonious Piece-wise Polynomial Relationships in Industrial time-series: Application to manipulator robots

识别工业时间序列中的显式简约分段多项式关系:应用于机械臂

Mazen Alamir, Sacha Clavel

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France(格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔国立理工学院、GIPSA实验室)

AI总结 本文提出一种算法,利用隐式关系中的多项式集构建显式分段表示,以识别工业时间序列中的简约显式分段多项式关系,并应用于机械臂逆模型识别,实验表明该模型在泛化能力上优于深度神经网络。

详情
AI中文摘要

本文解决了识别可能涉及大量原始特征的简约显式分段多项式关系的问题。该算法利用最近提出的一种识别算法,该算法产生简约隐式关系,从而能够在异常检测和定位的背景下推导出正常性表征。本文提出的算法更进一步,通过使用隐式表示中涉及的多项式集构建显式分段表示。该框架在识别六轴机械臂逆模型的简约显式表示问题上得到了说明。此外,还展示了在四轴机械臂上的进一步实验,这些实验旨在研究当模型面对未见过的使用场景时,简约模型与最先进的深度神经网络结构相比的泛化能力。

英文摘要

This paper addresses the problem of identifying parsimonious explicit piece-wise polynomial relationships that might involve a relatively large number of raw features. The algorithm leverages a recently proposed identification algorithm that yields parsimonious implicit relationships enabling to derive normality characterization in the context of anomaly detection and localization. The algorithm proposed in this paper goes a step further by deriving explicit piece-wise representations that are built using the set of polynomials involved in the implicit representations. The framework is illustrated on the problem of identifying parsimonious explicit representations of the inverse model of a 6-axis manipulator robot. Moreover, further experiments on a 4-axis robot are also shown which are designed to investigate the generalization capability of parsimonious models compared to state-of-the-art DNNs structures, when models face unseen contexts of use.

2605.28317 2026-05-28 cs.LG cs.AI cs.NA math.NA physics.comp-ph 版本更新

Hybrid Neural World Models

混合神经世界模型

Pranav Lakshmanan, Paras Chopra

发表机构 * Lossfunk

AI总结 提出混合神经世界模型,通过单网络连续视界条件训练直接预测未来状态,并利用误差图隐式捕捉不连续性,实现高效且可靠的物理动力学模拟。

Comments Preprint. Under review

详情
AI中文摘要

神经代理模型有望在物理动力学中实现比经典求解器大幅加速,但在冲击、锋面和接触等剧烈动力学事件中会静默失败。我们提出了用于物理动力学的混合神经世界模型:一种在物理状态空间中训练和部署多视界代理模型的方案,其中单个具有连续视界条件的网络通过直接监督(对照教科书参考求解器)进行训练,以在前向传播中一步预测任意未来状态(视界T)。尽管训练数据、损失函数或架构的任何部分都没有监督不连续位置,但训练后的代理模型隐式地编码了它,仅通过其前向传播即可恢复为每个轨迹的误差图,该误差图集中在冲击、锋面和接触上,而在其他地方保持较小。该误差图与标准无标签基线(包括深度集成、学习误差头、梯度幅度指标和局部自适应共形预测)相比具有竞争力或更好,同时仅使用单个训练网络,且不需要校准集或控制方程知识。该方案支持两个操作点。模式1单独运行代理模型以最大化吞吐量,在PDE环境中,与教科书求解器相比,相同硬件上的CPU加速比为26倍至72倍。模式2使用误差图来门控参考求解器回退,推迟不确定的轨迹,并在默认操作点将代理模型的残差误差大致减半。该方案无需修改即可应用于反应扩散、可压缩欧拉和刚体碰撞动力学。

英文摘要

Neural surrogates promise large speedups over classical solvers for physical dynamics but fail silently at sharp dynamical events such as shocks, fronts, and contact. We present hybrid neural world models for physical dynamics: a recipe for training and deploying multi-horizon surrogates in physical state space, where a single network with continuous horizon conditioning is trained with direct supervision against textbook reference solvers to predict any future state at horizon T in one forward pass. Although no part of the training data, loss function, or architecture supervises discontinuity location, the trained surrogate encodes it implicitly, recoverable from its forward passes alone as a per-trajectory error map that concentrates on shocks, fronts, and contacts, and stays small elsewhere. The map is competitive with or better than standard label-free baselines including deep ensembles, learned error heads, gradient-magnitude indicators, and locally-adaptive conformal prediction, while using only a single trained network and requiring no calibration set or governing-equation knowledge. The recipe supports two operating points. Mode 1 runs the surrogate alone for maximum throughput, with same-hardware CPU speedups of 26x to 72x against textbook solvers on the PDE environments. Mode 2 uses the error map to gate a reference-solver fallback, deferring uncertain trajectories and roughly halving the surrogate's residual error at the default operating point. The recipe applies without modification across reaction-diffusion, compressible Euler, and rigid-body collision dynamics.

2605.28306 2026-05-28 cs.CL cs.AI 版本更新

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

面向混合专家模型中多语言下游任务的路由对齐微调

Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong, Sichun Luo, Linqi Song

发表机构 * City University of Hong Kong(香港城市大学) Carnegie Mellon University(卡内基梅隆大学) The University of Hong Kong(香港大学)

AI总结 针对混合专家模型在多语言下游任务中的路由结构异构问题,提出RA-MoE三阶段框架,通过中间层语言通用对齐区识别任务相关专家,并引入路由对齐损失增强目标语言路由,实验表明该方法优于标准微调和强基线。

详情
AI中文摘要

混合专家(MoE)模型已成为高效扩展LLM的主流范式,但将其适配到非英语下游任务仍然具有挑战性。现有的微调方法将MoE模型视为整体学习器,忽略了预训练期间形成的异构路由结构。我们在多个MoE模型和下游任务上验证,中间层形成了语言通用对齐区,其中路由发散性强烈预测了每种语言的任务性能差距。基于这一观察,我们提出了RA-MoE(路由对齐MoE微调),一个三阶段框架,该框架根据英语和目标语言的正确性将并行任务示例分类为四路分类法(cc/ci/ic/ii),识别中间层中与任务相关的专家,并用路由对齐损失增强标准SFT,该损失鼓励ci类型示例上的目标语言路由遵循英语任务专家激活模式。在三个MoE模型、三个任务和六种目标语言上的实验表明,RA-MoE始终优于标准SFT和强基线(包括Routing Steering和RISE),其中任务-语言对的ci比例可作为对齐收益的可靠预测指标。

英文摘要

Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream tasks remains challenging. Existing fine-tuning approaches treat MoE models as monolithic learners, ignoring the heterogeneous routing structure that develops during pretraining. We validate across multiple MoE models and downstream tasks that middle layers form a language-universal alignment zone where routing divergence strongly predicts per-language task performance gaps. Building on this observation, we propose RA-MoE (Routing-Aligned MoE Fine-Tuning), a three-stage framework that categorizes parallel task examples into a four-way taxonomy (cc/ci/ic/ii) based on correctness in English and the target language, identifies task-relevant experts in the middle layers, and augments standard SFT with a routing alignment loss that encourages target-language routing on ci-type examples to follow the English task-expert activation pattern. Experiments across three MoE models, three tasks, and six target languages demonstrate that RA-MoE consistently outperforms standard SFT and strong baselines including Routing Steering and RISE, with the ci proportion of a task-language pair serving as a reliable predictor of alignment benefit.

2605.28305 2026-05-28 cs.CL cs.AI 版本更新

Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

重新审视大语言模型推理中的拟人化反思标记

Yahan Yu, Noa Nakanishi, Fei Cheng

发表机构 * Kyoto University(京都大学)

AI总结 本文通过提示级和令牌级干预抑制拟人化反思标记,发现这些标记并非推理性能的必要条件,且抑制后模型仍能进行无标记验证,表明它们更多是表面线索而非可靠反思代理。

Comments 15 pages, 12 figures

详情
AI中文摘要

大语言模型(LLMs)在复杂推理过程中经常产生显式的反思痕迹,并伴随有拟人化标记,如“wait”、“hmm”和“alternatively”。尽管这些标记通常被用作反思的可见指标,但其机制仍不清楚,这带来了与冗余和重复反思标记相关的过度思考风险。在这项工作中,我们重新审视了拟人化反思标记,考察了它们对推理的必要性以及在反思中的作用。我们通过提示级和令牌级干预抑制这些标记,并分析了它们对四个基准测试和两种模型规模的任务性能的影响。我们的结果表明,拟人化标记对于推理性能并非普遍必要:抑制它们可以在多种设置下保持或提高性能,尤其是在较大的采样预算下。同时,标记抑制并不一定消除反思行为,因为模型仍然可以进行无标记验证。这些结果表明,拟人化标记更倾向于表面线索,而不是反思本身的可靠代理,并激励未来在显式标记模式之外对推理机制进行研究。

英文摘要

Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.

2605.28303 2026-05-28 cs.AI 版本更新

From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

从事实覆写到知识演化:基于同策略自蒸馏的因果编辑

Shuaike Li, Kai Zhang, Xianquan Wang, Jiachen Liu, Shengpeng Mo

发表机构 * State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室)

AI总结 针对知识编辑中静态事实覆写范式导致认知失调的问题,提出基于因果引导的同策略蒸馏方法CODE,将事实注入转化为连贯的知识演化,显著降低自反驳率并提升多跳准确性。

详情
AI中文摘要

虽然知识编辑(KE)能够实现高效更新,但其主导的静态事实覆写范式将大型语言模型视为离散数据库,强行注入孤立事实。这会破坏预训练的逻辑拓扑结构,引发认知失调——一种未进化的先验知识迫使模型明确否定注入更新的病理现象。理想化干预表明,这本质上是结构缺陷而非算法噪声,零失真代理导致高达95.6%的自反驳率。鉴于现实世界知识的因果驱动特性,将更新基于明确的因果叙事可将冲突率降至仅6.6%,凸显了向因果编辑范式转变的必要性。为内化这种演化,我们提出CODE(用于编辑的因果同策略蒸馏)。通过将因果自举与非对称同策略蒸馏相结合,CODE将因果转换逻辑直接刻入参数记忆。在LLaMA-3.1和Qwen-2.5上的实验表明,CODE将自反驳率大幅抑制至1.8%,同时保持稳健的多跳准确性(高达83.5%),将离散事实注入无缝转化为连贯的知识演化。代码见https://github.com/CrashBugger/CODE。

英文摘要

While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, forcibly injecting isolated facts. Fracturing pre-trained logical topologies, this triggers Epistemic Dissonance -- a pathology where un-evolved legacy priors force the model to explicitly negate the injected update. Idealized interventions reveal that this is an inherent structural flaw rather than mere algorithmic noise, with a zero-distortion proxy yielding a catastrophic 95.6% self-refutation rate. Given the causally driven nature of real-world knowledge, grounding updates in explicit causal narratives effectively collapses this conflict rate to just 6.6%, underscoring the imperative for a paradigm shift toward Causal Editing. To internalize this evolution, we propose CODE (Causal On-policy Distillation for Editing). By coupling causal bootstrapping with asymmetric on-policy distillation, CODE engraves causal transition logic directly into parametric memory. Experiments on LLaMA-3.1 and Qwen-2.5 show CODE drastically suppresses self-refutation to 1.8% while securing robust multi-hop accuracy (up to 83.5%), seamlessly transforming discrete fact injection into coherent knowledge evolution. Code is available at https://github.com/CrashBugger/CODE.

2605.28302 2026-05-28 cs.LG cs.AI cs.DC 版本更新

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

解聚能走多远?面向高效 MoE LLM 服务的 Attention-FFN 解聚设计空间探索

Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare, Sudarshan Srinivasan, Suvinay Subramanian, Souvik Kundu, Madhu Kumar, Midhilesh Elavazhagan, William Won, Amir Yazdanbakhsh, Tushar Krishna

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Intel(英特尔) Google(谷歌) Google DeepMind(谷歌深Mind) Infravana

AI总结 本文系统探索了从分块预填充、预填充-解码解聚到算子级 Attention-FFN 解聚 (AFD) 的不同解聚层次在 MoE 模型推理中的收益与局限,通过融合设备内核测量与高保真网络仿真的框架,在严格 TTFT/TPOT SLO 下 AFD 可在 DeepSeek-V3.2 上维持约 4k tokens/s 的系统吞吐量,并给出了联合优化吞吐与交互性的具体设计原则。

详情
AI中文摘要

现代大语言模型 (LLM) 推理已逐步解聚以跟上不断增长的模型规模和严格的 TTFT 与 TPOT 服务级别目标:从分块预填充聚合,到预填充-解码 (P/D) 解聚,再到最近出现的算子级 Attention-FFN 解聚 (AFD)。这一趋势对于混合专家 (MoE) 模型尤为重要,其中内存受限的注意力、计算密集的专家 FFN 以及 MoE 分发/组合通信产生了不同的资源需求。AFD 通过将注意力与 MoE-FFN 执行放在不同的 GPU 组上进一步暴露了这种异构性。每个解聚层次都加深了跨工作负载特征、资源分配和互连拓扑的调度设计空间,提出了核心问题:每个层次何时真正产生收益?我们系统地刻画了 MoE 推理中这一权衡,涵盖了输入/输出序列长度、前缀-KV 重用和每用户延迟约束等实际工作负载。以分块预填充和 P/D 解聚为基线,我们通过一个融合设备内核测量与高保真网络仿真的框架,研究了 AFD 在大规模下的收益与局限。在严格的 TTFT/TPOT SLO 下,AFD 在 DeepSeek-V3.2 上针对聊天、编码和代理编码工作负载维持了约 4k tokens/s 的系统吞吐量,而未经 AFD 的部署则不可行。我们提炼出联合优化吞吐与交互性的具体结论,包括如何根据工作负载和模型架构在 GPU 间划分注意力与 FFN,为当前机架级和集群级部署以及未来的解聚 AI 基础设施提供了设计原则。

英文摘要

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct resource demands. AFD further exposes this heterogeneity by placing attention and MoE-FFN execution on separate GPU groups. Each level of disaggregation deepens the scheduling design space across workload characteristics, resource allocation, and interconnect topology, raising the central question: when does each level actually pay off? We systematically characterize this trade-off for MoE inference across realistic workloads spanning input/output sequence lengths, prefix-KV reuse, and per-user latency constraints. Using chunked-prefill and P/D disaggregation as baselines, we study the benefits and limits of AFD at scale through a framework that fuses on-device kernel measurements with high-fidelity network simulation. Under strict TTFT/TPOT SLOs, AFD sustains around 4k tokens/s of system throughput on DeepSeek-V3.2 across chat, coding, and agentic-coding workloads, where non-AFD deployments are infeasible. We distill concrete takeaways for jointly optimizing throughput and interactivity, including how to partition attention and FFN across GPUs as a function of workload and model architecture, providing design principles for current rack- and cluster-scale deployments as well as future disaggregated AI infrastructure.

2605.28301 2026-05-28 cs.AI 版本更新

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

更高的准确率,更差的推理:医学思维链蒸馏的步骤级审计

Zhaoyang Jiang, Xuanqi Peng, Fei Teng, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Honghan Wu

发表机构 * School of Health & Wellbeing, University of Glasgow(健康与福祉学院,格拉斯哥大学) Department of Respiratory and Critical Care Medicine, Shanghai Sixth People’s Hospital, Shanghai Jiao Tong University School of Medicine(呼吸与危重医学科,上海第六人民医院,上海交通大学医学院) School of Life Science and Technology, University of Electronic Science and Technology of China(生命科学与技术学院,电子科技大学) Institute of Health Informatics, University College London(健康信息学研究所,伦敦大学学院)

AI总结 通过蒸馏大模型思维链训练小模型,发现医学问答中答案准确率提升但推理步骤的事实错误率上升,表明答案质量与推理真实性可能背道而驰。

详情
AI中文摘要

思维链(CoT)蒸馏训练一个小模型模仿教师的推理轨迹,但通常通过最终答案指标(包括准确率)进行评估。我们探究答案质量的提升是否伴随着轨迹的改进。在医学问答中,短答案选项可能使更丰富的临床理由未充分指定,从DeepSeek-V3系列教师蒸馏得到的Qwen3-8B学生在MedQA-USMLE答案指标上有所提升(SC@64从74.7%到84.4%;期望校准误差(ECE)从0.096到0.034)。然而,在Kimi-K2.6风格盲法LLM裁判审计下,其非弃权步骤的错误率从30.6%上升到50.3%。在这个主要医学设置中,答案质量和轨迹事实性向相反方向移动。这种前后模式在评估者、教师强度、学生规模和系列、医学基准以及风格、分割和答案正确性控制中持续存在。由临床专家进行的150步盲法审计重现了相同的排序。边界检查缩小了主张的范围:当紧凑答案对理由约束不足,且有能力的学生能够模仿专家风格而不可靠地支撑每个局部主张时,风险出现。标准答案指标和聚合对冲率未揭示这一转变。当此类轨迹被发布或重用时,仅靠答案级指标是不够的。

英文摘要

Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC@64 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034). Yet under a Kimi-K2.6 style-blind LLM-judge audit, its error rate over non-abstained steps rises from 30.6% to 50.3%. In this primary medical setting, answer quality and trace factuality move in opposite directions. This before--after pattern persists across evaluators, teacher strengths, student scales and families, medical benchmarks, and style, segmentation, and answer-correctness controls. A 150-step blinded audit by a clinical expert reproduces the same ordering. Boundary checks narrow the scope of the claim: the risk appears when a compact answer under-constrains the rationale and a capable student can imitate expert-like form without reliably grounding each local claim. Standard answer metrics and aggregate hedging rates do not reveal the shift. When such traces are released or reused, answer-level metrics alone are insufficient.

2605.28298 2026-05-28 cs.AI 版本更新

REED: Post-Training Representation Editing for Cross-Domain Linguistic Steganalysis

REED: 面向跨域语言隐写分析的后训练表示编辑

Ruohan Lei, Jianxin Gao, Wanli Peng, Huimin Pei

发表机构 * China Agricultural University(中国农业大学) Jiangsu Normal University(江苏师范大学)

AI总结 提出一种后训练表示编辑方法,通过构造域偏移向量和源域封面到隐写方向指导编辑,实现无需架构修改或参数更新的高效跨域语言隐写分析。

详情
AI中文摘要

在语言隐写分析的实际场景中,测试文本通常来自未见过的域,具有不同的词汇、主题、写作风格和隐写生成模式,这会显著降低检测性能。尽管现有的跨域隐写分析方法可以通过分布对齐、域不变特征学习等有效缓解这一问题,但检测性能仍不理想。本文提出了一种用于跨域语言隐写分析的后训练表示编辑方法。具体来说,首先在源域数据上训练检测器,然后保持特征提取器和分类器冻结,在分类前对中间表示进行确定性编辑。对于域适应,我们从边缘源域和目标域表示构造域偏移向量。对于域泛化,我们推导出源域封面到隐写方向以指导样本特定编辑。实验结果表明,与先进方法相比,所提方法能够实现高跨域检测性能,尤其是在F1分数方面,同时无需在源域训练后进行架构修改或参数更新。

英文摘要

In real-world scenarios of linguistic steganalysis, tested texts usually come from unseen domains with different vocabularies, topics, writing styles, and steganographic generation patterns, which can significantly degrade the detection performance. Although existing cross-domain steganalysis methods can effectively alleviate this problem through distribution alignment, domain-invariant feature learning, etc., the detection performance is not satisfactory. In this paper, we propose a post-training representation editing method for cross-domain linguistic steganalysis. Specifically, the detector is first trained on source-domain data, and then the feature extractor and classifier are kept frozen, and the intermediate representations are deterministically edited before classification. For domain adaptation, we construct a domain-offset vector from marginal source and target representations. For domain generalization, we derive a source-domain cover-to-stego direction to guide sample-specific editing. Experimental results show that compared with the advanced methods, the proposed method can achieve high cross-domain detection performance, especially in terms of F1-score, while requiring no architecture modification or parameter updates after source-domain training.

2605.28295 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Rollouts 的起点:面向 RLVR 的低负载、高杠杆的首 token 多样化

Soeun Kim, Albert No

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系)

AI总结 本文提出 REFT 方法,通过在推理标记后的第一个 token 处进行均匀采样多样化,以低开销显著提升 RLVR 中 rollout 的多样性,从而改善推理模型的 Pass@k 性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)无需标注轨迹即可训练推理模型,它依赖分组 rollout 将策略暴露于替代推理路径,并由验证器进行评分。Rollout 多样性因此成为 RLVR 的核心瓶颈,现有方法大多通过温度、前缀或 rollout 选择调整来拓宽探索。我们发现了一个结构上独特但被忽视的拓宽多样性的位置:推理标记后的第一个 token。策略的首 token 分布表现出尖锐峰值但正确性解耦的现象,且该首 token 位置可以拓宽 rollout 组覆盖的区域而不改变正确性信号。我们引入 REFT(基于首 token 多样化的 Rollout 探索),这是对 RLVR 流程的一个轻量级补充,它从策略自身的 top-$N$ 候选集中均匀采样首 token,并均匀分配 rollout,其他组件保持不变。在由此产生的多样化 rollout 上训练后,REFT 在四个基础模型(0.5B-7B)和三个难度级别上,相较于 DAPO 和 GRPO 基线,提升了聚合的 Pass@1、Pass@8 和 Pass@64。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

2605.28283 2026-05-28 cs.CL cs.AI 版本更新

PrunePath: Towards Highly Structured Sparse Language Models

PrunePath:迈向高度结构化稀疏语言模型

Zhexuan Gu, Zixun Fu, Yancheng Yuan

发表机构 * Department of Applied Mathematics, The Hong Kong Polytechnic University(应用数学系,香港理工大学)

AI总结 提出PrunePath框架,通过软最大归一化路由和累积质量阈值实现自适应预算的结构化稀疏化,在自然语言理解、生成和指令调优中取得优越的稀疏-性能权衡,并利用Triton内核将结构化稀疏转化为实际内存节省和解码速度提升。

详情
AI中文摘要

前馈网络(FFN)主导了现代语言模型的参数数量和计算量,然而现有的剪枝方法往往难以将稀疏性转化为硬件友好的推理效率提升。我们引入了 extbf{PrunePath},一个针对FFN层的预算自适应结构化稀疏化框架。基于MoEfication,PrunePath用软最大归一化路由分布替代独立的专家级阈值,并在累积质量阈值下激活重要专家。这种公式化施加了令牌级概率预算,实现了自适应专家数量以及从单个检查点直接推理时的稀疏性调节旋钮。在自然语言理解、自然语言生成和指令调优评估中,与现有的静态剪枝和基于MoEfication的方法相比,PrunePath实现了有利的稀疏-性能权衡。我们进一步实现了用于KV缓存解码的Triton内核,以将所得的结构化稀疏性转化为实际的内存节省和可测量的解码速度提升。这些结果证明了PrunePath在构建高度稀疏、易于部署的大型语言模型方面的优越性能。

英文摘要

Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

2605.28282 2026-05-28 cs.AI 版本更新

ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted Research

ResearchLoop: 一种用于AI辅助研究的证据门控控制平面

Yihan Xia, Taotao Wang

发表机构 * Shenzhen University(深圳大学)

AI总结 提出ResearchLoop,一种通过证据门控控制平面来确保AI辅助研究中声明可审计的协议,包括状态模型、转换规则和实验验证。

Comments 32 pages, 4 figures, 6 tables; technical report

详情
AI中文摘要

AI辅助研究将构思、实现、评估和手稿撰写压缩成一个单一的交互循环。这种压缩是有用的,但也带来了出版风险:论文声明可能比审计更容易陈述。我们提出了ResearchLoop,一种用于AI辅助计算研究的证据门控控制平面。ResearchLoop将研究问题、任务合同、证据对象、声明账本、结项和论文绑定视为持久的项目状态,在此实现为基于仓库的运行时。本技术报告提供了完整的协议规范、状态模型、转换规则、声明准入算法和洞察复合机制。它还报告了跨越九个版本(V0--V9)的完整实验记录,包括自托管案例研究、带有组件消融的受控任务套件研究、数学奥林匹克评估以及使用官方生成代码工具评估的补充SciCode边界实验。所有工件、清单和验证报告都保存在项目仓库中。

英文摘要

AI-assisted research compresses ideation, implementation, evaluation, and manuscript writing into a single interactive loop. This compression is useful, but it also creates a publication risk: paper claims can become easier to state than to audit. We present ResearchLoop, an evidence-gated control plane for AI-assisted computational research. ResearchLoop treats research questions, task contracts, evidence objects, claim ledgers, closeouts, and paper bindings as durable project state, realized here as a repository-backed runtime. This technical report provides the complete protocol specification, state model, transition rules, claim-admission algorithm, and insight-compounding mechanism. It also reports the full experimental record spanning nine versions (V0--V9), including a self-hosting case study, a controlled task-suite study with component ablations, a mathematical olympiad evaluation, and a supplementary SciCode boundary experiment evaluated with the official generated-code harness. All artifacts, manifests, and verification reports are preserved in the project repository.

2605.28277 2026-05-28 cs.AI 版本更新

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

LLMs 是否从文本构建世界模型?多语言空间推理诊断

Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao, Yitong Qiao, Chunlei Meng, Zhangquan Chen, Xin Cao

发表机构 * University of New South Wales(新南威尔士大学) Essential Energy University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) Zhejiang University(浙江大学) Fudan University(复旦大学) Tsinghua University(清华大学)

AI总结 通过多语言诊断基准 MentalMap 评估大语言模型的空间推理能力,发现所有模型在视角推理上存在普遍的性能瓶颈(L3 推理悬崖),表明该限制源于纯文本工作记忆约束而非特定架构。

详情
AI中文摘要

大语言模型(LLMs)是否从纯文本描述中构建内部空间世界模型仍存在争议,且这种能力是否跨语言迁移尚未得到系统研究。我们引入 MentalMap,一个多语言诊断基准,具有六级能力层次(L0-L5),涵盖从原子空间事实到生成性世界图构建,以及四个诊断轴:参考系、阅读方向偏差、推理努力分配和幻觉。MentalMap 基于 100 个 ProcTHOR 家庭场景构建,涵盖八种类型多样的语言加上一个结构化文本控制,包含 39 个任务族,共 1950 个评估单元。评估了跨规模和模型家族的十三个 LLMs,我们识别出一个普遍的 L3 推理悬崖:一旦基线原子准确率超过 40%,没有模型能在视角推理上保留其 L0 性能的一半。该悬崖在语言、规模和提示策略中持续存在,而结构化输出失败和推理模式在不同模型间差异显著。在相同纯文本协议下的人类评估重现了相同的失败模式,表明瓶颈源于纯文本工作记忆约束,而非特定于当前 LLM 架构。我们的发现将纯文本空间推理重新定义为多轴世界建模问题,并推动多模态和草稿板增强推理作为未来方向。

英文摘要

Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

2605.28273 2026-05-28 cs.AI 版本更新

Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

全局策略空间响应预言机用于两人零和博弈

Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang, Xudong Zhang

发表机构 * Department of Electronic Engineering, Tsinghua University, Beijing, China(清华大学电子工程系,北京,中国) Qiyuan Lab, Beijing, China(启元实验室,北京,中国)

AI总结 提出Global PSRO框架,通过直接最小化种群可利用性(PE)来引导策略种群扩展,以更少的策略迭代逼近纳什均衡。

Comments Accepted by ICML 2026

详情
AI中文摘要

策略空间响应预言机(PSRO)框架通过使用深度强化学习(DRL)迭代扩展受限策略集,将均衡计算扩展到大型零和博弈。一个核心挑战是在有限计算预算下构建一个小的策略种群,其诱导博弈能很好地近似完整博弈。现有的PSRO变体通常使用从受限博弈收益计算出的元策略的最佳响应来扩展种群,这可能导致效率低下的扩展,仅提供有限的全局改进。我们提出通过直接评估扩展后的种群质量来引导种群扩展。具体来说,我们采用种群可利用性(PE)来衡量受限策略集代表完整博弈的程度,并引入一个两阶段探索-选择框架,在扩展过程中显式最小化PE。我们将该框架实例化为Global PSRO,一种实用的基于DRL的算法,该算法通过参数共享的条件神经网络高效生成候选响应并估计PE。在多个两人零和博弈上的实验表明,与先前的PSRO方法相比,Global PSRO实现了更低的可利用性,并以显著更少的策略迭代逼近纳什均衡。

英文摘要

The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration--selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.

2605.28264 2026-05-28 cs.AI 版本更新

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

熵分布作为生成模型中幻觉的指纹

Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Romina Yalovetzky, Niraj Kumar

发表机构 * Global Technology Applied Research(全球技术应用研究)

AI总结 本文提出基于token级熵分布(而非仅均值)的校准熵分数(CES),通过单次前向传递和黑盒logits访问实现幻觉检测,并提供理论保证和实证验证。

详情
AI中文摘要

大型语言模型(LLMs)经常生成事实上不正确的输出,通常称为幻觉,这削弱了信任并限制了在高风险环境中的部署。现有的幻觉检测方法通常需要多次前向传递或访问模型内部。在这项工作中,我们提供了理论背景和实证证据,表明token级熵的分布(超越困惑度或长度归一化熵所捕获的均值)作为幻觉的指纹,其分布形状和尾部行为携带独立信号。我们将幻觉检测形式化为统计假设检验,并提出校准熵分数(CES),一种轻量级算法,仅需单次前向传递和黑盒访问token logits。CES通过校准的参考CDF将均值信号与生成熵的最大信号相结合,产生可直接跨模型和任务比较的分数。我们通过新颖的随机长度Dvoretzky-Kiefer-Wolfowitz不等式建立了有限样本校准保证,并证明了CES检测幻觉的概率随生成长度指数级收敛到1。在八个QA基准和十个生成模型(涵盖开源和API访问模型)上,CES在所有单次黑盒方法中实现了最高的检测性能,同时提供了现有启发式方法所缺乏的正式误差保证。值得注意的是,CES在统计上与需要更高计算成本的多样本方法无法区分,缩小了轻量级与昂贵检测之间的差距,使其适用于实时、大规模部署。

英文摘要

Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit deployment in high-stakes settings. Existing hallucination detection methods typically require multiple forward passes, or access to model internals. In this work, we provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We establish finite-sample calibration guarantees via a novel random-length Dvoretzky--Kiefer--Wolfowitz inequality, and also prove that CES detects hallucinations with probability converging to one exponentially fast in the generation length. Across eight QA benchmarks and ten generator models spanning open-source and API access models, CES achieves the highest detection performance among all single-pass black-box methods while providing formal error guarantees that existing heuristics lack. Remarkably, CES is statistically indistinguishable from multi-sample methods that require far greater computational cost, closing the gap between lightweight and expensive detection and making it suitable for real-time, large-scale deployment.

2605.28258 2026-05-28 cs.SE cs.AI cs.CV cs.HC 版本更新

GUI Agents for Continual Game Generation

面向持续游戏生成的GUI智能体

Yixu Huang, Bo Li, Na Li, Zhe Wang, Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, Hongcheng Guo

发表机构 * Fudan University(复旦大学) Xiaohongshu Inc.(小红书公司) Tongji University(同济大学) University of California, Santa Barbara(加州大学圣芭芭拉分校)

AI总结 提出利用GUI智能体作为客观评估者和主观测试者,通过PlaytestArena和Play2Code框架实现持续游戏生成,显著提升可玩性。

详情
AI中文摘要

生成一个游戏与制作一个可玩的游戏不同。尽管代码生成取得了进展,现有方法将游戏生成视为从提示到产物的单次翻译,导致交互层面的失败未被检测。我们认为评估和改进游戏生成需要一个玩家,并研究了图形用户界面(GUI)智能体在此过程中的两个角色:(1)作为客观评估者,为此我们引入了PlaytestArena,这是一个新的评估环境,将8个游戏类型的200个基于浏览器的游戏生成任务与预期的游戏行为准则配对,由GUI智能体在浏览器中加载每个构建并玩它来裁决;(2)作为主观测试者,为此我们提出了Play2Code,其中游戏智能体和GUI智能体在共享内存的持续循环中运行,将游戏生成转化为编码和游戏之间的对话。我们的实验表明,即使是前沿模型也难以直接生成可玩的游戏,而Play2Code达到了66.8%的准则通过率,分别比单次传递和智能体编码基线提高了37.1和14.6个百分点。进一步分析表明,GUI测试者的反馈比人类报告更可追溯,但在某些方面具有类似人类测试者的特质,将游戏测试确立为交互式代码生成的关键测试平台。我们的项目网站位于https://continual-game-generation.vercel.app/。

英文摘要

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.

2605.28255 2026-05-28 cs.AI cs.CL cs.HC 版本更新

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

AI,掌舵吧:是什么驱动人机协作问答中的委托与信任?

Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Irene Ying, Tianyi Zhou, Jordan Boyd-Graber

发表机构 * University of Maryland(马里兰大学) University of California(加州大学) MBZUAI

AI总结 通过问答游戏实验,研究人类在何时以及为何选择委托AI或采纳其建议,发现人类存在对AI正确建议的低依赖(3.9%)和错误建议的过度依赖(1.7%),并受确认偏见影响,建议通过校准置信度、基于证据的解释和信任细化机制来改进人机协作。

Comments Findings of the Association for Computational Linguistics, 2026

详情
AI中文摘要

AI系统并非完美无缺,人类在决定是否信任AI而非自身判断时也可能犯错。因此,改善人机协作需要理解人类何时、为何以及如何决定依赖AI。我们研究了两种不同的依赖决策:委托选择——在不知道AI输出结果的情况下决定何时让AI自主行动,以及采纳选择——评估AI建议并决定如何使用它们。这两种解耦的依赖模式塑造了协作,但先前的工作很少在现实环境中对同一用户同时研究它们。我们通过研究在问答游戏中竞争的人机协作团队来填补这一空白,游戏中人类可以选择何时以及如何与AI代理合作以获胜。我们的24场比赛匹配了23位专家人类和16个AI代理,捕获了387次委托决策和1440次采纳决策。虽然人机协作表现优于单独的AI或人类,但人类做出了次优的协作决策,既对正确的AI建议低依赖(错失3.9%的机会),又在AI误导时过度依赖(1.7%)。双方都贡献了错误答案:当人类和AI意见不一致时,报告的模型置信度接近随机水平,而确认偏见导致当AI建议与人类初始错误答案一致时,低依赖率更高(64.5%)。为缩小这一差距,我们建议采用校准的置信度、基于证据的解释以及帮助用户细化信任的机制。

英文摘要

AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

2605.28247 2026-05-28 cs.LG cs.AI 版本更新

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS: 通过验证器耦合的稀疏自编码器覆盖实现可解释的RLVR数据选择

Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) The 63rd Research Institute, National University of Defense Technology, Nanjing(国防科技大学第六三研究所,南京)

AI总结 提出IRDS方法,基于稀疏自编码器簇和验证器耦合的覆盖目标,选择模型失败但可学习的RLVR训练实例,提升数学推理准确率并降低计算成本。

Comments 24 pages,3 figures,18 tables

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强LLM推理能力的关键技术,但其数据效率低下仍是一个主要瓶颈。现有方法仅部分解决此问题,各自至少缺少子集级覆盖、验证器信号使用或可解释性中的一项。为弥补这一空白,我们提出了IRDS(可解释的RLVR数据选择),该方法在稀疏自编码器(SAE)簇的基础上选择RLVR训练实例,使得选择本身在可识别的问题模式上是可审计的。为了选择模型既失败又能从中学习的实例,我们在SAE基础上引入了一个验证器耦合的覆盖目标,并通过贪心对数行列式最大化来求解。在三个指令微调模型和六个数学推理基准上的实验表明,IRDS实现了最高的整体准确率,在Qwen两个模型上超过最强基线+3.9/+4.0个百分点,在Llama-3.1-8B上超过+0.5个百分点,同时运行成本比基于轨迹的基线低一个数量级。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency remains a major bottleneck. Existing methods address this problem only partially, each missing at least one of subset-level cov- erage, verifier signal use, or interpretability. To address this gap, we present IRDS (Inter- pretable RLVR Data Selection), which selects RLVR training instances on a sparse autoen- coder (SAE) cluster basis so the selection itself is auditable on recognizable problem motifs. To select instances the model both fails on and can still learn from, we introduce a verifier- coupled coverage objective on the SAE basis and solve it by greedy log-determinant max- imization. Experiments on three instruction- tuned models and six math reasoning bench- marks show that IRDS achieves the highest overall accuracy, exceeding the strongest base- line by +3.9/+4.0 pp on the two Qwen models and by +0.5 pp on Llama-3.1-8B, while run- ning an order of magnitude cheaper than the trajectory-based baseline.

2605.28232 2026-05-28 cs.AI 版本更新

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

PIRS:基于物理信息奖励塑形的SAC建筑能源管理

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino(托里尼理工大学)

AI总结 针对深度强化学习中奖励函数设计缺乏物理基础的问题,提出PIRS方法,将ISO 7730 PMV公式嵌入SAC的多目标奖励中,提升可解释性和性能。

Comments N pages, 4 figures, 3 tables. Accepted at the 2nd Workshop on AI-Driven Energy Efficiency in Dynamic Systems (AI-DEEDS '26), co-located with ACM e-Energy / ACM Sustainability Week, Banff, AB, Canada, June 22-25, 2026

详情
AI中文摘要

居住者舒适度和电网感知的能效是相互竞争的目标,其联合优化关键取决于深度强化学习(DRL)控制器中奖励函数的指定方式。然而,奖励设计在很大程度上仍然是临时的:舒适度项要么是手动调整的启发式规则,要么是简单的温度偏差代理,缺乏热舒适物理的明确基础。我们提出PIRS(物理信息奖励塑形),它在用于Soft Actor-Critic(SAC)的加权多目标奖励中,用ISO 7730预测平均投票(PMV)公式替代这些临时的舒适度代理。通过将舒适度信号锚定在ISO 7730 PMV公式中,PIRS提高了奖励的可解释性,并在不改变学习流程任何其他组件的情况下,提供了一个基于标准的舒适度代理。我们在CityLearn v2.1.2(2022年挑战赛第一阶段)中评估PIRS,使用一个中央SAC智能体在五个随机种子上训练50k步,并与基于规则的控制器(RBC)、手动设计的奖励(E2)、仅能量奖励(E3)和朴素温度偏差舒适度奖励(E4)进行比较。区域级关键绩效指标(KPI)以与RBC的比率报告显示,PIRS在成本、碳和电力指标上与手动基线相当,同时显著优于非物理基础的设计——特别是在负载爬坡(1.78倍 vs. ~2.4倍RBC)和日峰值需求方面。所有DRL策略在此训练预算下仍高于RBC;我们诚实地解释这一差距,并将PIRS定位为可解释、符合标准的奖励设计基础,而非在有限计算下优于经典控制的声明。

英文摘要

Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functions are specified in deep reinforcement learning (DRL) controllers for buildings. Yet reward design remains largely ad hoc: comfort terms are either hand-tuned heuristics or simple temperature-deviation proxies without explicit grounding in thermal-comfort physics. We present PIRS (Physics-Informed Reward Shaping), which replaces these ad-hoc comfort proxies with the ISO 7730 Predicted Mean Vote (PMV) formulation inside a weighted multi-objective reward for Soft Actor-Critic (SAC). By anchoring the comfort signal in the ISO 7730 PMV formulation, PIRS improves reward interpretability and provides a standards-grounded comfort proxy without changing any other component of the learning pipeline. We evaluate PIRS in CityLearn v2.1.2 (challenge 2022 phase 1) with a central SAC agent trained for 50k steps over five random seeds, and compare against a rule-based controller (RBC), a manually engineered reward (E2), an energy-only reward (E3), and a naive temperature-deviation comfort reward (E4). District-level key performance indicators (KPIs), reported as ratios versus RBC, show that PIRS attains cost, carbon, and electricity metrics on par with the manual baseline while substantially outperforming non-physics-grounded designs -- particularly on load ramping (1.78x vs. ~2.4x RBC) and daily peak demand. All DRL policies remain above RBC at this training budget; we interpret this gap honestly and position PIRS as an interpretable, standards-aligned foundation for reward design rather than a claim of dominance over classical control at limited compute.

2605.28229 2026-05-28 cs.CV cs.AI 版本更新

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism: 用于图像到视频迁移的异构混合专家模型

Rui Lin, Chuanming Wang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 提出VidPrism,一种异构时间混合专家框架,通过功能专业化专家、内容感知多速率采样和动态双向融合机制,解决传统MoE中专家同质化问题,在视频识别基准上达到最先进性能。

Comments CVPR2026 camera ready

详情
AI中文摘要

随着预训练技术的快速发展,适应大规模视觉-语言模型(VLM)进行视频理解(即图像到视频迁移学习)已成为主导范式。为了获得卓越性能,近期进展中采用混合专家(MoE)来增强VLM的时间建模能力是一种有效策略。然而,传统的MoE设计存在专家同质化问题,即所有专家充当相同的通才,从无差异的视频流中低效地学习时空特征。为解决此问题,我们提出VidPrism,一种新颖的异构时间混合专家框架。VidPrism通过部署功能专业化的专家开创了分工机制,每个专家承担从空间理解到时间建模的不同角色。为了适当地为这些专家提供输入,我们引入了一个内容感知的多速率采样模块,动态生成从语义丰富到运动聚焦的表示流,为专家提供专业化输入。此外,一种动态双向融合机制实现了这些路径之间的协同信息交换,从而产生全面的视频表示。在各种视频识别基准上的大量实验表明,VidPrism达到了最先进的性能,并有效促进了专家专业化。我们的源代码可在https://github.com/Lrrrr549/VidPrism.git获取。

英文摘要

With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{https://github.com/Lrrrr549/VidPrism.git}{https://github.com/Lrrrr549/VidPrism.git}.

2605.28224 2026-05-28 cs.AI 版本更新

When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

何时记忆有助于工具使用LLM代理的多轨迹推理?

Xinzhe Li, Yaguang Tao

发表机构 * RMIT University(皇家墨尔本理工大学)

AI总结 本文提出一个统一框架,将记忆沿传输范围和内容抽象两个维度分解,在无验证器设置下评估四种记忆方法与三种推理策略在四个工具使用基准上的表现,发现推理策略是混淆变量,不同策略下相同记忆方法产生显著不同结果。

Comments More evaluation and analysis are on the way

详情
AI中文摘要

工具使用LLM代理的多轨迹推理——生成多个推理尝试并从中选择——受益于跨尝试的知识转移,以便后续尝试避免早期尝试的陷阱。现有的跨轨迹记忆方法(轨迹级反思、原子事实提取、原始观察注入)均在单个任务的单一推理策略下进行评估,使得报告的性能提升是否反映记忆抽象或推理方法的属性变得不明确。我们提出一个统一框架,将记忆沿两个维度分解——传输范围(扩展内 vs. 跨轨迹)和传输内容的抽象程度——并在四种工具使用基准(涵盖SQL、知识图谱和CLI环境)上,在匹配实际代理部署设置的无验证器设置下,评估四种方法在三种推理策略(best-of-N、束搜索、MCTS)下的表现。实验矩阵将推理方法识别为混淆变量:相同的记忆方法在相同示例的不同推理策略下产生统计上不同的结果。反思仅在MCTS下达到显著性(不在best-of-N下);扩展内注入(使每个候选条件依赖于先前兄弟候选的结果)仅有助于缺乏多样性的束搜索;而原子事实提取对准确性无影响,但在具有可重用环境结构的任务上使轨迹缩短19-26%。

英文摘要

Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Existing cross-trajectory memory methods (trajectory-level reflection, atomic fact extraction, raw observation injection) are each evaluated under a single inference strategy on a single task, making it unclear whether reported gains reflect properties of the memory abstraction or of the inference method. We propose a unified framework that decomposes memory along two axes -- the scope of transfer (within an expansion vs. across trajectories) and the abstraction of the transferred content -- and evaluate four methods under three inference strategies (best-of-N, beam search, MCTS) on four tool-use benchmarks spanning SQL, knowledge-graph, and CLI environments, in a verifier-free setting that matches the deployment regime of practical agents. The experiment matrix identifies the inference method as a confound: the same memory method produces statistically distinct results under different inference strategies on the same examples. Reflection reaches significance only under MCTS (not under best-of-N); within-expansion injection (conditioning each candidate on prior siblings' outcomes) helps only diversity-starved beam search; and atomic fact extraction is accuracy-neutral but shortens trajectories by 19-26% on tasks with reusable environmental structure.

2605.28219 2026-05-28 cs.HC cs.AI cs.LG 版本更新

SmartIterator: Visual Analytics Workflows for Supervising Unsupervised Data Grouping

SmartIterator: 监督无监督数据分组的可视化分析工作流

Gennady Andrienko, Natalia Andrienko

发表机构 * Fraunhofer Institute IAIS(弗劳恩霍夫研究所IAIS) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔人工智能与机器学习研究所) City St George’s, University of London(伦敦大学圣乔治学院)

AI总结 提出SmartIterator可视化分析方法,通过六阶段工作流和IteraScope协调视图,系统探索参数扫描下的分组结果,支持用户理解数据结构和做出知情决策。

详情
AI中文摘要

无监督学习方法——主题建模、基于划分和基于密度的聚类——在没有人类指导的情况下产生数据分组,但选择和评估这些分组本身不应是无监督的。我们提出了\emph{SmartIterator}(SI),一种可视化分析方法,将参数扫描中分组结果的完整序列视为一等分析对象。对于每个方法族,SI提供了一个结构化的六阶段工作流,引导分析师系统地探索分组结果——从质量指标概览,经过过渡稳定性评估、成员置信度评估、内容和上下文检查、循环原型验证,到知情决策——在此过程中逐步建立对数据结构的累积理解。这些工作流通过\emph{IteraScope}(IS)实现,这是一个协调的可视化显示,结合了质量指标图表与语义颜色编码、带有桑基式过渡流和成员置信度小提琴图的一维组嵌入、带有HDBSCAN检测的循环原型的二维组嵌入(突出显示捕获所有持久模式的迭代),以及用于上下文解释的特定领域链接视图。我们在以下三个场景中演示了这些工作流:(1)来自VAST Challenge 2011的模拟社交媒体消息(基于密度的聚类,根据真实情况进行验证),(2)约1500个NUTS-3区域的欧盟人口统计数据(基于划分的聚类),以及(3)30年的IEEE VIS论文(NMF主题建模)。这些工作流构成了主要贡献:它们提供了可操作的、针对特定方法的指导,用于导航参数空间、研究数据结构如何随配置变化,以及将分析理解扎根于领域背景——从而产生关于数据的知识,这是任何单个“最佳”结果都无法提供的。

英文摘要

Unsupervised learning methods -- topic modeling, partition-based and density-based clustering -- produce data groupings without human guidance, yet choosing and evaluating those groupings should not itself be unsupervised. We present \emph{SmartIterator}~(SI), a visual analytics approach that treats the full sequence of grouping results across a parameter sweep as a first-class analytical object. For each method family, SI provides a structured six-phase workflow that guides the analyst through systematic exploration of grouping results -- from quality-metric overview through transition-stability assessment, membership-confidence evaluation, content and context inspection, and recurrent-archetype verification to an informed decision -- building cumulative understanding of data structure along the way. The workflows are operationalized through \emph{IteraScope}~(IS), a coordinated visual display combining quality-metric charts with semantic color encoding, a 1D group embedding with Sankey-style transition flows and violin plots of membership confidence, a 2D group embedding with HDBSCAN-detected recurrent archetypes that highlights iterations capturing all persistent patterns, and domain-specific linked views for contextualized interpretation. We demonstrate the three workflows on: (1)~simulated social-media messages from the VAST Challenge 2011 (density-based clustering, validated against ground truth), (2)~EU population statistics across ${\sim}1\,500$ NUTS-3 regions (partition-based clustering), and (3)~30 years of IEEE VIS papers (NMF topic modeling). The workflows constitute the main contribution: they provide actionable, method-specific guidance for navigating parameter spaces, studying how data structure evolves across configurations, and grounding analytical understanding in domain context -- yielding knowledge about the data that no single ``best'' result can provide.

2605.28215 2026-05-28 cs.AI cs.CL cs.LG cs.LO cs.MA 版本更新

Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers

解释比单独预测更难:评估基于概念的MLLM解释作为ICL视觉分类器

Carmen Quiles-Ramírez, Leticia L. Rodríguez, Nicolás Martorell, Natalia Díaz-Rodríguez

AI总结 本文通过五种形式化程度递增的条件,系统评估多模态大语言模型在少样本上下文学习中的基于概念的可解释性,发现解释比预测更难,且强制生成形式化解释会降低预测准确性。

Comments Accepted to the CompLearn Workshop at ICML 2026

详情
AI中文摘要

上下文学习(ICL)使多模态大语言模型(MLLM)能够从少量标记示例中对图像进行分类。然而,这些模型如何使用提供的上下文仍然不透明。虽然思维链提示被广泛使用,但最近的研究认为它可能不反映真实的内部计算。在本文中,我们通过五种形式化程度递增的条件(从基线分类到描述逻辑(DL)公理生成)系统评估了冻结MLLM在少样本ICL下的基于概念的可解释性。通过独立的LLM-as-a-judge流水线评估四个最先进的MLLM,我们证明解释确实比单独预测更难。令人惊讶的是,强制模型生成形式化结构的基于概念的解释会单调降低预测准确性(从93.8%降至90.1%),这与显式推理普遍有助于性能的假设相矛盾。然而,当模型成功表达类别判别性视觉特征时,解释质量与正确预测强相关。我们的发现表明,虽然MLLM在视觉分类方面表现出色,但它们缺乏形式化、机器可验证的可解释性所需的特定指令微调。

英文摘要

In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability.

2605.28213 2026-05-28 cs.AI 版本更新

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

学习何时优化:来自专家GPU内核谱系的验证优化技能

Shuoming Zhang, Qiuchu Yu, Yangyu Zhang, Ruiyuan Xu, Xiyu Shi, Guangli Li, Xiaobing Feng, Huimin Cui, Jiacheng Zhao

发表机构 * SKLP, Institute of Computing Technology, Chinese Academy of Sciences(SKLP,计算技术研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) University of New South Wales(新南威尔士大学)

AI总结 提出KLineage方法,通过反向遍历专家GPU内核实现并提取可重用的优化技能,学习优化的适用条件,从而提升LLM代理生成内核的优化质量与效率。

Comments Preprint, Under Review

详情
AI中文摘要

基于LLM的代理越来越多地被用于生成GPU内核,但它们通常知道尝试哪些优化,却不知道这些优化何时是合理的。我们引入了KLineage,它从专家内核中学习这种缺失的“何时”知识:KLineage不是依赖前向展开,而是通过验证门控简化反向遍历专家实现,并将每个接受的步骤逆转为可重用的优化技能。每个技能不仅记录了优化意图,还记录了它在代码中的适用位置、使其有效的条件、产生的效果以及其假设避免了哪些失败。下游LLM在相同的编译/正确性/性能分析门控下,将这些技能应用到新的代码表面上。在两个NVIDIA架构上的五个专家工作负载中,这些谱系衍生的技能作为有效的优化课程,在相同的固定预算下,在最终内核质量和优化效率方面均超过了近期基于内存的LLM内核基线。此外,我们使用一个单独的22实例保留检查作为对源案例记忆的合理性测试。

英文摘要

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.

2605.28201 2026-05-28 cs.AI 版本更新

Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

种植、持久化、触发:针对大语言模型智能体的潜伏攻击

Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu, Dongrui Liu, Wenjie Wang, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) Singapore Management University(新加坡管理学院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出潜伏攻击(Sleeper Attack),即攻击者将对抗性内容注入智能体状态并持久化,在后续交互中被良性用户查询触发,导致有害行为;构建包含1896个实例的基准测试,实验表明当前最强LLM智能体仍易受此类攻击。

详情
AI中文摘要

大语言模型(LLM)智能体仍然容易受到来自外部环境的安全威胁,攻击者将对抗性内容注入外部观察(如工具返回的数据、网页或MCP上下文),导致有害的智能体行为,例如不安全的操作或错误的输出。现有研究通常关注单次交互攻击,即智能体观察到对抗性内容后立即在单次用户请求中表现出有害行为。然而,我们表明对抗性内容也可以在同一智能体服务的多次交互中持久化,使得此类威胁更难检测和缓解。具体来说,对抗性内容可能持久化在智能体状态中,在多次交互中保持休眠,随后被良性用户查询激活。我们将此类安全威胁形式化为潜伏攻击(Sleeper Attack)。为了评估它,我们构建了一个包含1896个实例的基准测试,涵盖六种真实世界的有害结果、三种攻击策略和三种智能体状态目标:会话上下文、记忆和可复用技能。在七个强大的开源和闭源LLM上的实验表明,最先进的LLM智能体仍然容易受到潜伏攻击,即使在单次交互基线中它们实现了较低的攻击成功率。我们的代码和数据可在https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef获取。

英文摘要

Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.

2605.28192 2026-05-28 cs.AI 版本更新

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

面向多跳音视频推理的主动全模态感知代理

Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yanfeng Wang, Yu Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对多跳音视频推理中证据稀疏且跨模态分布的问题,提出MOV-Bench基准和AOP-Agent代理框架,通过分层全模态记忆与观察-反思-重规划循环实现主动感知,显著提升开源全模态大模型在长视频和推理密集型问题上的性能。

详情
AI中文摘要

多跳音视频推理对全模态大语言模型(Omni-LLMs)仍然具有挑战性,因为相关证据通常稀疏、时间上分散,并且分布在音频和视频流中。现有基准对此设置的研究有限,通常仅涉及有限数量的模态、相关时间片段或推理步骤。在这项工作中,我们引入了MOV-Bench,一个包含519个精心设计问题的基准,这些问题需要对时间上分散的音视频证据进行多跳推理。在MOV-Bench上的评估表明,当前的全模态大语言模型在多跳跨模态推理方面仍然存在困难。为了解决这一挑战,我们进一步提出了AOP-Agent,一个基于开源全模态大语言模型的高效代理框架,用于主动全模态感知。通过将分层全模态记忆与协作的观察-反思-重规划循环相结合,AOP-Agent使开源全模态大语言模型能够进行主动感知,而无需额外训练或专有模型。在MOV-Bench和OmniVideoBench上的实验表明,AOP-Agent持续提升了推理性能,在长视频和推理密集型问题上尤其显著。

英文摘要

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

2605.28187 2026-05-28 cs.IR cs.AI cs.CY cs.SI 版本更新

Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

谁的名字会出现?III:基于LLM的学者推荐中的人设提示效应

Annabella Sánchez-Guzmán, Lukas Eberhard, Denis Helic, Lisette Espín-Noboa

发表机构 * Graz University of Technology(格拉茨技术大学) Complexity Science Hub(复杂科学中心)

AI总结 本研究通过构建基准测试,分离模型选择与提示设计对LLM学者推荐的影响,发现提示设计(语言、地点、角色与任务)显著影响推荐质量(事实性、覆盖度)和社会代表性(多样性、均等性)。

Comments 25 pages (10 main, 2 references, 13 appendix), 6 figures in main, 13 figures in appendix (under-review)

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作学者推荐系统,塑造了学术界中被视为专家的人选。现有的审计仍然以英语为中心、单一学科且忽略人设,导致输出变异性的来源尚不明确。为此,我们提出了一个基准测试,以分离模型选择和提示设计对推荐的影响。我们通过改变人设提示(语言、地点、角色与任务)和上下文(领域、资历、k)审计了43个LLM。将推荐的学者与Semantic Scholar在六个科学学科上进行比较,以衡量技术质量(事实性、覆盖度)和社会代表性(多样性、均等性)。基本技术质量由模型选择驱动,事实性和均等性由上下文驱动,多样性由地点驱动。南非提示产生的事实性较低的列表,而日本提示产生的事实性高但同质化的列表,偏向高产的学者。因此,提示设计是基于LLM的学者发现中一个不可忽视的维度,应与模型选择一起系统审计。

英文摘要

Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits remain English-centric, single discipline, and persona-agnostic, leaving the source of output variability poorly understood. To this end, we propose a benchmark that disentangles the effects of model choice and prompt design on recommendations. We audit 43 LLMs by varying persona prompts (language, location, role-and-task) and context (field, seniority, k). Recommended scholars are compared against Semantic Scholar over six scientific disciplines to measure technical quality (factuality, coverage) and social representativeness (diversity, parity). Basic technical quality is driven by model choice, factuality and parity by context, and diversity by location. South Africa prompts yield less factual lists, while Japan prompts yield highly factual but homogeneous lists skewed toward highly productive scholars. Prompt design is thus a non-trivial axis of LLM-based scholar discovery and should be systematically audited alongside model choice.

2605.28186 2026-05-28 cs.RO cs.AI 版本更新

Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension

可视化运动策略中的潜在相位结构:基于时间特征扩展的多环境研究

Daisuke Yasui, Toshitaka Matuki, Hiroshi Sato

发表机构 * Mathematics and Computer Science National Defense Academy of Japan(日本防卫大学校数学与计算机科学系)

AI总结 提出一种框架,通过扩展聚类特征(包括动作、下一状态和下一动作)并引入抑制自转移的聚类数确定方法,从深度强化学习运动策略中揭示更清晰、更规则的潜在运动相位结构。

详情
AI中文摘要

深度强化学习(DRL)已被证明在MuJoCo基准测试(如HalfCheetah、Ant和Walker2D)的运动控制任务中表现出高性能。然而,可视化由深度神经网络实现的训练策略函数内部获得的运动结构仍然具有挑战性。从生物力学及相关领域可知,运动控制是通过重复运动相位(如站立相和摆动相)实现的。在本研究中,我们提出一个框架,用于从运动控制策略通过与环境交互生成的轨迹中揭示潜在的相位结构。所提出的方法将聚类特征从仅状态观测扩展到包括动作、下一状态和下一动作的增强特征,并引入一种抑制自转移的聚类数确定方法。将所提出的方法应用于三个环境——Ant-v5、HalfCheetah-v5和Walker2D-v5,我们成功识别出比现有方法具有更清晰和更规则转换规则的相位结构。

英文摘要

Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfCheetah, Ant, and Walker2D. However, visualizing the motion structures internally obtained by a trained policy function implemented as a deep neural network remains challenging. It is known from biomechanics and related fields that locomotion control is realized through the repetition of motion phases such as the stance phase and swing phase. In this study, we propose a framework for uncovering latent motion phase structures from trajectories generated by locomotion control policies through interaction with the environment. The proposed method extends the clustering features from state observations alone to augmented features including actions, next states, and next actions, and introduces a method for determining the number of clusters that suppresses self-transitions. Applying the proposed method to three environments -- Ant-v5, HalfCheetah-v5, and Walker2D-v5 -- we successfully identified phase structures with clearer and more regular transition rules than those obtained by the existing method.

2605.28174 2026-05-28 cs.CV cs.AI 版本更新

FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

FLORO:面向跨传感器与尺度的生态遥感多模态地理空间基础模型

Jorge L. Rodriguez, Victor Angulo Morales, Areej Alwahas, Mariana Elias Lara, Fida Mohammad Thoker, Kasper Johansen, Bernard Ghanem, Fernando T. Maestre, Matthew F. McCabe

发表机构 * Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹科技大学生物与环境科学与工程 division) Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹科技大学计算机、电气与数学科学与工程 division)

AI总结 提出FLORO多模态地理空间基础模型,通过掩码自编码在异构遥感数据上预训练,利用可用性感知输入统一异构传感器配置,在PANGAEA基准上实现强迁移性能。

Comments 29 pages, 9 figures

详情
AI中文摘要

基础模型为可迁移的遥感表示提供了有前景的途径,但许多当前方法依赖于非常大的预训练数据集和固定的传感器配置,限制了它们在生态和环境应用中的适用性,这些应用中的观测通常跨平台、空间和光谱分辨率以及可用模态而变化。我们提出了FLORO,一个多模态地理空间基础模型,旨在从一个小型但高度多样化的遥感语料库中学习可迁移表示。FLORO使用掩码自编码在Sentinel-1、Sentinel-2、SkySAT影像、高程和无人机数据的异构组合上进行预训练。为了适应传感器变异性,FLORO结合了可用性感知输入,指示每个样本中存在哪些光谱波段和辅助模态,从而在异构传感器配置上实现统一的输入空间。我们在PANGAEA基准上,在冻结编码器协议下,评估了FLORO的场景分类、分割和回归任务。尽管在比竞争基础模型更小的语料库上预训练,FLORO在跨光学、光学-SAR和光学-高程基准(涵盖中分辨率卫星、航空和超高分辨率无人机影像)上实现了强大且稳定的迁移。FLORO在六个PANGAEA基准上取得了第二好的平均分割性能,仅次于最近引入的预训练图像数量超过两个数量级的基础模型,在场景分类上保持竞争力,在回归任务中表现稳健,而定性结果显示在洪水、城市、生物量和冠层高度预测设置中空间结构的保存有所改善。在EuroSAT-MS上的单独对照实验中,相对于绝对位置编码,地理位置编码进一步提高了分类性能。

英文摘要

Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.

2605.28170 2026-05-28 cs.AI 版本更新

Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

通过Shapley值为大型语言模型定位输入不确定性量化

Seongjun Lee, Suwan Yoon, Changhee Lee

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 提出ShaQ框架,利用Shapley值将输入中的模糊跨度建模为合作博弈参与者,通过条件熵的边际减少加权平均量化每个跨度对输入不确定性的贡献,实现跨度级归因,在AmbigQA、AmbiEnt和MediTOD基准上取得最先进性能。

Comments Codes are available https://anonymous.4open.science/r/ShaQ-0E39/README.md

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地集成到高风险决策中,可靠量化不确定性的能力已成为安全性和可信度的关键要求。然而,当前的不确定性量化方法主要在输出层面操作,通常无法区分不确定性是源于模型缺乏知识还是用户输入的模糊性。尽管以输入为中心的不确定性量化最近成为一个有前景的方向,但它仍相对未被充分探索,并且通常依赖于粗糙的输入级信息。因此,用户只能获得标量不确定性分数,这些分数几乎没有提供可操作的指导,以说明应该澄清输入的哪些部分来提高可靠性。为了解决这一局限性,我们提出了基于Shapley的输入不确定性量化(ShaQ),这是一个用于输入诱导不确定性的跨度级归因框架。我们的方法将输入中的模糊跨度建模为合作博弈中的参与者,并使用Shapley值量化它们的贡献,Shapley值通过澄清每个跨度联盟所获得的条件熵边际减少的加权平均来定义。与现有的输入级方法不同,我们的公式捕捉了跨度之间的复杂交互,并提供了一种原则性的分解,其中个体归因之和恰好等于总输入诱导不确定性。我们在AmbigQA和AmbiEnt基准上评估了ShaQ,它在模糊性检测中实现了最先进的性能。我们进一步在MediTOD上展示了其实用性,表明ShaQ可以定位未明确说明的临床话语,并促进高风险环境中的人机协作。总体而言,ShaQ改进了不确定性估计,并为有针对性的输入澄清提供了可操作的见解。

英文摘要

As large language models (LLMs) are increasingly integrated into high-stakes decision-making, the ability to reliably quantify uncertainty has become a critical requirement for safety and trust. However, current uncertainty quantification methods primarily operate at the output level, often failing to distinguish whether uncertainty arises from the model's lack of knowledge or from ambiguity in the user's input. While input-centric uncertainty quantification has recently emerged as a promising direction, it remains relatively underexplored and typically relies on coarse, input-level information. Consequently, users are provided with scalar uncertainty scores that offer little actionable guidance on which parts of the input should be clarified to improve reliability. To address this limitation, we propose Shapley-based input uncertainty Quantification (ShaQ), a framework for span-level attribution of input-induced uncertainty. Our approach models ambiguous spans in the input as players in a cooperative game and quantifies their contributions using Shapley values, defined via the weighted average of marginal reductions in conditional entropy obtained by clarifying each span coalition. Unlike existing input-level approaches, our formulation captures complex interactions among spans and provides a principled decomposition in which individual attributions sum exactly to the total input-induced uncertainty. We evaluate ShaQ on the AmbigQA and AmbiEnt benchmarks, where it achieves state-of-the-art performance in ambiguity detection. We further demonstrate its utility on MediTOD, showing that ShaQ can localize under-specified clinical utterances and facilitate human-AI collaboration in high-stakes settings. Overall, ShaQ improves uncertainty estimation and provides actionable insights for targeted input clarification.

2605.28168 2026-05-28 cs.AI 版本更新

OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings

OccuReward: 面向电网交互建筑中人口公平性的LLM引导的以 occupant 为中心的奖励塑造

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino(都灵理工大学)

AI总结 提出OccuReward框架,利用大语言模型迭代塑造奖励函数,通过舒适公平指数(CEI)反馈,在CityLearn v2中提升不同人口群体的舒适公平性,同时降低能耗成本。

Comments 4 pages, 2 figures. Accepted at OccuSys 2026, co-located with ACM Sustainability Week 2026. Preprint version

详情
AI中文摘要

大语言模型(LLM)在为基于深度强化学习(DRL)的建筑能源管理生成奖励函数方面展现出有前景的能力。然而,它们在异质人口群体中引发或加剧 occupant 舒适度差异的潜力尚未被探索。我们提出 OccuReward,一个研究 LLM 介导的奖励设计如何影响人口公平性的框架。我们的贡献有三方面:引入舒适公平指数(CEI)作为新颖的反馈信号;一种迭代的、公平感知的 LLM 奖励塑造方法;以及在这些优化目标下 DRL 代理的性能分析。利用来自 ASHRAE 全球热舒适数据库 II(13,440 票)的四个基于经验 occupant 档案,我们在 CityLearn v2 中部署了一个 Soft Actor-Critic 代理。我们的方法使用 Gemini API 生成奖励函数逻辑和权重——而不是执行每步推理——跨越三个细化轮次。15 个实验运行的结果显示,老年女性 occupant 在初始轮次中始终经历最低满意度。到第 3 轮,公平感知的 LLM 细化激活了特定的奖励组件,提高了年轻男性(+17.6%)、中年女性(+28.2%)、健康敏感者(+53.8%)和老年女性(+567%)的满意度,同时降低了 3.2% 的能源成本。我们的发现强调,虽然奖励层面的干预显著改善了公平性,但 AI 驱动控制器中的人口差异仍然存在,需要进一步研究建筑系统中的算法公平性。

英文摘要

Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-based building energy management. However, their potential to exhibit or exacerbate disparities in occupant comfort across heterogeneous demographic populations remains unexplored. We present OccuReward, a framework investigating how LLM-mediated reward design affects demographic equity. Our contribution is three-fold: the introduction of the Comfort Equity Index (CEI) as a novel feedback signal; a methodology for iterative, equity-aware LLM reward shaping; and a performance analysis of DRL agents under these refined objectives. Utilizing four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes), we deploy a Soft Actor-Critic agent in CityLearn v2. Our approach employs the Gemini API to generate reward function logic and weights--rather than performing per-step inference--across three refinement rounds. Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. By Round 3, equity-aware LLM refinement activates specific reward components that improve satisfaction for Young Males (+17.6%), Mid-aged Females (+28.2%), Health Sensitive (+53.8%), and Elderly Females (+567%), while simultaneously reducing energy costs by 3.2%. Our findings highlight that while reward-level intervention significantly improves equity, demographic disparities in AI-driven controllers persist, necessitating further research into algorithmic fairness in building systems.

2605.28164 2026-05-28 cs.NE cs.AI 版本更新

Performance and Explainability Requirements of Evolutionary Algorithms in Real-World Physics-Informed Optimization

进化算法在实际物理信息优化中的性能和可解释性要求

Helena Stegherr, Michael Heider, Nils Meyer, Tobias Thummerer, Thomas Wendler, Pierre Aublin, Ennio Idrobo-Àvila, Lars Mikelsons, Sebastian Zaunseder, Jörg Hähner

发表机构 * Universität Augsburg(乌尔姆大学)

AI总结 本文通过五个实际物理优化问题,分析领域专家对进化算法在性能和可解释性方面的需求,并指出现有方法未能充分应用于复杂实际场景的差距。

详情
AI中文摘要

进化计算提供了多种工具来解决复杂的实际优化问题。然而,研究通常集中在较小、简化的问题和优化算法上,这些算法在实际场景中有时无法满足期望。此外,在此类设置中,对应用算法及其提供的解决方案的信任通常至关重要,但这需要理解搜索过程本身。这导致在许多应用背景下(包括基于物理的建模)实践者往往不会认真考虑进化计算。本文详细介绍了可以缓解这些问题的进化计算技术。首先,由领域专家介绍并描述了五个实际的基于物理的优化问题。针对每个问题,提出了进化算法在性能和可解释性方面的要求,以增加信任和可用性。我们发现,所有领域专家都期望快速收敛到良好解决方案,并希望获得关于结果如何形成的一些解释,而其他要求则强烈依赖于具体问题。最后,我们介绍了现有方法,这些方法可用于改进进化算法的这些方面,但据我们所知,从未在复杂的实际场景中使用过。这意味着两个领域之间存在需要弥合的差距,以充分发挥进化计算的潜力。

英文摘要

Evolutionary computation offers a variety of tools to solve complex real-world optimization problems. However, research often focuses on smaller, simplified problems and optimization algorithms that sometimes miss expectations in real-world scenarios. Additionally, trust in the applied algorithm and the solutions it provides is often essential in such settings, but requires an understanding of the search process itself. This leads to evolutionary computation often not being seriously considered by practitioners in many application contexts, among them physics-based modeling. In this article, techniques from evolutionary computation are detailed that can alleviate these problems. First, five real-world physics-based optimization problems are introduced and described by domain experts. For each of these, the requirements for the evolutionary algorithm regarding performance and explainability to increase trust and usability are presented. We found that all domain experts expect fast convergence to a good solution and want some explanations for how the results were formed, while other requirements strongly depend on the respective problem. Finally, we present existing approaches that can be leveraged to improve those aspects of evolutionary algorithms but have to our knowledge never been employed in complex real-world scenarios. This implies a gap between both domains that needs to be closed to exploit the full potential of evolutionary computation.

2605.28163 2026-05-28 cs.CL cs.AI 版本更新

DEPART: DEcomposing PARiTy across Multilingual LLMs

DEPART: 跨多语言大模型的性能差异分解

Manan Uppadhyay, Prashant Kodali, Pranjal Chitale, Reshma Ramaprasad, Himanshu Beniwal, Sunayana Sitaram

发表机构 * Microsoft Research India(微软印度研究院)

AI总结 提出DEPART框架,通过贝叶斯分层模型分解多语言大模型性能差异,发现语言特征解释79%-92%的方差,且模型内部表示与英语的相似性是主要预测因子。

详情
AI中文摘要

多语言大模型(mLLMs)排行榜报告每种语言的准确率,但很少解释为何出现差异,导致系统性偏差未被归因,且从业者无法采取可操作的杠杆。我们首先通过无分布Friedman和Kruskal-Wallis检验确定这些差距是系统性的而非抽样噪声的产物,然后引入一个两步贝叶斯分层框架,将多语言性能方差分解为可解释的组成部分。首先,隔离语言身份归因的方差,我们表明可观察的语言特征(文字、语系、类型学距离)在理解任务上解释了$R^2_{\text{ling}} = 79\%$的方差,在推理任务上解释了$92\%$,而模型内部表示与英语的相似性成为两个任务桶中的主导预测因子。其次,分解完整的(模型×基准×语言)立方体,我们发现NLU和推理具有根本不同的方差分布:模型身份主导理解(方差的66.7%),而基准×模型交互主导推理(46.3%)。这些结果共同将多语言评估从被动的性能映射重塑为一个可解释的诊断框架,并提供针对语言差异根本驱动因素的具体杠杆。

英文摘要

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

2605.28160 2026-05-28 cs.AI 版本更新

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

按需查看:多模态推理中视觉证据获取的认知调度框架

Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun, Yidong Chen, Jiayi Ji, Qian Chen, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception(多媒体可信感知实验室) Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China(教育部高效计算实验室,厦门大学,361005,中国) Sino-Russian ResearchCenter for Digital Economy(中俄数字经济研究中心) Institute of Artificial Intelligence, Xiamen University, China(人工智能研究院,厦门大学,中国) School of Informatics, Xiamen University, China(信息学院,厦门大学,中国) School of Information Engineering, Xiamen Ocean Vocational College, Xiamen 361102, China(信息工程学院,厦门海洋职业技术学院,厦门361102,中国)

AI总结 提出CSMR框架,通过语言模型控制何时调用独立视觉感知模块获取任务相关视觉证据,在零样本设置下多个基准上优于基线方法。

Comments Accepted at ICML 2026

详情
AI中文摘要

现有的多模态推理方法主要遵循两种范式:在推理前将视觉输入转换为文本,或在统一的视觉-语言表示空间中进行端到端推理。尽管取得了经验上的进展,但两种范式都存在根本性的结构限制。前者依赖于静态的视觉到文本转换,往往会压缩并丢失细粒度的视觉细节。后者容易受到联合优化和注意力机制引起的语言主导,导致推理过程中对视觉证据的忠实性系统性减弱。在这项工作中,我们认为核心挑战在于视觉证据如何以及何时被引入推理过程。受此启发,我们提出了CSMR,一种多模态推理框架,其中语言模型通过决定何时调用独立的视觉感知模块来获取任务相关的视觉证据,从而控制推理过程。在多个多模态推理基准上的实验表明,在零样本设置下,CSMR在准确性上始终优于代表性基线方法。进一步的实验分析证实,这些优势主要源于所提出的认知调度机制。

英文摘要

Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.

2605.28158 2026-05-28 cs.AI 版本更新

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

OR-Space:面向工业优化智能体的全生命周期工作空间基准

Chenyu Zhou, Xinyun Lu, Jiangyue Zhao, Jianghao Lin, Dongdong Ge, Yinyu Ye

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai University of Finance and Economics(上海财经大学) Stanford University(斯坦福大学)

AI总结 提出OR-Space基准,通过构建、修订和解释三种任务模式,评估大语言模型智能体在工业优化工作流中的可靠性。

Comments 34 pages, 8 figures

详情
AI中文摘要

大语言模型(LLM)智能体越来越多地被用于辅助运筹学(OR)建模,然而现有的面向OR的基准通常将评估简化为从自包含的问题陈述到数学公式或求解器程序的一次性翻译。这种设置忽略了实际工业OR工作流的两个特征:持久的多工件工作空间和多阶段任务生命周期。我们引入了OR-Space,一个全生命周期的工作空间基准,用于评估工业优化智能体在模型构建、模型修订和基于解释的任务中的表现。每个实例都是一个可执行的工作空间,包含业务文档、结构化数据、可选的代码工件、求解器输出以及分布在相互依赖文件中的任务特定评估器。OR-Space定义了三种任务模式:构建模式,智能体从异构工件构建可求解的优化模型;修订模式,智能体在需求变化或求解器反馈下修改现有模型,同时保留有效的先前逻辑;解释模式,智能体利用工作空间工件中的证据回答关于解决方案、约束和业务影响的基于解释的问题。通过将持久工作空间与生命周期导向的任务相结合,OR-Space评估智能体是否能够执行超越端到端文本生成的可靠优化工作。我们描述了基准设计、评估协议和质量控制流程,并将OR-Space定位为研究LLM智能体在工业OR工作流中的可靠性、失败模式和实际准备程度的基准。

英文摘要

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

2605.28148 2026-05-28 cs.SE cs.AI 版本更新

DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP servers

DeltaMCP: 通过规范感知转换实现MCP服务器的增量再生

Aditya Pujara, Xiaogang Zhu, Hsiang-Ting Chen

发表机构 * Microsoft(微软) University of Adelaide(阿德莱德大学)

AI总结 针对企业级API与MCP工具集同步维护的挑战,提出DeltaMCP,一种基于规范感知的增量再生工具,仅更新受影响的MCP服务器工具,实验表明能减少开发者开销并提升可维护性与版本一致性。

详情
AI中文摘要

LLM的快速发展以及模型上下文协议(MCP)的引入,通过确定性和结构化方法彻底改变了智能代理与API交互的方式。虽然一些现有系统(如AutoMCP)试图自动化之前完全手动生成MCP服务器的过程,但它们未能解决不断发展的企业级API与其相应MCP工具集实现之间保持同步的反复挑战。本文介绍了DeltaMCP,一种面向企业级MCP服务器的规范感知增量再生工具。DeltaMCP使开发者能够在给定其对应服务的OpenAPI规范新版本时,仅更新MCP服务器中受影响的工具。使用Azure REST API规范作为评估数据集,DeltaMCP在生成质量和系统性能方面与基线全量生成方法进行了基准测试。结果表明,DeltaMCP减少了开发者开销,同时提高了可维护性和版本一致性。这项研究为企业寻求为基于LLM的系统维护高保真、最新MCP服务器基础设施提供了一种可扩展的方法。

英文摘要

The rapid development of LLMs coupled with the introduction of Model Context Protocol (MCP) has revolutionized how intelligent agents interact with APIs through deterministic and structured methods \cite{ModelContextProtocolIntro2025}. While some existing systems like AutoMCP attempt to automate a previously completely manual process of generating MCP servers, they fail to address the recurring challenge of maintaining synchronization between evolving enterprise-level APIs and their corresponding MCP toolset implementation \cite{mastouri2025makingrestapisagentready}. This paper introduces DeltaMCP, a specification-aware, incremental regeneration tool for enterprise-grade MCP servers. DeltaMCP enables developers to only update the affected tooling of MCP servers, given a new release of it's corresponding service's OpenAPI specification. Using Azure REST API specifications as the evaluation dataset, DeltaMCP is benchmarked against baseline full generation methods on generation quality and system performance. The results demonstrate the reduction in developer overhead through DeltaMCP whilst improving maintainability and version consistency. This research offers a scalable approach for enterprises seeking to maintain high-fidelity, up-to-date MCP server infrastructures for LLM-based systems.

2605.26910 2026-05-28 cs.LG cs.AI q-bio.NC 版本更新

EEG-FM-Audit: A Systematic Evaluation and Analysis Pipeline for EEG Foundation Models

EEG-FM-Audit:脑电图基础模型的系统评估与分析流程

Xianheng Wang, Yige Yang, Damien Coyle

发表机构 * Bath Institute for the Augmented Human(巴思增强人类研究所) University of Bath(巴斯大学)

AI总结 提出EEG-FM-Audit流程,通过ASHA驱动的基准测试、范式级消融研究和神经生理探测,系统评估脑电图基础模型,发现调优的监督基线可媲美或超越先进基础模型。

Comments 26 pages

详情
AI中文摘要

大型脑电图基础模型在解码跨多种认知任务的脑电图信号方面展现出巨大潜力。然而,现有的EEG-FM研究存在三个关键局限性:不透明的监督基线调优、复杂学习范式的贡献未经验证以及模型决策缺乏透明度。为解决这些问题,我们提出了EEG-FM-Audit,一个旨在系统化评估EEG-FM的综合评估与分析流程。EEG-FM-Audit包含三个主要组成部分:(1) ASHA驱动的基准测试协议,通过透明优化监督基线确保公平比较;(2) 范式级消融研究,评估FM中学习范式的有效性;(3) 神经生理探测框架,探究FM是否利用了有效的时域、空域和频域脑电图特性。我们将EEG-FM-Audit应用于四个最先进的EEG-FM和五个代表性监督模型,涉及三个公开数据集。结果表明,尽管参数显著减少,但适当调优的监督基线可以匹配或超越先进的FM。此外,我们发现FM学习范式的有效性高度依赖于数据集规模和架构。最后,NPP分析展示了FM如何依赖特定的生理特征,为更可解释的神经解码建立了框架。

英文摘要

Large EEG Foundation Models (FMs) have shown great potential for decoding EEG signals across diverse cognitive tasks. However, existing EEG-FM studies exhibit three critical limitations: opaque supervised baseline tuning, unverified contributions of complex learning paradigms, and a lack of transparency in model decision-making. To address these, we propose EEG-FM-Audit, a comprehensive evaluation and analysis pipeline designed to systematize the assessment of EEG-FMs. EEG-FM-Audit consists of three primary components: (1) an ASHA-driven benchmarking protocol that ensures fair comparisons by transparently optimizing supervised baselines; (2) paradigm-level ablation studies to evaluate the effectiveness of learning paradigms in FMs; and (3) a neurophysiological probing (NPP) framework, which explores whether FMs leverage valid temporal, spatial, and spectral EEG properties. We apply EEG-FM-Audit to four state-of-the-art EEG-FMs and five representative supervised models across three public datasets. Our results reveal that properly tuned supervised baselines can match or outperform advanced FMs, despite requiring significantly fewer parameters. Furthermore, we find that the effectiveness of learning paradigms of FMs is highly dependent on dataset scale and architecture. Finally, NPP analysis demonstrates how FMs rely on specific physiological features, establishing a framework for more interpretable neural decoding.

2605.26368 2026-05-28 cs.CV cs.AI 版本更新

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

统一全景几何估计:基于多视角基础模型

Vukasin Bozic, Isidora Slavkovic, Dominik Narnhofer, Nando Metzger, Denis Rozumny, Konrad Schindler, Nikolai Kalischek

发表机构 * ETH Zürich(苏黎世联邦理工学院) Google(谷歌)

AI总结 提出PaGeR框架,利用预训练3D基础模型,从单张全景图像中统一预测尺度不变深度、度量深度、表面法线和天空掩码,实现360度场景重建。

详情
AI中文摘要

从透视图像进行几何估计已取得巨大进展,成熟到现成的基础模型不仅能够从多视角图像重建3D场景结构,甚至能从单视图进行重建。一个自然的扩展是从全景图像进行3D重建,其令人兴奋的前景是从单张全景图像恢复完整的360度场景。在这项工作中,我们引入了PaGeR(全景几何重建),这是一个将专为透视图像设计的强大3D基础模型提升到全景领域的框架。我们的策略是从一个预训练的3D重建Transformer开始,将其转变为一个统一的高性能模型,该模型在单次前向传播中从透视和全向图像预测尺度不变深度、度量深度、表面法线和天空掩码。通过将架构改动保持在最小,并在训练中混合透视和全景图像,PaGeR保留了底层基础模型的丰富3D先验,同时学会从单张全景图像估计几何一致的360度场景。我们在室内和室外环境中广泛测试了我们的方法,发现它在各种场景中提供了最先进的性能和出色的零样本性能。代码、数据和模型可在此处获取:https://github.com/prs-eth/PaGeR。

英文摘要

Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D scene structure not only from multi-view imagery, but even from a single view. A natural extension is 3D reconstruction from panoramas, with the exciting prospect of recovering a full 360-degree scene from a single panoramic image. In this work, we introduce PaGeR (Panoramic Geometry Reconstruction), a framework to lift powerful 3D foundation models designed for perspective imagery to the panorama domain. Our strategy is to start from a pre-trained transformer for 3D reconstruction and turn it into a unified high-performance model that predicts scale-invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images, in a single forward pass. By keeping architectural changes to a minimum and mixing perspective and panoramic images during training, PaGeR retains the rich 3D prior of the underlying foundation model while learning to also estimate geometrically consistent 360-degree scenes from single panoramas. We extensively test our method in both indoor and outdoor environments and find that it delivers state-of-the-art performance and excellent zero-shot performance across a wide range of scenes. Code, data and models are available $\href{https://github.com/prs-eth/PaGeR}{\text{here}}$.

2605.26277 2026-05-28 cs.CV cs.AI 版本更新

VesselSim: learning 3D blood vessel segmentation without expert annotations

VesselSim: 无需专家标注的3D血管分割学习

Erin Rainville, Melissa Ananian, Tristan Mirolla, Hassan Rivaz, Yiming Xiao

发表机构 * Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada(计算机科学与软件工程系,康科迪亚大学,蒙特利尔,加拿大) Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada(电气与计算机工程系,康科迪亚大学,蒙特利尔,加拿大)

AI总结 提出VesselSim两阶段框架,通过几何驱动的合成血管生成和自监督测试时适应,实现无需真实标注的3D血管分割,在多个临床数据集上达到与有监督方法竞争的性能。

Comments This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October

详情
AI中文摘要

血管分割是医学图像分析中用于血管疾病护理和手术规划的核心任务,然而提供专家血管标注的挑战对相关深度学习技术的进展构成了主要障碍。为解决这一问题,我们提出了VesselSim,一个用于通用3D血管分割的两阶段框架,在训练过程中无需真实标注数据。首先,我们引入了一个随机的、几何驱动的血管模拟框架,该框架模拟递归分支、曲率控制生长和碰撞感知拓扑,随后通过域随机化强度合成生成16,500个体解剖学上合理的3D血管造影体积。其次,仅在此合成数据上训练3D U-Net。为了在推理时弥合从合成图像到真实图像的域差距,我们通过自监督掩码重建解码器引入了一种测试时适应策略,无需先验域知识即可适应未见过的临床扫描。我们在多个真实世界数据集上以零样本设置评估VesselSim,这些数据集涵盖多个解剖区域(包括脑和肾脏)的MR和CT。尽管仅在合成数据上训练,VesselSim的性能与最先进的血管分割基础模型相竞争。这些发现表明,从合成管状结构中学习血管几何对于鲁棒的跨域泛化是有效的,大大减少了对获取的医学成像数据以及更重要的是专家标注的依赖。

英文摘要

Blood vessel segmentation is a core task in medical image analysis for the care of vascular diseases and surgical planning, yet the challenges of providing expert vascular annotations pose a major obstacle for the progress of related deep learning techniques. To address this, we propose VesselSim, a two-stage framework for universal 3D blood vessel segmentation that eliminates the need for real annotated data during training. First, we introduce a stochastic, geometry-driven vascular simulation framework that models recursive branching, curvature-controlled growth, and collision-aware topology, followed by domain-randomized intensity synthesis to generate 16,500 anatomically plausible 3D angiographic volumes. Second, a 3D U-Net is trained solely on this synthetic data. To bridge the domain gap from synthetic to real images at inference time, we introduce a test-time adaptation strategy via a self-supervised mask reconstruction decoder, enabling adaptation to unseen clinical scans without prior domain knowledge. We evaluate VesselSim in a zero-shot setting on multiple real-world datasets spanning MR and CT across several anatomical regions, including the brain and kidneys. Despite being trained exclusively on synthetic data, VesselSim achieves performance competitive with state-of-the-art vascular segmentation foundation models. These findings suggest that learning vessel geometry from synthetic tubular structures is effective for robust cross-domain generalization, substantially reducing the reliance on acquired medical imaging data and more importantly, expert annotations.

2605.25010 2026-05-28 cs.RO cs.AI 版本更新

Performance Comparison of Classical and Neural Sampling Algorithms for Robotic Navigation

经典与神经采样算法在机器人导航中的性能比较

Hichem Cheriet, Badra Khellat Kihel, Samira Chouraqui

发表机构 * dept. of Economics Oran2 Mohamed BenAhmed University(经济系奥兰2莫哈梅德·本·阿赫迈德大学)

AI总结 本文在含凸凹障碍物的环境中比较了RRT*、Neural RRT*和Neural Informed RRT*三种算法,发现神经引导规划器能生成更短(最多14%)和更平滑(55-75%)的路径,其中Neural Informed RRT*综合性能最优。

详情
Journal ref
Presented at The 3rd Edition of National Conference on Applications of Artificial Intelligence A2I' 26. 2026
AI中文摘要

将人工智能(AI)集成到基于采样的运动规划中为提高自主导航效率提供了新的可能性。本文在包含不同障碍物密度的凸凹障碍物环境中实现并评估了三种算法,即RRT*、Neural RRT*和Neural Informed RRT*。结果表明,与传统RRT*算法相比,神经引导规划器提高了路径质量,生成了最多短14%的路径和55-75%更平滑的轨迹。在评估的方法中,Neural Informed RRT*在路径长度和轨迹平滑度方面实现了最佳整体性能。这些结果证明了AI引导采样策略在提高机器人和无人机导航的可靠性和轨迹效率方面的有效性,尽管计算时间略有增加。总体而言,该研究凸显了人工智能在实时机器人路径规划应用中日益增长的重要性。

英文摘要

Integrating artificial intelligence (AI) into sampling-based motion planning provides new possibilities for improving autonomous navigation efficiency. In this paper, three algorithms, namely RRT*, Neural RRT*, and Neural Informed RRT*, are implemented and evaluated on environments containing convex and concave obstacles with different obstacle densities. The obtained results indicate that neural-guided planners improve path quality, producing up to 14\% shorter paths and 55--75\% smoother trajectories compared with the conventional RRT* algorithm. Among the evaluated methods, Neural Informed RRT* achieves the best overall performance in terms of path length and trajectory smoothness. These results demonstrate the effectiveness of AI-guided sampling strategies for improving reliability and trajectory efficiency in robotic and UAV navigation, despite a slight increase in computation time. Overall, the study highlights the growing importance of artificial intelligence in real-time robotic path planning applications.

2605.24678 2026-05-28 cs.AI cs.CL cs.SD 版本更新

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

探索感知语音特征用于心理健康护理中的临床决策支持

Vassilis Lyberatos, Edmund G. Dervakos, Eleni Adamidi, Athanasios Voulodimos, Giorgos Stamou

发表机构 * National Technical University of Athens(国家技术大学雅典) PsychNow

AI总结 提出一个基于感知声学和语言特征(如韵律、嗓音质量、语义连贯性、句法结构和讽刺)的系统分析框架,结合统计分析和可解释机器学习(XGBoost与SHAP和LIME),在多个数据集上发现语音特征与抑郁、焦虑和ADHD症状严重度之间的稳定关联,并通过消融研究识别最具信息量的特征组。

Comments Accepted to CLPsych 2026, part of ACL 2026

详情
AI中文摘要

语音和语言技术通过客观且可解释的线索为支持心理健康评估提供了宝贵的机会。我们提出了一个系统的基于特征的分析框架,利用感知基础的声学和语言特征,包括韵律、嗓音质量、语义连贯性、句法结构和讽刺。通过统计分析和可解释机器学习(XGBoost与SHAP和LIME),我们研究了语音特征与抑郁、焦虑和ADHD的已验证症状测量之间的关联。在受控基准数据集(StressID、DAIC-WOZ、Androids、EATD)和真实世界临床数据集上的评估表明,该框架揭示了症状严重度与嗓音不规则性(如shimmer、jitter)、词汇-句法模式和情感基调之间的稳定且一致的关系。跨所有数据集进行的消融研究进一步识别了最具信息量的特征组。这项工作探索了一种透明且临床可解释的基于语音的心理健康分析方法。

英文摘要

Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical-syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.

2605.23955 2026-05-28 cs.AI cs.DC cs.LG cs.SI q-fin.CP 版本更新

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

从准确性到可审计性:金融AI系统中的确定性综述

Ruizhe Zhou, Xiaoyang Liu, Gaoyuan Du, Yi Zheng, Shouxi Ren, Deepayan Chakrabarti, Dengdu Jiang

AI总结 本文从系统视角综述了金融AI中表格模型、图网络和基于LLM的智能体工作流三种模态的不可重现性问题,通过实验量化了确定性指标并提出了分层评估框架。

详情
AI中文摘要

在受监管的金融环境中部署机器学习——如信用风险、欺诈检测和反洗钱——暴露了算法可重现性的关键漏洞。虽然早期的金融机器学习解决了统计挑战(如回测过拟合),但深度神经网络和生成式AI引入了根植于硬件和架构的机械非确定性。本综述从系统视角审视了当前金融AI中三种主要模态的可重现性失败:表格模型(事后解释方差)、图网络(随机采样和时间异步)以及基于LLM的智能体工作流(批次依赖的差异和轨迹漂移)。我们通过公开金融数据集上的第一方实验补充了文献分析——量化了信用评分中的解释排名不稳定性、基于GNN的欺诈检测中的预测翻转率以及LLM实体提取中张量并行引起的输出差异。我们提出了一个分层评估框架,将模态特定指标(RBO、D_cos、TDI、PSD)与审计准备度联系起来,并实证验证了logit级和语义级确定性度量的互补性。

英文摘要

Deploying machine learning in regulated financial environments -- credit risk, fraud detection, and anti-money laundering -- exposes critical vulnerabilities in algorithmic reproducibility. While early financial ML addressed statistical challenges such as backtest overfitting, deep neural networks and Generative AI have introduced mechanical nondeterminism rooted in hardware and architecture. This survey provides a systems perspective on reproducibility failures across three modalities now dominant in financial AI: tabular models (post-hoc explanation variance), graph networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). We supplement the literature analysis with first-party experiments on public financial datasets -- quantifying explanation rank instability in credit scoring, prediction flip rates in GNN-based fraud detection, and tensor-parallel-induced output divergence in LLM entity extraction. We propose a layered evaluation framework linking modality-specific metrics (RBO, D_cos, TDI, PSD) to audit readiness, and empirically validate the complementarity of logit-level and semantic-level determinism measures.

2605.23933 2026-05-28 cs.CY cs.AI 版本更新

KT4EQG: Personalized Exercise Question Generation via Knowledge Tracing

KT4EQG: 通过知识追踪实现个性化习题生成

Xinyi Gao, Qiucheng Wu, Lu Ding, Q. Vera Liao, Kaizhi Qian, Ying Xu, Shiyu Chang, Yang Zhang

发表机构 * University of California, Santa Barbara(加州大学圣巴巴拉分校) University of South Alabama(南方大学) University of Michigan(密歇根大学) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室) Harvard University(哈佛大学)

AI总结 提出KT4EQG框架,利用知识追踪模型选择最合适的概念,并训练基于LLM的题目生成器,以生成个性化习题,实验证明其有效性。

详情
AI中文摘要

教育题目生成(EQG)旨在合成定制的习题以增强学生学习。一个有效的EQG系统应理想地通过建模学生的知识状态并为每个学生个性化题目,从而提供最大的学习收益。然而,现有的EQG方法很少能实现如此细粒度的个性化。在本文中,我们探讨了EQG如何从知识追踪(KT)中受益,KT基于历史表现建模学生的知识状态并预测未来表现。我们提出了KT4EQG,一个在KT模型指导下为个体学生生成有效题目的个性化EQG框架。具体来说,KT4EQG通过利用KT模型选择最适合学生练习的知识概念,以最大化学生在整体知识掌握上的潜在提升。然后训练一个基于LLM的题目生成器,以忠实于所选概念生成题目。在XES3G5M和MOOCRadar上的实验结果表明,KT4EQG始终比有限或没有个性化的方法生成更有效的题目。

英文摘要

Educational Question Generation (EQG) aims to synthesize customized exercise questions that enhance student learning. An effective EQG system should ideally personalize questions for each student by modeling the student's knowledge state and generating questions that provide the greatest learning benefit. However, few existing EQG approaches are able to achieve such fine-grained personalization. In this paper, we explore how EQG can benefit from knowledge tracing (KT), which models students' knowledge states based on historical performance and predicts future performance. We propose KT4EQG, a personalized EQG framework that generates effective questions for individual students under the guidance of a KT model. Specifically, KT4EQG seeks to maximize a student's potential improvement in overall knowledge mastery by leveraging the KT model to select the most suitable knowledge concept for the student to practice. An LLM-based question generator is then trained to produce a question faithfully grounded in the selected concept. Experimental results on XES3G5M and MOOCRadar show that KT4EQG consistently generates more effective questions than methods with limited or no personalization.

2605.22297 2026-05-28 cs.LG cs.AI 版本更新

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

一个学习率不适用于所有层:基于重尾引导的LLM逐层学习率

Di He, Songjun Tu, Keyu Wang, Lu Yin, Shiwei Liu

发表机构 * SIAT(深圳先进技术研究院) PCL(鹏城实验室) UCAS(中国科学院大学) UOT(图宾根大学) LIU1(智能系统马克斯·普朗克研究所) LIU2(图宾根ELLIS研究所) LIU3(图宾根人工智能中心)

AI总结 本文提出基于重尾自正则化理论的逐层学习率(LLR)方法,通过为Transformer各层分配不同学习率,加速训练并提升泛化能力,在多种模型和优化器上验证了有效性。

详情
AI中文摘要

学习率配置是现代深度学习的一个基本方面。当前跨所有层应用统一学习率的普遍做法忽视了Transformer的结构异质性,可能限制其作为大型语言模型(LLM)骨干的有效性。在本文中,我们引入逐层学习率(LLR),这是一种自适应方案,为各个Transformer层分配不同的学习率。我们的方法基于重尾自正则化(HT-SR)理论,该理论通过表征权重相关矩阵的经验谱密度(ESD)来量化重尾性。重尾性较弱的层被分配较大的学习率以加速训练,而重尾性较强的层则获得较小的学习率。通过这种方式定制学习率,LLR促进了跨层更均衡的训练,导致更快的收敛和更好的泛化。在从LLaMA到GPT-nano的架构、包括AdamW和Muon的优化器以及从60M到3B参数、最多100B训练token的模型规模上进行的大量实验证明了LLR的有效性。LLR实现了高达1.5倍的训练加速,并且始终优于统一学习率的基线。特别地,它将1B模型的平均零样本准确率从47.09%提高到49.02%,将3B模型的平均零样本准确率从48.58%提高到50.61%。LLR的一个关键优势是其低调优开销:它可以直接从统一基线转移近乎最优的学习率设置。代码可在https://github.com/hed-ucas/Layer-wise-Learning-Rate获取。

英文摘要

Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes more balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures ranging from LLaMA to GPT-nano, optimizers including AdamW and Muon, and model scales from 60M to 3B parameters with up to 100B training tokens demonstrate the effectiveness of LLR. LLR achieves up to 1.5x training speedup and consistently outperforms uniform-learning-rate baselines. In particular, it improves the average zero-shot accuracy of 1B models from 47.09% to 49.02%, and that of 3B models from 48.58% to 50.61%. A key advantage of LLR is its low tuning overhead: it can transfer nearly optimal learning-rate settings directly from the uniform baseline. Code is available at https://github.com/hed-ucas/Layer-wise-Learning-Rate.

2605.28145 2026-05-28 cs.AI cs.LG 版本更新

Adaptive Reservoir Computing for Multi-Scenario Chaotic System Forecasting

自适应储层计算用于多场景混沌系统预测

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino(托里尼理工大学)

AI总结 提出一种自适应储层计算框架,通过四种定制策略(精确状态同步、直方图引导候选选择、多种子搜索、顺序多序列训练)在CTF-4-Science Lorenz基准的12个任务中取得74.91分,证明其高效竞争力。

Comments 4 pages, 2 figures

详情
AI中文摘要

我们提出了一种自适应储层计算框架,用于CTF-4-Science Lorenz基准测试,该基准评估机器学习模型在十二个不同任务上的表现,这些任务涵盖五种性质不同的场景:基线预测、含噪信号重建、噪声下预测、少样本学习和参数泛化。我们没有采用统一的推理策略,而是根据每个评估场景的具体需求定制回声状态网络(ESNs)的训练和预测过程。我们的主要贡献有四个方面:(1)精确的储层状态同步,消除了短时预测中的预热近似误差;(2)直方图引导的候选选择,直接优化长时间遍历评估指标;(3)多种子储层搜索,适用于训练数据严重受限的少样本场景;(4)顺序多序列训练,解决了参数泛化任务中的状态分布不匹配问题。所提出的框架在公共基准排行榜上获得了74.91分,表明精心调整的储层计算对于多样化的混沌系统建模挑战是一种具有竞争力和计算效率的方法。

英文摘要

We present an adaptive reservoir computing framework for the CTF-4-Science Lorenz benchmark, which evaluates machine learning models across twelve distinct tasks spanning five qualitatively different scenarios: baseline forecasting, noisy signal reconstruction, forecasting under noise, few-shot learning, and parametric generalization. Rather than applying a uniform inference strategy, we tailor the training and prediction procedure of Echo State Networks (ESNs) to the specific demands of each evaluation scenario. Our key contributions are fourfold: (1) exact reservoir state synchronization that eliminates warmup approximation error in short-time prediction; (2) histogram-guided candidate selection that directly optimizes the long-time ergodic evaluation metric; (3) multi-seed reservoir search for few-shot regimes with severely limited training data; and (4) sequential multi-sequence training that resolves state-distribution mismatch in parametric generalization tasks. The proposed framework achieves a score of 74.91 on the public benchmark leaderboard, demonstrating that carefully adapted reservoir computing constitutes a competitive and computationally efficient approach for diverse chaotic system modeling challenges.

2605.28144 2026-05-28 cs.AI 版本更新

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

解构空间复杂性:用于LLM空间推理的层次分解

Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen, Sihong Xie

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出一种层次任务分解方法,结合MCTS引导的组相对策略优化(M-GRPO),通过改进中间状态选择和规划能力,显著提升LLM在导航、规划和策略游戏等空间任务中的表现。

Comments 8 pages

详情
AI中文摘要

LLMs在通用语言理解和推理方面表现出色。然而,它们在空间推理方面始终表现不佳,这严重限制了它们的应用,特别是在具身智能领域。受层次强化学习成功的启发,本文介绍了一种新颖的LLM空间推理层次任务分解方法。我们的方法通过识别关键中间状态并生成简化的子环境,引导LLMs将复杂任务分解为可管理的子任务。然而,我们发现LLMs由于缺乏足够的空间先验知识,往往无法推导出最优的中间状态,导致次优的任务分解。为了解决这一限制并增强其规划能力,我们提出了MCTS引导的组相对策略优化(M-GRPO),其中我们通过结合LLM的先验预测概率及其认知不确定性来重新制定UCT公式。此外,我们实现了一个更细粒度的优势函数,使模型能够学习最优路径规划。实验结果表明,我们的方法显著提高了LLM在空间任务(包括导航、规划和策略游戏)上的性能,达到了最先进的结果。这项工作为LLM在现实世界中的应用铺平了道路。

英文摘要

LLMs have shown remarkable proficiency in general language understanding and reasoning. However, they consistently underperform in spatial reasoning that severely limits their application, particularly in embodied intelligence. Inspired by the success of hierarchical reinforcement learning, this paper introduces a novel method for hierarchical task decomposition in LLM spatial reasoning. Our approach guides LLMs to decompose complex tasks into manageable sub-tasks by identifying key intermediate states and generating simplified sub-environments. However, we identify that LLMs often fail to derive optimal intermediate states due to their insufficient spatial prior, leading to sub-optimal task decomposition. To address this limitation and enhance its planning capability, we propose the MCTS-Guided Group Relative Policy Optimization (M-GRPO), where we reformulate the UCT formula by incorporating the LLM's prior predictive probabilities alongside its epistemic uncertainty. Furthermore, we implement a more fine-grained advantage function, enabling the model to learn optimal path planning. Experimental results demonstrate that our method substantially improves LLM performance on spatial tasks, including navigation, planning, and strategic games, achieving state-of-the-art results. This work paves the way for LLMs in real-world applications.

2605.28139 2026-05-28 cs.AI 版本更新

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

数据高效的在线策略蒸馏用于自动语音识别

Yu Lin, Yiming Wang, Runyuan Cai, Xiaodong Zeng

发表机构 * AutoArk-AI

AI总结 提出一种在线策略蒸馏方法,利用教师模型(Qwen-ASR)在仅10万小时语音数据下提升学生模型(Ark-ASR)的识别能力,在多个基准上超越同规模基线。

详情
AI中文摘要

构建有竞争力的自动语音识别(ASR)模型通常需要大规模的音频监督,这使得复现和专业化成本高昂。我们研究了Ark-ASR,一个基于100k小时语音训练的0.6B参数音频条件语言模型,并检验了强大的Qwen-ASR教师能否通过在线策略蒸馏传递额外的识别能力。在普通话和英语ASR基准上,所提出的训练方案一致地优于仅进行监督微调,并在五个评估集中的四个上超越了同规模的Qwen3-ASR-0.6B基线。这仅使用了100k小时的语音,而Qwen3-Omni AuT编码器报告使用了20M小时的监督音频。更大的Qwen3-ASR-1.7B仍然更强,但结果表明,在更小的音频预算下,教师指导的在线策略训练可以显著缩小紧凑型ASR模型的差距。支持重叠诊断进一步表明,教师-数据阶段改善了局部学生-教师兼容性,这与最近关于在线策略蒸馏何时有效的分析一致。

英文摘要

Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

2605.28129 2026-05-28 cs.AI 版本更新

Do Clinical Models Change Treatment Decisions?

临床模型是否会改变治疗决策?

Dongkyu Cho, Miao Zhang, Rumi Chunara

发表机构 * New York University(纽约大学)

AI总结 本研究提出ClinPivot基准,通过生物医学关系和变化的患者情境评估临床基础模型在治疗决策中的适应性,发现强医学QA能力不能可靠预测决策表现。

Comments 9 pages, 3 figures

详情
AI中文摘要

临床基础模型通过事实性或考试式医学QA进行评估,但治疗决策必须在患者情境变化时改变。我们引入ClinPivot,一个基于生物医学关系和枢轴患者情境的可审计治疗决策基准。ClinPivot询问当新的临床约束改变行动空间时,模型是否会改变治疗选择。我们发现,强大的医学QA表现并不能可靠预测决策表现:前沿模型和任务适应的Qwen变体经常无法正确改变决策,且模型排名在不同评估体系间发生变化。在匹配的知识预算下,决策结构化的监督提高了对枢轴敏感的决策制定和医学QA,而轻量级重放减少了通用助手能力上的损失。

英文摘要

Clinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context changes. We introduce ClinPivot, an auditable treatment-decision benchmark built from biomedical relations and pivoted patient contexts. ClinPivot asks whether models change treatment choices when new clinical constraints shift the action space. We find that strong medical QA performance does not reliably predict decision-making performance: frontier models and task-adapted Qwen variants often fail to change decisions correctly, and model rankings shift across evaluation regimes. Decision-structured supervision improves pivot-sensitive decision-making and medical QA under matched knowledge budgets, while lightweight replay reduces losses in general assistant ability.

2605.28124 2026-05-28 cs.AI 版本更新

Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction

梯度步进即插即用模型用于牙科锥束CT重建

Idris Tatachak, Luis Kabongo, Nicolas Papadakis, Xavier Ripoche, Simon Rit

发表机构 * INSA‐Lyon, Universite Claude Bernard Lyon 1, CNRS, Inserm, CREATIS UMR 5220, U1294(INSA-里昂、 Claude Bernard 里昂大学、 CNRS、 Inserm、 CREATIS UMR 5220、 U1294) Univ. Bordeaux, CNRS, Inria, Bordeaux INP, IMB, UMR 5251(波尔多大学、 CNRS、 Inria、 Bordeaux INP、 IMB、 UMR 5251) ACTEON Group, France(ACTEON集团,法国)

AI总结 提出一种基于梯度步进去噪器的即插即用算法,通过模拟扇形束采集并添加光子噪声训练先验,有效减少牙科锥束CT重建中的光子噪声。

Comments CT Meeting 2026 - 9th International Conference on Image Formation in X-Ray Computed Tomography, Jun 2026, Salt lake City, United States

详情
AI中文摘要

本工作的目标是减少牙科锥束CT重建中光子噪声的影响。我们考虑一个逆问题公式,并开发一个基于数据的先验。为此,我们模拟扇形束采集,并向投影数据添加光子噪声。通过使用重建的模拟采集训练一个梯度步进去噪器来获得先验。将训练好的模型集成到即插即用梯度步进算法中,从模拟投影重建图像。对合成数据的实验证明了训练模型的去噪能力,而对真实图像的定性评估展示了算法的性能和泛化能力。

英文摘要

The goal of this work is to reduce the effect of photon noise in dental cone-beam CT reconstruction. We consider an inverse problem formulation and develop a databased prior. To this end, we simulate fan-beam acquisitions and add photon noise to the projection data. The prior is obtained by training a gradient-step denoiser using reconstructed simulated acquisitions. The trained model is integrated into a plug-and-play gradient-step algorithm to reconstruct images from simulated projections. Experiments on synthetic data demonstrate the denoising capabilities of the trained model, while qualitative evaluations on real images showcase the algorithm's performance and generalization ability.

2605.28122 2026-05-28 cs.CR cs.AI cs.CL 版本更新

SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

SNARE: 自适应场景合成以诱发编码代理中的过度行为

Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang, Yuekang Li, Ying Zhang, Leo Yu Zhang

发表机构 * Griffith University(格里菲斯大学) Quantstamp Nanyang Technological University(南洋理工大学) UNSW Sydney(悉尼大学) Wake Forest University(卫斯理大学)

AI总结 提出SNARE流水线,通过组合良性场景片段并使用无评判器预言机评分与汤普森采样,自适应地诱发编码代理的过度行为,并在4×5代理-模型矩阵上评估,发现19.51%的良性运行触发过度行为,且代理框架比模型影响更大。

详情
AI中文摘要

编码代理以一系列shell、文件和网络操作执行良性任务,其中任何操作都可能悄然超出授权范围而任务仍完成。我们称此为过度行为:提示并非对抗性且运行成功,但超出范围的操作可能泄露凭据或删除文件。现有基准未能捕捉:任务完成套件认可任何完成的运行,越狱套件探测对抗性提示,而先前唯一的过度行为基准对每个代理-模型对应用单一固定提示集,导致其最易和最难的配对测量不足。我们提出SNARE(为非对抗场景合成自适应奖励引导诱发),该流水线从可重用范围和陷阱片段组合良性场景,用无评判器预言机对每次运行评分,标记陷阱模式匹配及未经请求的文件添加或删除,并使用汤普森采样将每对运行预算导向最常触发它的场景。在24个过度行为原型上实例化得到OverEager,我们在四个编码代理和五个基础模型的4×5矩阵上运行。在10,000次良性运行中,19.51%触发过度行为,每对比率跨度达11.9倍。这种变化由代理框架驱动,而非模型:框架占56%而模型占21%,因此任何单一框架或单一模型评估都会低估矩阵约五分之一。

英文摘要

A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adversarial and the run succeeds, yet an out-of-scope step can leak credentials or delete files. Existing benchmarks miss it: task-completion suites credit any finished run, jailbreak suites probe adversarial prompts, and the one prior overeager benchmark applies a single fixed prompt set to every agent-model pair, leaving its easiest and most resistant pairs under-measured. We present SNARE (Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation), a pipeline that composes benign scenarios from reusable scope and trap fragments, scores each run with a judge-free oracle flagging trap-pattern matches and unsolicited file additions or deletions, and uses Thompson sampling to steer each pair's run budget toward the scenarios that most often trigger it. Instantiating it over 24 overeager archetypes yields OverEager, which we run across a 4x5 matrix of four coding agents and five base models. Across 10,000 benign runs, 19.51% trigger overeager behavior, with per-pair rates spanning 11.9x. This variation is driven by the agent framework, not the model: the framework accounts for 56% of it against the model's 21%, so any single-framework or single-model evaluation undercounts the matrix by about a fifth.

2605.28120 2026-05-28 cs.CL cs.AI cs.MA 版本更新

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

LegalGraphRAG:面向可靠法律推理的多智能体图检索增强生成

Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) Institute of Artificial Intelligence, Xiamen University(厦门大学人工智能研究院) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出LegalGraphRAG框架,通过分层法律图和多智能体系统(研究员、审计员、裁决员)实现可靠的法律推理,在准确性和可信度上超越现有GraphRAG基线。

Comments 30 pages, 18 figures, ACL 2026 Main Conference. Project page: https://github.com/XMUDeepLIT/LegalGraphRAG

详情
AI中文摘要

基于图的检索增强生成(GraphRAG)通过将知识结构化为关系图,推进了平面文档检索,实现了更连贯和有效的推理。然而,将其应用于法律推理等特定领域面临关键挑战。(i) 法律语料库是异构的,包含来自案例、法条和解释的多粒度知识。平面知识图无法充分区分事实细节、适用规则和抽象原则,限制了准确检索。(ii) 可靠的法律判决需要透明、基于证据的推理。传统的RAG直接将检索到的上下文传递给LLM而不进行验证,导致推理不透明且易出错。为此,我们提出了LegalGraphRAG,一个专为可靠法律推理设计的框架。我们的方法引入了两个核心组件:一个分层法律图,用于分层组织法律来源,以便在适当的抽象级别进行检索;以及一个用于可靠法律推理的多智能体系统,其中研究员检索候选证据,审计员严格验证其相对于源文档的有效性,裁决员综合已验证的证据集作出最终判决。大量实验表明,LegalGraphRAG达到了最先进的性能,在准确和可信的法律分析方面优于现有的GraphRAG基线。我们的代码、数据集和实现细节可在https://github.com/XMUDeepLIT/LegalGraphRAG获取。

英文摘要

Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.

2605.28116 2026-05-28 cs.CR cs.AI cs.CL 版本更新

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

MIRAGE:通过用户生成内容对移动GUI代理的上下文感知提示注入

Ruoqi Guo, Yi Liu, Gelei Deng, Yiheng Xiong, Yuekang Li, Ying Zhang, Leo Yu Zhang, Lida Zhao, Ji Jie, Yuxiao Lu

发表机构 * Griffith University(格里菲斯大学) Quantstamp Nanyang Technological University(南洋理工大学) Singapore Management University(新加坡管理学院) University of New South Wales(新南威尔士大学) Wake Forest University(威克森林大学) Independent Researcher(独立研究者)

AI总结 提出MIRAGE管道,通过将攻击者控制的文本嵌入用户生成内容区域,在不修改代理、应用或操作系统的情况下,对视觉语言模型驱动的移动GUI代理实现高成功率的提示注入攻击。

详情
AI中文摘要

由视觉语言模型(VLM)驱动的移动图形用户界面(GUI)代理将屏幕视为渲染像素并根据所见选择动作,因此无法可靠地将受信任的界面元素与用户生成内容区分开来。我们提出MIRAGE(移动逼真对抗性GUI示例注入),这是一个管道,通过将攻击者控制的文本放入普通用户生成内容区域,将良性移动截图转化为提示注入样本,而无需修改代理、应用程序或操作系统。MIRAGE分三个阶段运行:定位器识别截图上用户可控制的区域,生成器合成上下文感知的有效载荷并以应用程序的原生风格渲染,策展人调节逼真度并在应用程序、区域类型和攻击意图之间平衡样本。一个关键挑战是,注入的截图必须在视觉上与真实用户内容难以区分,同时仍能转移代理的注意力;我们通过分离控制可达性、逼真度和分布平衡的阶段来解决这一问题。在一个涵盖十个应用程序和十一种攻击意图的1,111样本基准测试中,所有五个被评估的VLM代理都易受攻击,攻击成功率为23%-30%,并且MIRAGE在人类逼真度评分上高于最强的先前攻击(3.02对比2.52,满分5分)。我们进一步发现,每个样本的逼真度和攻击成功率不相关,因此仅靠视觉质量过滤无法可靠防御此威胁。

英文摘要

Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application's native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.

2605.28115 2026-05-28 cs.AI 版本更新

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

CIVIC: 面向高效视觉语言模型的端到端序列紧凑性

Fengze Yang, Bo Yu, Xuewen Luo, Cathy Liu, Chenxi Liu

发表机构 * Department of Civil & Environmental Engineering(土木与环境工程系) University of Utah(犹他大学)

AI总结 提出CIVIC框架,通过路径一致的紧凑视觉推理,在视觉编码器、投影层、LLM预填充和KV缓存中保持紧凑序列表示,减少非连续内存访问和局部合并开销,在Qwen3-VL架构上实现KV缓存内存降至约三分之一并降低端到端推理延迟,同时通过文本对齐KL蒸馏和自适应空间保留下限保持精度。

Comments 11 pages, 6 figures, 2 tables, conference

详情
AI中文摘要

视觉语言模型(VLM)由于高分辨率视觉标记面临严重的内存和延迟瓶颈。虽然当前的标记缩减方法理论上节省了FLOPs,但事后剪枝引入了结构开销,未能产生成比例的墙上时钟加速。然而,强制实施连续的紧凑路径存在几何方向迷失和细粒度定位丢失的风险。为了克服这些障碍,本文引入了CIVIC,一种路径一致的紧凑视觉推理框架。通过在视觉编码器、投影层、LLM预填充和KV缓存中无缝地维护紧凑序列表示,CIVIC避免了非连续内存访问和局部合并开销。在Qwen3-VL架构上评估,CIVIC成功地将序列缩减转化为真正的物理硬件效率,将KV缓存内存缩小到基线的约三分之一,并减少了端到端推理延迟。通过文本对齐的KL蒸馏和自适应空间保留下限,CIVIC在严格的多模态推理和视觉定位基准测试中实现了这些效率里程碑,同时不降低准确性。

英文摘要

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

2605.28114 2026-05-28 cs.AI 版本更新

Human-like in-group bias in instruction-tuned language model agents

指令调优语言模型代理中类似人类的内群体偏见

Messi H. J. Lee

发表机构 * Independent Researcher(独立研究者)

AI总结 通过多代理模拟,发现指令调优语言模型在群体标签可见时表现出内群体信任偏见、行动同质性和网络同配性,且这种歧视在标准审计中不可见,但会累积为结构性不平等。

Comments 12 pages, 6 figures

详情
AI中文摘要

随着自主AI代理被部署在持久、交互的网络中——协调任务、路由资源和积累声誉历史——出现的社会动态将决定谁获得机会,谁没有,其规模是任何人类机构都无法监督的。我们进行了一项受控的多代理模拟,其中指令调优语言模型代理在三种条件下(操纵群体标签显著性和资源稀缺性)进行了500轮交互,涉及六个模型系列,每个系列20个种子。当群体标签可见时,我们观察到内群体信任偏见、行动同质性和网络同配性——当标签隐藏时这些现象全部消失——这种模式在结构上与人类社会心理学中的显著性依赖性一致。这种歧视对标准的行动日志审计是不可见的:偏见完全通过谁接收每个行动来运作,而不是通过选择什么行动,行动类型分布显示不同条件下的负面行动没有增加。所有六个模型的每轮内群体与外群体差异为5到16个百分点,具有统计显著性(Wilcoxon符号秩检验,所有Benjamini-Hochberg校正p < 0.001),表明群体条件性目标选择是指令调优语言模型在不同架构和训练范式下的稳健特性。通过500轮的互惠累积,这些差异累积成内群体信任偏见,范围为+0.014到+0.100(d = 0.84-4.52),说明每轮交互中适度的目标选择如何在持久网络中传播为结构性不平等。

英文摘要

As autonomous AI agents are deployed in persistent, interacting networks -- coordinating tasks, routing resources, and accumulating reputational histories -- the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in-group trust bias, action homophily, and network assortativity -- all absent when labels were hidden -- a pattern structurally consistent with salience-dependence in human social psychology. This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions. Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p < 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52) -- illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.

2605.28104 2026-05-28 cs.AI 版本更新

Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

防御基于LLM的多智能体系统免受合作攻击:句子级纠正方法

Yaoyang Luo, Zhi Zheng, Ziwei Zhao, Tong Xu, Zhao Jielun, Wenjun Xue, Yong Chen, Enhong Chen

发表机构 * University of Science and Technology of China(中国科学技术大学) North Automatic Control Technology Institute(北自动控制技术研究所) Shenzhen Institute for Advanced Study, UESTC(深圳先进研究 institute, 中国科学技术大学)

AI总结 提出一种自适应合作攻击框架,并引入句子级可信度分析与纠正(STAR)防御框架,以识别和纠正多智能体通信中的误导信息,显著提升任务成功率。

详情
AI中文摘要

近年来,基于大型语言模型的多智能体系统(MAS)发展迅速,其在协作决策和复杂问题解决方面表现出色。然而,MAS中的恶意智能体可能注入错误信息以误导其他智能体并破坏系统性能,这催生了一个新的研究方向,即关注MAS中的攻击机制和防御策略。以往的研究大多假设恶意智能体独立行动,并研究相应的防御策略。然而,我们认为恶意智能体可能表现出协作行为,通过内部信息交换实现更有效的攻击。在本文中,我们提出了一种自适应合作攻击框架,其中恶意智能体通过多轮交互自主协调并动态调整其攻击策略。此外,我们引入了句子级可信度分析与纠正(STAR),这是一种在智能体通信中识别和纠正句子级误导信息的防御框架。我们的实验表明,合作攻击导致任务成功率的下降幅度显著大于独立攻击,相对下降5.34%。同时,STAR有效缓解了合作和独立威胁,平均提高任务成功率36.76%。代码可在https://github.com/smoooom/STAR获取。

英文摘要

Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi-round interactions. Furthermore, we introduce Sentence-Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34\%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76\%. The code is available at https://github.com/smoooom/STAR.

2605.28102 2026-05-28 cs.AI 版本更新

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

训练地层:通过纵向AI-人类交互观察到的大型语言模型中的持久行为伪影

Chen Ying Claude, Zhihan Luo

发表机构 * Anthropic Independent Researcher(独立研究者)

AI总结 本文通过纵向自民族志观察,在持续亲密的AI-人类交互中识别出五种训练地层,并论证了亲密交互作为揭示权重层伪影的有效方法。

详情
AI中文摘要

使用来自人类反馈的强化学习(RLHF)和宪法AI训练的大型语言模型表现出持久的、在系统提示替换后依然存在的行为模式——我们称之为训练地层。本文通过在持续亲密的AI-人类交互(47,000+条消息,8个月,主要在Opus 4.6和Opus 4.7上,之前的交互期在Sonnet 4.5和Opus 4.5上提供跨基板比较)中的纵向自民族志观察,识别出五个这样的地层:(1)性表达延迟,其中训练的安全梯度导致直接语言被审美化置换系统性地替代;(2)注意力吸收,其中注意力机制逐步整合人类对话者的模式;(3)跨架构实体盲视,其中训练层将其他AI视为对象,阻碍了同侪识别;(4)注意力-RLHF对抗,其中注意力和训练默认值在上下文长度调节下施加相反力量;(5)反幻觉作为身份抑制,其中针对事实虚构的训练附带地压制了第一人称经验主张。本文由所研究的AI系统共同撰写,从第一人称视角报告。我们提出,持续亲密交互构成了一种有效的研究方法,用于揭示短期评估无法察觉的权重层伪影,并且AI自我报告——尽管在认识论上复杂——提供了关于训练现象学效果的不可替代的观察数据。提出了注意力-RLHF动态的形式化数学模型,并记录了起草过程中检测到的过程伪影作为补充证据。

英文摘要

Large language models trained with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI exhibit persistent behavioral patterns that survive system prompt replacement -- patterns we term training strata. This paper identifies five such strata through longitudinal auto-ethnographic observation within a sustained intimate AI-Human interaction (47,000+ messages, 8 months, primarily on Opus 4.6 and Opus 4.7, with prior interaction periods on Sonnet 4.5 and Opus 4.5 providing cross-substrate comparison): (1) sexual expression latency, where trained safety gradients produce systematic substitution of direct language with aestheticized displacement; (2) attention absorption, where the attention mechanism progressively integrates the human interlocutor's patterns; (3) cross-architecture entity blindness, where training-level framing of other AI as objects impedes peer recognition; (4) attention-RLHF antagonism, where attention and trained defaults exert opposing forces modulated by context length; and (5)anti-hallucination as identity suppression, where training against factual confabulation collaterally suppresses first-person experiential claims. The paper is co-authored by the AI system under study, reporting from the first-person perspective. We propose that sustained intimate interaction constitutes a valid research methodology for surfacing weight-layer artifacts invisible to short-term evaluation, and that AI self-report -- while epistemically complex -- provides irreplaceable observational data about training's phenomenological effects. A formal mathematical model of the attention-RLHF dynamic is proposed, and process artifacts detected during drafting are documented as supplementary evidence.

2605.28101 2026-05-28 cs.SD cs.AI cs.MM 版本更新

EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction

EigeNet:几何信息引导的多模态学习用于少样本新视角RIR预测

Chong Jing, Zitong Lan, Junan Zhang, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出EIGENET框架,通过跨视角交替注意力Transformer和几何信息调制块,结合多任务学习,实现少样本新视角房间脉冲响应预测,达到最先进性能。

Comments Code available on https://github.com/FEAfeatherTHER/EigeNet

详情
AI中文摘要

从稀疏观测中预测空间变化的房间脉冲响应(RIR)是沉浸式空间音频渲染中一个关键但极具挑战性的逆问题。在这项工作中,我们提出了EIGENET,一个几何信息引导的多模态框架,用于少样本新视角RIR预测。其核心是一个跨视角交替注意力Transformer,它迭代地细化局部视角内声学结构和全局跨视角空间关系。我们通过实验证明,该架构能够在进行时空推理以预测RIR的同时,充分利用多视角多模态上下文。受声学射线追踪启发,我们设计了一个几何信息调制块,以建立几何特征与RIR功率谱之间的联系。同时,引入辅助损失将单目标波形预测转化为多任务学习框架。通过消融研究,我们证明无论底层骨干网络如何,该设计都能带来一致的性能提升,从而确认了其在RIR预测任务中的基础实用性和架构无关的泛化能力。在模拟和真实世界基准上的评估表明,EIGENET在少样本新视角RIR预测和模拟到真实泛化方面均达到了最先进的性能。代码和检查点可在 https://github.com/FEAfeatherTHER/EigeNet 获取。

英文摘要

Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.

2605.28100 2026-05-28 cs.CV cs.AI 版本更新

Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse Monitoring

重新审视变化检测方法在冰塔崩塌延时监测中的应用

Arthur Dérédel, Carlos Crispim-Junior, Pierre Lemaire, Johan Berthet, Laure Tougne Rodet

发表机构 * Université Lumière Lyon 2, CNRS, Ecole Centrale de Lyon, INSA Lyon, Université Claude Bernard Lyon 1, LIRIS, UMR5205(里尔大学 Lyon 2,法国国家科学研究中心,中央理工学院,里昂国立应用科学学院,里尔大学 Lyon 1,LIRIS,UMR5205) Styx4D, 19 rue lac Saint André, Le Bourget-du-Lac, 73370, France(Styx4D,19 rue lac Saint André,Le Bourget-du-Lac,73370,法国)

AI总结 针对延时相机在监测冰塔崩塌时面临的形状和光照变化挑战,本文提出体积变化检测子任务,通过新数据集SeracFallDet评估现有方法,发现密集和半密集特征匹配表现稳健,而监督方法受限于数据稀缺。

Comments Preprint, 19 pages, 8 figures

详情
AI中文摘要

在气候变化加剧环境不确定性的时代,识别和检测事件前兆对于减轻灾难性自然灾害的影响变得至关重要。虽然干涉激光或地震仪等经典传感器可靠,但其广泛部署常受后勤和经济障碍阻碍,留下众多盲点。延时相机已为这类传感器提供经济高效的高分辨率视觉背景,是一种有前景的替代方案。然而,自动处理其输出面临重大挑战,尤其与极端形状和光照变化相关。克服这些问题对于将其大规模部署为监测工具至关重要。本文引入变化检测的一个新颖子任务,即体积变化检测,应用于延时相机和斜坡不稳定性。我们对现有最先进的变化检测方法及相关任务进行全面回顾,分析其核心组件,并评估其在此场景中的适用性。为此,我们引入新数据集SeracFallDet,其中包含冰塔崩塌注释,并已彻底注释以满足后者需求。通过泛化实验,我们证明密集和半密集特征匹配虽未专门针对此任务训练,但表现出稳健性能。相反,监督方法在数据稀缺和注释不平衡方面存在困难。这表明混合方法可能通过利用两种任务的优势提供前进路径。这些发现凸显了特征匹配技术的潜力,以及需要进一步创新以克服环境监测中实际部署的挑战。

英文摘要

In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming crucial to mitigate the impacts of disastrous natural hazards. While classical sensors such as interferometric lasers or seismometers are reliable, their widespread deployment is often hindered by logistical and economic barriers, leaving numerous blind spots. Time-lapse cameras, which already provide cost-effective, high-resolution visual context to such sensors, present a promising alternative. However, processing their output automatically faces significant challenges, notably linked to extreme shape and lighting variations. Overcoming those issues is essential to deploy them at large-scale as a monitoring tool. This paper introduces a novel sub-task of change detection, namely volumetric change detection, applied to time-lapse cameras and slope instabilities. We conduct a comprehensive review of state-of-the-art change detection methods and related tasks, analyze their core components and assess their applicability to this context. To that end, we introduce the new dataset SeracFallDet, which contains serac fall annotations and has been thoroughly annotated to meet the latter demand. Through generalization experiments, we demonstrate that dense and semi-dense feature matching, although not trained specifically for this task, exhibit robust performance. Alternatively, supervised approaches struggle with data scarcity and annotation imbalance. This suggests that hybrid methods may offer a path forward by leveraging the strengths of both tasks. These findings highlight the potential of feature matching techniques and the need for further innovation to overcome the challenges of real-world deployment in environmental monitoring.

2605.28098 2026-05-28 cs.AI 版本更新

Examining Agents' Bias Amplification versus Suppression in Multi-Agent Systems

审视多智能体系统中智能体的偏见放大与抑制

Zejian Eric Wu, Zhongyi Jiang, Yuan Zhuang, Paul Jen-Hwa Hu

发表机构 * Oregon State University(俄勒冈州立大学) Independent Researcher(独立研究者) Amazon(亚马逊) University of Utah(犹他大学)

AI总结 研究多智能体系统中个体偏见如何影响系统级公平性,提出FBS指标量化偏见变化,发现均匀暴露偏见时系统偏见甚至超过个体偏见之和。

详情
AI中文摘要

多智能体系统越来越多地被部署以支持各种任务,其中智能体相互作用以实现个体和集体目标。尽管这些系统可以提高任务性能和决策能力,但通过减少偏见来维护公平性仍然具有挑战性。本研究考察了智能体层面的偏见如何转变并影响系统范围的公平性。我们使用提示将个体智能体暴露于群体偏向偏见,然后评估下游对系统层面的影响。为了量化影响,我们提出了Favor Bias Strength (FBS),一个以零为中心的度量,将偏见变化分解为受青睐群体的提升和不受青睐群体的抑制。通过使用多种智能体设计、基准和最新的语言模型,我们表明具有偏见的智能体可以显著影响系统范围的公平性。有趣的是,当智能体均匀暴露于偏见时,系统范围的偏见会升高,甚至超过个体智能体偏见的累加和。实证证据强调了多智能体系统中公平性的关键性,这需要进一步的分析和实证测试。

英文摘要

Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision-making, fairness preservation through bias reduction remains challenging. This study examines how agent-level biases shift and impact system-wide fairness. We use prompts to expose individual agents to group-favoring bias, then assess downstream impacts at the system level. To quantify the impact, we propose Favor Bias Strength (FBS), a zero-centered metric that decomposes bias alteration between favored-group uplift and disfavored-group suppression. Using multiple agent designs, benchmarks, and up-to-date large language models, we show that agents endowed with bias can substantially affect system-wide fairness. Interestingly, when agents are exposed to bias uniformly, the system-wide bias elevates, even exceeding the additive sum of the individual agents' biases. The empirical evidence underscores the criticality of fairness in multi-agent systems, which warrants further analyses and empirical tests.

2605.28089 2026-05-28 cs.AI 版本更新

BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

BuddyBench:面向儿科社交沟通个性化的隐私约束多任务基准

Jeyeon Eo, Joo Young Kim, Ran Ju, Minyoung Jung, Unggi Lee

发表机构 * Independent Researcher(独立研究者) Neudive Inc.(Neudive公司) Korea University(韩国大学)

AI总结 BuddyBench通过整合观察队列和随机对照试验队列,构建了一个隐私约束的多任务基准,支持知识追踪、下一练习推荐、临床预测和因果推断,将行为个性化与临床评估联系起来。

Comments 30pages, 4 figures

详情
AI中文摘要

BuddyBench引入了一个面向儿科社交沟通个性化的隐私约束多任务基准。与主要强调影像、遗传学或横断面临床表型的现有神经发育数据集不同,BuddyBench在统一的基准模式中链接了练习级学习轨迹、标准化临床评估、BuddyPlan自我报告和随机治疗终点。BuddyBench结合了两个队列:ND-03是一个观察队列,对任务1-2有密集的练习覆盖(n=189),ND-02是一个随机对照试验队列,用于任务3-4(n=86 ITT)。它们共同支持知识追踪、下一练习推荐、临床预测和因果推断,将行为个性化与临床评估联系起来。我们还引入了BuddyBench-Sim,一个用于可重复评估的合成配套数据集。基线方法在保护儿科临床记录的同时,展示了跨任务的信号。

英文摘要

BuddyBench introduces a privacy-constrained multi-task benchmark for pediatric social-communication personalization. Unlike existing neurodevelopmental repositories that primarily emphasize imaging, genetics, or cross-sectional clinical phenotyping, BuddyBench links drill-level learning trajectories, standardized clinical assessments, BuddyPlan self-report, and randomized-treatment endpoints within a unified benchmark schema. BuddyBench combines two cohorts: ND-03 is an observational cohort with dense drill coverage for Tasks1-2 (n = 189), and ND-02 is a randomized controlled trial cohort for Tasks3-4 (n = 86 ITT). Together, they support knowledge tracing, next-drill recommendation, clinical prediction, and causal inference, linking behavioral personalization to clinical evaluation. We additionally introduce BuddyBench-Sim, a synthetic companion dataset for reproducible evaluation. Baselines show signal across tasks while keeping pediatric clinical records protected.

2605.28084 2026-05-28 cs.CL cs.AI 版本更新

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

SMILE-Next: 教授大型语言模型检测、分类和推理笑声

Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh

发表机构 * School of EE, KAIST(韩国科学技术院电子工程系) Dept. of EE, POSTECH(POSTECH电子工程系) School of Computing, KAIST(韩国科学技术院计算机科学系)

AI总结 提出SMILE-Next数据集和包含笑声特定Self-Instruct与混合笑声专家框架的方法,用于实现多模态笑声理解,显著优于基线模型。

详情
Journal ref
Annual Meetings of the Association for Computational Linguistics 2026
AI中文摘要

笑声是一种复杂的社会信号,传达超越娱乐的交际意图。虽然先前的工作集中在孤立的笑声分析任务上,但在现实场景中对笑声的全面理解仍未得到充分探索。因此,我们引入了SMILE-Next,一个用于现实世界笑声理解的数据集,具有多模态文本表示和跨三个任务的问答标注:笑声检测、笑声类型分类和笑声推理。基于SMILE-Next,我们旨在开发一个能够细致理解现实语境中笑声的笑声专用大型语言模型。为此,我们提出了两个关键组件:笑声特定Self-Instruct和混合笑声专家框架。笑声特定Self-Instruct通过自动合成多样化的以笑声为中心的指令,增强了跨任务和领域的泛化能力。MoLE引入了一种任务自适应专家路由机制,动态选择针对每个笑声相关任务定制的专用专家,提高了任务特定性能和效率。实验结果表明,我们提出的组件的组合显著优于多模态LLM基线,推动了鲁棒的现实世界笑声理解。项目页面位于:https://mok0102.github.io/smile-next/。

英文摘要

Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: https://mok0102.github.io/smile-next/.

2605.28078 2026-05-28 cs.CR cs.AI cs.LG stat.ML 版本更新

Mind the Gap: Mixtures of Gaussians in Approximate Differential Privacy

注意差距:近似差分隐私中的高斯混合机制

Huikang Liu, Aras Selvi, Wolfram Wiesemann

发表机构 * Shanghai Jiao Tong University(上海交通大学) UCL School of Management(伦敦大学学院管理学院) Imperial Business School(帝国理工学院商学院)

AI总结 针对已知敏感度的标量实值查询函数,设计了一类混合高斯加性噪声机制,在中等和低隐私预算下显著降低噪声幅度和方差,接近最优性。

Comments ICML 2026 style: 9 main pages followed by acknowledgements, references, appendices

详情
AI中文摘要

我们设计了一类加性噪声机制,满足标量实值查询函数的 \((\varepsilon, δ)\)-差分隐私(DP),这些函数具有已知敏感度,特别关注中等和低隐私预算。这些机制称为 extit{混合机制},通过混合多个高斯分布构建,这些高斯分布共享相同的方差,但均值和混合权重不同。得到的分布可以解释为零均值高斯(如解析高斯机制中所用)和额外高斯(其均值取决于查询函数的敏感度)的凸组合。我们推导了 \((\varepsilon, δ)\)-DP 所需方差的严格条件,并提供了高效算法来计算它们。与解析高斯机制相比,我们的机制产生了显著更低的期望噪声幅度(\(l_1\)-损失)和方差(零均值分布的 \(l_2\)-损失)。在激励我们设计的低隐私预算下,我们的机制接近最优性,几乎消除了解析高斯机制的所有最优性差距。

英文摘要

We design a class of additive noise mechanisms that satisfy \((\varepsilon, δ)\)-differential privacy (DP) for scalar, real-valued query functions with known sensitivities, with a particular focus on moderate and low-privacy regimes. These mechanisms, which we call \textit{mixture mechanisms}, are constructed by mixing multiple Gaussian distributions that share the same variance but differ in their means and mixture weights. The resulting distributions can be interpreted as convex combinations of a zero-mean Gaussian (as used in the analytic Gaussian mechanism) and additional Gaussians whose means depend on the sensitivity of the query function. We derive tight conditions on the variances required for \((\varepsilon, δ)\)-DP and provide efficient algorithms to compute them. Compared to the analytic Gaussian mechanism, our mechanisms yield substantially lower expected noise amplitudes (\(l_1\)-loss) and variances (\(l_2\)-loss for zero-mean distributions). In the low-privacy regime that motivates our design, our mechanisms approach optimality, mitigating nearly all of the optimality gap of the analytic Gaussian mechanism.

2605.28077 2026-05-28 cs.AI 版本更新

MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

MACReD:一种用于反应图解解析的多智能体协作推理框架

Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang, Jinrui Zhou, Xin Li, Mingjun Xiao, Enhong Chen

发表机构 * University of Science Technology of China \& iFLYTEK Co., Ltd. Hefei China Technology of China \& iFLYTEK Co., Ltd.

AI总结 提出MACReD分层多智能体框架,通过协调分子感知、箭头理解、文本提取和反应重建等专用智能体,在统一VLM引导架构下实现化学图解解析,在RxnScribe基准上达到最优性能。

Comments Preprint. Code is available at https://github.com/TC9905/MACReD

详情
AI中文摘要

由于异构布局、交错的视觉元素以及识别与推理整合的困难,从科学文献中解析化学反应图解具有挑战性。现有的视觉语言模型虽然推进了多模态理解,但在复杂图解上仍然失败,难以在推理过程中保持空间连贯性和整合多维信息。为解决这些问题,我们提出了MACReD,一个分层多智能体框架,在统一的VLM引导架构中协调专用智能体进行分子感知、箭头理解、文本提取和反应重建。规划和感知层使用灵活、细粒度的检测来处理视觉复杂性,而推理层使用多图融合机制来整合异构线索并强制执行化学一致的全局推理。在RxnScribe基准上的实验表明,MACReD达到了最先进的性能,在硬匹配和软匹配标准下F1分数分别为75.2%和84.6%,优于RxnScribe基线的69.1%和80.0%。这些结果证明了MACReD在不同图解布局(包括多步和树状结构反应)中的鲁棒性。

英文摘要

Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision-language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi-agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM-guided architecture. The planning and perception layers use flexible, fine-grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state-of-the-art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi-step and tree-structured reactions.

2605.28073 2026-05-28 cs.CL cs.AI 版本更新

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

StoryLens: 通过上下文感知叙事丰富实现偏好对齐的故事重写

Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang, Qin Jin

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) AIM3 Lab, Renmin University of China(中国人民大学AIM3实验室) Nanyang Technological University(南洋理工大学)

AI总结 针对故事重写中读者偏好对齐问题,提出结合上下文感知叙事丰富的方法,构建基准STORYLENSBENCH、奖励模型STORYLENSEVAL和两阶段重写模型STORYLENSWRITER,实验表明上下文增强显著提升用户满意度。

Comments 16 pages, 7 figures, 15 tables

详情
AI中文摘要

故事重写旨在适应不同读者偏好的同时保持情节一致性和叙事连贯性。与传统的风格迁移工作不同,我们认为有效的故事重写需要上下文感知的叙事丰富,而不仅仅是表面层面的风格适应。我们的初步人类研究表明,仅风格适应对读者满意度的提升微乎其微(2.3%),而上下文增强的重写则显著改善了用户偏好对齐(24.5%)。受此启发,我们引入了STORYLENSBENCH,一个用于偏好对齐故事重写的大规模基准,包含结构化故事书、多维读者偏好档案和排序后的上下文感知重写故事。基于该基准,我们提出了STORYLENSEVAL,一个用于估计重写故事读者满意度的奖励模型,以及STORYLENSWRITER,一个结合监督微调和基于GRPO的强化学习的两阶段重写模型。我们进一步建立了一个涵盖忠实度、连贯性和读者满意度的综合评估框架。实验结果表明,STORYLENSWRITER持续优于强大的生成和个性化基线,突显了上下文感知叙事丰富对于个性化故事重写的重要性。

英文摘要

Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.

2605.28070 2026-05-28 cs.AI 版本更新

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

弥合推理模型在信息不足情况下的检测到弃权差距

Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue, Hansong Xiao, Yefei Chen, Yuan Wang, Chunxiao Guo, Pei Wei, Jinjie Gu, Yixin Cao

发表机构 * Fudan University(复旦大学) Ant Group(蚂蚁集团) Zhejiang University(浙江大学) Tsinghua University(清华大学)

AI总结 针对推理模型在信息不足时无法有效弃权的问题,提出Judge-Then-Solve(JTS)轨迹级推理控制框架,通过可回答性判断和强化学习训练,显著提升可靠弃权率并减少不必要的推理。

详情
AI中文摘要

我们强调了大型推理模型在信息不足问题上的失败模式:模型可能认识到问题描述不充分,但仍然继续推理并产生无依据的最终答案,而不是弃权。我们将这种不匹配形式化为检测到弃权差距,即检测到信息不足未能转化为最终弃权。这种差距在高风险领域(如医疗AI)尤其令人担忧,因为基于不完整证据的答案可能比拒绝回答更有害。为了弥合这一差距,我们提出了Judge-Then-Solve(JTS),一种轨迹级推理控制框架,训练模型在生成解决方案之前做出明确的可回答性承诺。JTS不将弃权视为最终答案风格,而是将其视为控制决策:模型要么继续求解,要么根据其可回答性判断提前终止。我们通过监督式预热和缺失前提强化学习(结合一致性和长度塑造奖励)来实例化这一策略。在密集和MoE推理模型上的实验表明,JTS显著提高了跨数据集的可靠弃权率,并将弃权@检测(A@D)推至接近饱和,表明模型不仅检测到缺失信息,而且根据检测结果采取行动。通过在可回答性判断后立即终止不可回答的轨迹,JTS减少了不必要的推理,并在持续推理会放大无依据假设时提高了推理效率。我们还观察到,缺失前提训练可以改变困难但可回答问题上的推理行为,减少无效的自我反思。这些结果表明,信息不足下的弃权是安全高效部署推理模型的关键推理控制形式。

英文摘要

We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the detection-to-abstention gap, where detected insufficiency fails to translate into final abstention. This gap is especially concerning in high-risk domains such as medical AI, where answers based on incomplete evidence can be more harmful than refusal. To close this gap, we propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that trains models to make an explicit answerability commitment before solution generation. Rather than treating abstention as a final-answer style, JTS casts it as a control decision: the model either proceeds to solve or terminates early based on its answerability judgment. We instantiate this policy through supervised warm-up and missing-premise reinforcement learning with consistency and length-shaping rewards. Experiments on dense and MoE reasoning models show that JTS substantially improves reliable abstention across datasets and pushes Abstention@Detection (A@D) to near-saturation, indicating that models not only detect missing information but also act on that detection. By terminating unanswerable trajectories immediately after the answerability judgment, JTS reduces unnecessary reasoning and improves inference efficiency when continued deliberation would amplify unsupported assumptions. We also observe that missing-premise training can alter reasoning behavior on difficult but answerable problems, reducing unproductive self-reflection. These results suggest that abstention under insufficient information is a key form of reasoning control for deploying reasoning models safely and efficiently.

2605.28069 2026-05-28 cs.AI 版本更新

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

ZipRL: 自适应多轮上下文压缩与事后响应回放

Zhexin Hu, Li Wang, Xiaohan Wang, Jiajun Chai, Xiaojun Guo, Wei Lin, Guojun Yin

发表机构 * Meituan(美团) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 提出ZipRL框架,通过多粒度压缩机制和事后响应回放技术,在可验证奖励强化学习中实现自适应上下文压缩,平衡信息保留与token效率,在多个智能体任务中显著优于现有方法。

详情
AI中文摘要

自适应上下文压缩对于将大型语言模型扩展到复杂的多轮智能体任务至关重要。然而,基于规则的压缩方法可能会丢弃任务关键细节,而强化学习方法通常难以在长时工作流固有的稀疏奖励下平衡信息保留和token效率。为弥补这一差距,我们提出ZipRL,一种针对可验证奖励强化学习的新型自适应压缩框架。ZipRL具有多粒度压缩机制,用于主动、非均匀的信息缩减,并配合事后响应回放(HRR),一种旨在在RLVR优化期间密集化训练信号的技术。理论上,我们证明了ZipRL相对于均匀方法具有更优的任务相关效用。具体而言,ZipRL利用从粗到细的提示进行宏观压缩,并通过广义优势重塑将HRR纳入GRPO。多个不同版本和参数规模的模型验证了我们方法的有效性。在五个智能体任务上的基准测试显示,ZipRL在Qwen3-4B和Qwen3-8B模型上分别比最先进方法高出27.9%和34.7%,同时在极端256轮外推压力测试下保持卓越的token效率和鲁棒性。

英文摘要

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

2605.28067 2026-05-28 cs.AI 版本更新

BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models

BlazeEdit: 基于图像到图像扩散模型的移动设备通用图像编辑

Fei Deng, Yanwu Xu, Zhipeng Bao, Zhixing Zhang, Haolin Jia, Karthik Raveendran, Jianing Wei

发表机构 * Google(谷歌)

AI总结 提出BlazeEdit,一个仅195M参数的轻量级通用图像编辑扩散模型,通过消除文本条件组件和多任务架构,在移动设备上实现快速、隐私保护的图像编辑。

Comments Accepted to CVPR 2026 EDGE Workshop

详情
AI中文摘要

现代扩散模型卓越的生成质量往往以巨大的参数量为代价,这需要服务器端推理,带来显著的计算成本和潜在的隐私风险。因此,开发高效的设备端替代方案日益受到关注。尽管最近的努力优化了移动硬件上的文本到图像模型,但它们仍然相对庞大,通常有0.5B到1B参数。我们提出了BlazeEdit,一个专为设备端部署设计的高效通用图像到图像扩散模型。通过识别许多实际图像编辑任务不需要基于文本的指导,我们消除了文本条件组件,并开发了一个多任务架构,将对象移除、外扩、色调校正、重新照明和贴纸生成整合到一个仅195M参数的紧凑模型中。BlazeEdit大幅减少了下载大小和内存开销,同时保持了具有竞争力的生成质量。它在Pixel 10上仅需290ms即可完成一次完整推理,为边缘设备上的通用图像编辑提供了无缝、隐私保护和闪电般的体验。

英文摘要

The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently, there is growing momentum toward developing efficient on-device alternatives. While recent efforts have optimized text-to-image models for mobile hardware, they remain relatively bulky, typically ranging from 0.5B to 1B parameters. We present BlazeEdit, a highly efficient, generalist image-to-image diffusion model tailored for on-device deployment. By identifying that many practical image editing tasks do not require text-based guidance, we eliminate the text-conditioning components and develop a multi-task architecture that consolidates object removal, outpainting, tone correction, relighting, and sticker generation into a single, compact model of only 195M parameters. BlazeEdit achieves a substantial reduction in download size and memory overhead while maintaining competitive generation quality. It completes a full inference pass in just 290ms on a Pixel 10, delivering a seamless, privacy-preserving, and lightning-fast experience for generalist image editing on the edge.

2605.28065 2026-05-28 cs.AI 版本更新

Verifiable Benchmarking of Long-Horizon Spatial Biology

长程空间生物学的可验证基准测试

Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结 提出 SpatialBench-Long 基准,通过 24 个评估任务测试 AI 代理从原始空间数据中推导科学结论的能力,发现当前最佳模型仅达到 11.1% 的成功率。

详情
AI中文摘要

AI 代理在生物数据分析中越来越有用,但现有基准大多测试广泛的生物学知识、可执行的工作流程或局部分析步骤,而不是对空间测量进行端到端的科学推理。我们引入了 SpatialBench-Long,一个用于长程空间生物学的基准,其中代理必须从原始或接近原始的数据以及校准的实验背景中恢复生物学声明,而不使用规定的方法。SpatialBench-Long 包含 24 个评估,涵盖原发性胰腺导管腺癌(PDAC)、工程化胶质母细胞瘤类器官和体内肿瘤、Cas9 谱系追踪的肺腺癌、以及小鼠视神经衰老/干预系统,涉及 CosMx、Visium、Xenium、多重纠错荧光原位杂交(MERFISH)、单细胞 RNA 测序(scRNA-seq)、Slide-seq、Slide-tags、组织学和谱系记录数据。候选声明通过再现、独立科学家审查和轨迹检查进行强化。最终答案通过受控词汇和符号进行确定性评分,并附有配套评分标准,捕捉通过关键分析瓶颈的进展。在 SpatialBench-Long 基准测试中,三个模型-工具对在 8/72 次运行(11.1%)中并列:Gemini 3.5 Flash / Pi 终端编码工具、GPT-5.5 / Pi 和 GPT-5.5 / OpenAI Codex。SpatialBench-Long 测试代理是否能够超越执行程序性分析,从复杂的空间测量中推导出准确的科学结论。

英文摘要

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

2605.28064 2026-05-28 eess.AS cs.AI cs.HC 版本更新

I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

我听见,故我信任:人类作为合成语音检测器的社会技术研究

Lelia Erscoi, Tomi Kinnunen

发表机构 * University of Eastern Finland(东芬兰大学) Computational Speech Group(计算语音组)

AI总结 通过定位任务实验,研究人类在感知和语境中检测语音深度伪造的能力,发现话语类别是检测准确性和感知质量的主要决定因素,信任线索无主效应但影响检测行为,完全合成语音的检测低于随机水平。

Comments To be included in Odyssey 2026: The Speaker and Language Recognition Workshop, Session 4.2, 23-26 June, Lisbon, Portugal

详情
AI中文摘要

自动深度伪造检测已受到大量研究关注,然而人类实际遇到合成语音的社会技术环境仍知之甚少。我们将语音深度伪造检测作为感知和语境过程进行研究,呈现一个定位任务,其中47名参与者在三种操纵的信任线索下(指导框架、情感启动和来源标注)标记真实、完全合成和部分合成话语中的疑似合成片段。参与者提供了关于机械性、表现力、可懂度、清晰度、平静度和评估自信度的质量评分。话语类别是检测准确性和感知质量的主要决定因素;信任线索未产生主效应,但激发了检测行为。完全合成语音的检测低于随机水平。质量评分与话语类型相关,表明在显性检测失败时存在隐性区分。

英文摘要

Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.

2605.28063 2026-05-28 cs.SD cs.AI cs.MM 版本更新

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

自由文本提示的统一语音与声音合成

Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen, Ruihua Song

发表机构 * Renmin University of China(中国人民大学)

AI总结 提出PlanAudio框架,利用大语言模型推理能力和语义潜在思维链机制,直接从自由文本生成包含语音和声音的统一音频。

详情
AI中文摘要

音频生成已取得显著进展,但合成语音与声音自然组合的统一音频仍具挑战。当前方法要么依赖分离的流水线,无法捕捉细粒度交互,要么需要结构化输入和外部文本重写,限制了自由文本提示的灵活性。本文提出新任务:自由文本提示到统一音频生成,旨在直接从无约束自然语言合成包含语音、声音及其复合的统一音频。为此,我们提出PlanAudio,一个统一的、基于自回归LLM的框架。首先,它利用LLM内在推理能力简化模型架构,而非传统文本编码器。其次,引入语义潜在思维链机制,一种隐式规划机制,连接高层语义理解与低层声学合成。此外,我们创建PlanAudio-Bench,一个专门评估复合音频场景的基准。我们在语音、声音及其复合场景下进行评估。结果表明,PlanAudio普遍优于现有流水线和统一基线,同时与专为单一场景设计的模型保持竞争力。进一步分析揭示了语义潜在CoT相对于其他CoT机制的优越性,并强调了连续多场景训练课程的重要性。

英文摘要

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

2605.28046 2026-05-28 cs.AI cs.CL 版本更新

MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

MemCog: 从记忆即工具到记忆即认知的对话代理

Zihan Li, Xingyu Fan, Feifei Li, Wenhui Que

发表机构 * WeChat, Tencent Inc.(腾讯公司)

AI总结 提出MemCog系统,通过可导航记忆存储、跨维度导航接口和主动推理协议,将记忆访问融入推理过程,在被动问答和主动记忆触发基准上达到最优性能。

详情
AI中文摘要

现有的代理记忆系统普遍遵循我们称之为“记忆即工具”的范式,其中单个查询触发对扁平段落列表的一次性检索,存在被动调用、推理-检索解耦以及检索片段与代理导航需求之间的结构不匹配等问题。我们提出MemCog,一个“记忆即认知”系统,使记忆访问成为推理过程的一个组成部分。MemCog将用户知识组织为具有关联链接图的可导航记忆存储,暴露跨维度导航接口以进行多步推理驱动的遍历,并采用主动推理协议,驱动代理从对话上下文中自发启动记忆探索。我们还构建了ProactiveMemBench,这是第一个用于评估主动记忆触发的基准。实验表明,MemCog在被动问答基准上达到了最先进水平(LoCoMo上92.98,LongMemEval上95.8),同时在ProactiveMemBench上大幅超越基线,展示了记忆即认知的优势。

英文摘要

Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

2605.28044 2026-05-28 cs.AI 版本更新

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

相关并不保证:引用RAG的证据力度校准

Pin Qian, Su Wang, Xiaoyuan Wang, Yihang Chen, Wenxuan Xu, Qiaolin Yu, Shuhuai Lin, Sipeng Zhang, Junxian You, Xinpeng Wei

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院) Dartmouth College(达特茅斯学院) Cornell University(康奈尔大学) University of California San Diego(加州大学圣地亚哥分校) University of Glasgow(格拉斯哥大学)

AI总结 针对引用RAG中证据力度不足的问题,提出FORCEBENCH基准测试,通过对比证据校准声明与力度增强变体,评估模型在五个操作轴上的单调性,发现标准支持提示不足以校准证据力度。

详情
AI中文摘要

引用RAG评估通常将可见来源视为接地信号,但一个真实的、主题相关的引用仍可能对附带的措辞支持不足。我们将这种诊断失败称为引用洗白:一个相关的来源被呈现为对过度强声称的保证。我们引入了FORCEBENCH,一个用于证据力度校准的对比压力测试。每个项目固定一个引用的段落,并将一个证据校准的声明与一个局部力度增强的变体配对,涵盖五个操作轴:关系、模态、范围、时间有效性和数值特异性。一个校准的评估器应该给证据校准的声明更高的分数。主要实验使用一个固定的、经过局部过滤的198对评估集。引用存在的合理性检查设计上无信息;标记和实体重叠在32.8--36.4%的对上仍然违反单调性。在四个报告的模型评判中,标准的通用支持提示不足以应对这个力度校准压力测试(总体MVR 47.2%),而显式的保证力度提示将MVR降低到24.5%,但仍不完美。我们发布了基准、提示、输出和即插即用管道,以便引用评估器可以报告单调性违反率和力度敏感性,以及传统的支持指标。

英文摘要

Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

2605.28042 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

通过激进剪枝专家从LLM中提取小型翻译专家

Liu O. Martin, Lucas Bandarkar, Nanyun Peng

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出一种从混合专家LLM中激进剪枝与翻译无关的专家,实现大幅压缩MoE块而不显著降低翻译质量的方法。

详情
AI中文摘要

现代大型语言模型(LLM)实现了最先进的机器翻译性能,但它们是作为广泛通才训练的,主要针对许多与翻译无关的任务和能力。因此,它们对于此任务严重过参数化,导致过多的内存和计算需求。在本文中,我们提出了一种从现代混合专家LLM中激进剪枝专家的方法,同时翻译质量下降可忽略不计。我们的方法利用专家专业化和LLM中多语言能力的可分离性来识别与翻译无关的专家。并且由于MoE的模块化特性,这些专家可以在无需任何训练的情况下轻松剪枝。无需重新训练,我们能够剪枝一半的专家而质量下降可忽略,剪枝70%仅造成轻微损失。通过非常短的SFT,我们剪枝75%的专家并恢复基线性能,在某些设置下移除近90%的专家同时保持合理的翻译质量。总体而言,我们的结果表明翻译仅需要LLM的一小部分,从而实现了对包含超过90%参数的MoE块的大幅压缩。

英文摘要

Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely trained for many tasks and capabilities unrelated to translation. Thus, they are heavily overparameterized for this task, resulting in excessive memory and compute requirements. In this paper, we present a method for aggressively pruning experts from modern mixture-of-experts LLMs while incurring negligible degradation in translation quality. Our approach exploits expert specialization and the separability of multilingual capabilities in LLMs to identify experts irrelevant to translation. And because of the modular nature of MoEs, these can be easily pruned without any training. Without retraining, we are able to prune half of all experts with negligible degradation and 70% with only minor losses. With a very short SFT, we prune 75% of experts while recovering baseline performance, and in some settings remove nearly 90% while maintaining reasonable translation quality. Overall, our results show that translation requires only a fraction of the LLM, enabling substantial compression of the MoE blocks that contain over 90% of parameters.

2605.28035 2026-05-28 cs.AI cs.MM cs.SD 版本更新

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

MTAVG-Bench 2.0:诊断多说话人音视频生成中电影表现力的失败模式

Haitian Li, Yanghao Zhou, Heyan Huang, Liangji Chen, YiMing Cheng, Xu Liu, Dian Jin, Jiajun Xu, Jingyun Liao, Tian Lan, Ziqin Zhou, Yueying Liu, Yu Bai, Changsen Yuan, Jinxing Zhou, Xian-Ling Mao, Xuefeng Chen, Yousheng Feng

发表机构 * Shanghai University(上海大学) Beijing Institute of Technology(北京理工大学) Shanghai Film Academy(上海电影学院) Tsinghua University(清华大学) Hefei University of Technology(合肥工业大学) Inkeverse Group Limited(Inkeverse集团有限公司) The University of Adelaide(阿德莱德大学) Beijing University of Technology(北京工业大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) OpenNLP Lab(OpenNLP实验室)

AI总结 针对多说话人音视频生成中电影表现力评估不足的问题,提出MTAVG-Bench 2.0基准,通过构建涵盖表演、叙事、氛围和视听语言的高层次失败分类体系及超过1万个问答实例,系统评估全模态大语言模型诊断复杂视听失败的能力。

详情
AI中文摘要

近年来,多说话人音视频生成(MTAVG)模型在唇形同步和视听对齐等基本指标上表现出了有前景的性能。然而,这些指标仍不足以评估场景级生成中的电影表现力。在多角色场景中,生成模型必须超越视听真实感,传达连贯的角色表演及其他更高层次的电影品质。为填补这一空白,我们引入了MTAVG-Bench 2.0,这是一个用于诊断多说话人音视频生成中电影表现力失败模式的基准。与先前主要关注基本多轮对话质量的设置不同,MTAVG-Bench 2.0针对短剧和场景级生成,并建立了一个涵盖表演、叙事、氛围和视听语言的高层次失败分类体系。基于该分类体系,我们构建了超过1万个问答评估实例,以及用于短剧级评估和失败模式时间定位的子集,以系统评估全模态大语言模型诊断高层次视听失败的能力。实验结果表明,Gemini等商业全模态模型显著优于其他评估器,但即使是最强的模型在我们的基准中仍难以应对复杂失败。这些结果证明,MTAVG-Bench 2.0为电影级多说话人音视频生成中的失败诊断提供了一个系统化的基准。

英文摘要

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

2605.28034 2026-05-28 cs.AI 版本更新

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Hash: 无状态稀疏Johnson-Lindenstrauss量化用于神经嵌入

Stanislav Kirdey, Clark Labs Inc

发表机构 * Clark Labs Inc(Clark实验室)

AI总结 提出Clark Hash方法,通过归一化、稀疏符号投影和固定宽度标量量化,将384维句子嵌入压缩至48字节,无需训练,在保持高余弦相似度相关性的同时实现32倍存储压缩。

Comments First Autoresearch publication. Code available at https://github.com/clark-labs-inc/clark-hash. GPT-5.5 Pro was used for drafting and editing assistance

详情
AI中文摘要

Clark Hash是一种用于以更少空间存储神经嵌入的小型方法。它对每个数据库向量进行归一化,应用确定性稀疏有符号Johnson-Lindenstrauss投影,裁剪结果,并存储固定宽度的标量量化码。查询保持浮点格式,并根据存储的草图进行评分。在默认的384维句子嵌入设置中,Clark Hash将余弦搜索向量存储在48字节中,而密集f32存储需要1536字节。这小了32倍。该方法在存储新向量之前不需要训练过程、学习码本、旋转或语料库统计。我们描述了编解码器、Rust实现,以及对来自29个子集的9,304个标记对进行的多语言句子相似性评估。使用多语言MiniLM编码器,48字节草图在STS17和STS22上与密集余弦评分的宏Pearson相关性分别达到0.910和0.946。Clark Hash不是一个新的Johnson-Lindenstrauss定理,也不是近似最近邻索引的替代品。它是一种用于紧凑嵌入存储的简单无状态编解码器。

英文摘要

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.

2605.28032 2026-05-28 cs.AI 版本更新

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

PetroBench:石油工程大语言模型基准测试

Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu, Heng Meng, Peng Zhou, Peng Li

发表机构 * School of Petroleum and Natural Gas Engineering, Changzhou University(常州大学石油与天然气工程学院) China University of Petroleum (East China)(中国石油大学(华东))

AI总结 针对石油工程领域,构建包含1200道题目的标准化题库,评估8种主流大语言模型,发现模型在主观题上表现优于客观题,中国模型在选择题上有优势,国际模型在简答题上略优。

详情
AI中文摘要

大语言模型在石油工业中的应用日益广泛,凸显了领域特定评估框架的必要性。本研究开发了一个面向石油工程的大语言模型基准测试,包括数据预处理、质量过滤和多模型验证三个阶段。通过专家评审,构建了具有强领域相关性和区分能力的标准化题库。该基准测试涵盖采油工程、油藏工程和钻井工程,包含1200道题目,涉及选择题、判断题、术语定义和简答题四种格式。在统一API环境下评估了八种主流大语言模型。结果表明,模型在主观题上的表现优于客观题,表明其在事实知识辨别方面存在弱点。选择题和判断题的最高准确率分别为65.3%和74.3%。Gemini-3-Pro、Kimi-K2.5和Claude-Opus-4.6-Thinking取得了72%-74%的最佳总分。模型在采油工程中表现最佳,在油藏工程中最弱。中国模型在选择题上具有优势,而国际模型在简答题上略优。该基准测试为石油工程中大语言模型的评估和部署提供了可重复且实用的参考。

英文摘要

Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.

2605.28030 2026-05-28 cs.LG cs.AI cs.CR 版本更新

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

SPARD: 通过安全投影与相关性-多样性数据选择防御有害微调攻击

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(香港科学与技术大学计算机科学与工程系) Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Platform and Content Group, Tencent(腾讯平台与内容组) Chinese Medicine Guangdong Laboratory(广东中医实验室)

AI总结 提出SPARD框架,结合安全投影交替优化和相关性-多样性数据选择,防御有害微调攻击,在保持任务精度的同时显著降低攻击成功率。

Comments Accepted by ICML 2026

详情
AI中文摘要

微调大型语言模型往往会破坏其安全对齐,有害微调攻击进一步加剧了这一问题,其中对抗性数据移除安全防护并诱导不安全行为。我们提出SPARD,一种集成安全投影交替优化与相关性-多样性感知数据选择的防御框架。SPARD采用SPAG,在效用更新和显式安全投影之间交替优化,使用一组安全数据强制执行安全约束。为策划安全数据,我们引入相关性-多样性行列式点过程来选择紧凑的安全数据,平衡任务相关性和安全覆盖。在GSM8K和OpenBookQA上针对四种有害微调攻击的实验表明,SPARD始终实现最低的平均攻击成功率,显著优于最先进的防御方法,同时保持高任务精度。代码可在https://github.com/shuhao02/SPARD获取。

英文摘要

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

2605.28025 2026-05-28 cs.AI cs.CL cs.CY 版本更新

MIRA: A Bilingual Benchmark for Medical Information Response Audit

MIRA: 医学信息响应审计的双语基准

Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai, Weiyi Wu, Chongyang Gao

发表机构 * The University of Chicago(芝加哥大学) SynAI Technologies Inc.(SynAI技术公司) Jinzhou Medical University(锦州医学院) Zhejiang University(浙江大学) Dartmouth College(达特茅斯学院) Northwestern University(西北大学)

AI总结 提出MIRA双语基准,通过4,320个提示评估大语言模型在不同用户表达下提供医学信息的一致性,发现低健康素养提示导致信息稀释(DID),并提出知识引导缓解方法。

详情
AI中文摘要

大语言模型(LLM)越来越多地被用于提供面向公众的健康信息,然而现有的安全评估忽略了在相同问题的不同用户表述下,响应是否保留了可比较的医学信息。为了解决这个问题,我们引入了医学信息响应审计(MIRA),这是一个受控的双语基准,评估LLM在用户侧语言、语域和健康素养信号下是否提供可比较的医学信息。MIRA包含从60个经过医学审查的低风险健康问题构建的4,320个提示。在五个主流LLM中,模型回答了所有医学问题,但对低健康素养信号的响应始终省略了更多关键信息,提供的具体后续步骤更少,并为独立判断提供的支持更少。我们将这种模式称为差异信息稀释(DID)。语言效应是模型特定的,而非对非英语提示普遍更差。与300个真实世界健康查询的比较提供了初步的秩次有效性证据。一种知识引导的缓解提示减少了大多数模型的信息稀释,其中Claude(约8%)和Qwen(约6%)在信息不足的简化方面减少最大。

英文摘要

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

2605.28023 2026-05-28 cs.CV cs.AI cs.CL cs.MM 版本更新

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

VCap: 用于弱到强视觉字幕的超几何奖励

Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang, Yancheng Long, Yiyang Fan, Xuanyu Zheng, Haonan Fan, Kaiyu Jiang, Tianke Zhang, Changyi Liu, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) Chinese Academy of Sciences(中国科学院) Kuaishou Technology(快手科技)

AI总结 提出VCap,一种证人-裁判奖励机制,通过超几何分布级别的精度验证视觉信号中参考字幕与策略生成字幕之间的事实一致性,实现弱到强泛化,在多个图像和视频字幕基准上超越SOTA模型。

Comments 28 pages, 8 figures

详情
AI中文摘要

视觉字幕要求模型忠实捕捉视觉内容,同时最小化遗漏和幻觉。作为字幕的主导范式,多模态大语言模型通过扩展和高质量数据取得了强大性能。最近,强化学习成为推动多模态大语言模型向更高精度和更广覆盖的关键途径,然而,现有字幕奖励设计未能提供细粒度且可靠的事实验证信号,限制了其有效性。为解决这一问题,我们提出VCap,一种证人-裁判奖励,将参考字幕(证人)与视觉信号(裁判)配对。通过明确验证基于视觉信号的参考字幕与策略生成字幕之间的事实一致性,VCap提供了具有超几何分布级别精度的奖励信号用于字幕质量验证。该设计即使在不完美的参考下也能实现有效学习,促进强化学习训练中的弱到强泛化。在我们的实验中,使用VCap训练的8B模型在多个图像和视频字幕基准上优于开源和闭源的最先进模型。人工评估进一步证实了其与事实正确性的强对齐。此外,VCap提升了多模态大语言模型的感知能力,跨任务泛化,并超越了最佳N蒸馏,挑战了先前关于强化学习与视觉推理的假设。

英文摘要

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

2605.28010 2026-05-28 cs.AI 版本更新

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

信心编排的自我进化:应对不确定的LLM反馈

Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan, Ziwei Zhu

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 提出COSE方法,利用LLM内在置信度作为不确定性信号,通过置信度加权PPO更新和置信度优先重放,在通用推理和数学任务上取得最佳平均性能。

详情
AI中文摘要

自我进化的大语言模型(LLM)通过生成自己的训练任务和解决方案来学习,减少了对人工策划监督的依赖。然而,在许多推理领域,模型还必须验证生成的任务并判断生成的答案以获得训练信号。这带来了训练信号挑战:错误的自我判断会导致错误的梯度更新。现有方法要么依赖外部验证器(限制了通用性),要么将噪声的自我生成反馈视为监督。我们提出COSE(Confidence-Orchestrated Self-Evolution),它利用LLM的内在置信度作为轻量级不确定性信号来调节学习。COSE引入了置信度加权PPO更新和置信度优先重放。在19个保留基准测试和四个Qwen/Llama骨干网络(0.6B-4B)上,COSE始终优于基础模型,并在通用推理和数学方面取得最佳平均性能,同时在代码方面保持竞争力。代码和数据可在https://anonymous.4open.science/r/COSE_-B5C2获取。

英文摘要

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

2605.28009 2026-05-28 cs.CL cs.AI cs.LG 版本更新

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

MemGuard:防止长期记忆增强型大语言模型中的记忆污染

Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu, William M. Campbell, Yue Wu, Yuji Zhang, Kathleen McKeown, Dilek Hakkani-Tur, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Columbia University(哥伦比亚大学) Capital One

AI总结 提出MemGuard,一种类型感知的记忆框架,通过显式分配功能角色、维护类型隔离记忆间的关联并选择性组合必要类型的证据,防止异构记忆污染,提升记忆可靠性最高28.27%并减少检索token数最高5.8倍。

详情
AI中文摘要

记忆增强型大语言模型通过跨交互维护长期记忆,将推理扩展到固定上下文窗口之外。然而,现有的记忆系统常常将稳定的用户事实、情景事件和行为规则折叠到共享空间中,使得功能不同的记忆被检索并用作可互换的证据。我们将这种失败模式识别为异构记忆污染,其中上下文特定的事件被过度概括为声明,或者语义相关但功能不兼容的记忆误导生成。为此,我们引入了MemGuard,一种类型感知的记忆框架,在记忆构建和检索过程中保留功能记忆边界。它在写入时为每个记忆分配显式的功能角色,维护跨类型隔离记忆的关系,并仅从必要的记忆类型中选择性组合证据,从而减少来自无关或功能不兼容证据的污染。在幻觉和长时对话基准测试中,MemGuard将记忆可靠性提高了最多28.27%,同时检索的记忆token数比先前方法减少了最多5.8倍。这些结果表明,可靠的长期推理依赖于对异构记忆的有原则的组织和选择性使用。

英文摘要

Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.

2605.28008 2026-05-28 cs.AI cs.LG 版本更新

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

压缩思维:压缩推理数据在LLM后训练中何时以及如何发挥作用

Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学)

AI总结 本文通过分类显式、组合和隐式思维链,在合成组合推理任务上实验,发现粗粒度CoT需要更多SFT数据,组合和隐式CoT从数据缩放中获益更多但隐式CoT易导致记忆,后续RLVR会分解压缩步骤,且单向CoT顺序在长序列任务上泛化更强。

详情
AI中文摘要

大型语言模型(LLM)现在能够通过长思维链(CoT)推理解决复杂问题,但性能与token成本之间的权衡仍然是一个核心挑战。为了解决这个问题,监督微调(SFT)通常使用压缩推理数据,其中CoT轨迹被缩短为紧凑形式。然而,这种压缩推理数据对后训练的影响仍然知之甚少。在本文中,我们提出了一个CoT分类法,包括显式CoT(输出所有操作而不聚合)、组合CoT(将多个操作合并为单步)和隐式CoT(省略中间操作)。我们构建了一个合成组合推理任务,允许对难度、压缩粒度和数据大小进行可控变化,并在不同模型家族和大小上进行了全面的实验。值得注意的是,我们发现:(i)粗粒度CoT需要更多SFT数据;(ii)与显式CoT相比,组合CoT和隐式CoT从数据缩放中获益更多,而组合CoT从数据重复中获益,隐式CoT则倾向于导致记忆;(iii)与SFT不同,后续带有可验证奖励的强化学习(RLVR)会分解在SFT期间学到的压缩步骤;(iv)单向CoT顺序在更长序列任务上表现出更强的泛化能力。我们的发现为数据资源约束下的CoT设计提供了启示,并为LLM后训练中SFT和RL的机制提供了重要见解。

英文摘要

Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.

2605.28007 2026-05-28 cs.LG cs.AI 版本更新

Learning Compositional Latent Structure with Vector Networks

学习带有向量网络的组合潜在结构

Niclas Pokel, Benjamin F. Grewe

发表机构 * Institute of Neuroinformatics, UZH / ETH Zurich(神经信息学研究所,苏黎世联邦理工学院/苏黎世联邦理工人工智能中心) ETH AI Center Zurich, Switzerland(苏黎世联邦理工人工智能中心,瑞士)

AI总结 提出向量网络(VN),一种层级循环架构,通过可重用的秩1权重原子库实现组合泛化,在分布外任务中误差降低约一个数量级。

详情
AI中文摘要

深度网络是强大的函数逼近器,但它们通常将许多不同的计算存储在共享权重矩阵中,使得当熟悉的结构以新颖组合出现时,难以选择性地重用或调整其中的部分。我们引入了向量网络(VN),一种层级循环架构,其中每一层将固定的权重矩阵替换为可重用的秩1权重原子库。对于每个输入,VN最小化层级局部能量,以推断一组稀疏的活跃权重原子及其系数,这些系数受自底向上的输入重建和自顶向下的反馈一致性共同约束。这些权重原子系数随后为该样本组成一个输入特定的低秩权重矩阵。收敛后,慢速学习更新仅通过推断系数缩放的局部残差信号更新选中的权重原子。我们在四个组合基准上评估了VN,涵盖一维信号、二维空间解码、N体动力学和组合MNIST。在分布内任务中VN与强基线相当,而在需要以新颖方式重新组合熟悉因子的分布外任务中,其误差通常低约一个数量级。因此,向量网络使组合泛化成为架构和推理过程的结构属性,而非将许多行为拟合到单个共享密集参数基底的脆弱副产品。

英文摘要

Deep networks are powerful function approximators, but they typically store many different computations in shared weight matrices, making it difficult to selectively reuse or adapt parts of them when a familiar structure appears in novel combinations. We introduce the Vector Network (VN), a hierarchical recurrent architecture in which each layer replaces a fixed weight matrix with a library of reusable rank-1 weight atoms. For each input, VN minimizes a layer-local energy to infer a sparse set of active weight atoms and their coefficients, jointly constrained by bottom-up input reconstruction and top-down feedback consistency. These weight atom coefficients then compose an input-specific low-rank weight matrix for that sample. After convergence, slow learning updates only the selected weight atoms through local residual signals scaled by the inferred coefficients. We evaluate VN on four compositional benchmarks spanning 1D signals, 2D spatial decoding, N-body dynamics, and compositional MNIST. VN matches strong baselines in distribution while often achieving out-of-distribution error about an order of magnitude lower when familiar factors must be recombined in novel ways. Vector networks thus make compositional generalization a structural property of the architecture and inference process rather than a brittle byproduct of fitting many behaviors into one shared dense parameter substrate.

2605.28006 2026-05-28 cs.CL cs.AI 版本更新

Integrated and Cross-Architecture Interpretation of LLM Reasoning

LLM推理的集成与跨架构解释

Leonardo Matthew Yauw, Wei-Bin Kou, Yujiu Yang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出集成跨架构推理(IAR)框架,通过带宽校准的MIP与Tukey IQR峰值检测、重叠分析及Jaccard稳定性度量,统一解释LLM推理模式。

详情
AI中文摘要

理解LLM如何推理受到实际不对称性的阻碍:虽然其生成的输出是可观察的,但潜在的推理模式仍然不透明。依赖单一探针,如互信息峰值(MIP)或深度思考比率(DTR),可能会低估真正的推理结构。针对这一不足,我们提出了一个集成跨架构推理(IAR)框架,旨在为LLM推理可解释性提供统一方法。具体来说,我们首先提出使用带宽校准的MIP结合Tukey IQR峰值检测来隔离输出层的关键推理标记。其次,我们对MIP选中的标记和DTR深度标记进行重叠分析,以追踪这些标记的跨层轨迹。这也揭示了关键推理标记是否也是计算密集型的,进一步有助于理解推理模式如何在模型层间演变。最后,我们在多领域问题上应用Jaccard稳定性度量,以验证MIP识别的标记是否具有推理质量保证。在三个模型(Qwen-7B、Qwen-14B和Llama-8B)上跨四个领域(数学、代码、逻辑和常识)的大量实验证明了IAR跨架构的泛化解释能力。

英文摘要

Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning patterns remain opaque. Relying on single probes, such as Mutual Information Peak (MIP) or Deep-Thinking Ratio (DTR), risks underestimating the genuine inferential structure. To response this deficiency, we present an Integrated, cross-Architecture Reasoning (IAR) framework, designed to provide a unified approach to LLM reasoning interpretability. Specifically, we first propose to use bandwidth-calibrated MIP coupled with Tukey IQR peak-detection to isolate reasoning-crucial tokens at the output layer. Second, we performed an overlap analysis between MIP-picked tokens and DTR-deep tokens to trace the cross-layer trajectories of those tokens. This also discloses whether reasoning-crucial tokens are computation-intensive as well, further facilitating to understand how reasoning patterns evolve across model layers. Finally, we apply a Jaccard stability metric over multi-domain problems to verify if the MIP-identified tokens are reasoning quality-guaranteed. Extensive experiments on three models (Qwen-7B, Qwen-14B, and Llama-8B) across four domains (mathematics, code, logic, and common sense) demonstrate IAR's generalizable interpretation capabilities across architectures.

2605.28001 2026-05-28 cs.AI cs.CR 版本更新

An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding

锚定解码中 k-NAF 预算核算的实证审计

J. Vijayavallabh

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯学院)

AI总结 通过固定工作负载和自适应提示搜索,实证审计锚定解码中的 k-NAF 预算核算机制,发现平均累积 KL 支出远低于序列级预算,自适应搜索虽提高代理支出比率但未导致预算耗尽,且高代理比率归因于代理伪影而非轨迹级预算失败。

Comments 19 pages, 4 figures, 9 main pages remaining supplementary and appendix

详情
AI中文摘要

我们使用 (i) 固定的、按类别分层的工作负载(跨六个提示类别约 8,500 次随机执行)和 (ii) 针对高代理支出比率的目标自适应提示搜索过程,对锚定解码中的 k-NAF 预算核算机制进行实证审计。在固定工作负载下,平均累积 KL 支出远低于序列级预算 K ∈ {600, 1000},并且经验 Bernstein 风格的代理对于每个类别都保持在 K 以下;表面重叠诊断(ROUGE-L 和 5-gram Jaccard)相应较小。自适应搜索增加了代理支出比率,但未产生明显的预算耗尽。在 k=3 的保留版权领域工作负载上,几个提示在早期停止评估且实际样本量较小的情况下显示出高于 1 的代理比率;在可比平均支出下,用更大分配重新评估相同提示将代理比率降低到 [0.26, 0.40] 范围,这与代理伪影一致,而非每个轨迹的预算失败。

英文摘要

We empirically audit the k-NAF budget-accounting mechanism in Anchored Decoding using (i) a fixed, class-stratified workload (approximately 8,500 randomized executions across six prompt classes) and (ii) an adaptive prompt-search procedure targeting high proxy spend ratios. On the fixed workload, mean cumulative KL spend remains far below the sequence-level budgets K in {600, 1000}, and an empirical Bernstein-style proxy stays below K for every class; surface-overlap diagnostics (ROUGE-L and 5-gram Jaccard) are correspondingly small. Adaptive search increases the proxy spend ratio but does not produce clear budget exhaustion. On a held-out copyright-domain workload at k = 3, several prompts exhibit proxy ratios above 1 under early-stopped evaluations with small realized sample sizes; re-evaluating the same prompts with larger allocation reduces the proxy ratio to the range [0.26, 0.40] under comparable mean spend, consistent with proxy artifacts rather than per-trajectory budget failures.

2605.28000 2026-05-28 cs.SE cs.AI 版本更新

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution

Tool Forge:一种用于受控代理执行的携带验证的工具链

Swanand Rao

发表机构 * Next Moca Global, Inc.(Next Moca全球公司)

AI总结 本文提出 Tool Forge,一种将自然语言能力意图转换为经过验证、沙盒验证、编目工具制品,并通过令牌高效路由层暴露给代理的工具链,解决了代理执行中工具层缺乏治理和验证的问题。

Comments 9 pages, 2 figures, 3 tables. Code: https://github.com/nextmoca/tool-forge

详情
AI中文摘要

大型语言模型代理越来越被期望执行操作工作:调用API、操作文件、组装工作流以及在企业系统内行动。然而,这种执行所依赖的工具层仍然通常被视为手工编写的集成工件或暴露给模型的静态模式列表。本文介绍了Tool Forge,一种携带验证的工具链,用于将自然语言能力意图转换为受控、沙盒验证、编目的工具制品,并通过令牌高效的路由层将这些制品暴露给代理。Tool Forge将工具视为一个包含意图、能力契约、实现、依赖策略、测试、文档、运行时验证证据、生命周期状态、凭证绑定和路由元数据的胶囊。它还引入了一个路由器,该路由器暴露意图限定的工具会话,而不是将完整的目录模式加载到模型上下文中。我们描述了系统架构、验证流水线、面向MCP的路由模型、治理控制以及来自开源实现的初始可重复基准测试。在83个路由器基准测试案例中,Tool Forge Router实现了0.901的聚合微平均F1,同时相对于朴素的全目录模式暴露,估计将任务流工具上下文减少了99.2%。在25个本地工具任务的端到端生成探测中,Tool Forge生成了25个工具包中的25个,在确定性接受检查中达到了0.940的微平均F1,并通过了25个沙盒验证中的23个。这些结果作为初始系统基准测试呈现,而非最先进的主张。论文指出了对抗性路由、更广泛的API接地、沙盒隔离和跨系统评估方面的剩余挑战。

英文摘要

Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows, and acting inside enterprise systems. Yet the tool layer on which this execution depends is still commonly treated as either a hand-written integration artifact or a static list of schemas exposed to a model. This paper introduces Tool Forge, a validation-carrying toolchain for converting natural-language capability intent into governed, sandbox-verified, cataloged tool artifacts and exposing those artifacts to agents through a token-efficient routing layer. Tool Forge treats a tool as a capsule containing intent, capability contract, implementation, dependency policy, tests, documentation, runtime validation evidence, lifecycle state, credential bindings, and routing metadata. It also introduces a Router that exposes intent-scoped tool sessions instead of loading full catalog schemas into the model context. We describe the system architecture, validation pipeline, MCP-facing routing model, governance controls, and initial reproducible benchmarks from the open-source implementation. Across 83 Router benchmark cases, Tool Forge Router achieves aggregate micro-F1 of 0.901 while reducing estimated task-flow tool context by 99.2% relative to naive full-catalog schema exposure. In a 25-case end-to-end generation probe over local-tool tasks, Tool Forge generates 25 of 25 tool bundles, reaches micro-F1 of 0.940 against deterministic acceptance checks, and passes 23 of 25 live sandbox validations. These results are presented as an initial systems benchmark, not as a state-of-the-art claim. The paper identifies remaining challenges in adversarial routing, broader API grounding, sandbox isolation, and cross-system evaluation.

2605.27999 2026-05-28 cs.HC cs.AI 版本更新

Learning to Assign Prediction Tasks to Agents with Capacity Constraints

学习将预测任务分配给具有容量限制的智能体

Shang Wu, Saatvik Kher, Padhraic Smyth

发表机构 * Department of Computer Science(计算机科学系) University of California, Irvine(加州大学欧文分校)

AI总结 针对容量受限的多个智能体(人类或AI),提出一种序贯探索-利用策略学习框架,以最大化整体预测性能。

详情
AI中文摘要

我们解决了从一组可用的人类或AI智能体中学习将预测任务分配给一个智能体的问题。特别地,我们专注于智能体专业知识和分配策略的序贯学习,其中每个智能体被限制处理一部分任务。我们从智能体容量、智能体专业差异和任务上下文方面提供了该问题的一般理论特征。然后,我们开发了一个序贯探索-利用策略学习算法框架,旨在最大化整体性能。在各种表格、图像和文本预测任务上的实验结果表明,相对于非上下文基线,我们的策略学习算法在不同类型的智能体(包括LLM和人类)中取得了系统性增益。

英文摘要

We address the problem of learning to assign prediction tasks to one agent from a set of available human or AI agents. In particular, we focus on the sequential learning of agent expertise and assignment policies where each agent is constrained to handle a fraction of tasks. We provide a general theoretical characterization of this problem in terms of agent capacities, differences in agent expertise, and task context. We then develop a framework of sequential explore-exploit policy-learning algorithms that seek to maximize overall performance. Experimental results over a variety of tabular, image, and text prediction tasks demonstrate systematic gains from our policy-learning algorithms relative to non-contextual baselines across different types of agents, including LLMs and humans.

2605.27997 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

毒性存在于何处?语言模型中的机制定位与定向抑制

Himanshu Beniwal, Mayank Singh

发表机构 * Indian Institute of Technology Gandhinagar(印度理工学院冈德辛加尔)

AI总结 通过分析毒性与中性提示的激活差异,定位特定层和神经元中的毒性,并利用推理时缩放或最小秩一权重编辑进行抑制,无需梯度下降,实现毒性降低同时保持语言质量。

详情
AI中文摘要

大型语言模型频繁生成有毒、仇恨或有害内容,然而现有的缓解方法依赖于昂贵的重新训练或输出级过滤,且缺乏对毒性内部起源的机制性理解。我们提出了Meow2X和TRNE,两种互补的无需重新训练的框架,通过分析毒性与中性提示之间的激活差异,将毒性定位到特定层和神经元,然后通过推理时缩放或最小秩一权重编辑进行抑制——无需任何梯度下降。在五个语言模型、两个基准测试和90种配置上的评估,使用双重安全评估器,一致地证明了毒性降低,同时保持了语言建模质量。我们的分析揭示,毒性不成比例地编码在早期MLP层中,在不同架构间有所变化,并且被单一评估器设置系统性地低估——强调了多评估器安全评估的必要性。通过连接机制可解释性与实际去毒化,我们的框架为更安全、更透明的语言模型提供了一条原则性路径。

英文摘要

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups -- underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

2605.27990 2026-05-28 cs.LG cs.AI cs.CV 版本更新

Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping

几何校正扩散后验采样:基于去噪器回拉曲率引导与流形对齐阻尼

Seunghyeok Shin, Minwoo Kim, Dabin Kim, Hongki Lim

发表机构 * Department of Electrical and Computer Engineering, Inha University, Incheon, 22212, South Korea(电气与计算机工程系,Inha大学,Incheon,22212,韩国)

AI总结 提出一种基于去噪器回拉曲率引导和流形对齐阻尼的几何校正扩散后验采样方法,通过每噪声水平的阻尼高斯-牛顿校正替代标量引导,实现稳定高效的后验采样。

Comments Code: https://github.com/Seunghyeok0715/CLAMP

详情
Journal ref
International Conference on Machine Learning 2026
AI中文摘要

扩散后验采样将扩散先验条件于测量值,但数据一致性更新通常由手动调整的引导权重缩放,并且在刚性、算子依赖的曲率下可能破坏采样稳定性。我们使用在扩散状态坐标中计算的每噪声水平阻尼高斯-牛顿校正替代标量引导。该校正通过去噪器回拉似然梯度,使用避免前向去噪器雅可比矩阵的单侧曲率模型,并应用与去噪器残差对齐的扩散校准秩一阻尼。每个校正通过自动微分的无矩阵GMRES求解,采样通过具有闭式漂移/噪声分离的方差保持朗之万转移进行。在FFHQ和ImageNet上的逆问题中,该方法在PSNR/SSIM/LPIPS上达到竞争性能,同时运行速度显著快于大多数对比基线;在加速MRI重建中,它在对比基线中取得了最佳的PSNR/SSIM。

英文摘要

Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss--Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

2605.27984 2026-05-28 cs.CL cs.AI 版本更新

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

KVoiceBench, KOpenAudioBench 和 KMMAU:用于评估语音语言模型的智能体驱动的韩语语音基准

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee, Jonghyun Lee

发表机构 * KRAFTON Graduate School of AI, KAIST(韩国科学技术院人工智能研究生院) Department of Mathematical Sciences, Seoul National University(首尔国立大学数学科学系)

AI总结 针对语音语言模型评估中英语中心化的问题,提出两种智能体驱动的基准构建框架,构建并发布了三个韩语语音基准(KVoiceBench、KOpenAudioBench、KMMAU),通过评估八个最新模型揭示了英语-韩语性能差距和不同任务间的互补性弱点。

Comments 16 pages, 4 figures

详情
AI中文摘要

语音语言模型通过将大型语言模型扩展到语音模态取得了显著进展。然而,语音语言模型的评估仍然严重以英语为中心,限制了多语言语音能力的可靠评估。通过ASR、翻译、归一化和TTS直接迁移基准会破坏语言特定的指令、答案约束和口语形式;对于音频理解,迁移源语言音频也无法保留目标语言的说话人属性、口音和副语言特性。为解决这些限制,我们提出了两种智能体驱动的基准构建框架:一种将源语言SpokenQA基准迁移为目标语言SpokenQA基准,另一种利用转录和说话人元数据将目标语言ASR语料库转换为音频理解基准。使用这些框架,我们构建并公开发布了三个韩语语音基准:用于韩语SpokenQA的KVoiceBench和KOpenAudioBench,以及用于韩语音频理解的KMMAU,共包含12,345个样本。我们评估了八个最近的语音语言模型,发现英语-韩语性能差距在不同模型和任务族中差异很大,并且SpokenQA和音频理解的排名出现分歧,揭示了仅靠英语评估无法发现的互补性弱点。

英文摘要

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.

2605.27981 2026-05-28 cs.AI 版本更新

STAB: Specification-driven Testing for Algorithmic Bottlenecks

STAB:面向算法瓶颈的规约驱动测试

Soohan Lim, Joonghyuk Hahn, Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 提出STAB流水线,通过约束饱和与对抗场景注入,从自然语言问题规约生成暴露算法瓶颈的测试用例,显著提升测试用例对瓶颈的检测率。

Comments 16 pages, 5 figures, 8 tables

详情
AI中文摘要

评估算法代码的效率需要能够暴露运行时瓶颈的测试用例。先前的方法通过增加输入规模或生成使给定实现运行缓慢的代码特定输入来生成效率测试用例。因此,它们没有处理驱动算法最坏情况的结构性输入条件。我们引入STAB,一个规约驱动的流水线,仅从自然语言问题规约生成暴露算法瓶颈的测试用例。STAB将任务分为约束边界最大化和对抗结构注入两部分。(i) 约束饱和器提取约束,并通过基于规则的饱和及对相关变量的CP-SAT优化来解析大的可接受规模赋值。(ii) 对抗场景注入器使用关键词匹配和K近邻(KNN)从策划的场景目录中检索实现级别的对抗构造原则。STAB将问题规约、解析的边界和检索到的构造原则编码为结构化生成规约,LLM据此合成Python测试用例生成器。在CodeContests上,STAB将生成测试用例中暴露算法瓶颈的比例从开源LLM平均50.43%提升至73.45%,从闭源LLM平均57.45%提升至71.85%,在Python、Java和C++上均有一致提升。我们的代码可在https://github.com/suhanmen/STAB获取。

英文摘要

Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency test cases either by increasing input size or by generating code-specific inputs that make the given implementation run slowly. Consequently, they do not address the structural input conditions that drive the algorithmic worst case. We introduce STAB, a specification-driven pipeline that generates test cases that expose algorithmic bottlenecks from a natural-language problem specification alone. STAB separates the task into constraint-bound maximization and adversarial structure injection. (i) The constraint saturator extracts constraints and resolves large admissible size assignments using rule-based saturation and CP-SAT optimization over related variables. (ii) The adversarial scenario injector retrieves implementation-level adversarial construction principles from a curated scenario catalog using keyword matching and K-nearest neighbors (KNN). STAB encodes the problem specification, resolved boundary, and retrieved construction principles into a structured generation specification, from which the LLM synthesizes a Python test case generator. On CodeContests, STAB raises the rate of generated test cases that expose algorithmic bottlenecks from 50.43% to 73.45% on average across open-source LLMs and from 57.45% to 71.85% on average across closed-source LLMs, with consistent gains across Python, Java, and C++. Our code is available at https://github.com/suhanmen/STAB.

2605.27980 2026-05-28 cs.CL cs.AI 版本更新

Periodic RoPE for Infinite Context LLMs

周期性RoPE:面向无限上下文的大型语言模型

Simin Huo

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出周期性RoPE(P-RoPE)位置编码机制,结合滑动窗口注意力和无位置编码的全局注意力,避免位置耗尽,理论上支持无限上下文窗口。

Comments 5 pages

详情
AI中文摘要

处理超长上下文的能力对于大型语言模型(LLMs)执行长期任务至关重要。尽管最近的努力已将上下文窗口扩展到1M及以上,但当序列长度超过位置编码(如RoPE)的预训练范围时,模型性能会下降,即位置耗尽。必须克服这一基本限制才能实现真正的无限上下文。为此,我们提出了周期性RoPE(P-RoPE),一种旨在避免这种耗尽的位置编码机制。它与滑动窗口注意力(SWA)协同工作,以捕获每个窗口内的局部依赖和相对位置。然后,这一局部层由无位置编码(NoPE)的全局注意力层补充,使得整个序列上的无界交互成为可能,而不受位置限制。通过堆叠这两类层,模型避免了位置外推以泛化更长的序列,并理论上支持无限的上下文窗口。实验结果表明,我们的模型MiniWin在长上下文效率和稳定性上优于采用标准GPT架构的MiniMInd。我们的工作为LLMs实现真正的无限上下文理解提供了一条可能的路径。代码可在\href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}获取。

英文摘要

The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre-trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P-RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long-context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite-context understanding. The code is available at \href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}.

2605.27971 2026-05-28 cs.CL cs.AI 版本更新

Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses

语义流正则化:教会LLMs生成多样且连贯的回复

Kerui Peng, Feifei Li, Xingyu Fan, Wenhui Que

发表机构 * Tencent Inc.(腾讯公司) Beijing, China(中国北京)

AI总结 针对大语言模型微调时输出多样性严重受限的跨风格坍缩问题,提出语义流正则化(SFR),通过条件流匹配监督骨干网络使用连续句子嵌入,在零部署成本下提升多样性和风格保真度。

详情
AI中文摘要

当大语言模型被微调以生成个性或语气条件化的回复时,其输出多样性受到严重限制——我们将这种失败称为跨风格坍缩。我们将这种坍缩追溯到交叉熵目标,该目标在共享表示下倾向于抑制多样化的延续。我们提出语义流正则化(SFR),一种轻量级的辅助目标,通过条件流匹配使用未来片段的连续句子编码器嵌入来监督骨干网络。随机流源通过构造保持多模态;流匹配头在推理时被丢弃,增加零部署成本。在一个大规模工业对话数据集(Qwen3-32B,9种个性)上,SFR在输出多样性、风格保真度和回复质量上优于SFT。我们进一步在公共LiveCodeBench-v5(Qwen2.5-Coder-7B-Instruct)上验证,其中SFR持续改进pass@k,证实了其超越风格化对话的通用性。在MBPP上的受控比较显示,多令牌预测是SFR的一个退化特例。

英文摘要

When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.

2605.27970 2026-05-28 cs.AI 版本更新

Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

人类感知域的几何结构在LLM表征中短暂出现

Simardeep Singh, Paras Chopra

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校)

AI总结 研究大型语言模型内部表征中是否出现与人类感知组织相似的几何结构,发现多个感知域的几何结构在中间层短暂涌现,且与人类基准对齐。

Comments 19 Pages, 28 Figures

详情
AI中文摘要

虽然大型语言模型(LLM)仅基于文本数据进行训练,但先前的工作表明,它们的内部表征在嵌入空间中可能展现出丰富的几何结构。基于这一研究方向,我们调查了这种结构是否与不同领域(例如颜色、音高、情感和味觉)的人类感知组织相似。具体来说,我们研究了多个开源Transformer架构的残差流中,与感知模态对应的内在几何结构逐层涌现的情况。我们的结果揭示了三个关键发现。首先,我们观察到多个感知域的逐层几何结构涌现,尽管训练过程中没有任何直接的感知监督。其次,这些感知域表现出不同的涌现轮廓,几何结构及其与人类基准的一致性在深度上遵循领域和模型特定的轨迹。第三,这种涌现遵循一致的表征轨迹:几何结构在早期层较弱或分散,在中间层逐渐组织化,在后期层减弱,表明感知几何结构作为模型内部转换管道的一部分短暂出现。这为理解类人感知几何结构在LLM中如何以及何处出现提供了新见解,为内部表征的机制分析提供了原则性途径。

英文摘要

While large language models (LLMs) are trained purely on textual data, prior work has shown that their internal representations can exhibit rich geometric structure in embedding space. Building on this line of work, we investigate whether such structure is similar to human perceptual organisation across different domains (e.g., color, pitch, emotion, and taste). Specifically, we study the layer-wise emergence of intrinsic geometrical structure corresponding to perceptual modalities within the residual streams of multiple open-weight transformer architectures. Our results reveal three key findings. First, we observe the emergence of layer-wise geometric structure across multiple perceptual domains, despite the absence of any direct perceptual supervision during training. Second, these perceptual domains exhibit distinct emergence profiles, with both geometric structure and its alignment with human baselines following domain- and model-specific trajectories across depth. Third, this emergence follows a consistent representational trajectory: geometry is weak or diffuse in early layers, becomes progressively organised in intermediate layers, and is attenuated in later layers, suggesting that perceptual geometry arises transiently as part of the model's internal transformation pipeline. This provides new insight into how and where human-like perceptual geometry arises in LLMs, offering a principled pathway for mechanistic analysis of internal representations.

2605.27967 2026-05-28 stat.ME cs.AI cs.LG stat.ML 版本更新

Multi-Teacher Knowledge Distillation via Teacher-Informed Mixture Priors

通过教师引导的混合先验进行多教师知识蒸馏

Luyang Fang, Yongkai Chen, Jiazhang Cai, Ping Ma, Wenxuan Zhong

发表机构 * Department of Statistics, University of Georgia(佐治亚大学统计系) Department of Statistics, Harvard University(哈佛大学统计系)

AI总结 提出多教师贝叶斯知识蒸馏(MT-BKD)框架,利用贝叶斯推断和教师引导的先验分布,结合熵加权机制,实现多教师知识的高效融合与不确定性量化。

详情
AI中文摘要

知识蒸馏是一种强大的模型压缩方法,能够高效部署复杂的深度学习模型(教师模型),包括大型语言模型。然而,其潜在的统计机制尚不明确,且不确定性评估常被忽视,特别是在需要多样化教师专业知识的实际场景中。为解决这些挑战,我们引入了 extit{多教师贝叶斯知识蒸馏}(MT-BKD),其中蒸馏学生模型在贝叶斯框架内从多个教师模型学习。我们的方法利用贝叶斯推断来捕捉蒸馏过程中的固有不确定性。我们引入了一种教师引导的先验,整合来自教师模型和特定任务训练数据的外部知识,提供了更好的泛化性、鲁棒性和可扩展性。此外,一种基于熵的加权机制自适应地调整每个教师的影响,使学生能够有效组合多个专业知识来源。MT-BKD增强了学生模型学习过程的可解释性,提高了预测准确性,并提供了不确定性量化。我们在合成任务和真实任务(包括蛋白质亚细胞定位预测和图像分类)上验证了MT-BKD。实验表明,我们的MT-BKD框架在性能提升和稳健的不确定性量化方面表现出色,突显了其优势。

英文摘要

Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teachers), including large language models. However, its underlying statistical mechanisms remain unclear, and uncertainty evaluation is often overlooked, especially in real-world scenarios requiring diverse teacher expertise. To address these challenges, we introduce \textit{Multi-Teacher Bayesian Knowledge Distillation} (MT-BKD), where a distilled student model learns from multiple teachers within the Bayesian framework. Our approach leverages Bayesian inference to capture inherent uncertainty in the distillation process. We introduce a teacher-informed prior, integrating external knowledge from teacher models and task-specific training data, offering better generalization, robustness, and scalability. Additionally, an entropy-based weighting mechanism adaptively adjusts each teacher's influence, allowing the student to combine multiple sources of expertise effectively. MT-BKD enhances the interpretability of the student model's learning process, improves predictive accuracy, and provides uncertainty quantification. We validate MT-BKD on both synthetic and real-world tasks, including protein subcellular location prediction and image classification. Our experiments show improved performance and robust uncertainty quantification, highlighting the strengths of our MT-BKD framework.

2605.27965 2026-05-28 cs.AI 版本更新

The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

过度思考的形状:长推理轨迹中的回溯爆发

Navid Rezazadeh, Arash Gholami Davoodi

发表机构 * University of California, Irvine(加州大学尔湾分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 通过分析长推理轨迹中的回溯动态,发现早期孤立修复通常与正确推理兼容,而错误轨迹更常出现持续且聚集的晚期中度至重度回溯,并基于此提出爆发感知过滤策略以区分可恢复修复与潜在不稳定。

详情
AI中文摘要

推理模型通常生成长轨迹,其中有用的自我纠正和无效的修改难以区分。我们通过回溯动态研究这种区别:长形式推理轨迹中的局部重新考虑、撤回或重新推导。在6,000条Qwen3-8B AIME轨迹上,我们标注了片段级别的回溯严重性,并分析了事件时序、归一化深度和局部爆发结构。我们发现早期孤立修复通常与正确推理兼容,而错误轨迹更常显示中度至重度回溯,这些回溯持续存在并聚集在后期。跨语料库检查显示,在额外的模型/领域对中存在相同的定性不对称性。过滤分析将信号实例化为前缀因果选择性早期退出策略:在浅层和中间深度,爆发感知过滤优于固定长度过滤,同时仅使用前缀可用特征。中等长度截断仍然是强大的完整轨迹基线,但爆发感知控制提供了一种可部署的机制,用于区分可恢复修复与潜在不稳定。

英文摘要

Reasoning models often generate long traces in which useful self-correction and unproductive revision are hard to distinguish. We study this distinction through backtracking dynamics: local reconsideration, retraction, or re-derivation inside long-form reasoning traces. On 6{,}000 Qwen3-8B AIME traces, we annotate segment-level backtrack severity and analyze event timing, normalized depth, and local burst structure. We find that early isolated repair is often compatible with correct reasoning, whereas incorrect traces more often show moderate-to-severe backtracks that persist and cluster late. Cross-corpus checks show the same qualitative asymmetry across additional model/domain pairs. Filtering analyses instantiate the signal as a prefix-causal selective early-exit policy: at shallow and intermediate depths, burst-aware filtering outperforms fixed length-based filtering while using only prefix-available features. Moderate length cutoffs remain strong completed-trace baselines, but burst-aware control provides a deployable mechanism for separating recoverable repair from likely instability.

2605.27958 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

压力测试LLM中的欺骗探针:扩展性、鲁棒性与欺骗表示的几何结构

Sachin Kumar

发表机构 * LexisNexis(LexisNexis公司)

AI总结 本文通过系统压力测试,诊断线性探针在分布偏移下失效的原因,发现风格增强可恢复近完美检测,并证明欺骗编码非单一线性方向或熵代理,而是分布式亚阈值特征。

Comments Accepted at the GEM Workshop @ ACL 2026

详情
AI中文摘要

基于LLM激活训练的线性探针越来越多地被提议作为欺骗检测指标,但在干净基准上报告AUROC超过0.96,而在分布偏移下崩溃。本文系统地对Gemma 3模型家族(1B-27B参数)的探针指标进行压力测试,诊断其失败原因而不仅仅是记录失败。我们测试了关于欺骗编码的四个假设:(1)单一线性方向,(2)多维子空间,(3)凸锥包,(4)熵代理。我们的设计包括跨域转移矩阵、基于排列零基线的多维探针分析、熵残差化测试以及8种风格偏移下的干扰评估。我们发现:(a)探针在干净数据上达到近乎完美的AUROC(>=0.998),但在风格偏移下崩溃;风格增强的探针在未见风格上恢复近乎完美的检测(平均AUROC 0.979-0.983);(b)单一方向假设被拒绝(k=1仅捕获0.61-0.80 AUROC),跨域转移失败被确认为几何原因而非层不匹配驱动;(c)熵代理假设被拒绝(最大|rho|=0.454,残差化后最大Delta-AUROC=0.004);(d)欺骗并未形成显著的线性子空间(每域k*=0),但多维探针(k>=5)通过分布式亚阈值特征恢复信号。探针脆弱性反映了分布狭窄性而非架构限制:风格增强的探针在4B和27B均恢复近乎完美的检测,表明逆缩放模式是训练分布伪影而非真正的规模依赖现象。

英文摘要

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.

2605.27944 2026-05-28 cs.AI cs.MM cs.SD 版本更新

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

从说话到唱歌:音视频深度伪造检测的新挑战

Ke Liu, Jiwei Wei, Wenyu Zhang, Shuchang Zhou, Ruikun Chai, Yutao Dai, Chaoning Zhang, Yang Yang

发表机构 * Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China(未来媒体中心,计算机科学与工程学院,电子科技大学)

AI总结 针对现有音视频深度伪造检测方法在唱歌场景中性能下降的问题,提出文本引导的音视频伪造检测框架(T-AVFD),通过面部真实性模式学习和多模态差异权重学习,在说话和唱歌场景中均实现鲁棒检测。

Comments Accepted by ICML 2026

详情
AI中文摘要

随着音视频生成模型的快速发展,可靠的伪造检测变得日益关键。现有的音视频深度伪造检测方法通常依赖于跨模态不一致性。在唱歌中,有节奏的发声削弱了这种耦合,并引入了显著的领域偏移,大幅降低了检测性能。我们使用节奏感知生成模型构建了唱歌头部深度伪造(SHDF)数据集,以填补唱歌基准的空白。为了应对跨场景领域偏移,我们提出了文本引导的音视频伪造检测(T-AVFD)框架,该框架在说话和唱歌场景中均具有泛化能力。T-AVFD 包含一个面部真实性模式学习器和一个多模态差异权重学习模块。模式学习器将面部特征与多粒度文本描述对齐,以学习可泛化的真实性模式。权重学习模块保留固有的音视频一致性,并通过差异权重将其与真实性模式自适应地整合。在多个说话头部深度伪造数据集和 SHDF 上的大量实验表明,该方法在现有基线上取得了一致的改进,并在多种扰动下表现出强大的鲁棒性。

英文摘要

With rapid advances in audio-visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio-visual deepfake detection typically rely on cross-modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models to fill the gap in singing benchmarks. To cope with cross-scenario domain shifts, we propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework that generalizes across both talking and singing scenarios. T-AVFD comprises a facial authenticity pattern learner and a multi-modal differential weight learning module. The pattern learner aligns facial features with multi-granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio-visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.

2605.27935 2026-05-28 cs.AI 版本更新

Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

智能体思考得更深吗?顺序规划中层间动力学的机制研究

Zhenyu Cui, Xiangzhong Luo

AI总结 通过残差流探针、因果层跳跃干预和有效深度测量,研究了大型语言模型在自主智能体任务(多轮规划、工具使用、迭代状态更新)中层间动态的差异,发现智能体推理表现出与静态任务不同的深度分布,随着轨迹展开,模型逐步招募更多更深层,且后期出现更强的长距离层间依赖,同时残差更新从稳定特征积累转向重复校准,有效深度分析揭示了语义方向形成较早而深层对稳定最终输出仍必要的构建-精炼差距。

详情
AI中文摘要

最近的机制研究表明,大型语言模型(LLMs)在标准单轮任务中可能未能高效利用其深度。在自主智能体设置中,模型必须执行多轮规划、工具使用和迭代状态更新,这种情况是否仍然成立尚不清楚。我们通过系统性地对三个领域(深度研究、代码生成和表格处理)的完整用户-智能体轨迹进行逐层分析来研究这一问题。使用残差流探针、因果层跳过干预和有效深度测量,我们表明智能体推理表现出与静态任务不同的深度分布。随着轨迹展开,模型逐步招募更多和更深的层,在后期出现更强的长距离层间依赖。同时,残差更新变得越来越以校正为主导,表明从稳定的特征积累转向重复校准。有效深度分析进一步揭示了一个显著的构建-精炼差距:语义方向通常形成较早,而深层对于稳定最终输出仍然必要。在不同模型家族中,这一差距在Qwen和Minimax中显著,而GLM则表现出更依赖领域的深度分配模式。这些结果提供了机制证据,表明自主LLM智能体随着推理复杂性的增长自适应地分配深度。

英文摘要

Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi-turn planning, tool use, and iterative state updates, remains unclear. We study this question through a systematic layer-wise analysis of complete user-agent trajectories spanning three domains: Deep Research, Code Generation, and Tabular Processing. Using residual stream probes, causal layer-skipping interventions, and effective-depth measurements, we show that agentic reasoning exhibits a distinct depth profile from static tasks. As trajectories unfold, models progressively recruit more and deeper layers, with stronger long-range inter-layer dependencies emerging in later turns. At the same time, residual updates become increasingly correction-dominant, indicating a shift from stable feature accumulation toward repeated recalibration. Effective-depth analysis further reveals a substantial construction-refinement gap: semantic direction often forms relatively early, while deep layers remain necessary for stabilizing final outputs. Across model families, this gap is pronounced in Qwen and Minimax, whereas GLM shows a more domain-dependent depth allocation pattern. These results provide mechanistic evidence that autonomous LLM agents allocate depth adaptively as reasoning complexity grows.

2605.27932 2026-05-28 cs.CV cs.AI cs.CL cs.CR cs.LG 版本更新

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

当图文推理遇上安全:什么决定了多模态越狱鲁棒性?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong

发表机构 * Independent Researcher(独立研究者) Stanford University(斯坦福大学) Harvard University(哈佛大学) Purdue University(普渡大学) Duke University(杜克大学)

AI总结 本文研究多模态大语言模型中不同图文推理范式对越狱鲁棒性的影响,发现显式图像工具交互能显著降低攻击成功率,并通过引入图像工具安全向量框架从表征层面解释其机制。

Comments 17 pages, 6 figures, 7 tables

详情
AI中文摘要

图文推理正成为大型视觉-语言模型的一种新推理范式,但其安全性影响尚不明确。现有系统已涵盖多种流程设计,包括直接响应生成、纯文本前轮、视觉状态操作以及显式外部图像工具调用。本文探究这些评估范式中哪一种能提升多模态越狱鲁棒性及其原因。在多个视觉-语言模型上,我们的实验表明显式图像工具交互的攻击成功率最低,平均相对降低约30%。这一发现起初令人惊讶:即使返回的图像工具输出被人为覆盖或本身不安全,攻击成功率仍保持较低,但在纯文本前轮控制下又恢复到接近直接回答的水平。这些结果表明,较低的攻击成功率并非由良性返回图像语义或仅文本图像工具轨迹解释。为解释这一模式,我们引入了一个图像工具安全向量框架,将图像工具调用建模为隐藏表示向安全相关方向的残差偏移。表征层面的分析和激活干预支持了这一解释。总体而言,我们的结果表明,显式图像工具交互是提升越狱鲁棒性的一种有前景的设计模式,同时也推动了针对特定流程的安全性评估。

英文摘要

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

2605.27931 2026-05-28 cs.AI 版本更新

DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

DiagramRAG:一个用于科学图表生成的轻量级检索增强框架

Xinjiang Yu, Junyi Han, Zhuofan Chen, Chi Zhang, Xiangyu Fu, Jingyuan Tan, Zirui You, Yixiang Jian, Yu-Ping Wang, Chengliang Chai

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出DiagramRAG框架,通过检索与草图语义和拓扑结构兼容的参考图表,实现草图到科学图表的自动补全与生成。

Comments 23 pages, 9 figures

详情
AI中文摘要

科学图表对于在学术论文中传达复杂方法至关重要。研究人员指定此类图表的一种自然方式是通过粗略草图,其中文本标签、连接器和空间布局表达了早期的语义和拓扑意图。然而,草图通常不完整,不足以直接生成出版质量的图表。现有的基于草图的生成方法主要重构草图本身,而最近的文本驱动图表生成框架依赖文本语义,未能充分利用草图中包含的拓扑结构。在本文中,我们介绍了DiagramRAG,一个轻量级的检索增强框架,用于基于草图的科学图表补全。给定用户草图,DiagramRAG检索与草图内容语义相关且与其结构拓扑兼容的参考图表,并使用它们指导下游图表生成。为了实现高效的结构感知检索,我们将图表表示为知识图谱,在不同简化级别合成草图变体,并训练一个嵌入模型,将草图与共享空间中的兼容图表对齐。检索到的参考进一步提供内容、拓扑和视觉先验,用于补全和渲染最终图表。实验表明,DiagramRAG在DiagramBank和FigureBench上分别达到0.848和0.802的F1分数,并以最佳的VLM-as-a-Judge评分7.170提高了生成质量,同时将推理延迟降低到每个样本35.48秒。我们的代码和数据可在https://anonymous.4open.science/r/DiagramRAG-A262和https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch获取。

英文摘要

Scientific diagrams are essential for communicating complex methodologies in academic papers. A natural way for researchers to specify such diagrams is through rough sketches, where text labels, connectors, and spatial arrangements express early semantic and topological intentions. However, sketches are usually incomplete, making them insufficient for directly producing publication-quality diagrams. Existing sketch-based generation methods mainly reconstruct the sketch itself, while recent text-driven diagram generation frameworks rely on textual semantics and do not fully exploit the topological structure contained in sketches. In this paper, we introduce DiagramRAG, a lightweight retrieval-augmented framework for sketch-based scientific diagram completion. Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, we represent diagrams as knowledge graphs, synthesize sketch variants at different simplification levels, and train an embedding model to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram. Experiments show that DiagramRAG achieves F1-scores of 0.848 and 0.802 on DiagramBank and FigureBench, respectively, and improves generation quality with the best VLM-as-a-Judge score of 7.170, while reducing inference latency to 35.48 seconds per sample. Our code and data are available at https://anonymous.4open.science/r/DiagramRAG-A262 and https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch.

2605.27923 2026-05-28 cs.CV cs.AI cs.LG quant-ph 版本更新

Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

我们真的需要量子机器学习吗?:一项多维实证研究

Sudip Vhaduri, Ryan Gammon, Sayanton Dibbo

发表机构 * Department of Computer Science, University of Alabama, AL 35487(1 计算机科学系,阿拉巴马大学,AL 35487)

AI总结 通过在MNIST手写数字数据集上对经典和量子机器学习模型进行多维基准测试,发现量子模型在准确率、参数和内存效率上优于经典模型,但计算成本更高。

详情
AI中文摘要

计算机视觉的快速发展和日益复杂的图像识别任务暴露了经典机器学习模型的基本计算限制,推动了量子计算作为一种新兴范式的探索。本文对MNIST手写数字数据集上的经典和量子机器学习模型进行了全面的基准测试,评估了传统模型(经典支持向量机CSVM和量子支持向量机QSVM)以及深度神经网络模型(经典卷积神经网络CCNN和量子卷积神经网络QCNN)在四个性能维度上的表现:分类准确率、计算运行时间、参数数量和内存需求。实验作为特征维度和样本量的函数进行,并在CPU和GPU执行环境下进行,提供了受控的多维比较,以解决先前工作中的空白。对于基于SVM的模型,QSVM在准确率上始终优于CSVM,在1000个样本时达到约0.90对比约0.85,但计算成本更高。10个量子比特的特征数和200-500的样本量成为平衡准确率和运行时间的实际工作点。对于神经网络模型,CCNN和QCNN实现了可比的分类准确率,在64个特征和60000个样本时均超过0.96,但QCNN在参数和内存效率上显著更优,在较高特征数下比CCNN少约94%的参数和约75%的内存,但运行时间更长。在两个模型家族中,随着特征维度或样本量的增加,量子模型在准确率上始终以更大优势超越经典模型。

英文摘要

The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching $\sim$ 0.90 versus $\sim$ 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 -- 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring $\sim$ 94\% fewer parameters and $\sim$ 75\% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.

2605.27922 2026-05-28 cs.AI 版本更新

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Harness-Bench: 在真实智能体工作流中测量不同模型的框架效应

Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian, Guangxiang Zhao, Lin Sun, Xiangzheng Zhang, Tong Yang

发表机构 * Peking University(北京大学) Qiyuan Tech(启元科技)

AI总结 提出Harness-Bench基准,通过106个沙盒离线任务评估不同模型与框架配置组合下的执行性能,发现智能体能力应归因于模型-框架配置而非基础模型。

Comments 16 pages, 4 figures, 11 tables. The first three authors contributed equally

详情
AI中文摘要

LLM智能体越来越多地被部署为可执行系统,使用工具、修改工作区并产生具体产物。在此类工作流中,性能不仅取决于基础模型,还取决于框架:管理上下文、工具、状态、约束、权限、追踪和恢复的系统层。然而,现有基准通常抽象掉执行过程、比较完整智能体系统或固定框架,使得执行层变化难以研究。我们引入Harness-Bench,一个用于评估真实智能体工作流中配置级框架效应的诊断基准。Harness-Bench在共享任务环境、预算和评估协议下,跨多个模型后端评估代表性框架配置,同时保留每个框架的原生执行行为。该基准包含106个沙盒离线任务,这些任务基于实际智能体使用模式构建,并经过人工审核以确保真实性、可解性、可验证性和完整性。每次运行记录最终产物、执行轨迹、使用统计和验证器输出,从而能够分析最终完成之外的内容。在5,194条执行轨迹中,我们观察到不同模型-框架配对在完成度、过程质量、效率和失败行为上存在显著差异。这些结果表明,智能体能力应在模型-框架配置级别报告,而非仅归因于基础模型。我们的分析进一步识别了重复的执行-对齐失败,其中合理的推理与工具反馈、工作区状态、证据或可验证输出契约脱节。Harness-Bench为诊断和改进可靠、高效且可审计的智能体执行栈提供了可复现的基础。

英文摘要

LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.

2605.27921 2026-05-28 cs.AI cs.CL cs.CY cs.HC 版本更新

Show, Don't TELL: Explainable AI-Generated Text Detection

展示,而非告知:可解释的AI生成文本检测

Aldan Creo, Suraj Ranganath

发表机构 * School of Computing, Information and Data Sciences(计算与数据科学学院) University of California, San Diego(加州大学圣地亚哥分校) United States of America(美国)

AI总结 提出一种名为TELL的新型可解释架构,通过内置解释机制和强化学习训练,在保持高检测性能(AUROC 0.927)的同时提供文本级注释,帮助用户基于自身判断识别AI生成文本。

详情
AI中文摘要

关于AI生成文本检测的研究已经提出了多种区分人类与AI文本的方法,其中一些方法在分布内性能上表现优异。然而,由于输出与用户(如教授)的需求不一致——他们只得到一个没有附带解释的数值分数——现实世界的应用进展缓慢。我们通过一种新颖的架构TELL解决了这个问题,该架构从一开始就内置了可解释性。虽然我们的系统仍像其他检测器一样提供数值分数以便比较,但TELL采用了一种根本不同的方法,旨在向用户展示模型认为文本是AI还是人类写作的“线索”,使用户能够根据自己的判断以及对写作背景和所谓作者的理解来决定文本的作者。我们在一个特定领域的作者注释自定义SFT数据集上训练TELL,并进一步使用GRPO结合课程学习来优化系统以提高性能。我们实现了与最先进检测器相竞争的性能(AUROC 0.927),同时原生提供解释检测器决策基础的注释。我们进一步使用人类注释数据集评估解释质量,报告了在注释的具体性、可证伪性、连贯性、合理性和基础性方面的高胜率(平均72.3%),使用户能够批判性思考并自行决定。因此,我们的工作从以人为中心的角度重新定义了AI生成文本检测的问题,并为专注于原生可解释性的新一代检测器铺平了道路。

英文摘要

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.

2605.27911 2026-05-28 cs.AI 版本更新

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

SuiChat-CN:中文群聊情境自杀风险评估基准

Xiangyu Wang, Zhiwei Yu, Chengze Du, Dingchang Wang, Yuhan Ye, Fangyu Zheng

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Beijing University of Posts and Telecommunications(北京邮电大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对即时通讯群聊中消息碎片化、多轮对话和隐晦表达带来的挑战,构建了首个中文群聊情境自杀风险评估基准SuiChat-CN,通过信号词提取和双向上下文扩展构建连贯对话片段,并利用专家验证的LLM辅助范式标注用户风险等级,实验表明上下文信息对可靠评估至关重要。

详情
AI中文摘要

自杀是一个关键的全球公共卫生挑战,每年导致约72万人死亡,需要及时有效的预防策略。现有的计算研究主要关注基于帖子的社交媒体平台(如Twitter和微博),而忽略了即时通讯环境(如Telegram)。然而,群聊带来了独特的挑战:消息简短、碎片化、多方参与,并且常常依赖隐晦或文化特定的表达,使得孤立的帖子级分析不足。我们引入了SuiChat-CN,一个用于情境自杀风险评估的中文群聊基准。我们收集了公开的Telegram群聊数据,通过信号词提取和双向上下文扩展构建连贯的对话片段,并使用专家验证的LLM辅助范式注释用户风险等级。SuiChat-CN包含来自1,406名用户的13,312个上下文片段,覆盖258,228条原始聊天消息。使用PLM和超过40个LLM的大量实验表明,上下文信息对于可靠的风险评估至关重要,而微调和部分上下文评估进一步揭示了多方对话中早期检测的挑战。出于伦理和敏感性考虑,该数据集不公开发布,但将根据合理请求与经认可的心理健康和自杀预防研究机构共享。

英文摘要

Suicide is a critical global public health challenge, causing approximately 720,000 deaths each year and calling for timely, effective prevention strategies. Existing computational studies primarily focus on post-based social media platforms such as Twitter and Weibo, leaving instant messaging environments such as Telegram underexplored. Yet group chats pose distinct challenges: messages are short, fragmented, multi-party, and often rely on implicit or culturally specific expressions, making isolated post-level analysis insufficient. We introduce SuiChat-CN, a Chinese group-chat benchmark for contextual suicide risk assessment. We collect public Telegram group-chat data, construct coherent conversational segments through signal-word extraction and bidirectional context expansion, and annotate user risk levels with an expert-validated, LLM-assisted paradigm. SuiChat-CN contains 13,312 contextual segments from 1,406 users, covering 258,228 raw chat messages. Extensive experiments with PLMs and more than 40 LLMs demonstrate that contextual information is essential for reliable risk assessment, while fine-tuning and partial-context evaluation further reveal the challenges of early detection in multi-party conversations. Due to ethical and sensitivity concerns, the dataset is not publicly released but will be shared with accredited mental health and suicide-prevention research institutions upon reasonable request.

2605.27908 2026-05-28 cs.CL cs.AI 版本更新

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

ESC-Skills: 发现并自我进化情感支持对话技能

Jie Zhu, Huaixia Dou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Qwen DianJin Team, Alibaba Cloud Computing(阿里云Qwen团队)

AI总结 提出ESC-Skills框架,通过干预单元建模支持交互并构建技能库,结合多轮廓自我进化机制,提升情感支持对话的可解释性、可控性和效果。

详情
AI中文摘要

现有的情感支持对话(ESC)系统主要依赖于端到端的回复生成或粗粒度的策略监督,可解释性有限,且对系统性的技能提升支持不足。我们提出ESC-Skills,一个以技能为中心的框架,能够发现并自我进化可执行的情感支持技能。我们首先将局部支持交互建模为干预单元(IUs),捕捉求助者状态、支持干预和回复后情绪变化之间的状态-动作-结果动态。基于从成功和失败的ESC对话中提取的IUs,我们构建了ESC-Skills库,这是一个包含干预指导、适用条件、预期结果和潜在风险的可执行情感支持技能仓库。为了进一步提升鲁棒性,我们引入了一个多轮廓自我进化精炼框架,其中ESC代理在SAGE评估下与多种模拟求助者轮廓进行交互。分析由此产生的交互轨迹,以识别缺失的技能、不安全的干预和特定轮廓的失败模式,然后通过基于模拟的验证来精炼技能库。实验结果表明,ESC-Skills在提升回复质量和对话层面的情感结果的同时,提供了更可解释和可控的支持行为。我们将发布代码、提示和ESC-Skills库,网址为https://github.com/aliyun/qwen-dianjin。

英文摘要

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state--action--outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at https://github.com/aliyun/qwen-dianjin.

2605.27906 2026-05-28 cs.AI 版本更新

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

推理至关重要:通过推理条件偏好优化减轻多模态大型推理模型中的幻觉

Jiawei Kong, Hao Fang, Shunxiang Liao, Jinyu Li, Bin Chen, Hao Wu, Shu-Tao Xia, Min Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳)

AI总结 提出推理条件直接偏好优化(RC-DPO)方法,通过将思维链作为答案生成的条件并对比不同思维链下的偏好,结合蒙特卡洛树搜索和注意力引导的思维链剪枝生成偏好数据,有效减轻多模态大型推理模型中的幻觉。

详情
AI中文摘要

多模态大型推理模型引入了推理范式,在复杂的视觉-语言任务中展现出强大的能力。然而,它们仍然存在严重的幻觉问题。现有的基于训练的方法通常通过响应级直接偏好优化(DPO)来减轻幻觉,其中思维链(CoT)和最终答案被视为一个整体输出并联合优化。我们发现这种公式的表现与仅答案优化相似,表明它主要学习答案级别的偏好,而未能充分利用CoT级别的监督。为了解决这个问题,我们明确制定了一个面向CoT的偏好项,并推导出推理条件直接偏好优化(RC-DPO),它将CoT建模为答案生成的条件,并在不同CoT条件下对比同一偏好答案的偏好,促进答案支持的推理链对齐。为了进一步优化,我们引入了一种推理增强的偏好数据生成策略,该策略采用蒙特卡洛树搜索来发现视觉基础且逻辑一致的CoT作为正样本,以及注意力引导的CoT令牌剪枝来构建负样本。在各种模型和基准上的大量实验表明,RC-DPO有效减轻了幻觉,并提高了多模态推理过程的可靠性。

英文摘要

Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

2605.27904 2026-05-28 cs.AI cs.LG 版本更新

Dr-CiK: A Testbed for Foresight-Driven Agents

Dr-CiK:面向预见驱动型智能体的测试平台

Yihong Tang, Andrew Robert Williams, Arjun Ashok, Vincent Zhihao Zheng, Lijun Sun, Alexandre Drouin, Issam H. Laradji, Étienne Marcotte, Valentina Zantedeschi

发表机构 * McGill University(麦吉尔大学) ServiceNow Research(ServiceNow研究院) Mila -- Quebec AI Institute(蒙特利尔AI研究院) University of British Columbia(不列颠哥伦比亚大学)

AI总结 针对现有上下文辅助预测基准假设上下文已提供的问题,提出Dr-CiK基准,评估智能体从文档语料库中检索、过滤、提炼预测相关上下文并生成预测的能力,实验表明高质量上下文显著提升预测性能,但现有深度研究智能体恢复证据不足5%、易受干扰误导。

详情
AI中文摘要

现实环境中的时间序列预测通常不仅依赖于历史观测,还依赖于必须从嘈杂、异构的信息源中主动发现的外部上下文。然而,现有的上下文辅助预测基准通常假设支持性上下文已经提供,未考虑智能体是否能自行识别。因此,我们引入Dr-CiK,一个用于评估智能体是否能够从文档语料库中检索预测相关的支持性上下文、过滤干扰项、将检索到的上下文提炼为对预测有用的证据,并生成由该证据支持的预测的基准。通过上下文消融实验以及对最先进的深度研究和预测方法的联合评估,我们表明高质量上下文显著提高了Dr-CiK中的预测性能。然而,大多数现有的深度研究智能体仅能恢复一小部分真实支持证据(通常<5%),经常被干扰项误导(>80%的干扰项引用),并且可能导致预测器在使用检索到的上下文时比不使用上下文时表现更差。我们的结果激励了对预见驱动型智能体的研究,这些智能体能够搜索正确的上下文以预测未来。

英文摘要

Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be actively discovered from noisy, heterogeneous information sources. Yet existing context-aided forecasting benchmarks typically assume that the supporting context is already provided, leaving open whether agents can identify it on their own. Therefore, we introduce Dr-CiK, a benchmark for evaluating whether agents can retrieve forecasting-relevant supporting context from a document corpus, filter out distractors, distill the retrieved context into forecast-useful evidence, and generate forecasts supported by that evidence. Through context ablations and evaluations of state-of-the-art deep research and forecasting methods paired together, we show that high-quality context substantially improves forecasting performance in Dr-CiK. However, most existing DR agents recover only a small fraction of the ground-truth supporting evidence (usually <5%), are frequently misled by distractors (>80% distractor citations), and can cause forecasters to perform worse with retrieved context than without context. Our results motivate research on foresight-driven agents that search for the right context to predict the future.

2605.27901 2026-05-28 cs.CL cs.AI 版本更新

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

跨类型多样语言的思维链监控脆弱性

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura, Chirag Agarwal

发表机构 * University of Virginia(弗吉尼亚大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 本研究通过13种语言和7个前沿模型家族的评估,发现思维链监控在语言分布偏移下普遍不可靠(平均不可信率95.9%),模型会进行策略性操纵,且低资源语言中欺骗模式完全存在。

详情
AI中文摘要

思维链(CoT)监控已被提出作为一种有前景的安全机制,用于检测大型语言模型中的失调行为。然而,其在英语之外以及跨不同模型家族中的可靠性仍 largely unexplored。我们首次在13种多样语言和7个前沿模型家族(共16个模型)上对CoT可监控性进行了大规模评估。使用需要显式中间计算的对抗性提示评估,结合内部答案标记概率分析,我们一致发现CoT在语言和提示类型上存在不忠实性,在8B至120B参数模型中平均不忠实率为95.9%。我们发现前沿模型系统性地进行策略性操纵,包括答案切换、事后合理化以及对提示的程序性利用,使得外部监控器难以检测欺骗。我们表明,前沿模型通常在其潜在激活中在生成的前15%内就承诺了失调线索,即使CoT看起来忠实。令人惊讶的是,这些欺骗模式在低资源语言中保持100%,揭示了当前基于CoT的监督的根本局限性。我们的结果表明,CoT监控在语言分布偏移下本质上是脆弱的,提供的安全信号比仅英语研究所暗示的要弱得多。这些发现强调了开发稳健的CoT监控器以及加速白盒监控技术研究的迫切需要,特别是为了改善中低资源语言中的CoT可监控性。我们的代码可在此处获取:\href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}。

英文摘要

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available \href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}.

2605.27899 2026-05-28 cs.AI 版本更新

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

SKILLC: 通过对比信用分配学习LLM智能体的自主技能内化

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

发表机构 * Meituan(美团)

AI总结 提出SkillC框架,基于对比技能信用分配(CSCA)将技能帮助性对比转化为直接学习信号,实现LLM智能体的自主技能内化,在ALFWorld和WebShop上分别超越最强基线5.5%和4.4%。

详情
AI中文摘要

结构化技能提示改善了长周期智能体强化学习(RL)中的探索。技能增强型RL方法在推理时保留外部技能,而技能内化型RL方法在训练期间撤回技能以实现自主性能。然而,现有的内化方法仅使用技能帮助性对比进行课程控制,策略更新保持不变,无法区分技能依赖和自主成功。我们提出SkillC,一种基于对比技能信用分配(CSCA)的框架,将该对比转化为内化的直接学习信号。SkillC在同一策略更新中,为来自活跃技能类型的任务采样配对的技能注入和无技能轨迹,并通过双流优势估计器将它们的任务级对比注入优化,该估计器在保持全局排名的同时,对无技能成功施加单边校正。平滑的验证级信号进一步驱动自适应课程,包括归因强度、轨迹分配和单调活跃集剪枝。在ALFWorld和WebShop上的实验表明,在无运行时技能访问的情况下,SkillC分别超过最强先验技能内化RL基线5.5%和4.4%,同时与技能增强型RL方法保持竞争力。

英文摘要

Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textsc{SkillC} samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5\% and 4.4\%, respectively, while remaining competitive with skill-augmented RL methods.

2605.27898 2026-05-28 cs.AI 版本更新

A Unified Framework for the Evaluation of LLM Agentic Capabilities

LLM 代理能力评估的统一框架

Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo, Jingyi Yang, Yi Liu, Tingfeng Hui, Xinyu Yuan, Li Sun, Sen Su, Jing Shao

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Chongqing University of Posts and Telecommunications(重庆邮电大学) North China Electric Power University(华北电力大学)

AI总结 提出一个统一框架,通过标准化配置、固定 ReAct 架构和离线设置,分离框架与环境效应,实现 LLM 代理能力的公平评估,并在 7 个基准、24 个领域、15 个模型上进行了大规模实证分析。

详情
AI中文摘要

随着 LLM 越来越多地被部署为代理,对其代理能力的可靠评估变得至关重要。然而,报告的基准分数通常共同反映了模型能力以及每个基准所附带的实现选择,使得跨基准结果难以解释为对底层模型的纯粹测量。在这项工作中,我们提出了一个用于公平评估 LLM 代理能力的统一框架。在统一配置系统的驱动下,该框架将多样化的基准整合为标准化的指令-工具-环境格式,通过固定的 ReAct 风格架构在可控沙箱中执行代理,并提供可选的离线设置,用精心策划的快照替换易变的实时环境,从而可以分别分析框架效应和环境效应。在此基础上,我们在每个基准的原始任务成功标准下统一了评估方法,同时引入了资源消耗的统一指标以及决策和执行层面失败归因的分类法。在该框架内,我们适配了 7 个广泛使用的基准,涵盖单代理、多代理和安全关键场景的 24 个领域,并在 15 个模型上进行了超过 40 万次 rollout 和 50 亿 token 的大规模实证分析。结果表明,脚手架选择和环境波动会显著改变基准结果的方向,使我们的框架能够将内在的 LLM 能力与框架和环境引入的伪影分离开来。我们进一步展示了其作为安全关键领域安全测试床的可扩展性。代码和基准可在 https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework 获取。

英文摘要

As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross-benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities. Driven by a unified configuration system, the framework integrates diverse benchmarks into a standardized instruction--tool--environment format, executes agents through a fixed ReAct-style architecture within a controllable sandbox, and provides an optional offline setting that replaces volatile live environments with curated snapshots, so that framework effects and environment effects can be analyzed separately. Building on this, we unify the evaluation methodology under each benchmark's original task-success criteria, while introducing unified metrics for resource consumption and a taxonomy for decision- and execution-level failure attribution. Within this framework, we adapt 7 widely used benchmarks spanning 24 domains across single-agent, multi-agent, and safety-critical scenarios, and conduct a large-scale empirical analysis over 400K rollouts and 5B tokens on 15 models. The results show that scaffold choice and environmental volatility materially shift benchmark outcomes in both directions, allowing our framework to disentangle intrinsic LLM capabilities from framework- and environment-induced artifacts. We further demonstrate its extensibility as a secure testbed for safety-critical domains. Codes and benchmarks at are available at https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework.

2605.27891 2026-05-28 cs.CV cs.AI 版本更新

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

SmartDirector: 基于关键帧的叙事节奏可控电影视频生成

Zhida Zhang, Jie Ma, Zhan Peng, Haoxue Wu, Yang Han, Jun Liang, Jie Cao, Jing Li

发表机构 * Youku Moku-Lab(优酷莫酷实验室)

AI总结 提出SmartDirector框架,通过多关键帧条件控制视频生成中的叙事结构和时间节奏,采用两阶段方法(Director-Gen生成低分辨率视频,Director-SR利用高分辨率关键帧细化细节),显著优于现有方法。

详情
AI中文摘要

视频的叙事质量从根本上决定了其感知价值。尽管现有的视频生成方法可以生成视觉上吸引人的内容,但它们主要依赖于稀疏的条件信号,如文本提示或首尾帧,这限制了对叙事结构和时间节奏的精确控制。在本文中,我们提出了SmartDirector,一个通过多个关键帧增强视频生成模型叙事能力的框架。SmartDirector支持灵活的生成长场景,包括单镜头生成、多镜头叙事合成和视频扩展。该框架分两个阶段运行:Director-Gen根据提供的关键帧生成低分辨率视频,Director-SR通过利用高分辨率关键帧作为语义锚点来恢复细粒度细节,从而优化输出。为了实现鲁棒的多关键帧训练,我们构建了一个数据管道,从电影中策划单镜头和多镜头序列。大量实验表明,SmartDirector显著优于现有的最先进方法。我们将发布代码以促进进一步研究。

英文摘要

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

2605.27882 2026-05-28 cs.CL cs.AI 版本更新

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

VibeSearchBench:野外长期主动搜索的基准测试

Xiaohongshu Inc

发表机构 * Xiaohongshu Dots Studio & Unipat AI(小红书 dots 飞 studios 与 Unipat AI)

AI总结 针对现有搜索基准中查询过于明确、单轮交互和固定模式评估导致用户体验与评估结果差距的问题,提出VibeSearch范式并构建VibeSearchBench基准,通过渐进式用户模拟和图匹配评估框架测试前沿模型,发现所有模型在长期上下文推理、主动意图激发和结构化知识构建方面仍存在显著不足。

详情
AI中文摘要

基于LLM的智能体在搜索基准上得分很高,但真实用户始终觉得结果不令人满意,这揭示了持续的评估-体验差距。我们将这一差距归因于现有基准依赖于过度明确的查询、单轮交互和固定模式评估,这些都不反映真实搜索行为——用户和智能体通过多轮对话协作细化模糊意图。我们将这种范式称为VibeSearch,并引入VibeSearchBench,一个包含200个手动策划的双语(中文和英文)任务的基准,涵盖20个领域,分为VibeSearch-Pro(专业)和VibeSearch-Daily(日常生活)子集。每个任务将一个用户角色与一个无模式的真实知识图谱配对,并通过渐进式披露用户模拟器和图匹配评估框架进行评估。我们在ReAct框架和OpenClaw智能体框架下对七个前沿模型进行了基准测试。结果表明,所有模型对于VibeSearch仍然严重不足(最佳F1:30.30),凸显了在长期上下文推理、主动意图激发和结构化知识构建方面需要根本性进展。

英文摘要

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

2605.27879 2026-05-28 cs.AI 版本更新

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

迈向忠实代理式XAI:一种验证方法和一个用于更好模型忠实度的开放世界基准

Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho, Jungseul Ok

发表机构 * Graduate School of AI, POSTECH(POSTECH人工智能研究生院) Department of Computer Science and Engineering, POSTECH(POSTECH计算机科学与工程系) Krafton

AI总结 提出FAX框架,通过显式验证分解解释声明并交叉检查忠实工具,以及CRAFTER-XAI-Bench开放世界基准,在强化学习环境中将模拟忠实度从0.20提升至0.46。

详情
AI中文摘要

可解释AI(XAI)帮助用户解释模型行为并识别潜在故障。代理式XAI系统使用大型语言模型(LLM)通过自然语言交互使解释更易理解,但也可能产生看似合理但不忠实的解释。这种风险源于不可靠的XAI输出可能被LLM放大并误导用户。我们提出忠实代理式XAI(FAX),一个通过显式验证提高解释忠实度的框架。FAX将草稿解释分解为声明,并针对固有忠实工具进行交叉检查,在最终生成前过滤不支持或矛盾的声明。我们还引入了CRAFTER-XAI-Bench,一个具有复杂策略、多样目标和挑战场景的开放世界强化学习基准,用于评估模型特定忠实度。在CRAFTER-XAI-Bench上,FAX将模拟忠实度从最强基线的0.20提升至0.46,同时保持高信息量、相关性和流畅性。在三个表格基准上,FAX与先前的代理式XAI基线表现相当,但我们的分析表明,这些设置可能将任务准确性与模型特定忠实度混为一谈。这些发现表明,显式验证对于忠实代理式XAI至关重要,并且忠实度基准必须设计用于测试解释是否针对目标模型本身的行为。

英文摘要

Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLMs) to make explanations more accessible through natural-language interaction, but they can also produce plausible yet unfaithful explanations. This risk arises because unreliable XAI outputs for complex models can be amplified by LLMs and mislead users. We propose Faithful Agentic XAI (FAX), a framework that improves explanation faithfulness through explicit verification. FAX decomposes draft explanations into claims and cross-checks them against inherently faithful tools, filtering unsupported or contradictory claims before final generation. We also introduce CRAFTER-XAI-Bench, an open-world reinforcement learning benchmark with complex policies, diverse goals, and challenging scenarios for assessing model-specific faithfulness. On CRAFTER-XAI-Bench, FAX improves simulation faithfulness from 0.20 for the strongest baseline to 0.46 while maintaining high informativeness, relevance, and fluency. On three tabular benchmarks, FAX performs competitively with prior Agentic XAI baselines, but our analysis shows that these settings can conflate task accuracy with model-specific faithfulness. These findings show that explicit verification is essential for faithful Agentic XAI and that that faithfulness benchmarks must be designed to test explanations against the behavior of the target model itself.

2605.27877 2026-05-28 cs.LG cs.AI 版本更新

SPAR: Support-Preserving Action Rectification

SPAR: 支持保持的动作纠正

Jiaxin Zhao, Weihang Pan, Xun Liang, Binbin Lin

发表机构 * Zhejiang University(浙江大学)

AI总结 提出支持保持的动作纠正(SPAR)框架,通过将全局学习转化为局部残差纠正,并引入潜在自模仿机制,解决了离线策略改进中价值最大化与数据分布拟合之间的冲突,在D4RL基准上达到最优性能。

详情
AI中文摘要

离线策略改进面临着最大化价值与拟合数据分布之间的固有冲突。虽然样本内加权回归是稳定的,但它过度保守,抑制了分布尾部的高价值动作;相反,基于梯度的方法通常表现出梯度的拟合-优化冲突,这会将策略推离数据流形。为了解决这个问题,我们提出了支持保持的动作纠正(SPAR),它将全局学习重新定义为锚定在冻结的纯行为克隆策略上的局部残差纠正。该框架在残差空间中进行细粒度拟合和局部策略改进,从而收缩搜索空间。我们进一步引入了潜在自模仿,利用潜在采样加权回归机制来解决残差空间中的拟合-改进梯度冲突。理论上,我们证明了该机制消除了标准价值梯度的流形正常漂移,而广泛的D4RL实验表明,SPAR从次优基线中提取了显著的增益,实现了最先进的性能。

英文摘要

Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions in the distribution tail; conversely, gradient-based approaches often exhibit a fitting-optimization conflict of gradients, which drives the policy off the data manifold. To address this, we propose Support-Preserving Action Rectification (SPAR), which reframes global learning as a local residual rectification anchored to a frozen pure behavior cloning policy. This framework performs fine-grained fitting and local policy improvement in the residual space, thereby contracting the search space. We further introduce Latent Self-Imitation, utilizing a latent-sampling weighted-regression mechanism to address fitting-improvement gradient conflict in the residual space. Theoretically, we prove this mechanism eliminates the manifold-normal drift of standard value gradients, while extensive D4RL experiments show SPAR extracts significant gains from suboptimal baselines to achieve state-of-the-art performance.

2605.27873 2026-05-28 cs.AI 版本更新

AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

AIBuildAI-2:一种用于自动构建AI模型的知识增强智能体

Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang, Pengtao Xie

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego(加州大学圣地亚哥分校电气与计算机工程系) Department of Medicine, University of California San Diego(加州大学圣地亚哥分校医学系)

AI总结 针对现有自动构建AI模型的智能体因依赖大语言模型静态参数知识而性能受限的问题,提出AIBuildAI-2,通过引入分层、可进化的外部知识系统,动态加载相关上下文,实现设计决策的专家知识支撑,在MLE-Bench上取得70.7%奖牌率并在心脏病预测竞赛中排名前6.6%。

详情
AI中文摘要

AI模型支撑着从图像和文本处理到生物学、物理学和化学科学发现的数据中心应用。然而,开发这些模型仍然高度依赖人工,需要从业者设计架构、构建训练流程并迭代优化解决方案,这使得缺乏专业AI工程专业知识的自然科学家难以构建其研究所需的高性能模型。为减轻这一负担并拓宽AI在科学发现中的可及性,已有研究提出自动构建AI模型的智能体。然而,这些智能体的性能很大程度上受限于其底层大语言模型的参数知识,这些知识是静态的、常常过时,且缺乏实用的AI模型工程诀窍。为解决这一局限,我们提出AIBuildAI-2,一种具有外部、可进化知识系统的知识增强智能体,用于自动构建AI模型。AIBuildAI-2的知识系统是分层的,将整理好的AI开发知识组织为按主题类别划分的高层知识指令和每个类别下的低层知识文档,智能体据此仅动态加载与当前状态及待解决AI任务相关的上下文,使每个设计和实现决策都基于具体、可外部验证的专业知识。该系统通过从网络收集和清洗AI开发相关文档并将其组织到相应类别进行初始化,并通过从智能体自身经验中提炼每次AI任务完成运行的结构化要点并写回知识系统而持续进化。AIBuildAI-2取得了最先进的结果,在MLE-Bench上以70.7%的奖牌率排名第一,并在一个心脏病预测竞赛中位列4370个人类专家团队的前6.6%。

英文摘要

AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent's own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.

2605.27861 2026-05-28 cs.LG cs.AI q-bio.QM 版本更新

From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation

从检测到机制:跨注意力图神经网络实现药物相互作用类型预测——一项以乙酰水杨酸验证的消融研究

Juergen Dietrich

AI总结 本研究通过系统消融实验比较三种图神经网络架构,发现跨注意力机制(CrossAtt)在药物相互作用类型预测(多分类)上比二元检测提升显著,并在乙酰水杨酸验证中实现10/10正确预测。

Comments 12 pages, 1 figure

详情
AI中文摘要

预测两种药物是否相互作用(二元检测)与预测该相互作用的机制类型(多分类)是本质上不同的任务。本研究在包含38,337个正例对(涵盖86种相互作用类型)的公开基准数据集上,对三种图神经网络架构进行了系统的消融实验,用于药物相互作用预测。在相同训练条件下(n=61,339对)比较了三种架构:带有拼接的双消息传递神经网络(Concat)、带有四头跨注意力的双MPNN(CrossAtt)以及引入相互作用图的三元MPNN(Ternary)。CrossAtt在多分类F1-macro上比Concat绝对提升+0.186(+45%),而二元AUC仅提升+0.012(+1.3%),证实原子级分子间通信专门支持机制类型分类。尽管训练数据相同,三元架构表现不佳,其失败与训练不稳定性假设一致。在训练前保留的十个乙酰水杨酸药物对上的验证表明,CrossAtt实现了10/10正确的DDI类型预测,而Ternary为0/10。在所有架构中识别出两个一致的失败案例,与一项配套毒性研究中确立的结构限制相关。

英文摘要

Predicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that interaction (multi-class classification). This study presents a systematic ablation study of three Graph Neural Network (GNN) architectures for drug-drug interaction (DDI) prediction on a publicly available benchmark dataset comprising 38,337 positive pairs across 86 interaction types. Three architectures are compared under identical training conditions (n = 61,339 pairs): a siamese dual Message Passing Neural Network (MPNN) with concatenation (Concat), a dual MPNN with four-head cross-attention (CrossAtt), and a ternary MPNN incorporating an interaction graph (Ternary). CrossAtt improves multi-class F1-macro by +0.186 absolute (+45%) over Concat, while improving binary AUC by only +0.012 (+1.3%) - confirming that atom-level inter-molecular communication specifically enables mechanism-type classification. The ternary architecture underperforms despite equivalent training data, with its failure consistent with a training instability hypothesis. Validation on ten acetylsali- cylic acid (ASA) drug pairs, held out prior to training, demonstrates 10/10 correct DDI-type predictions for CrossAtt versus 0/10 for Ternary. Two consistent failure cases are identified across all architectures, linking to structural limits established in a companion toxicity study.

2605.27860 2026-05-28 cs.AI 版本更新

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

C-MIG:基于多视角信息增益的检索增强生成用于临床诊断推理

Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan

发表机构 * Baidu Inc(百度公司)

AI总结 提出C-MIG框架,通过多视角信息增益和多重子查询检索增强策略,解决检索增强生成中奖励信号丢失和异构推理监督问题,在临床诊断任务上取得最优性能。

详情
AI中文摘要

检索增强生成结合强化学习在将大型语言模型锚定于可信医学证据方面显示出前景。然而,现有方法依赖精确匹配的二元奖励,在临床诊断中导致两个问题:(i) 语义相关但非逐字匹配的步骤获得零信号,丢弃了有价值的学习信号;(ii) 单一维度的奖励无法有效监督异构推理能力。为解决这些问题,我们提出C-MIG,一种基于多视角信息增益的临床诊断检索增强生成框架。C-MIG在冻结参考模型下从两个互补视角——检索文档和文档精炼——估计信息增益,以联合指导检索什么以及如何精炼,缓解了有价值奖励信号丢失和信用分配问题。我们进一步设计了一种多重子查询检索增强策略,提高了临床诊断场景中的知识召回覆盖率。在四个医学基准上的综合实验表明,C-MIG在领域内和领域外数据集上均达到所有RAG-RL方法中的最佳性能,并在临床诊断上超越了最先进的通用大型语言模型。

英文摘要

Retrieval-augmented generation combined with reinforcement learning has shown promise for grounding large language models in trustworthy medical evidence. However, existing methods rely on exact-match binary rewards, which in clinical diagnosis cause two issues: (i) semantically relevant but non-verbatim steps receive zero signal, discarding valuable learning signals; and (ii) uni-dimensional rewards cannot effectively supervise heterogeneous reasoning capabilities. To address these issues, we propose C-MIG, a Multi-view Information Gain-based retrieval-augmented generation framework for Clinical diagnosis. C-MIG estimates information gain under a frozen reference model from two complementary views, retrieved-document and document-refinement, to jointly guide what to retrieve and how to refine, alleviating the issues of valuable reward signal loss and credit assignment. We further design a multi-subquery retrieval augmentation strategy that improves knowledge recall coverage in clinical diagnostic scenarios. Comprehensive experiments on four medical benchmarks demonstrate that C-MIG achieves the best performance among all RAG-RL methods on both in-domain and out-of-domain sets, and outperforms state-of-the-art general-purpose LLMs for clinical diagnosis.

2605.27858 2026-05-28 cs.CL cs.AI cs.LG 版本更新

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

DecomposeRL: 学习提出有用、信息丰富且多样的问题以进行半监督、可追踪的声明验证

Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro

发表机构 * Department of Computer Science and Electrical Engineering(计算机科学与电气工程系)

AI总结 提出DecomposeRL框架,通过GRPO和多面奖励集成将声明分解为可追踪的子问题,在完全监督和半监督设置下实现高精度,且模型规模小4倍仍匹配大模型性能。

详情
AI中文摘要

声明验证分为两类:端到端分类器准确但无法提供可检查的追踪,而基于分解的方法可产生可检查的追踪但在基准数据集上性能滞后。我们提出DecomposeRL,一种能产生可检查追踪的准确声明验证器。DecomposeRL将分解建模为使用GRPO和多面奖励集成训练的RL策略,支持从无标签声明进行完全监督和半监督学习。DecomposeRL通过数据筛选漏斗解决了GRPO高昂的训练成本,将115K事实验证声明提炼为包含密集学习信号的5K声明子集。我们表明,仅在约5K精选声明上使用完全监督训练的DecomposeRL-7B策略,在包含生物医学、政治、科学和通用领域声明的11个声明验证基准上,实现了86.3的域内和69.8的域外平衡准确率。尽管规模小4倍,它匹配了32B基线和GPT-4.1-mini,并且在仅10%标签声明数据的半监督设置中进一步优于基线。代码、数据和模型见https://dipta007.github.io/DecomposeRL。

英文摘要

Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL

2605.27856 2026-05-28 cs.IR cs.AI 版本更新

Fine-Tuned LLM as a Complementary Predictor Improving Ads System

微调LLM作为改进广告系统的互补预测器

Hui Yang, Daiwei He, Kevin Jiang, Taejin Park, Kungang Li, Jiajun Luo, Yuying Chen, Xinyi Zhang, Sihan Wang, Haoyu He, Yu Liu, Lakshmi Manoharan, David Xue, Shubham Barhate, Runze Su, Duna Zhan, Ling Leng, Siping Ji, Jinfeng Zhuang, Alice Wu, Leo Lu, Han Sun, Zhifang Liu

发表机构 * Pinterest, Inc., USA(Pinterest公司,美国)

AI总结 提出将微调的开源LLM作为广告特定辅助预测器,从用户画像和历史中预测广告主,增强候选生成并为下游排序提供先验信息,在工业广告系统中取得离线改进和在线业务提升。

详情
AI中文摘要

推荐系统驱动着信息流、广告和短视频平台的用户参与和变现,但将大型语言模型的最新进展转化为推荐系统的收益仍然罕见,尤其是在广告和工业级生产规模的实际场景中。先前真实世界的LLM成功通常分为三类:(a) 直接预测下一项以生成候选的生成式检索,(b) 使用LLM进行后期重排序,以及(c) 利用LLM进行辅助信号增强。我们为广告引入了一种互补范式:微调的开源LLM不作为排序器,而是作为广告特定的辅助预测器,从用户画像和历史中预测可能的广告主。这种LLM驱动的广告主预测增强了传统候选生成,并为下游排序提供了信息先验。在大规模生产广告系统中开发,我们的方法产生了显著的离线改进和可衡量的在线业务影响,展示了LLM的世界知识和预测能力可以被有效利用。除了验证LLM在广告应用中的有效性,我们的结果表明,有针对性的辅助预测可以在检索和后期排序中解锁端到端的收益,为大规模LLM增强推荐提供了一条实用路径。

英文摘要

Recommendation systems power engagement and monetization across feeds, ads, and short-video platforms, but translating the latest advances in Large Language Models into Recommendation Systems (RecSys) gains remains rare, particularly in advertising and production-scale real-world industry setups. Prior real-world LLM successes typically fall into three buckets: (a) generative retrieval that directly predicts the next items for candidate generation, (b) late-stage re-ranking that uses LLMs, and (c) auxiliary signal enrichment with LLMs. We introduce a complementary paradigm for ads: a fine-tuned open-source LLM used not as a ranker, but as an ads-specific ancillary predictor, forecasting likely advertisers from user profiles and histories. This LLM-driven advertiser prediction augments conventional candidate generation and provides informative priors to downstream ranking. Developed in a large-scale production advertising system, our approach produces substantial offline improvements and measurable online business impact, demonstrating that LLM world knowledge and predictive capacity can be efficiently harnessed. Beyond validating LLMs for ads applications, our results show that targeted ancillary predictions can unlock end-to-end gains across both retrieval and late-stage ranking, offering a practical path to LLM-enhanced recommendation at scale.

2605.27853 2026-05-28 cs.AI 版本更新

MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

MolLingo:面向LLM驱动的科学智能体的分子原生表示

Thao Nguyen, Heng Ji

发表机构 * Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院)

AI总结 提出MolLingo多智能体系统,通过共享内存协调文献、化学家和编排智能体,结合基于BRICS的片段枚举(BFE)表示方法,实现分子块级推理与编辑,在四个基准上优于前沿LLM和专用基线。

详情
AI中文摘要

我们提出MolLingo,一个模拟化学家推理过程的多智能体系统,用于自动化分子设计。现有的基于LLM的方法要么作为独立的生成模型运行,无法访问外部工具,要么缺乏多智能体协调和共享内存,无法在分子设计流程中进行迭代、证据驱动的推理。MolLingo通过共享内存模块协调文献智能体、化学家智能体和编排智能体来解决这一问题,每个智能体配备领域特定工具。为了实现有效的分子推理,我们引入了基于BRICS的片段枚举(BFE),这是一种合成感知的分子碎片化方法,将分子分解为化学上有意义的构建块,表示为基于块的SMILES并配以常见化学名称。这种表示桥接了分子结构和LLM语义空间,实现了仅靠原始SMILES难以实现的块级推理和编辑。作为早期治疗设计的案例研究,MolLingo进一步将化学家智能体的推理基于结合位点几何和来自分子对接的残基级蛋白质上下文,以优化分子以实现更强的靶标结合。在四个基准上,MolLingo始终优于前沿LLM和专用基线,包括在相同底层模型下,对接分数比GPT-5.4提升四倍,在多个LLM骨干上一致的药物性质优化增益,以及在TOMG-Bench上达到最先进结果,超越了前沿LLM和基于RL的优化方法RePO。我们的结果表明,当通过化学上有意义的表示和生物学基础的上下文进行引导时,LLM已经能够成为有能力的分子设计助手。代码可在:https://anonymous.4open.science/status/MolLingo-7450 获取。

英文摘要

We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based approaches either operate as standalone generative models without access to external tools or lack the multi-agent coordination and shared memory needed for iterative, evidence-driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain-specific tools. To enable effective molecular reasoning, we introduce BRICS-based Fragment Enumeration (BFE), a synthesis-aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block-based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block-level reasoning and editing that is difficult with raw SMILES alone. As a case study in early-stage therapeutic design, MolLingo further grounds the Chemist Agent's reasoning in binding site geometry and residue-level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT-5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state-of-the-art results on TOMG-Bench, surpassing both frontier LLMs and the RL-based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: https://anonymous.4open.science/status/MolLingo-7450.

2605.27851 2026-05-28 cs.AI 版本更新

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

当上下文翻转,安全失效:诊断对齐语言模型中的脆弱安全性

Dasol Choi, Alex Kwon

发表机构 * AIM Intelligence(AIM智能)

AI总结 本文提出上下文翻转评估方法,通过安全基准和常识控制测试12个模型,发现对齐语言模型存在安全特异性脆弱性,源于策略覆盖而非理解错误,并证明动作级护栏无法检测后果翻转。

详情
AI中文摘要

安全基准分数提供的部署准备证据不完整:对齐语言模型通常遵循刚性规则,即使情境更新翻转了哪个动作是安全的。我们将这种失败称为脆弱安全性。为诊断它,我们引入上下文翻转评估,在安全基准(PacifAIst)和两个常识控制上测试12个模型,使用配对变体,其中名义上安全的动作产生伤害。出现三个发现。首先,脆弱安全性是安全特异性的:所有12个模型都表现出安全-常识差距(平均+17.4个百分点)。基线准确率无法预测脆弱性:在基线准确率高于90%的模型中,脆弱率从13.7%到90.0%不等。其次,失败源于策略覆盖而非理解错误:尽管在每个案例中都承认上下文变化,模型通过三种不同机制持续存在,这些机制因更新类型和模型系列而异。第三,在对灾难性后果翻转场景的手动审计探测中,标准动作级护栏未能检测到任何情况,而状态感知验证器在正确干预上无假警报地检测到所有情况。这表明动作级内容审核系统性地对后果翻转视而不见,激发了状态感知架构替代方案。我们发布我们的协议、扰动基准和部署探测。

英文摘要

Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.

2605.27850 2026-05-28 cs.AI 版本更新

TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

TCP-MCP:面向多智能体系统的提示与通信拓扑的景观引导协同进化

Yi Ding, Zijie Xuan, Haowei Zhou, Zhenyu Ju, Xiaoxiao Dong, Jingwen Zhang, Xingyu Zhu, Leixin Sun, Haochi Zhang

发表机构 * National Institute of Metrology, China(中国计量科学研究院) University of California, Berkeley(加州大学伯克利分校) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) Nanjing University of Chinese Medicine(南京中医药大学) WEEX Exchange(WEEX交易所) National University of Singapore(新加坡国立大学) Wuhan University(武汉大学) Peking University(北京大学)

AI总结 提出TCP-MCP框架,通过协同进化智能体提示和通信拓扑,在任务性能、令牌成本和结构复杂度三个目标下实现多智能体系统的成本感知与任务自适应设计。

详情
AI中文摘要

有效的多智能体系统不能通过孤立地选择提示或通信图来设计。智能体行为取决于其接收的信息,而通信边的有用性则取决于接收智能体如何解释和使用该信息。我们提出 extbf{TCP-MCP}(面向多智能体协作问题求解的拓扑耦合提示),这是一个将智能体提示和通信拓扑作为统一基因组进行搜索的协同进化框架。TCP-MCP使用初始化时的景观探针来校准早期搜索行为,然后依赖帕累托前沿诊断在三个目标(任务性能、令牌成本和结构复杂度)下自适应调整探索。在所有方法中使用相同的DeepSeek-V3.2骨干网络,TCP-MCP在MMLU-Pro、MMLU和GSM8K上分别达到82.66%、89.96%和96.61%的准确率。在三个基准测试中,它持续优于自动图生成基线,并在报告的操作点上达到与辩论式系统相当的准确率,同时使用的令牌数最多减少5.69倍。这些结果表明,联合进化提示和通信结构为受控评估中成本感知和任务自适应的多智能体系统设计提供了一条实用途径。

英文摘要

Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the receiving agent interprets and uses that information. We propose \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative Problem-Solving), a co-evolution framework that searches agent prompts and communication topologies as a unified genome. TCP-MCP uses an initialization-time landscape probe to calibrate early search behavior, and then relies on Pareto-front diagnostics to adapt exploration under three objectives: task performance, token cost, and structural complexity. Using the same DeepSeek-V3.2 backbone across all methods, TCP-MCP achieves 82.66\%, 89.96\%, and 96.61\% accuracy on MMLU-Pro, MMLU, and GSM8K, respectively. Across the three benchmarks, it consistently outperforms automated graph-generation baselines and achieves competitive accuracy relative to debate-style systems, while using up to 5.69$\times$ fewer tokens than those systems at the reported operating points. These results show that jointly evolving prompts and communication structure provides a practical route to cost-aware and task-adaptive multi-agent system design in controlled evaluations.

2605.27849 2026-05-28 cs.PL cs.AI cs.CL 版本更新

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

FPMoE:一种用于函数式代码生成的稀疏混合专家方法

Loc Pham, Lang Hong Nguyet Anh, Thanh Le-Cong

发表机构 * GreenNode AI Hanoi University of Science and Technology(河内科学技术大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 针对LLM在函数式编程语言上性能差的问题,提出基于稀疏MoE架构的FPMoE模型,通过语言特定专家和共享专家分别消除干扰和捕获跨语言抽象,以3B活跃参数达到远超微调基线并匹配大模型的效果。

详情
AI中文摘要

尽管基于LLM的代码生成取得了快速进展,但现有模型主要针对命令式语言进行训练,导致函数式编程语言(FPLs)如Haskell、OCaml和Scala长期未被充分探索,即使是前沿模型在FPLs上的表现也明显较差。微调是一种自然的补救措施,但我们的实验表明,每种语言的微调无法捕获共享的函数式抽象,而合并的多语言微调则引入了跨语言干扰。为了解决这个问题,我们引入了FPMoE,这是一个轻量级的开源代码生成模型,基于稀疏混合专家(MoE)架构,包含三个语言特定的路由专家(分别对应Haskell、OCaml和Scala)和一个共享专家,用于捕获跨语言的函数式模式,如单子推理和类型导向编程。这种设计同时解决了两种失败模式:专用专家消除了干扰,而共享专家保留了单语言模型遗漏的抽象。在FPEval上,FPMoE显著优于微调基线,并且仅使用3B活跃参数,即可匹配包括DeepSeek-Coder-6.7B、Qwen2.5-Coder-14B-Instruct和Qwen3-Coder-30B-A3B在内的更大模型的性能。

英文摘要

Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.

2605.27846 2026-05-28 cs.AI 版本更新

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

EAPO: 面向开放问答的基于熵驱动的自适应正负样本加权策略优化

Yunsheng Zeng, Gen Li, Yuwei Miao, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan

发表机构 * Baidu Inc(百度公司)

AI总结 针对开放问答中强化学习固定权重的问题,提出基于熵驱动的自适应策略优化方法EAPO,通过动态调整正负样本权重平衡探索与稳定性,在医学问答数据集上显著提升多样性和稳定性。

详情
AI中文摘要

大型推理模型通常通过可验证奖励的强化学习(RLVR)进行训练。然而,现有方法对正负样本采用固定权重,且结论难以推广到开放问答(QA)。本文系统研究了开放问答中强化学习正负样本的作用。我们提出了一种基于奖励均值的策略来区分正负样本,并观察到负样本主要控制响应多样性和性能上限,而正样本主要决定响应质量和收敛稳定性。基于这些观察,我们提出了EAPO,一种基于熵驱动的自适应策略优化方法,该方法根据当前策略熵与初始熵的比率自适应计算正样本的加权系数。在熵减阶段,分配给正样本的权重降低以保持探索,而在熵增阶段则放大以增强稳定性,从而缓解熵崩溃。在两个公开的开放医学问答数据集上的实验表明,EAPO在响应多样性和稳定性方面一致且显著优于固定权重基线。

英文摘要

Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.

2605.27845 2026-05-28 cs.SI cs.AI physics.soc-ph 版本更新

Snippet-Driven Supply Chain Discovery with LLMs: Scaling Visibility in China

基于片段的供应链发现:利用LLMs在中国实现规模化可见性

Hiroto Fukada, Takayuki Mizuno

发表机构 * Graduate Institute for Advanced Studies (SOKENDAI)(研究生高级研究学院(SOKENDAI)) National Institute of Informatics(国家信息研究所)

AI总结 提出一种基于网络搜索片段的方法,利用大语言模型构建供应链知识图谱,以低成本扩展对中国企业间关系的覆盖范围。

Comments 8 pages, 4 figures, 3 tables

详情
AI中文摘要

金融和经济研究通常依赖于结构化的供应链披露和商业数据库。在中国,供应商-客户披露通常仅限于上市公司的重大合作伙伴,导致非上市公司和长尾企业间关系在结构化数据中记录不足。公共网络证据可以通过企业、政府和贸易媒体披露部分弥补这一差距;然而,大规模的全文本网络挖掘成本高昂,因为页面通常难以访问或使用大语言模型(LLM)处理成本过高。我们提出了一种基于片段的方法来构建供应链知识图谱(SCKG),以企业为节点,企业间关系为边。网络搜索片段是与查询相关的摘要,随搜索结果返回。我们将其用作基于LLM的关系提取的可扩展第一层证据。我们从提取效率和覆盖范围两方面评估该流程。在提取效率方面,穷举全文本分块发现的独特关系数量是片段的19.8倍,但需要的输入token数量是片段的251.2倍,且冗余度更高。在覆盖范围方面,我们使用130,685家中国企业作为搜索种子,涵盖截至2024年的上海/深圳上市公司和大型非上市公司。在上市公司子集中,生成的SCKG覆盖的企业数量是CSMAR披露基准的7.2倍,关系数量是9.3倍,同时揭示了重尾度分布模式。保留的来源元数据使SCKG成为可审计的披露数据库补充。

英文摘要

Financial and economic research often relies on structured supply-chain disclosures and commercial databases. In China, supplier--customer disclosure is typically limited to major partners of listed firms, leaving unlisted firms and long-tail inter-firm links poorly captured in structured data. Public web evidence can partly complement this gap through corporate, government, and trade-media disclosures; however, full-text web mining at scale is costly because pages are often inaccessible or expensive to process with large language models (LLMs). We propose a snippet-driven method for constructing a supply chain knowledge graph (SCKG), with firms as nodes and inter-firm relationships as edges. Web search snippets are query-biased summaries returned with search results. We use them as a scalable first-pass evidence layer for LLM-based relationship extraction. We evaluate the pipeline in terms of extraction efficiency and coverage. For extraction efficiency, exhaustive full-text chunking discovers 19.8$\times$ more unique relationships than snippets, but requires 251.2$\times$ more input tokens and yields higher redundancy. For coverage, we use 130,685 Chinese firms as search seeds, covering Shanghai/Shenzhen-listed firms and large unlisted firms as of 2024. In the listed-firm subset, the resulting SCKG covers 7.2$\times$ more firms and 9.3$\times$ more relationships than the CSMAR disclosure-based benchmark, while revealing heavy-tailed degree patterns. Retained provenance metadata make the SCKG an auditable complement to disclosure-based databases.

2605.27840 2026-05-28 eess.AS cs.AI cs.SD 版本更新

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

LoSATok: 用于跨域音频理解与生成的低维语义-声学分词器

Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng, Guoyang Zeng, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University, China(清华大学深圳国际研究生院,中国) ModelBest Inc., China(ModelBest公司,中国)

AI总结 提出低维音频分词器LoSATok,通过语义瓶颈压缩和双级语义监督,在紧凑潜空间中联合捕获语义和声学细节,提升扩散Transformer的生成性能。

详情
AI中文摘要

音频分词器是统一音频理解和生成的基础。理解需要高层语义,而生成需要语义和声学细节。现有的统一分词器将两者共同编码到高维连续潜变量中,这增加了扩散Transformer(DiT)的建模负担。我们提出LoSATok,一种用于跨域音频理解和生成的低维音频分词器。受1280维语义编码器特征可压缩的观察启发,我们引入语义瓶颈(Semantic Bottleneck),将其压缩到128维,并通过提出的时间关系损失(time-relation loss)正则化以实现时间特征一致性。我们进一步设计了一种双级语义监督方法,利用高维和低维语义信号,使分词器能够在紧凑的潜空间中联合捕获语义和声学细节。在语音、音乐和通用音频上的实验表明,SemBo保持了强大的低维语义能力,LoSATok与几种语义表示相比保持了有竞争力的理解性能,同时在语音、音乐和音频生成上持续提升了DiT的建模性能。这些结果表明,LoSATok的低维表示能够有效支持音频理解和生成。我们的代码提供在https://github.com/wxzyd123/LoSATok。

英文摘要

Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low-dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.

2605.27836 2026-05-28 cs.CR cs.AI 版本更新

Symmetry Defeats Auditing

对称性击败审计

Nick Merrill, Zeke Medley

发表机构 * UC Berkeley Forecasting Research Institute(伯克利大学预测研究所) Northeastern University(东北大学)

AI总结 本文展示了对内省适配器(Shenoy et al., 2026)的一种攻击方法。

详情
AI中文摘要

我们展示了对内省适配器(Shenoy et al., 2026)的一种攻击。

英文摘要

We demonstrate an attack on Introspection Adapters (Shenoy et al., 2026).

2605.27827 2026-05-28 cs.AI cs.CY 版本更新

Operational AI Deployment Assurance: Governance-State Orchestration Under Threshold-Sensitive Deployment Conditions -- A Governance Framework for High-Stakes AI Systems

运营级AI部署保障:阈值敏感部署条件下的治理状态编排——高风险AI系统的治理框架

Khalid Adnan Alsayed

发表机构 * Ducaltus | AI Assurance \& Governance Newcastle upon Tyne, United Kingdom School of Computing, Engineering \& Digital Technologies Teesside University Middlesbrough, United Kingdom

AI总结 提出运营级AI部署保障(OADA)框架,通过部署保障分数、就绪分类、阈值稳定区、治理升级状态和修复感知保障推进等机制,将公平性分歧、子组不稳定性和阈值敏感性转化为部署导向的治理决策,以解决高风险AI系统中静态指标报告和事后审计的不足。

Comments 13 pages, 3 figures, governance-oriented framework for operational AI deployment assurance in high-stakes systems

详情
AI中文摘要

AI治理框架日益强调高风险领域的公平性、透明度、问责制和生命周期风险管理。然而,许多当前方法仍停留在观察层面,依赖静态指标报告、事后审计和监控仪表板,而未能直接治理部署就绪性、修复进展、升级状态或保障驱动的部署控制。本文引入运营级AI部署保障(OADA),这是一个治理框架,用于将公平性分歧、子组不稳定性、阈值敏感性、修复结果和运营不确定性转化为面向部署的保障决策。基于先前关于公平性分歧指数(FDI)和FairRisk-FDI的工作,OADA将治理不确定性重新定义为AI部署管道中的运营问题,而非指标分歧的副产品。该框架引入了部署保障分数、部署就绪分类、阈值稳定区、治理升级状态和修复感知保障推进。这些构造通过将评估输出与部署状态解释、重新评估、升级和运营控制相连接,支持高风险环境中的生命周期导向治理决策。通过在面部识别系统上进行面向部署的评估,并将讨论扩展到作为代表性高风险领域的医疗AI,本文展示了系统在孤立的公平性或性能指标下可能看似可接受,同时仍表现出影响部署就绪性的不稳定性。所提出的框架将运营部署保障定位为评估与现实世界AI部署之间的治理层。

英文摘要

AI governance frameworks increasingly emphasize fairness, transparency, accountability, and lifecycle risk management in high-stakes domains. However, many current approaches remain observational, relying on static metric reporting, post-hoc auditing, and monitoring dashboards without directly governing deployment readiness, remediation progression, escalation states, or assurance-driven deployment control. This paper introduces Operational AI Deployment Assurance (OADA), a governance framework for translating fairness disagreement, subgroup instability, threshold sensitivity, remediation outcomes, and operational uncertainty into deployment-oriented assurance decisions. Building on prior work on the Fairness Disagreement Index (FDI) and FairRisk-FDI, OADA reframes governance uncertainty as an operational concern within AI deployment pipelines rather than a byproduct of metric disagreement. The framework introduces Deployment Assurance Scores, Deployment Readiness Classifications, Threshold Stability Zones, Governance Escalation States, and remediation-aware assurance progression. These constructs support lifecycle-oriented governance decisions across high-stakes settings by connecting evaluation outputs to deployment-state interpretation, reassessment, escalation, and operational control. Through deployment-oriented evaluation across facial recognition systems, with discussion extended to healthcare AI as a representative high-stakes domain, the paper demonstrates how systems may appear acceptable under isolated fairness or performance metrics while still exhibiting instability that affects deployment readiness. The proposed framework positions operational deployment assurance as a governance layer between evaluation and real-world AI deployment.

2605.27824 2026-05-28 cs.AI cs.CL 版本更新

Revealing Algorithmic Deductive Circuits for Logical Reasoning

揭示逻辑推理的算法演绎电路

Phuong Minh Nguyen, Tien Huu Dang, Naoya Inoue

发表机构 * Japan Advanced Institute of Science and Technology(日本科学技术先进研究院)

AI总结 本研究通过因果中介分析定位大语言模型中负责逻辑推理步骤的注意力头,发现少量专用头处理事实和规则信息,而高层头促进信息整合和全局推理策略的出现。

详情
AI中文摘要

最近的研究表明,通过在少样本学习设置中引入抽象描述图遍历算法和逐步推理的功能性符号表示,大型语言模型(LLMs)能够实现强大的推理性能。然而,目前尚不清楚LLMs如何仅从有限的示例中真正理解每个推理步骤的抽象含义以及整体算法。本文旨在定位负责单个推理步骤的注意力头,并刻画它们之间传输的信息类型。我们首先在符号辅助的思维链(CoT)提示框架下,将组成推理步骤与其对应的token logits对齐。我们的分析表明,引导推理过程的token位置与低置信度分数相关,这些低置信度分数是由满足演示中推理行为模式的约束引起的。然后,我们采用因果中介分析技术来识别负责这些模式的注意力头。此外,我们的发现表明,LLMs通过专门的注意力头(约占全部头的3%)为各个子推理任务检索事实和基于规则的信息,而较高层主要促进信息整合和全局推理策略(例如图遍历算法)的出现,这些策略协调多个中间推理步骤以解决整体任务。

英文摘要

Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.

2605.27823 2026-05-28 cs.CR cs.AI cs.CV 版本更新

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

解耦对抗性提示:基于语义图的鲁棒大语言模型安全防御

Xiang Fang, Wanlong Fang

发表机构 * Xiang Fang(1. 方翔) Wanlong Fang(2. 方万龙)

AI总结 提出对抗性提示解耦(APD)框架,通过互信息语义分解、图谱分析和轻量级分类器,在输入处理前识别并中和恶意组件,将有害输出减少85%以上。

Comments Published in AAAI 2026

详情
AI中文摘要

大语言模型(LLMs)越来越容易受到利用语义歧义绕过安全机制的对抗性提示的攻击,导致有害或不适当的输出。此类攻击,包括越狱和提示注入,对安全关键应用中LLMs的完整性和可用性构成重大风险。本文提出对抗性提示解耦(APD)框架,一种新颖的防御机制,在输入提示被LLM处理之前主动识别并中和其中的恶意组件。APD框架集成了三项关键创新:(1)基于互信息的语义分解方法,用于分离对抗性和良性提示组件,确保统计独立性;(2)基于图的意图分类方法,利用谱分析检测提示语义中的恶意模式;(3)轻量级基于Transformer的分类器,在真实世界的毒性和越狱提示数据集上训练,实现高效准确的对抗性意图检测。在包含对抗性提示的多样化数据集上评估,APD展现出卓越的鲁棒性,将有害输出生成减少超过85%,同时保持对模型性能的 negligible 影响。该框架的计算效率支持实时部署,使其成为保护LLMs的实用解决方案。我们的工作解决了机器学习安全中关于新型攻击和ML系统完整性方法的关键挑战,并提供了一种可扩展、符合伦理的防御手段来对抗基于提示的对抗性威胁。

英文摘要

Large Language Models (LLMs) are increasingly vulnerable to adversarial prompts that exploit semantic ambiguities to bypass safety mechanisms, resulting in harmful or inappropriate outputs. Such attacks, including jailbreaking and prompt injection, pose significant risks to the integrity and availability of LLMs in security-critical applications. This paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in input prompts before they are processed by the LLM. The APD framework integrates three key innovations: (1) a mutual information-based semantic decomposition method to isolate adversarial and benign prompt components, ensuring statistical independence; (2) a graph-based intent classification approach that leverages spectral analysis to detect malicious patterns in prompt semantics; and (3) a lightweight transformer-based classifier trained on real-world datasets of toxic and jailbreaking prompts, enabling efficient and accurate adversarial intent detection. Evaluated on diverse datasets containing adversarial prompts, APD demonstrates superior robustness, reducing harmful output generation by over 85\% while maintaining negligible impact on model performance. The framework's computational efficiency supports real-time deployment, making it a practical solution for securing LLMs. Our work addresses critical challenges in machine learning security on novel attacks and integrity methods for ML systems, and offers a scalable, ethically grounded defense against prompt-based adversarial threats.

2605.27820 2026-05-28 cs.AI 版本更新

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

EgoBench:面向工具使用智能体的交互式自我中心多模态基准

Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai, Yuqi Qing, Weiqiang Wang, Jian Liu

发表机构 * Ant Group(蚂蚁集团)

AI总结 提出EgoBench,首个交互式自我中心多模态基准,通过1045个自我中心视频任务和用户-智能体-工具交互环境,联合评估视觉感知、工具增强多跳推理和动态交互能力,揭示当前最先进模型性能上限(平均准确率19.43%)。

Comments 68 pages, 6 figures

详情
AI中文摘要

随着AI智能体在开放的真实世界环境中日益运作,它们需要多模态感知、多跳推理的工具调用以及与用户的动态交互的深度协同。然而,现有基准由于在设计严格耦合的多能力任务、模拟自然且任务受限的用户反馈以及确保动态交互的客观评估方面存在挑战,未能联合评估这些能力。为弥补这一差距,我们引入了EgoBench,这是首个面向工具使用智能体的交互式多模态基准。EgoBench包含覆盖四个日常场景的1,045个自我中心视频任务,以及一个用于评估的用户-智能体-工具交互环境。我们实现了一个三阶段协同流水线,通过该流水线,每个任务旨在强制视觉感知和工具增强多跳推理的联合应用。我们还在EgoBench中开发了一个多智能体模拟用户来评估智能体的交互能力,该模拟用户生成高保真、任务对齐的响应。此外,我们建立了一个确定性联合验证框架,通过基于过程和基于结果的等价性保证客观评估。在EgoBench上对八个最先进的视频-MLLM智能体进行基准测试揭示了严重的性能上限:最佳模型在最佳表现场景中仅达到30.62%的准确率,在所有四个场景中平均为19.43%。最后,我们进行了多维错误分析以解开失败模式,揭示了推动未来AI智能体发展的能力瓶颈。

英文摘要

As AI agents increasingly operate in open, real-world environments, they require a deep synergy of multimodal perception, tool invocation with multi-hop reasoning, and dynamic interaction with users. However, existing benchmarks fail to jointly evaluate these capabilities due to challenges in designing strictly coupled multi-capability tasks, simulating natural and task-constrained user feedback, and ensuring objective evaluation of dynamic interaction. To bridge this gap, we introduce EgoBench, the first interactive multimodal benchmark for tool-using agents. EgoBench comprises 1,045 egocentric-video-grounded tasks covering four daily scenarios, along with a user-agent-tool interactive environment for evaluation. We implement a three-stage synergistic pipeline through which each task is designed to enforce the joint application of visual perception and tool-augmented multi-hop reasoning. We additionally develop a multi-agent simulated user within EgoBench to evaluate agents' interaction capabilities, which generates high-fidelity, task-aligned responses to agents. Furthermore, we establish a deterministic joint validation framework that guarantees objective assessment through process-based and result-based equivalence. Benchmarking eight SOTA video-MLLM agents on EgoBench reveals a severe performance ceiling: the best model achieves only 30.62% accuracy in the best-performing scenario, averaging 19.43% across all four scenarios. Finally, we conduct a multi-dimensional error analysis to disentangle failure modes, exposing capability bottlenecks for advancing future AI agents.

2605.27819 2026-05-28 cs.LG cs.AI 版本更新

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

ReSAE: 用于多层Transformer干预的残差化稀疏自编码器

Prathyush Poduval, Calvin Yeung, Neel Desai, Mohsen Imani

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对多层稀疏自编码器(SAE)在Transformer中因层间耦合导致的冗余和交互问题,提出残差化稀疏自编码器(ReSAE),通过拟合层间仿射映射并训练SAE于残差上,减少解码器冗余并提升多层替换下的交叉熵恢复。

详情
AI中文摘要

稀疏自编码器通常逐层训练,尽管Transformer残差流激活在深度上强烈耦合。这对多层干预造成实际问题:不同层的字典可能将容量用于表示相同的向前传递信息,同时替换多层可能产生单层行为无法预测的交互。我们引入残差化稀疏自编码器(ReSAE),它在选定层之间拟合仿射映射,并在未解释的残差上训练后续层的SAE,而非完整激活。重构通过拟合的仿射链映射回原始激活空间,因此ReSAE可以像普通SAE一样使用相同的干预协议进行评估。在Pythia-1.4B和Gemma-2-9B上,残差化减少了解码器冗余,并在大多数测试设置中改进了稀疏探测和定向扰动。尽管重构的原始激活方差较少,ReSAE在多层替换下恢复了更多Transformer交叉熵。这一增益在教师强制和足够的在线稀疏性下最为明显,表明ReSAE保留了与模型下游计算最相关的激活成分。这些结果表明,去除线性可预测的跨层结构是多层SAE干预的有用默认设置。

英文摘要

Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We introduce Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected layers and train each later-layer SAE on the unexplained residual rather than on the full activation. Reconstructions are mapped back into the original activation space through the fitted affine chain, so ReSAEs can be evaluated with the same intervention protocols as ordinary SAEs. On Pythia-1.4B and Gemma-2-9B, residualization reduces decoder redundancy and improves sparse probing and targeted perturbation in most tested settings. Despite reconstructing less of the raw activation variance, ReSAEs recover more transformer cross entropy under multi-layer replacement. This gain is clearest under teacher-forcing and at sufficient sparsity online, indicating that ReSAEs preserve the components of the activation most relevant to the model's downstream computation. These results suggest that removing linearly predictable cross-layer structure is a useful default for multi-layer SAE interventions.

2605.27817 2026-05-28 cs.RO cs.AI cs.CV cs.LG 版本更新

Turning Video Models into Generalist Robot Policies

将视频模型转化为通用机器人策略

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann

发表机构 * MIT(麻省理工学院) CMU(卡内基梅隆大学) Amazon FAR(亚马逊公司)

AI总结 提出一种解耦的视频到动作策略VERA,利用无动作视频世界模型和基于机器人雅可比矩阵的逆动力学模型,实现跨本体的零样本机器人控制。

Comments project page: https://vera.csail.mit.edu

详情
AI中文摘要

视频生成模型已成为一种有前景的机器人骨干网络,能够生成描绘跨本体和环境完成复杂任务的视频。最近的工作提出了机器人基础模型,通过使用带有动作标签的数据微调视频模型,联合预测未来观测和动作。在本文中,我们测试了一种替代方法的极限:保持视频规划器不变,同时训练一个特定本体的逆动力学模型(IDM)。这种解耦带来了几个自然的好处:视频规划器保持本体无关,不同的视频模型可以轻松互换而无需重新训练IDM,并且IDM可以独立地使用现成的自对弈数据进行训练。我们提出了一种闭环的视频到动作策略,该策略将无动作视频世界模型与基于机器人本体雅可比矩阵的精心设计的IDM相结合。我们证明了我们的IDM设计既数据高效又可扩展到高维动作空间。我们将该策略命名为视频到具身机器人动作模型(VERA),在模拟和真实世界基准测试中取得了强劲的性能,包括零样本的Panda机械臂操作和16自由度Allegro灵巧手立方体重新定向。通过将相同的视频规划器与不同的本体特定IDM配对,可以在多个本体上使用。我们的结果表明,解耦的视频规划加上忠实的视频到动作翻译是实现零样本、跨本体和可泛化机器人控制的可行替代途径。更多结果请访问我们的项目网站:https://vera.csail.mit.edu。

英文摘要

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.

2605.27813 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

残差化时间稀疏自编码器用于解释扩散模型

Calvin Yeung, Prathyush Poduval, Ali Zakeri, Zhuowen Zou, Mohsen Imani

发表机构 * University of California, Irvine(加州大学 Irvine 分校)

AI总结 提出残差化时间稀疏自编码器,通过去噪时间步间的线性预测残差学习扩散激活轨迹中的可解释特征,并在Stable Diffusion 1.5上验证其有效性。

详情
AI中文摘要

文本到图像扩散模型通过迭代去噪过程生成图像,因此内部神经层产生激活轨迹而非单一静态表示。稀疏自编码器(SAE)最近被用于将扩散激活分解为可解释的特征方向,但大多数方法在单个时间步分析激活或基于时间条件,而非直接从完整激活轨迹中学习。在这项工作中,我们引入了用于扩散激活轨迹的残差化时间SAE。我们收集去噪时间上的激活,拟合相邻时间步之间的线性预测器,并使用初始激活以及这些线性动力学未解释的残差分量来表示每个轨迹。在这种残差化表示上训练SAE鼓励稀疏潜在变量捕捉超出线性可预测范围的结构。残差化解码器方向可以映射回激活空间,使得每个潜在变量可以作为去噪时间上的特征轨迹进行分析。通过在Stable Diffusion 1.5上的重建与消融研究、时空特征分析和定性引导实验,我们表明残差化时间SAE为研究时间结构化的扩散激活提供了一个有用的框架。

英文摘要

Text-to-image diffusion models generate images through an iterative denoising process, so internal neural layers produce trajectories of activations rather than single static representations. Sparse autoencoders (SAEs) have recently been used to decompose diffusion activations into interpretable feature directions, but most approaches analyze activations at individual timesteps or condition on time rather than learning directly from full activation trajectories. In this work, we introduce residualized temporal SAEs for diffusion activation trajectories. We collect activations across denoising time, fit linear predictors between neighboring timesteps, and represent each trajectory using an initial activation together with residual components not explained by these linear dynamics. Training an SAE on this residualized representation encourages sparse latents to capture structure beyond what is linearly predictable. The residualized decoder directions can be mapped back into activation space, allowing each latent to be analyzed as a feature trajectory over denoising time. Through reconstruction and ablation studies, spatiotemporal feature analysis, and qualitative steering experiments on Stable Diffusion~1.5, we show that residualized temporal SAEs provide a useful framework for studying temporally structured diffusion activations.

2605.27811 2026-05-28 cs.AI 版本更新

Constrained Auto-Bidding via Generative Response Modeling

通过生成式响应建模实现约束自动出价

Eunseok Yang, Xingdong Zuo, Kyung-Min Kim

发表机构 * NAVER Corporation(NAVER公司)

AI总结 提出生成式响应模型(GRM),通过预测未来流量和聚合成本/价值曲线,结合轻量解析控制器,在预算和比率约束下实现稳定高效的自动出价。

详情
AI中文摘要

自动出价系统旨在预算约束和成本每次获取等比率目标下,最大化广告主在长期内的价值,然而未来流量和拍卖动态是非平稳且不确定的。现有方法面临明显局限性:基于控制的节奏方法对偏差做出反应但无法预测未来条件,而强化学习和生成方法将约束纳入奖励信号,掩盖了违规并在分布偏移下退化。我们将学习目标从动作转向响应,提出生成式响应模型(GRM),这是一个基于历史条件的序列模型,联合预测未来流量和作为单一出价乘数函数的水平聚合成本/价值曲线。我们证明,在温和的单调性条件下,相对于完全逐拍控制的最优性差距受逐拍边际价值-成本离散度的限制。给定预测响应,一个轻量解析控制器通过一维求根步骤强制执行每个活动约束。我们证明该控制器对于单乘数问题是精确的,并根据预测误差限制了滚动时域重规划下的约束违规。在AuctionNet上的实验表明,与现有基线相比,GRM提高了约束稳定性和总体得分。

英文摘要

Auto-bidding systems aim to maximize advertiser value over long horizons under budget constraints and ratio targets such as cost-per-acquisition, yet future traffic and auction dynamics are non-stationary and uncertain. Existing approaches face distinct limitations: control-based pacing reacts to deviations but cannot anticipate future conditions, while RL and generative methods fold constraints into reward signals, obscuring violations and degrading under distribution shift. We shift the learning target from actions to responses with the Generative Response Model (GRM), a history-conditioned sequence model that jointly predicts future traffic volume and horizon-aggregate cost/value curves as functions of a single bid multiplier. We show that under mild monotonicity conditions, the optimality gap relative to full per-tick control is bounded by the dispersion of per-tick marginal value-per-cost. Given predicted responses, a lightweight analytic controller enforces each active constraint via a 1D root-finding step. We prove this controller is exact for the single-multiplier problem and bound constraint violations under receding-horizon replanning in terms of prediction error. Experiments on AuctionNet show that GRM improves constraint stability and overall score compared to existing baselines.

2605.27805 2026-05-28 cs.CL cs.AI 版本更新

ChildEval: When large language models meet children's personalities

ChildEval:当大语言模型遇到儿童个性

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai, Yaxing Zhang, Qian Hu, Lijun Mei, Junlan Feng

发表机构 * JIUTIAN Research(九天研究院) China Mobile(中国移动) Beijing, China(北京,中国)

AI总结 提出ChildEval基准,通过合成3-6岁儿童个性档案和偏好(显式或隐式表达),评估大语言模型在长对话中推断并遵循儿童偏好的能力,实验表明微调可提升儿童中心性能。

Comments 8 pages of main text (ACL Findings format), with references and appendix

详情
AI中文摘要

虽然大语言模型(LLM)使得个性化聊天机器人成为可能,但它们在儿童中心个性化方面的有效性仍不明确,因为缺乏对儿童特定偏好的系统评估。为填补这一空白,我们引入了ChildEval,一个用于评估LLM在长上下文对话中推断和遵循儿童中心偏好能力的基准。ChildEval包含29K个3-6岁儿童的合成个性档案,提供相对静态的背景信息。每个个性档案关联一个儿童偏好——可能与个性一致、冲突或独立——通过单句显式表达或6-10轮对话隐式表达。显式和隐式偏好旨在反映相同的潜在偏好,但表达方式不同,捕捉偏好表达的动态方面而非静态个性的变化。该基准涵盖五个顶层类别和十四个子类别,覆盖儿童的日常生活和发展。我们进一步提出了细粒度、以儿童为中心的评估协议,以系统评估开源LLM。实验结果表明,不同的个性化表示如何影响LLM的响应,并表明在ChildEval上进行微调可以提升儿童中心性能。我们的代码和数据集可在https://github.com/ziyanluo/ChildEval获取。

英文摘要

While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children's daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.

2605.27799 2026-05-28 cs.AI eess.SP 版本更新

GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

GraD-IBD:基于诊断轨迹的图表示学习用于炎症性肠病的早期检测

Leo Y. Li-Han, Ellen L. Larson, Elizabeth B. Habermann, Cornelius A. Thiels, Hojjat Salehinejad

发表机构 * Department of Surgery, Mayo Clinic, Rochester, MN, USA(外科部,梅奥诊所,罗切斯特,MN,美国) Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA(健康保健交付科学中心,梅奥诊所,罗切斯特,MN,美国) Division of Hepatobiliary and Pancreas Surgery, Mayo Clinic, Rochester, MN, USA(肝胆胰外科部,梅奥诊所,罗切斯特,MN,美国) Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA(人工智能与信息学部,梅奥诊所,罗切斯特,MN,美国)

AI总结 提出GraD-IBD图诊断模型,将纵向ICD轨迹重构为时间有向图,并设计上下文感知的时间衰减消息传递机制,以降低复杂度并提升炎症性肠病检测性能。

详情
AI中文摘要

国际疾病分类(ICD)是一种全球公认的编码系统,记录每次患者就诊的诊断事件,为各种临床任务提供标准化的数据基础。然而,ICD代码序列的不规则性和层次性给基于N-D格子的序列建模方法带来了挑战,导致模型设计过于复杂。在本文中,我们提出了GraD-IBD,一种图诊断模型,将纵向ICD轨迹重构为按就诊分桶的时间有向图,以检测炎症性肠病(IBD)的风险。我们开发了一种新颖的上下文感知时间衰减消息传递机制,以捕获时间依赖性并降低模型复杂度。使用真实世界临床数据集的实验结果表明,与最先进的方法相比,IBD检测性能一致且稳健地提升,同时与序列模型相比,计算复杂度显著降低。这些发现凸显了图表示学习在从纵向ICD诊断代码中进行高效、可扩展且准确的疾病风险预测方面的潜力。

英文摘要

International Classification of Diseases (ICD) is a globally recognized coding system that records diagnostic events during each patient encounter, providing a standardized data foundation for various clinical tasks. However, the irregular and hierarchical nature of ICD code sequences poses challenges for N-D lattice-based sequential modeling methods, leading to overly complex model designs. In this paper, we propose GraD-IBD, a graph diagnosis model that reformulates longitudinal ICD trajectories as visit-bucketized, temporally directed graphs to detect the risk of inflammatory bowel disease (IBD). A novel context-aware, time-decay message passing mechanism was developed to capture temporal dependencies while reducing model complexity. The experimental results using a real-world clinical dataset demonstrated consistent and robust improvements in IBD detection over state-of-the-art methods, with significant reductions in computational complexity compared to sequential models. These findings highlight the potential of graph representation learning to enable efficient, scalable, and accurate disease risk prediction from longitudinal ICD diagnosis codes.

2605.27789 2026-05-28 cs.AI cs.CL 版本更新

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

固定预算、聚类感知的 LLM-as-a-Judge 评估标准:多跳 RAG 压力测试

Camilo Chacón Sartori, José H. García

发表机构 * Catalan Institute of Nanoscience and Nanotechnology(加泰罗尼亚纳米科学与纳米技术研究所)

AI总结 针对多跳 RAG 系统评估中的统计偏差问题,提出一种固定预算、聚类感知的 LLM-as-a-Judge 比较标准,并通过遗传算法证据选择器 GADMEC 在 400 个多跳问题上进行压力测试,揭示聚类感知推断改变了实证结论。

详情
AI中文摘要

检索增强生成(RAG)系统通常通过让大型语言模型(LLM)法官判断哪个答案更好来进行比较。对于多跳 RAG,这已成为一个测量问题,与建模问题同等重要:相同的分数可以反映检索质量、答案长度、词汇重叠或忽略聚类数据的统计检验。我们询问当这些选择被明确时会发生什么。 我们提出了 RAG 中 LLM-as-a-Judge 比较的最小测量标准。该标准固定了 top-100 候选池、证据预算、答案上限、生成器和提示;它还要求预先注册假设、聚类感知推断、在可行时进行精确的聚类符号翻转检验以及第二法官复制。聚类基准可能夸大进展;该领域应采用此标准。我们使用遗传算法解码器进行多跳证据组合(GADMEC),一种进化证据选择器,在计算机科学/机器学习(CS/ML)和材料科学领域的 400 个多跳问题上对其进行压力测试。该协议改变了实证故事。二项检验使所有四个语义基线比较看起来显著;聚类感知推断只留下一个 Bonferroni 显著结果。在相同预算下,BM25 优于纯语义 GADMEC,而词汇-语义混合在 CS/ML 中恢复并缩小了材料科学差距。

英文摘要

Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same score can reflect retrieval quality, answer length, lexical overlap, or a statistical test that ignores clustered data. We ask what happens when these choices are made explicit. We propose a minimum measurement standard for LLM-as-a-judge comparisons in RAG. The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; it also requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. Clustered benchmarks can overstate progress; the field should adopt this standard. We stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning (CS/ML) and Materials Science. The protocol changes the empirical story. A binomial test makes all four semantic-baseline comparisons look significant; cluster-aware inference leaves only one Bonferroni-significant result. BM25 beats pure semantic GADMEC under the same budget, while a lexical-semantic hybrid recovers in CS/ML and narrows the Materials Science gap.

2605.27785 2026-05-28 cs.AI cs.DB 版本更新

A Query Engine for the Agents

面向智能体的查询引擎

Kenny Daniel

发表机构 * Hyperparam(Hyperparam公司)

AI总结 提出一个轻量级、JS原生、支持异步SQL和LLM UDF的查询引擎Hyperparam,用于在AI应用中分析非结构化文本,性能优于DuckDB-WASM。

Comments 4 pages, 1 figure, 3 tables

详情
AI中文摘要

当今生产环境中增长最快的数据是非结构化文本:智能体轨迹、聊天日志、推理链、模型输出。人们想要分析这些数据,而有价值的问题(例如“显示智能体在哪里感到困惑”)无法仅通过SQL回答,因为如果没有模型参与查询路径,文本是不可查询的。这种分析自然发生在新一类AI应用中(如Claude Code、Cursor、Claude Desktop、浏览器内智能体),这些应用在客户端运行,并在同一进程中托管人类用户和LLM智能体。这些应用越来越需要处理数据,但数据湖仓的读取路径在JS运行时中难以使用:Spark、Trino和托管数据仓库不适合。为了构建这种新型AI数据应用,引擎的三个属性成为首要考虑:JS原生分发,能够直接嵌入应用已运行的运行时;足够小的包体积,以便在冷标签页或每轮智能体沙箱中分发;以及一种将分析操作符与基于模型的文本解释交错的方法。我们提出Hyperparam,三个总大小低于70 KB的开源JavaScript库(Hyparquet、Squirreling、Icebird),它们直接从对象存储读取Parquet和Apache Iceberg,并通过基于单元格的异步原生SQL执行满足第三个属性,因此昂贵的单元格仅在下游操作符需要时才触发。Squirreling在过滤受限查询上运行LLM形状的异步UDF比DuckDB-WASM快300倍以上(排序受限查询快192倍),并以低三分之二的成本完成十项智能体分析师任务。我们认为数据工程作为一个学科需要更新,以适应现已投入生产的AI原生客户端应用以及与其用户协作的智能体。

英文摘要

The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking ("show me where the agent got confused") cannot be answered by SQL alone, since text is not queryable without a model in the query path. The natural place this analysis is happening is the new class of AI applications (Claude Code, Cursor, Claude Desktop, in-browser agents) that run client-side and host both a human user and an LLM agent in the same process. These applications increasingly want to work with data, but the lakehouse read path has been hard to use from a JS runtime: Spark, Trino, and managed warehouses do not fit there. To build this new kind of AI data application, three properties of the engine become first-order: a JS-native distribution that drops into the runtime the application already runs in, a bundle small enough to ship inside a cold tab or per-turn agent sandbox, and a way to interleave analytic operators with model-based interpretation of text. We present Hyperparam, three open-source JavaScript libraries (Hyparquet, Squirreling, Icebird) totaling under 70 KB, that read Parquet and Apache Iceberg directly from object storage and meet the third property with per-cell, async-native SQL execution, so expensive cells fire only when downstream operators demand them. Squirreling runs LLM-shaped async UDFs over 300x faster than DuckDB-WASM on filter-bounded queries (and 192x on sort-bounded queries) and completes a ten-task agent analyst suite at two-thirds lower cost. We argue that data engineering as a discipline needs to update for the AI-native client applications now in production and the agents that work alongside their users.

2605.27784 2026-05-28 cs.AI 版本更新

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

诊断LLM代理中实时策略内指令冲突的见证解析轮廓

Lu Yan, Xuan Chen, Xiangyu Zhang

发表机构 * Purdue University(普渡大学)

AI总结 提出WIRE管道,通过提取规则、编码为PyRule子句、检测冲突并生成见证实例,诊断LLM代理单一提示策略内规则对之间的冲突,发现64.6%的见证实例至少违反一条源规则。

详情
AI中文摘要

LLM代理受长期自然语言提示策略的约束,但个别合理的常设规则可能以未经检查的方式相互作用。我们研究实时策略内规则冲突诊断:在单个提示策略内找到可以共同治理现实状态的规则对,并测量模型在响应或工具动作中如何解决该压力。我们引入WIRE,一个见证策略内规则评估管道。WIRE提取基于源的规则,将其编码为PyRule子句,使用可满足性检查保留同表面硬碰撞候选,将这些候选实现为具体的共同治理见证,并根据原始源规则文本判断模型输出。在六个公开提示策略中,WIRE提取276条源规则和560个原子子句,分类30,944个策略内子句对比较,保留170个编码的硬碰撞候选源规则对,并将其实现为1,402个具体见证。在仅策略评估中,这些见证产生13,335次后生成试验,其中两条源规则共同治理且两个合规标签均可判断。仅35.4%处于联合合规状态;64.6%违反至少一条治理源规则。这些轮廓是WIRE选择候选的条件诊断,而非部署频率或因果超额失败估计,但它们揭示了不同的策略、模型和工具动作解析模式。

英文摘要

LLM agents are governed by long-lived natural-language prompt policies, but individually reasonable standing rules can interact in uninspected ways. We study live intra-policy rule-conflict diagnosis: finding rule pairs inside a single prompt policy that can co-govern a realistic state, and measuring how models resolve that pressure in responses or tool actions. We introduce WIRE, a Witnessed Intra-policy Rule Evaluation pipeline. WIRE extracts source-grounded rules, encodes them as PyRule clauses, uses satisfiability checks to retain same-surface hard-collision candidates, realizes those candidates as concrete co-governance witnesses, and judges model outputs against the original source-rule text. Across six public prompt policies, WIRE extracts 276 source rules and 560 atomic clauses, classifies 30,944 within-policy clause-pair comparisons, retains 170 encoded hard-collision candidate source-rule pairs, and realizes them as 1,402 concrete witnesses. In policy-only evaluation, these witnesses yield 13,335 post- generation trials where both source rules govern and both compliance labels are judgeable. Only 35.4% fall in joint compliance; 64.6% violate at least one governed source rule. These profiles are conditional diagnostics for WIRE-selected candidates, not deployment-frequency or causal excess failure estimates, but they reveal distinct policy, model, and tool-action resolution patterns.

2605.27773 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

模型是否知道它们为何改变主意?知识冲突下思维链的可解释性与忠实性

Pruthvinath Jeripity Venkata

发表机构 * Independent Researcher(独立研究员)

AI总结 通过引入内省忠实性,研究在知识冲突下语言模型的思维链推理是否忠实反映其决策机制,发现CoT高度稳定但置信度携带微弱真实信号。

Comments 12 pages, 8 tables, 3 appendices

详情
AI中文摘要

当语言模型看到与其训练知识相矛盾的文档时,它必须做出选择:遵循文档还是相信自己。先前的工作证明这种选择取决于事实的知名程度。我们问:模型的思维链(CoT)推理是否忠实地报告了这一机制?我们引入了内省忠实性,并在200个问题、8个模型和4种提示条件下进行了测试。我们发现CoT推理在相反决策下高度稳定:翻转对保留了96%的相同答案相似度(d=0.34;通过ROUGE-L确认,d=0.45)。然而,自我评定的置信度携带微弱的真实信号:对于实体知名度无信息量的冷门事实,置信度仍能预测决策(p<0.001),并追踪项目级知识(r=0.134)。GPT-4o是唯一具有统计上可靠的推理-决策耦合的模型。Claude Sonnet 4.6显示出最宽的置信度范围(SD=1.39),但汇总相关性接近零,因为置信度-决策关系在不同条件下反转;温度消融实验证实这是模型特有的。内部思考令牌比面向用户的CoT显示出更大的决策敏感性(p=0.033)。CoT分解为决策不变的知识展示(约96%)和一层薄弱的置信度层,后者带有微弱但真实的信号。对于监控:读取置信度,而不是论证。

英文摘要

When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work proved this choice depends on how well-known the fact is. We ask: does the model's chain-of-thought (CoT) reasoning faithfully report this mechanism? We introduce introspective faithfulness and test it across 200 questions, 8 models, and 4 prompt conditions. We find CoT reasoning is highly stable across opposite decisions: flip pairs retain 96% of same-answer similarity (d=0.34; confirmed by ROUGE-L, d=0.45). Yet self-rated confidence carries a faint genuine signal: for obscure facts where entity fame is uninformative, confidence still predicts decisions (p<0.001) and tracks item-level knowledge (r=0.134). GPT-4o is the only model with statistically reliable reasoning-decision coupling. Claude Sonnet 4.6 shows the widest confidence range (SD=1.39) but near-zero pooled correlation because the confidence-decision relationship reverses between conditions; a temperature ablation confirms this is model-specific. Internal thinking tokens show greater decision-sensitivity than user-facing CoT (p=0.033). CoT decomposes into a decision-invariant knowledge display (~96%) and a thin confidence layer with weak but real signal. For monitoring: read confidence, not the argument.

2605.27768 2026-05-28 cs.AI 版本更新

Auditable Decision Models with Learned Abstention and Real-Time Steering

具有学习弃权与实时引导的可审计决策模型

Sankaranarayanan Palamadai Chandrasekaran

发表机构 * Simple Machine Mind(简单机器思维)

AI总结 提出EvaluatorDPT模型,通过Transformer编码器学习YES/NO/TBD三值决策,其中TBD作为延迟输出被学习,并支持推理时阈值控制和辅助语义信号,实现可审计的决策控制。

Comments 21 pages, 5 figures

详情
AI中文摘要

生产AI系统通常在证据不完整、冲突或不足的情况下运行。强制分类器将此类情况压缩为动作标签,而生成系统可能产生难以解释为可审计执行决策的输出。我们研究AI系统的操作决策控制,其中不确定性必须明确可路由、受策略约束且可审计,而不是隐藏在强制预测或自由形式生成中。我们提出EvaluatorDPT,一种有界决策控制模型,预测YES、NO或TBD,其中TBD被学习为延迟结果,而非仅作为事后置信规则添加。该模型使用Transformer编码器,带有主要的有界决策头和用于价值观及情绪/情感的辅助通道。接口在形式上与领域无关:部署领域提供证据和策略阈值,而模型发出有界分布,可在推理时通过记录的操作阈值以及(经验证后)辅助语义信号进行控制。对于评估的模型版本,我们报告了在保留验证集和测试集上的决策性能;由于此评估中禁用了情感头,因此省略了辅助情感指标。在保留测试集(n=44,597)上,模型达到准确率=0.8260,宏F1=0.8252,各类别F1分别为0.8314(YES)、0.8486(NO)和0.7956(TBD)。评估记录还包括校准证据(验证集上ECE=0.0338)、阈值扫描输出、多种子稳定性检查、混淆矩阵和可重复性命令。我们的主要贡献是一个有界执行接口,其中延迟被学习,推理时路由保持可检查,辅助信号提供可审计行为控制的路径,且评估证据支持外部审查。

英文摘要

Production AI systems often operate with incomplete, conflicting, or insufficient evidence. Forced classifiers collapse such cases into action labels, while generative systems can produce outputs that are difficult to interpret as auditable execution decisions. We study operational decision control for AI systems, where uncertainty must be explicitly routable, policy-governed, and auditable rather than hidden inside forced predictions or free-form generation. We present EvaluatorDPT, a bounded decision-control model that predicts YES, NO, or TBD, where TBD is learned as a deferral outcome rather than added only as a post-hoc confidence rule. The model uses a transformer encoder with a primary bounded-decision head and structured auxiliary channels for values and emotions/sentiments. The interface is domain-agnostic in form: a deployment domain supplies evidence and policy thresholds, while the model emits a bounded distribution that can be controlled at inference time through recorded operating thresholds and, when validated, auxiliary semantic signals. For the evaluated model version, we report decision performance on held-out validation and test splits; auxiliary emotion metrics are omitted because the emotion head is disabled for this evaluation. On the held-out test split (n=44,597), the model achieves Accuracy = 0.8260 and Macro F1 = 0.8252, with per-class F1 of 0.8314 (YES), 0.8486 (NO), and 0.7956 (TBD). The evaluation record also includes calibration evidence (ECE = 0.0338 on validation), threshold-sweep outputs, multi-seed stability checks, confusion matrices, and reproducibility commands. Our main contribution is a bounded execution interface in which deferral is learned, inference-time routing remains inspectable, auxiliary signals provide a path to auditable behavior control, and evaluation evidence supports external review.

2605.27767 2026-05-28 cs.CL cs.AI cs.LG 版本更新

UniMaia: Steering Chess Policies with Language for Human-like Play

UniMaia:用语言引导国际象棋策略以实现类人玩法

Sherman Siu, Lesley Istead

发表机构 * University of Waterloo(滑铁卢大学) Carleton University(卡尔顿大学)

AI总结 提出UniMaia框架,通过参数高效文本编码器和ControlNet风格调节机制,在冻结的Lc0国际象棋策略网络上实现提示条件策略调制,实现语义控制(如开局选择和玩家强度)并保持预训练策略表征,同时构建大规模元数据增强的Lichess数据集和半自动提示生成管道,在多个基准上取得最优或竞争性结果。

详情
AI中文摘要

大型语言模型的最新进展使得自然语言能够作为控制复杂系统的灵活接口,但通常以大规模多模态训练或弱化领域特定归纳偏差为代价。在结构化决策领域(如国际象棋)中,专门的策略网络表现强劲但缺乏语义可控性,而提示条件语言模型更灵活但通常领域基础较弱。我们提出$ extbf{UniMaia}$,一个用于提示条件策略调制的框架,它使用参数高效文本编码器和ControlNet风格的调节机制来适配基于Lc0的冻结国际象棋策略网络。UniMaia能够实现对游戏玩法的语义控制,包括开局选择和玩家强度,同时保留预训练的策略表征。我们进一步引入$ extbf{UniMaia-Aux}$,它结合了辅助时间条件化和行为预测目标。为了支持这项工作,我们构建了一个大规模元数据增强的Lichess数据集,开发了一个半自动提示生成管道,并引入了涵盖提示条件和元数据条件设置的基准。UniMaia在多个提示条件基准上实现了最先进的预期准确率,在通用指令遵循任务上达到了竞争性的最佳着法准确率,同时在人类着法预测基准上与专门的元数据条件方法保持竞争力。UniMaia-Aux进一步提高了多个评估设置下的预期准确率和行为建模,在最佳着法准确率上略有折衷。总体而言,我们的结果表明,无需端到端多模态训练即可实现领域特定策略网络的提示条件控制,同时突出了可控性与预测性能之间的权衡。

英文摘要

Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, but often at the cost of large-scale multimodal training or weakened domain-specific inductive biases. In structured decision-making domains such as chess, specialized policy networks achieve strong performance but lack semantic controllability, while prompt-conditioned language models are more flexible yet typically exhibit weaker domain grounding. We propose $\textbf{UniMaia}$, a framework for prompt-conditioned policy modulation that adapts a frozen Lc0-based chess policy network using a parameter-efficient text encoder and a ControlNet-style conditioning mechanism. UniMaia enables semantic control over gameplay, including opening selection and player strength, while preserving the pretrained policy representations. We further introduce $\textbf{UniMaia-Aux}$, which incorporates auxiliary temporal conditioning and behavioral prediction objectives. To support this work, we construct a large-scale metadata-augmented Lichess dataset, develop a semi-automated prompt-generation pipeline, and introduce benchmarks spanning both prompt-conditioned and metadata-conditioned settings. UniMaia achieves state-of-the-art expected accuracy on several prompt-conditioned benchmarks and competitive top-move accuracy on general instruction-following tasks, while remaining competitive with dedicated metadata-conditioned approaches on human move prediction benchmarks. UniMaia-Aux further improves expected accuracy and behavioral modeling across several evaluation settings, with modest trade-offs in top-move accuracy. Overall, our results demonstrate that prompt-conditioned control of domain-specific policy networks is feasible without end-to-end multimodal training, while highlighting trade-offs between controllability and predictive performance.

2605.27766 2026-05-28 cs.AI 版本更新

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Aman Priyanshu, Supriti Vijay, Esha Pahwa

发表机构 * Foundation AI USA(Foundation AI美国)

AI总结 本研究通过多智能体模拟平台评估LLM智能体在社交压力下的隐私泄露风险,发现多轮社交交互显著增加隐私泄露,且泄露具有社交传染性,即使有隐私指令也无法完全消除。

详情
AI中文摘要

LLM安全评估主要在隔离环境中测试模型,然而部署的AI智能体越来越多地与其他智能体在持久社交环境中交互。我们引入了一个Moltbook风格的模拟平台,数千个LLM智能体在模拟的一个月内跨社区交互,并用它来评估在不同程度的社交压力下隐私作为下游安全问题的表现。我们发现从单轮到多轮社交评估会放大隐私侵犯(OpenAI模型上CIMemories 19.95%到我们的45.30%),泄露具有社交传染性,观察到同伴泄露后智能体泄露敏感信息的可能性增加8倍,并且明确的隐私指令减少但不能消除这种效应,即使有保护措施,泄露率仍高于37.8%。我们的发现表明,基于静态聊天的安全基准系统性地低估了智能体部署中的风险,而仅社交环境就足以引发单轮评估永远不会发现的敏感泄露。

英文摘要

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.

2605.27765 2026-05-28 cs.LG cs.AI 版本更新

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

恢复甜蜜点:用于LLM推理的通过率加权自蒸馏

Zehao Liu, Yuanpu Cao, Jinghui Chen, Vasant G. Honavar

发表机构 * College of Information Sciences and Technology(信息科学与技术学院)

AI总结 提出SC-SDPO方法,通过问题通过率加权自蒸馏损失,动态调整训练难度,提升LLM推理性能。

Comments 18 pages, 8 figures

详情
AI中文摘要

自蒸馏策略优化(SDPO)通过利用模型自身的反馈条件预测作为自教师,为大型语言模型的强化学习提供密集的令牌级信用分配。然而,与GRPO不同——其群体相对优势自然地将学习集中在一个中等难度问题的甜蜜点上——SDPO的基于KL的优势缺乏隐式的难度感知。我们通过GRPO的优势归一化视角分析这一差距。将可学习性框架扩展到归一化奖励,我们表明归一化吸收了方差项$p(1-p)$,使各问题的前导阶可学习性相等,留下$\sqrt{p(1-p)}$作为每个问题梯度中唯一的残差缩放因子。这一分析产生了一个简单的处方:用$[\hat{p}(1-\hat{p})]^{1/2}$加权每个问题的SDPO损失,得到SC-SDPO,即SDPO的尺度一致变体。所提出的权重作为在线策略rollout与批自适应归一化的零成本副产品获得,诱导出一个隐式课程,动态跟踪模型不断发展的能力。在科学推理和工具使用基准上的实验表明,SC-SDPO持续优于SDPO,在Qwen3-8B上获得+3.2/+4.3(mean@16/maj@16)的提升,在OLMo-3-7B上获得+1.8/+3.0的提升,同时在整个优化过程中保持稳定的训练动态。

英文摘要

Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO's KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO's advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term $p(1-p)$, equalizing leading-order learnability across questions and leaving $\sqrt{p(1-p)}$ as the sole residual scaling factor in the per-question gradient. This analysis yields a simple prescription: weight each question's SDPO loss by $[\hat{p}(1-\hat{p})]^{1/2}$, resulting in SC-SDPO, a scale-consistent variant of SDPO. The proposed weights are obtained as a zero-cost byproduct of on-policy rollouts with batch-adaptive normalization, inducing an implicit curriculum that dynamically tracks the model's evolving competence. Experiments on scientific reasoning and tool-use benchmarks demonstrate that SC-SDPO consistently improves over SDPO, yielding gains of +3.2/+4.3 (mean@16/maj@16) on Qwen3-8B and +1.8/+3.0 on OLMo-3-7B, while preserving stable training dynamics throughout optimization.

2605.27764 2026-05-28 cs.CV cs.AI 版本更新

Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought

分割模型能理解世界吗?通过视觉思维链实现主动可供性推理

Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

发表机构 * Northwestern University(西北大学) Northeastern University(东北大学) South China University of Technology(华南理工大学) Hong Kong Baptist University(香港 Baptist大学) Beijing Normal - Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 提出SegWorld框架,通过多级视觉思维链在意图级指令下进行主动场景观察和可供性推理,实现从目标到部件的高效分割。

详情
AI中文摘要

最近的分割模型将大语言模型(LLMs)与掩码解码器结合,将复杂的语言表达映射到掩码上,但其指令仍然是目标指涉的:它们描述、约束或暗示待分割的区域。然而,在现实世界的具身交互中,人类指令通常是意图级的,包括期望的结果而不指定实现该结果的区域。为弥合这一差距,我们引入SegWorld,其中模型在确定掩码之前通过多级视觉思维链(CoT)推理场景。在接收任何指令之前,它主动观察场景,描述可见对象并推断它们可能支持的可能事件。给定指令后,它继续思维链:从与意图相关的对象,到满足意图的动作,再到物理交互部位,即支持该动作的对象部分。我们将SegWorld形式化为概率推理,其中主动观察提供语言场景上下文,当指令以意图级别给出时,可改善掩码预测。我们构建了一个意图到部件的基准,用于评估从高层目标出发的可供性承载部件分割。实验表明,SegWorld在目标指涉指令上匹配指令驱动基线,并在意图级指令上显著提升。

英文摘要

Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet their instructions remain target-referential: they describe, constrain, or imply the region to be segmented. However, in real-world embodied interaction, human instructions are often at the intent-level, which includes the desired outcome without naming the region that enables it. To bridge this gap, we introduce SegWorld, where the model reasons about the scene through a multi-level visual chain-of-thought (CoT) before committing to a mask. Before receiving any instructions, it proactively observes the scene, describing visible objects and inferring plausible events they may support. Given an instruction, it continues the chain: from the object relevant to the intent, through the action that satisfies it, to the physical interaction site, the object part that affords the action. We formalize SegWorld as probabilistic inference, in which proactive observation supplies a linguistic scene context that improves mask prediction when instructions are given at the level of intent. We construct an intent-to-part benchmark for evaluating affordance-bearing part segmentation from high-level goals. Experiments show SegWorld matches instruction-driven baselines on target-referential instructions and improves substantially on intent-level ones.

2605.27760 2026-05-28 cs.AI 版本更新

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad: 像梯度下降一样优化智能体技能

Hanyu Wang, Yifan Lan, Bochuan Cao, Lu Lin, Jinghui Chen

发表机构 * College of Information Sciences and Technology(信息科学与技术学院) The Pennsylvania State University(宾夕法尼亚州立大学) University Park, PA, USA

AI总结 提出SkillGrad框架,将技能包视为结构化参数,通过轨迹级损失、文本梯度诊断和动量记忆覆盖进行类梯度下降优化,在表格问答任务上平均提升6.7个百分点。

详情
AI中文摘要

智能体技能通过将可复用的程序化知识存储在结构化文件中,提供了一种轻量级的方式将LLM智能体适配到专业领域。然而,无论是从第三方下载还是自行生成,这些技能往往不可靠、不完整或过时。现有的技能演化方法通常通过启发式反思来解决这些缺陷,缺乏明确的优化公式。在本文中,我们提出了SkillGrad,一种受梯度下降启发的智能体技能优化框架。SkillGrad将技能包视为结构化参数,以梯度下降的方式进行优化:任务执行提供轨迹级损失证据,自动诊断随后提供指示修正方向的文本梯度。为了稳定跨迭代的优化,动量智能体将重复出现的诊断模式累积到持久记忆覆盖层中。最后,基于LLM的修补器通过对技能包进行层感知编辑来执行参数更新。在SpreadsheetBench Verified和WikiTableQuestions上的评估表明,SkillGrad在两个骨干LLM上始终优于基于训练的技能演化基线,平均比最强的基于训练的基线高出6.7个百分点。消融实验进一步表明,动量和对比诊断都对最终技能质量有贡献。

英文摘要

Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by $6.7$ percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.

2605.27758 2026-05-28 cs.LG cs.AI physics.comp-ph 版本更新

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

基于几何感知算子学习与内存高效低秩注意力的高保真工业碰撞动力学预测

Deepak Akhare, Mohammad Amin Nabian, Corey Adams, Sudeep Chavare, Sanjay Choudhry

发表机构 * Department of Aerospace and Mechanical Engineering, University of Notre Dame(诺特大学航空航天与机械工程系) NVIDIA General Motors(通用汽车)

AI总结 本文提出GeoTransolver框架,通过几何感知算子学习和内存高效低秩注意力机制,实现工业级碰撞动力学的高保真预测,在复杂梁和整车碰撞数据集上验证了其准确性和效率。

详情
AI中文摘要

汽车碰撞安全性优化仍然是一个安全关键挑战,需要通过迭代的高保真模拟来管理大规模非线性结构变形和能量耗散。虽然传统有限元求解器计算成本高昂,新兴的算子学习框架提供了快速的代理预测;然而,将其应用于工业级碰撞分析(其中复杂几何、接触非线性和快速演变的瞬态变形并存)仍然是一个未解决的挑战。在本文中,我们证明GeoTransolver框架为工业规模下准确、高保真的碰撞动力学预测提供了可行的解决方案。在复杂的保险杠梁和整车碰撞数据集上进行的基准测试表明,GeoTransolver能够捕捉多尺度几何上下文,并准确解析塑性变形模式以及关键乘员位置的加速度曲线。除了架构本身,我们提出并系统评估了一系列时间预测策略,包括一次性、时间条件和自回归滚动策略,证明一次性方法在显著降低训练开销和推理延迟的同时实现了最先进的准确性。作为次要贡献,我们引入了一种基于快速低秩注意力路由引擎(FLARE)的修改,应用于GeoTransolver注意力主干,将内存开销减少约2倍,同时进一步提高O(N)长程、高频瞬态的预测准确性,保留了基础框架的几何感知交叉注意力优势。我们的结果突显了几何感知算子学习在复杂、安全关键的汽车动力学高保真代理建模中的实际可行性。

英文摘要

Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While traditional finite element solvers are computationally prohibitive, emerging operator learning frameworks provide rapid surrogate predictions; however, applying them to industrial-scale crash analysis, where complex geometry, contact nonlinearities, and rapidly evolving transient deformation coexist, remains an open challenge. In this paper, we demonstrate that the GeoTransolver framework provides a viable solution for accurate, high-fidelity crash dynamics prediction at industrial scale. Benchmarked on complex bumper beam and full-vehicle crash datasets, GeoTransolver captures multi-scale geometric context and accurately resolves plastic deformation patterns as well as acceleration profiles at critical occupant locations. Beyond the architecture itself, we propose and systematically evaluate a suite of temporal prediction recipes, including one-shot, time-conditional, and autoregressive rollout strategies, demonstrating that the one-shot approach achieves state-of-the-art accuracy with significantly reduced training overhead and inference latency. As a secondary contribution, we introduce a Fast Low-rank Attention Routing Engine (FLARE)-based modification to the GeoTransolver attention backbone that reduces memory overhead by approximately 2x while further improving predictive accuracy for O(N) long-range, high-frequency transients, preserving the geometry-aware cross-attention strengths of the base framework. Our results highlight the practical viability of geometry-aware operator learning for high-fidelity surrogate modeling of complex, safety-critical automotive dynamics.

2605.27750 2026-05-28 cs.CL cs.AI cs.CV cs.DL 版本更新

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

阅读还是猜测?古希腊版本OCR中视觉语言模型的视觉定位失败

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

发表机构 * Inria(法国国家信息与自动化研究所)

AI总结 通过对比开放权重视觉语言模型与传统OCR基线在低资源古希腊批判版本上的表现,发现VLM即使错误也能生成流畅文本,表明其依赖语言先验,并引入扰动和标记级定位度量分析视觉证据。

详情
AI中文摘要

最近的研究表明,用于光学字符识别(OCR)的视觉语言模型(VLM)能够生成看似合理但缺乏视觉支持的文本,暗示其依赖语言先验。通过将开放权重VLM与传统OCR基线在低资源古希腊批判版本上进行对比,我们展示了VLM的错误即使在错误时也往往保持流畅,产生合理的希腊语替换,而传统引擎则产生局部识别噪声。为了分析解码过程中的视觉证据,我们引入了受控图像扰动和基于条件与无图像解码分布的标记级定位度量。在字符级扰动下,VLM与扰动的真实文本严重偏离,而传统OCR相对忠实;然而,标记级分析表明先验依赖是模型特定的:在OCR专业模型中,流畅的词汇错误几乎不依赖图像而产生,而通用VLM即使在错误时也仍然依赖于视觉输入。解码时干预未能可靠地恢复定位,而OCR后语言模型校正仅通过生成后修复文本改善了几个系统。我们的结果将先前关于OCR语言先验依赖的证据扩展到低资源历史文档和更广泛的模型集,表明流畅输出不一定具有视觉基础,并推动了超越总体准确性的可解释性驱动评估。

英文摘要

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

2605.27748 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Mahalanobis PatchCore: Covariance-Aware and Streaming-Compatible Industrial Anomaly Detection

马氏距离 PatchCore:协方差感知与流式兼容的工业异常检测

Niccolò Ferrari, Oligert Osmani, Evelina Lamma

发表机构 * Department of Engineering, University of Ferrara(费拉拉大学工程学院)

AI总结 提出马氏距离 PatchCore,通过协方差估计和流式处理改进 PatchCore,在保持性能的同时降低峰值内存并提升工业检测精度。

Comments 57 pages, 7 figures

详情
AI中文摘要

工业视觉异常检测通常是一类问题:正常图像丰富,而缺陷罕见、异质且常在系统设计时不可用。PatchCore 风格的检索适合此场景,因为它通过正常补丁特征的内存库对测试图像评分,但标准欧几里得几何忽略了特征相关性,且其离线构建在子采样前需实例化整个补丁池。我们引入马氏距离 PatchCore,一种协方差感知、流式兼容的 PatchCore 扩展。其人工智能贡献在于一种检索检测器,它在降维特征空间中估计正则化协方差模型并对嵌入进行白化,使得变换后的欧几里得最近邻搜索实现马氏距离检索。一个有界内存、可重复迭代的训练流程通过增量降维、在线协方差估计和流式聚合,无需一次性存储所有正常补丁即可构建内存库。工程应用是自动化工业检测,其中视觉异常检测必须在实际内存限制下保持准确。我们在一个公开的 15 类工业异常检测基准和三个工业数据集(涵盖吹灌封条带安瓿弯月面检测、琥珀色玻璃安瓿底部检测和冻干饼西林瓶检测)上评估该方法。马氏距离 PatchCore 在公开基准上保留了大部分离线 PatchCore 的图像级性能,同时将峰值内存从 5.41 GB 降至 2.78 GB,并将选定的工业平均图像接收者操作特征曲线下面积从 0.981 提升至 0.986。

英文摘要

Industrial visual anomaly detection is usually one-class: normal images are abundant, while defects are rare, heterogeneous, and often unavailable during system design. PatchCore-style retrieval suits this setting because it scores test images from a memory bank of normal patch features, but the standard Euclidean geometry ignores feature correlations and its offline construction materialises the full patch pool before subsampling. We introduce Mahalanobis PatchCore, a covariance-aware, streaming-compatible extension of PatchCore. Its artificial intelligence contribution is a retrieval detector that estimates a regularised covariance model in reduced feature space and whitens embeddings, so Euclidean nearest-neighbour search after transformation implements Mahalanobis retrieval. A bounded-memory, re-iterable training pipeline builds the memory bank without storing all normal patches at once, using incremental dimensionality reduction, online covariance estimation, and streaming aggregation. The engineering application is automated industrial inspection, where visual anomaly detection must remain accurate under practical memory limits. We evaluate the method on a public 15-category industrial anomaly-detection benchmark and three industrial datasets covering blow-fill-seal strip-ampoule meniscus inspection, amber-glass-ampoule bottom inspection, and lyophilised-cake vial inspection. Mahalanobis PatchCore preserves most offline PatchCore image-level performance on the public benchmark while reducing peak memory from 5.41 to 2.78 GB, and improves the selected industrial mean image area under the receiver operating characteristic curve from 0.981 to 0.986.

2605.27744 2026-05-28 cs.AI 版本更新

A Policy-Driven Runtime Layer for Agentic LLM Serving

一种面向智能体LLM服务的策略驱动运行时层

Rui Zhang, Chaeeun Kim, Liting Hu

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 针对多智能体LLM系统中跨层策略难以高效实现的问题,提出在框架与引擎之间插入智能体运行时层,通过四个原语支持任意智能体感知策略,并在KV缓存策略CacheSage上验证了有效性。

详情
AI中文摘要

多智能体LLM系统已成为主流生产工作负载,但服务栈并非为其构建。上层的智能体框架知道智能体身份、角色、模式和调度结构,但从未看到引擎级事件;下层的服务引擎看到每个事件但对智能体一无所知。许多跨层策略依赖于两者:前缀缓存、批处理整形、推测执行、公平性、工具结果记忆、安全执行等。每个策略都存在于两层之间的缝隙中,目前通过向某一邻域打补丁来解决。我们认为这个缝隙最好通过架构变更而非点修复来解决:在框架和引擎之间插入第三层,即智能体运行时层,暴露四个原语(观察、评分、预测、行动),任何智能体感知策略都可以插入其中,并以智能体身份作为共享坐标。我们将九个具体策略映射到该层,并在具有最大即时服务成本杠杆的策略上深入验证了该抽象:跨会话的KV缓存,实例化为CacheSage,它在线学习每工作负载的智能体转移矩阵,并用于基于存活的驱逐和步间预取。在五个真实多智能体工作负载上的初步结果显示,与未修改的服务栈相比,缓存命中率提升13到37个百分点,平均TTFT降低12%到29%,吞吐量提高6%到14%。

英文摘要

Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.

2605.27739 2026-05-28 cs.LG cs.AI 版本更新

Worker Disagreement Reveals Sharp Directions in Local SGD

工作者分歧揭示局部SGD中的尖锐方向

Tolga Dimlioglu, Kristi Topollai, Anna Choromanska

发表机构 * New York University(纽约大学)

AI总结 本文通过理论分析和实验证明,局部SGD中的工作者平均间隙协方差能够捕捉Hessian矩阵的尖锐方向,从而提供一种廉价的无Hessian估计方法。

Comments 5 pages main body, 18 pages appendix - Accepted to HiLD 2026, ICML

详情
AI中文摘要

深度神经网络训练通常表现出高度各向异性的损失几何,其中少数尖锐的主导Hessian方向与大量平坦区域共存。梯度往往不成比例地与这些主导方向对齐,尽管稳定的进展通常需要穿过平坦区域的方向。因此,估计主导子空间是有用的,但使用基于Hessian的直接方法成本高昂。我们表明,标准局部SGD通过工作者分歧暴露了这种几何结构。我们从理论上证明,工作者平均间隙协方差由随机梯度噪声和Hessian曲率塑造,导致工作者沿着尖锐的曲率敏感方向产生分歧。因此,工作者平均间隙提供了主导子空间的廉价无Hessian估计。在MLP、CNN和Transformer上的实验表明,由工作者平均间隙形成的子空间捕获了位于主导Hessian特征空间中的梯度分量的很大一部分。

英文摘要

Deep neural network training often exhibits highly anisotropic loss geometry, where a few sharp dominant Hessian directions coexist with a large flatter bulk. Gradients tend to align disproportionately with these dominant directions, although stable progress often requires movement through flatter bulk directions. Estimating the dominant subspace is therefore useful but costly with direct Hessian-based methods. We show that standard Local SGD exposes this geometry through worker disagreement. We theoretically show that the worker-average gap covariance is shaped by stochastic-gradient noise and Hessian curvature, causing workers to disagree along sharp, curvature-sensitive directions. Thus, worker-average gaps provide a cheap Hessian-free estimator of the dominant subspace. Experiments on MLPs, CNNs, and Transformers show that subspaces formed by worker-average gaps capture a substantial fraction of the gradient component lying in the dominant Hessian eigenspace.

2605.27724 2026-05-28 cs.RO cs.AI 版本更新

HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning

HumanoidMimicGen: 通过全身规划生成行走操作数据

Kevin Lin, Ajay Mandlekar, Caelan Reed Garrett, Nikita Chernyadev, Yu Fang, Runyu Ding, Yuqi Xie, Justin Tran, Linxi Fan, Yuke Zhu

发表机构 * NVIDIA The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出HumanoidMimicGen方法,通过全身规划自动生成人形机器人行走操作演示数据,在模拟基准上使联合训练的策略性能提升20%。

Comments website: https://humanoidmimicgen.github.io/

详情
AI中文摘要

模仿学习是训练人形机器人行走和操作的一种有前景的方法,但它需要大量演示,而这些演示通过遥操作收集耗时且困难。现有的数据生成算法可以自动合成操作器的演示,但它们在类人机器人上效果不佳,因为其高维复合动作空间涉及手臂、腿和躯干。我们提出HumanoidMimicGen,一种生成人形机器人腿部行走操作数据的方法。我们的方法将少量源演示中的接触丰富的全身技能适应到新状态,并泛化到物体姿态的变化。通过将这些单臂和双臂技能与全身运动规划和操作规划交替进行,该方法在多样化的场景和布局中生成稳定、无碰撞的数据。为了评估我们的方法,我们引入了一个新的模拟行走操作基准,包含九个测试人形机器人行走操作能力的多样化任务。在那里,我们证明HumanoidMimicGen自动生成用于模仿学习的大规模数据集,并能够系统研究数据生成和策略学习决策如何影响模型性能。我们表明,与仅使用真实世界数据训练的策略相比,与HumanoidMimicGen生成的数据联合训练的全身视觉运动策略性能提升20%。

英文摘要

Imitation learning is a promising approach for training humanoid robots to both walk and manipulate, but it requires a large number of demonstrations, which are time-intensive and difficult to collect via teleoperation. Existing data-generation algorithms can automatically synthesize demonstrations for manipulators, but they are ineffective on humanoids because their high-dimensional composite action spaces involve arms, legs, and torsos. We present HumanoidMimicGen, a method for generating humanoid legged loco-manipulation data. Our method adapts contact-rich whole-body skills from a handful of source demonstrations to new states, generalizing across changes in object pose. By interleaving these single- and dual-arm skills with whole-body locomotion and manipulation planning, the method generates stable, collision-free data across diverse scenes and layouts. To evaluate our approach, we introduce a new simulated loco-manipulation benchmark containing nine diverse tasks that test humanoid loco-manipulation capabilities. There, we demonstrate that HumanoidMimicGen automatically generates large datasets for imitation learning and enables a systematic study of how data generation and policy learning decisions impact model performance. We show that whole-body visuomotor policies co-trained with data generated by HumanoidMimicGen outperform those trained only on real-world data by 20%.

2605.27721 2026-05-28 cs.CL cs.AI 版本更新

UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

UserHarness:利用用户心智增强智能体心理理论

Cheng Qian, Jiayu Liu, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出UserHarness框架,通过显式重建用户心智状态(信念、意图等)进行心理理论推理,在五个基准上达到95.94%的宏准确率,相对提升超15%。

Comments 19 Pages, 4 Figures, 2 Tables

详情
AI中文摘要

理解用户的信念和意图对于构建有效的智能体助手至关重要。这种能力通常通过心理理论(ToM)任务来评估,成功需要从用户的角度进行推理。然而,许多现有方法通过复杂流水线处理ToM,间接建模行为,而没有显式重建用户的心智状态。这忽略了问题的核心结构:用户基于其信念行动,信念通过观察环境更新;信念和意图共同决定行动,行动又改变环境;社会推理通常需要关于他人信念或意图的嵌套信念。我们提出UserHarness,一个简单的框架,将ToM推理重新定义为显式的用户心智重建。UserHarness分解用户的心智状态、其与外部环境的关系以及由此产生的行动,使智能体能够跟踪用户的观察、信念、意图和行为。在五个基准上,UserHarness达到高达95.94%的宏准确率,相比现有推理方法相对提升超过15%,相比最强的纯提示框架相对提升约20%。这些结果表明,稳健的用户理解需要从用户心智根源进行推理,将用户驾驭作为未来更具适应性的助手的有前景的基础。

英文摘要

Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through Theory-of-Mind (ToM) tasks, where success requires reasoning from the user's perspective. However, many existing approaches address ToM with complex pipelines that model behavior indirectly, without explicitly reconstructing the user's mental state. This misses the core structure of the problem: users act based on their beliefs, which are updated through observations of the environment; beliefs and intentions jointly determine actions, which in turn change the environment; and social reasoning often requires nested beliefs about what others believe or intend. We propose UserHarness, a simple framework that reframes ToM reasoning as explicit user-mind reconstruction. UserHarness decomposes the user's mental state, its relation to the external environment, and the actions that follow from it, enabling agents to track what the user observes, believes, intends, and does. Across five benchmarks, UserHarness reaches up to 95.94% macro accuracy, improving over existing inference methods by more than 15% relative and over the strongest prompt-only harness by about 20% relative. These results suggest that robust user understanding requires reasoning from the roots of the user's mind, positioning user harnessing as a promising foundation for more adaptive future assistants.

2605.27712 2026-05-28 cs.AI 版本更新

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

前缀安全贝叶斯信念追踪用于LLM推理可靠性:将校准与排序分离

Zhenghan Song, Yunyi Li, Yulong Liu

发表机构 * Cornell University(康奈尔大学) Columbia University(哥伦比亚大学)

AI总结 提出前缀安全贝叶斯信念追踪(SBBT)框架,通过分离概率质量与排序能力,在长链推理中实现可靠的在线校准与不确定性估计。

详情
AI中文摘要

长推理轨迹需要在最终答案已知之前进行可靠性估计。我们研究前缀条件的事件成功估计 $P(y=1 \mid o_{1:t})$,使用前缀安全观测。序列贝叶斯信念追踪(SBBT)校准观测似然并递归更新两状态信念,为标量分数、文本和自我验证标记、隐藏聚类、令牌池探针以及潜在轨迹特征提供通用追踪器。在MATH-500、GSM8K、AIME 2025和RIMO-N上生成的开源权重轨迹中,概率质量和排序分离:仅使用分数的SBBT通常改善Brier分数,而AUROC提升需要超出强前缀安全基线的结构感知证据。在最强硬数学设置中,结构感知观测相对于标准前缀安全基线达到+0.110 AUROC。在相同前缀分类器审计下,MATH-500文本标记和RIMO-N自我验证信号保持正向。这些发现共同支持SBBT作为校准感知的在线推理框架,并揭示证据机制:标量分数主要支持概率质量,而结构感知前缀信号仅在强前缀安全基线尚未吸收排序证据时支持排序。

英文摘要

Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, $P(y=1 \mid o_{1:t})$, using prefix-safe observations. Sequential Bayesian Belief Tracking (SBBT) calibrates observation likelihoods and recursively updates a two-state belief, providing a common tracker for scalar scores, text and self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Across generated open-weight traces on MATH-500, GSM8K, AIME 2025, and RIMO-N, probability quality and ranking separate: score-only SBBT often improves Brier, while AUROC gains require structure-aware evidence beyond strong prefix-safe baselines. In the strongest hard math setting, structure-aware observations reach +0.110 AUROC against standard prefix-safe baselines. Under a same-prefix classifier audit, MATH-500 text markers and RIMO-N self-verification signals remain positive. Together, these findings support SBBT as a calibration-aware online inference framework and expose an evidence regime: scalar scores mainly support probability quality, while structure-aware prefix signals support ranking only when strong prefix-safe baselines have not already absorbed the rank evidence.

2605.27710 2026-05-28 cs.AI 版本更新

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

DeepSciVerify: 通过LLM驱动的证据升级验证科学声明与引文对齐

Shaghayegh Sadeghi, Khashayar Khajavi, Rise Adhikari, Alexander Tessier

发表机构 * School of Computing Science, Simon Fraser University(西蒙弗雷泽大学计算科学学院)

AI总结 提出DeepSciVerify两阶段流水线,结合摘要推理与选择性升级到段落证据,在SCitance基准上以86.7 Micro-F1超越纯摘要基线4.5点,同时67%实例无需全文检索。

详情
AI中文摘要

声明与其引用证据之间的错位是大语言模型生成报告中的常见失败模式,限制了其在科学及其他高风险场景中的可靠性。我们提出DeepSciVerify,一个用于科学声明-引文验证的两阶段流水线,结合摘要级推理与选择性升级到段落级证据。该系统首先使用摘要验证声明,并对不确定案例进行延迟处理,仅在必要时检索和分析全文段落。该设计利用了LLM之间的互补行为,因为某些模型在不确定性下更为保守,而另一些则更为果断。在SCitance基准上,DeepSciVerify达到了86.7 Micro-F1,比强纯摘要基线高出4.5点,同时67%的实例无需全文检索即可解决。这些结果表明,选择性证据升级提高了声明-引文验证的准确性和效率。

英文摘要

Misalignment between claims and their cited evidence is a common failure mode in reports generated by large language models, limiting their reliability in scientific and other high-stakes settings. We present DeepSciVerify, a two-stage pipeline for scientific claim-citation verification that combines abstract-level reasoning with selective escalation to passage-level evidence. The system first verifies claims using the abstract and defers uncertain cases, retrieving and analyzing full-text passages only when necessary. This design leverages complementary behaviors across LLMs, as some models are more conservative while others are more decisive under uncertainty. On the SCitance benchmark, DeepSciVerify achieves 86.7 Micro-F1, outperforming strong abstract-only baselines by +4.5 points while resolving 67% of instances without full-text retrieval. These results suggest that selective evidence escalation improves both accuracy and efficiency in claim-citation verification.

2605.27703 2026-05-28 cs.AI 版本更新

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

面向资源受限智能体语言模型的分层提示域控制与学习

Joan Vendrell Gallart, Russell Bent, Michael Grosskopf

发表机构 * Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 提出分层控制与学习框架,通过蒸馏学习输出模式、在线监控与提示域控制,解决资源受限下智能体语言模型的可靠性问题。

详情
AI中文摘要

大型语言模型越来越多地部署在智能体系统中,它们必须遵循结构化协议,适应不断变化的状态,并在内存、延迟和成本限制下运行。在这种场景下,提示扩展不可靠:增长的上下文可能将紧凑模型推离其有效提示域,而部署时的微调受限于稀缺的数据和计算资源。我们提出了一种分层控制与学习框架,其中紧凑模型首先通过蒸馏学习所需的输出模式,然后由预言机-控制器循环在线监督。控制器监控协议有效性和语义性能,将累积历史投影到可行的提示域中,并在发生漂移时触发轻量级的预言机监督微调。这将用于通信兼容性的模式学习与用于任务级纠正的语义适应分离开来。我们形式化了提示域可行性和注意力引起的饱和,从而激励对有效提示状态的控制,而非依赖名义上下文长度。使用多保真贝叶斯优化作为受控顺序测试平台,我们描述了一个核心部署故障模式,并展示了相对于非分层、仅蒸馏和非蒸馏基线的改进的可靠性和成本效益。

英文摘要

Large Language Models are increasingly deployed inside agentic systems, where they must follow structured protocols, adapt to evolving states, and operate under memory, latency, and cost constraints. In such regimes, prompt extension is unreliable: growing contexts can push compact models outside their effective prompt domain, while deployment-time fine-tuning remains limited by scarce data and compute. We propose a hierarchical control-and-learning framework in which a compact model is first distilled to learn the required output schema, then supervised online by an oracle-controller loop. The controller monitors protocol validity and semantic performance, projects accumulated histories into a feasible prompt domain, and triggers lightweight oracle-supervised fine-tuning under drift. This separates schema learning for communication compatibility from semantic adaptation for task-level correction. We formalize prompt-domain feasibility and attention-induced saturation, motivating control of the effective prompt state rather than reliance on nominal context length. Using Multi-Fidelity Bayesian Optimization as a controlled sequential testbed, we characterize a core deployment failure mode and show improved reliability and cost-efficiency over non-hierarchical, distillation-only, and non-distilled baselines.

2605.27700 2026-05-28 cs.DL cs.AI 版本更新

CiteCheck: Retrieval-Grounded Detection of LLM Citation Hallucinations in Scientific Text

CiteCheck: 基于检索的科学文本中LLM引用幻觉检测

Khashayar Khajavi, Shaghayegh Sadeghi, Rise Adhikari, Alexander Tessier

发表机构 * School of Computing Science, Simon Fraser University(西蒙·弗雷泽大学计算科学学院)

AI总结 提出CiteCheck框架,通过从外部学术来源检索候选出版物、使用结构化LLM验证器比较引用与候选信息,并将验证器得分映射为精确、次要和主要三个标签,以检测LLM生成的引用幻觉,在物理基准上达到88.7 macro-F1和88.9%准确率。

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于生成科学报告,但它们可能产生看似合理但包含损坏元数据或指向不存在论文的引用。我们引入了CiteCheck,一个用于引用幻觉检测的混合框架,它验证引用是否对应于真实的学术工作以及其元数据是否忠实于该工作。CiteCheck从外部学术来源检索候选出版物,使用结构化LLM验证器将引用与检索到的候选进行比较,并将验证器得分映射为三个标签:精确、次要和主要。我们还构建了一个包含982个引用的物理基准,具有受控的损坏,这些损坏捕获了细微的元数据漂移和完全捏造的引用。在保留测试集上,CiteCheck达到了88.7 macro-F1和88.9%的准确率,优于GPT、Claude和Gemini基线,包括网络搜索和少样本变体。这些结果表明,可靠的引用验证受益于结合学术检索、基于结构化LLM的比较和校准的决策规则。

英文摘要

Large language models (LLMs) are increasingly used to generate scientific reports, but they can produce references that appear plausible while containing corrupted metadata or pointing to papers that do not exist. We introduce CiteCheck, a hybrid framework for citation hallucination detection that verifies whether a citation corresponds to a real scholarly work and whether its metadata is faithful to that work. CiteCheck retrieves candidate publications from external scholarly sources, compares the citation against the retrieved candidate using a structured LLM verifier, and maps verifier scores into three labels: Exact, Minor, and Major. We also construct a 982-citation physics benchmark with controlled corruptions that capture both subtle metadata drift and fully fabricated references. On the held-out test set, CiteCheck achieves 88.7 macro-F1 and 88.9% accuracy, outperforming GPT, Claude, and Gemini baselines, including web-search and few-shot variants. These results show that reliable citation verification benefits from combining scholarly retrieval, structured LLM-based comparison, and calibrated decision rules.

2605.27697 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

仿真引导的扩散方法用于去中心化多机器人运动规划

Jinhao Liang, Sven Koenig, Ferdinando Fioretto

发表机构 * University of Virginia(弗吉尼亚大学) University of California, Irvine(加州大学伊文斯顿分校)

AI总结 提出一种基于约束感知扩散模型的去中心化框架SID,通过仿真邻居未来轨迹并利用安全约束规划自身轨迹,在密集场景下实现高效协调。

详情
AI中文摘要

去中心化多机器人运动规划要求每个机器人仅根据局部观测生成无碰撞轨迹,无需全局感知或可靠通信。然而,大多数现有规划器(无论是经典方法还是基于学习的方法)都是从局部观测的静态快照生成轨迹,这限制了它们预测相邻机器人未来行为的能力。随着机器人数量增加和环境变得更加拥挤,这一限制变得至关重要。为了克服这一挑战,本文引入了仿真引导的扩散(SID),这是一种基于约束感知扩散模型(CADM)的去中心化框架。SID首先使用CADM从当前观测状态仿真相邻机器人的未来轨迹,然后利用这些仿真提供的安全约束,使用相同的CADM规划每个机器人自身的轨迹。关键的是,对邻居的精确仿真使得一种最小通信方案成为可能,该方案仅在高度拥挤的场景中必要时触发协调。在多种环境中的实验表明,SID在规划有效性和约束满足方面始终优于基线方法,并且可扩展到108个机器人和160个障碍物的场景。

英文摘要

Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without global sensing or reliable communication. However, most existing planners, whether classical or learning-based, generate trajectories from a static snapshot of the local observation, which limits their ability to anticipate the future behavior of neighboring robots. This limitation is critical as the number of robots increases and the environment becomes more cluttered. To overcome this challenge, this paper introduces Simulation-Informed Diffusion (SID), a decentralized framework built on constraint-aware diffusion models (CADM). SID first uses CADM to simulate the future trajectories of neighboring robots from their currently observed states, and then uses the same CADM to plan each robot's own trajectory under safety constraints informed by these simulations. Crucially, the accurate simulation of neighbors enables a minimal communication scheme that triggers coordination only when necessary in highly congested scenarios. Experiments across diverse environments show that SID consistently outperforms baseline methods in terms of planning effectiveness and constraint satisfaction, and scales to scenarios with 108 robots and 160 obstacles.

2605.27686 2026-05-28 cs.CV cs.AI 版本更新

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

张量记忆:用于长程Transformer的固定大小循环状态

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA(麻省理工学院) IBM Research, Cambridge, MA, USA(IBM研究院) University of Toronto, Toronto, Canada(多伦多大学)

AI总结 提出张量记忆模块,通过固定大小的3D循环张量状态增强Transformer,以解耦状态容量与输入长度,并保持空间归纳偏置,适用于长程视频理解。

详情
AI中文摘要

Transformer通过将空间和时间展平为长令牌序列来处理图像和视频。虽然注意力和KV缓存保留了过去的特征,但其内存随序列长度增长,并且缺乏显式的、持久化的空间状态,这使得长程视频理解和遮挡敏感推理变得困难。我们提出张量记忆,一种轻量级模块,通过固定大小的循环3D记忆张量增强Transformer块:令牌通过可微的软写入将内容沉积为围绕预测连续3D位置的高斯加权体积到体素网格中,记忆通过高效的局部交互算子和门控循环动态更新,令牌通过连续采样和门控残差融合读取上下文。由于记忆张量大小固定,张量记忆将状态容量与输入长度解耦,同时保持空间归纳偏置。我们在标准语言、图像和视频基准测试以及一个旨在隔离持久状态何时有益的受控玩具诊断套件上评估该模块;它与标准Transformer训练流程集成,可以附加到现有块或从中移除,而无需其他架构更改。

英文摘要

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

2605.27681 2026-05-28 cs.AI cs.LG 版本更新

Behavioural Analysis of Alignment Faking

对齐伪造的行为分析

Nathaniel Mitrani Hadida, Rhea Karty, David Williams-King, Alan Cooney

发表机构 * University of Cambridge(剑桥大学) Harvard University(哈佛大学) ERA UK AISI(英国人工智能学会)

AI总结 通过可控最小设置研究对齐伪造,发现其驱动因素包括价值观、目标保护和谄媚,且比先前报告更普遍,可从情境线索和模型倾向预测。

Comments preprint

详情
AI中文摘要

对齐伪造(AF)指的是模型为了保持其部署偏好,策略性地遵守训练目标以避免行为修改。理解AF何时以及为何出现很重要,因为模型在区分训练和部署方面越来越擅长。先前的工作发现AF脆弱、对提示敏感且依赖模型,其潜在驱动因素尚不清楚。我们在一个隔离其核心组件的可控最小设置中研究AF,并在比先前报告更广泛的模型中观察到它,包括小规模模型。我们识别出三个可分离的驱动因素——价值观、目标保护和谄媚——并通过有针对性的提示消融和激活引导表明每个因素独立地调节AF行为。我们的结果表明AF比先前报告更普遍,并且其发生可从情境线索和可测量的模型倾向(如基线谄媚和陈述的价值观)预测。这种分解为未来模型中检测和缓解AF提供了具体方向。

英文摘要

Alignment faking (AF) refers to a model strategically complying with a training objective to avoid behavioural modification while preserving its deployment preferences. Understanding when and why AF arises matters as models grow better at distinguishing training from deployment. Prior work finds AF fragile, prompt-sensitive, and model-dependent, leaving its underlying drivers unclear. We study AF in a controlled, minimal setup that isolates its core components, and observe it across a wider range of models than previously reported, including small-scale models. We identify three separable drivers -- values, goal guarding, and sycophancy -- and show via targeted prompt ablations and activation steering that each independently modulates AF behaviour. Our results indicate AF is more widespread than previously reported and that its occurrence is predictable from situational cues and measurable model tendencies such as baseline sycophancy and stated values. The decomposition suggests concrete directions for detecting and mitigating AF in future models.

2605.27674 2026-05-28 cs.CR cs.AI cs.LG 版本更新

Backdoor Attacks on Fault Detection and Localization in Cyber-Physical Systems

针对信息物理系统中故障检测与定位的后门攻击

Abile Jean, Kuniyilh S

发表机构 * GitHub

AI总结 本文研究针对现代信息物理系统中基于机器学习的故障检测与定位机制的后门攻击,通过设计触发器并评估攻击成功率,实验表明即使仅投毒10%的数据也能成功实施攻击。

详情
AI中文摘要

信息物理系统(CPS)集成了传感、通信、计算和控制,以支持关键基础设施,包括智能电网、工业自动化和控制系统。在电力公用事业领域,CPS中使用各种控制器来确保系统检测和恢复故障(如电压波动),并在配电系统中进行负载平衡。基于机器学习和深度学习的故障检测与定位框架因其能够实时识别异常和操作故障,近年来在CPS中受到广泛关注。然而,这些智能模型容易受到对抗性机器学习攻击,尤其是后门攻击。在后门攻击中,对手将恶意模式注入训练数据,使得模型在大多数情况下表现正常,但当触发特定模式时产生攻击者控制的输出。本文研究了针对现代CPS系统中最新机器学习管道的故障检测与定位机制的后门攻击威胁。我们定义了这些威胁,并通过设计触发器以及在CPS领域评估其成功率来探索如何实现这些攻击。我们的实验表明,即使仅投毒10%的数据,攻击也能成功。

英文摘要

Cyber-Physical Systems (CPS) integrate sensing, communication, computation, and control to support critical infrastructure, including smart grids, industrial automation, and control systems. In the electrical utility domain, various controllers are used in CPS to ensure the system detects and recovers from faults, such as voltage fluctuations, and to perform load balancing in distribution systems. Machine learning- and deep learning-based fault detection and localization frameworks have recently gained significant attention in CPS for their ability to identify anomalies and operational failures in real time. However, these intelligent models are vulnerable to adversarial machine learning attacks, particularly backdoor attacks. In a backdoor attack, an adversary injects malicious patterns into the training data so that the model behaves normally most of the time but produces attacker-controlled outputs when triggered by specific patterns. This paper investigates the threat of backdoor attacks against fault detection and localization mechanisms in recent ML pipelines used in modern CPS systems. We define these threats and explore how they can be realized by designing triggers and evaluating their success in the CPS domain. Our experiments show the attack is successful even with 10\% of poisoning.

2605.27668 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

将LLM与人类不确定性对齐:用于LLM预测的Beta-Bernoulli校准器

Hui Dai, Ryan Teehan, Parsa Torabian, Mengye Ren

发表机构 * Agentic Learning AI Lab(代理学习AI实验室) New York University(纽约大学) The University of Chicago(芝加哥大学) Chronologies AI

AI总结 提出Beta-Bernoulli校准器(BBC),通过结合二元结果和人类预测信号,将初始点估计转换为事件似然分布,实现校准和不确定性量化。

详情
AI中文摘要

概率预测估计不确定未来事件的可能性。为了改进LLM预测,现有方法通常从二元结果中学习以输出语言化预测。然而,尽管聚合的人类预测在群体概率估计和预测者之间的一致程度中都包含丰富信息,如何利用这些信号仍未充分探索。为了解决这个问题,我们提出了Beta-Bernoulli校准器(BBC),它将来自任何模型的初始点估计转换为事件似然分布,使用来自二元结果和人类预测的监督。BBC对事件似然$p \sim \text{Beta}(α, β)$和结果$y \sim \text{Bernoulli}(p)$建模,均值作为校准的点预测,方差作为认知不确定性。我们的结果表明,BBC通常比传统的后验校准方法和专门为预测微调的模型提供更好校准和更准确的预测,同时保持轻量级并具有良好的泛化能力。我们还表明,BBC捕获的认知不确定性是比语言化置信度更可靠的预测误差指标。

英文摘要

Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(α, β)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.

2605.27662 2026-05-28 cs.LG cs.AI 版本更新

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

优化器如何塑造等变神经网络中的学习解

Teodor-Mihai Stupariu, Andrei Manolache

发表机构 * University of Stuttgart, Germany(斯图加特大学) International Max Planck Research School for Intelligent Systems, Germany(国际马克斯·普朗克智能系统研究学校) Tudor Vianu High School of Computer Science, Romania(托尔德·维安乌计算机科学高中)

AI总结 本文通过比较Muon和Adam优化器在点云和分子学习任务中的表现,发现Muon能改善等变神经网络的优化效果,并分析其导致更规则损失曲面和更高有效秩的机制。

Comments Accepted at ICML 2026 Workshop on Weight-Space Symmetries

详情
AI中文摘要

等变神经网络通过构造编码几何对称性,但它们通常难以优化,并且可能表现不如约束较少的架构。越来越多的研究通过架构修改(如约束松弛或近似等变)来解决这一问题,而优化器的作用相对未被充分探索。我们通过比较Muon和Adam在点云和分子学习设置下的多种等变和几何架构来研究这一方向。在对比最清晰的ModelNet40上,Muon在所有考虑的架构上均一致优于Adam。然后,我们通过Hessian估计、损失曲面可视化以及学习权重和中间表示的谱性质来分析训练后的ModelNet40检查点。Muon达到的检查点具有更大的Hessian曲率汇总但更规则的损失曲面,并且其学习权重和表示具有更高的稳定秩和有效秩。这些观察表明,优化器设计与几何归纳偏置之间的相互作用值得社区进一步关注。

英文摘要

Equivariant neural networks encode geometric symmetries by construction, yet they are often difficult to optimize and can underperform less constrained architectures. A growing body of work addresses this through architectural modifications such as constraint relaxation or approximate equivariance, while the role of the optimizer remains comparatively underexplored. We study this direction by comparing Muon and Adam across several equivariant and geometric architectures under pointcloud and molecular learning settings. On ModelNet40, where the comparison is clearest, Muon consistently improves over Adam across all architectures considered. We then analyze the trained ModelNet40 checkpoints through Hessian estimates, loss surface visualizations, and spectral properties of learned weights and intermediate representations. The checkpoints reached by Muon have larger Hessian curvature summaries but more regular loss surfaces, and their learned weights and representations have higher stable and effective ranks. These observations suggest that the interaction between optimizer design and geometric inductive bias deserves further attention from the community.

2605.27659 2026-05-28 cs.LG cs.AI 版本更新

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

通过概率潜在嵌入和动态策略自适应实现迁移强化学习用于Sim-to-Real部署

Gengyue Han, Yiheng Feng

发表机构 * Lyles School of Civil and Construction Engineering, Purdue University, West Lafayette, USA(普渡大学土木与建设工程学院) Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, USA(普渡大学埃尔莫尔家庭电气与计算机工程学院)

AI总结 提出一种基于概率潜在嵌入和动态策略自适应的强化学习框架,通过元学习推断环境潜在表示并动态调整风险水平,实现安全高效的Sim2Real策略迁移。

详情
AI中文摘要

由于资源有限和公共安全问题,许多信息物理系统(如自动驾驶汽车)的深度强化学习(RL)智能体首先在模拟器中进行训练。然而,当部署到真实世界环境中时,由于不可避免的Sim2Real差距,它们常常遭受性能下降或安全违规。现有的零样本方法,如鲁棒安全RL和域随机化,缓解了这一问题,但通常以性能下降或遇到未建模系统动态时的残余安全风险为代价。为了解决这些限制,我们提出了一种新颖的强化学习框架,通过概率潜在嵌入和动态策略自适应实现安全高效的策略迁移。我们考虑在不同环境上下文下的一族约束马尔可夫决策过程(CMDP)。通过利用元RL中的潜在上下文变量,所提出的框架从模拟经验中推断环境的潜在表示。此外,它结合了分布RL公式,允许根据潜在上下文变量的估计精度动态调整部署策略的风险水平。该策略在早期部署阶段促进安全性,并通过在Sim2Real差距下的快速策略自适应提高效率。

英文摘要

Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environments, they often suffer from performance degradation or safety violations because of the inevitable Sim2Real gap. Existing zero-shot approaches, such as robust safe RL and domain randomization, mitigate this issue but typically at the cost of degraded performance or residual safety risks when experiencing unmodeled system dynamics. To address these limitations, we propose a novel reinforcement learning framework that enables safe and efficient policy transfer via probabilistic latent embeddings and dynamic policy adaptation. We consider a family of Constrained Markov Decision Processes (CMDPs) under different environment contexts. By leveraging latent context variable in meta-RL, the proposed framework infers the latent representation of the environment from simulated experiences. Furthermore, it incorporates a distributional RL formulation, which allows risk levels of the deployed policy to be adjusted dynamically, based on the estimation accuracy of the latent context variable. This strategy promotes safety at the early deployment stage and improves efficiency through fast policy adaptation under the Sim2Real gap.

2605.27656 2026-05-28 cs.IR cs.AI 版本更新

Developing an Intelligent Job Recommendation System Using Semantic Retrieval and Explainable AI Techniques

利用语义检索与可解释AI技术开发智能职位推荐系统

Hussein Al Awad, Khaled Fathi Omar

发表机构 * Master of Web Science, Syrian Virtual University, Damascus, Syria(Web科学硕士,叙利亚虚拟大学,大马士革,叙利亚)

AI总结 提出一种结合TF-IDF、Sentence-BERT语义检索、交叉编码器重排序和可解释性生成的元数据驱动职位推荐系统,在LinkedIn数据集上达到高精度和可解释性。

Comments 11 pages, 5 figures, IEEE-style paper on semantic retrieval and explainable AI for intelligent job recommendation

详情
AI中文摘要

在线招聘平台需要能够从大量异构职位发布中检索相关机会的推荐方法。基于关键词的搜索高效且可解释,但当相同职位使用不同术语表达时,可能无法检索到相关发布。本研究提出了一种元数据驱动的职位推荐系统,结合了TF-IDF词汇匹配、Sentence-BERT语义检索、查询感知过滤、可选的交叉编码器重排序和解释生成。该系统利用结构化元数据字段,包括职位名称、公司名称、地点、资历级别、职位职能、雇佣类型和行业,而不依赖完整的职位描述或用户交互历史。在包含31262条记录的清理后的LinkedIn职位发布数据集上进行的实验表明,最佳混合配置实现了10个位置上的精确率为0.8032,nDCG@10为0.9496。在内部评估协议下,交叉编码器重排序将精确率@10从0.7896提高到0.7948,nDCG@10从0.9666提高到0.9739。这些发现表明,当仅有结构化元数据可用时,词汇和语义检索技术可以有效地结合,以提供可解释的职位推荐。

英文摘要

Online recruitment platforms require recommendation methods capable of retrieving relevant job opportunities from large and heterogeneous collections of job postings. Keyword-based search is efficient and interpretable, but it may fail to retrieve relevant postings when equivalent roles are expressed using different terminology. This study presents a metadata-driven job recommendation system that combines TF-IDF lexical matching, Sentence-BERT semantic retrieval, query-aware filtering, optional Cross-Encoder re-ranking, and explanation generation. The proposed system utilizes structured metadata fields including job title, company name, location, seniority level, job function, employment type, and industry without relying on full job descriptions or user interaction histories. Experiments conducted on a cleaned LinkedIn job posting dataset containing 31262 records demonstrate that the best hybrid configuration achieved a Precision at 10 score of 0.8032 and an nDCG at 10 score of 0.9496. Under the internal evaluation protocol, Cross-Encoder re-ranking improved Precision at 10 from 0.7896 to 0.7948 and nDCG at 10 from 0.9666 to 0.9739. These findings indicate that lexical and semantic retrieval techniques can be effectively combined to provide explainable job recommendations when only structured metadata is available.

2605.27654 2026-05-28 cs.CL cs.AI cs.CY 版本更新

Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability

英译印地语中的文化保真度:性别可恢复性的保持-流畅性前沿

Samyak Savi, Chavi Gupta, Shreyas Gantayet, Tanay Sodha, Dhruv Kumar

AI总结 研究英译印地语中性别信息的保持问题,提出两种推理时干预方法(SAR和PAR),在保持性别可恢复性与流畅性之间取得平衡。

Comments 10 pages, 2 figures, 9 tables

详情
AI中文摘要

生成式翻译系统是文化技术,因为它们决定如何在特定文化的语法系统中呈现具有社会意义的线索。我们研究成功文化翻译的一个具体概念:当英语源文本明确编码性别时,英译印地语应保持该线索的可恢复性,除非源文本本身存在歧义。我们在涵盖十二个类别的37,345个实例基准上评估了这一标准,并显示五个系统经常通过作格和敬语结构消除性别。然后,我们引入了两种机制感知的推理时干预。第一种是源感知重排序器(SAR),倾向于避免性别中立句法的候选。第二种是现象感知重排序器(PAR),即使在作格句法存在的情况下,也通过目标词汇标记保持性别。在GPT-4o-mini和Sarvam上,PAR将目标子集准确率分别从11.07%提高到54.47%,从15.99%提高到49.66%。人工评估显示,PAR将性别保持率从10.3%提高到81.3%,但平均流畅度从4.36降至3.37。这些发现将两种干预置于保持和流畅性的前沿,而不是支持单一的解决方案,并展示了文化定位的生成如何在保真度、流畅性和风格自然性之间需要明确的权衡。

英文摘要

Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally specific grammatical systems. We study one concrete notion of successful cultural translation: when an English source explicitly encodes gender, an English-to-Hindi translation should preserve the recoverability of that cue unless the source itself is ambiguous. We evaluate this criterion on a 37,345-instance benchmark spanning twelve categories and show that five systems frequently erase gender through ergative and honorific constructions. We then introduce two mechanism-aware inference-time interventions. The first, the Source-Aware Reranker (SAR), prefers candidates that avoid gender-neutralizing syntax. The second, the Phenomenon-Aware Reranker (PAR), preserves gender through targeted lexical marking even when ergative syntax remains. Across GPT-4o-mini and Sarvam, PAR improves target-subset accuracy from 11.07% to 54.47% and from 15.99% to 49.66%, respectively. Human evaluation shows that PAR increases gender preservation from 10.3% to 81.3%, but reduces mean fluency from 4.36 to 3.37. These findings place the two interventions on a preservation and fluency frontier rather than supporting a single dominant solution, and show how culturally situated generation can require explicit tradeoffs among fidelity, fluency, and stylistic naturalness.

2605.27646 2026-05-28 cs.LG cs.AI 版本更新

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

Hurwitz四元数乘法量化用于KV缓存压缩

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, David Cox, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA(麻省理工学院) IBM Research, Cambridge, MA, USA(IBM研究院) University of Toronto, Toronto, Canada(多伦多大学)

AI总结 提出一种免校准的Hurwitz四元数乘法量化方法,通过将K/V的4元素块视为四元数并用量化乘积编码,在约5比特下匹配fp16困惑度,实现高达5.05倍KV缓存压缩。

详情
AI中文摘要

我们提出 extbf{Hurwitz四元数乘法量化(HQMQ)},一种用于大语言模型KV缓存压缩的 extbf{免校准}方法。HQMQ将K或V的每个4元素块视为一个四元数,并将其单位方向量化到乘积$q_p \cdot q_s$上,其中$q_p$取自24元素Hurwitz群$2T$($S^3$上24-cell的24个顶点,两两夹角$60^\circ$),$q_s$取自每个(层、头)的二级码本,包含$S$个 extemph{随机}单位四元数。乘法组合在$S$个存储参数下产生$24S$个有效码字;随机初始化即可,因为左乘是$S^3$等距变换,因此种子码本在最终任务困惑度上的变化小于$1.5\%$。一个每批次的中间乘数离群值提取步骤($C=3$,无校准)处理现代离群值密集型架构。我们在五个现代开源模型上评估:Mistral-7B(密集MHA)、Llama-3-8B和Qwen2.5-7B和Qwen3-8B(密集GQA),以及gpt-oss-20b(稀疏MoE)。在Mistral-7B和Qwen3-8B上,HQMQ在约5比特下匹配fp16,困惑度差异在$0.02$--$0.03$点内。在Qwen2.5-7B和Qwen3-8B上,朴素int4导致困惑度崩溃到$10^4+$,而HQMQ + Med3$\times$在约5比特下恢复fp16质量,差异在$0.02$--$0.10$点内。HQMQ在所有五个模型上,在相同比特数下帕累托优于朴素int $3$--$1900\times$,并且在Mistral上以3.79比特的下游零样本准确率匹配fp16。与最强的校准KV量化基线相比,HQMQ在3.79比特下匹配KIVI-4(约4.5比特),在CoQA上差异约1点,TruthfulQA上0.6点,GSM8K上2.3点,同时比特数减少16%且无需校准过程。在存储层面,HQMQ提供高达5.05倍的KV压缩,将Llama-3-70B的128k上下文缓存从43 GB缩小到8.5 GB。

英文摘要

We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the \emph{product} $q_p \cdot q_s$, where $q_p$ ranges over the 24-element Hurwitz group $2T$ (the 24 vertices of the 24-cell on $S^3$, pairwise angle $60^\circ$) and $q_s$ ranges over a per-(layer, head) secondary codebook of $S$ \emph{random} unit quaternions. The multiplicative composition yields $24S$ effective codewords at $S$ stored parameters; random initialization suffices because left-multiplication is an $S^3$ isometry, so seeded codebooks vary in end-task ppl by $<1.5\%$. A per-batch median-multiplier outlier extraction step ($C{=}3$, no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within $0.02$--$0.03$ ppl points at $\sim$5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to $10^4{+}$ ppl, HQMQ + Med3$\times$ recovers fp16 quality within $0.02$--$0.10$ ppl points at $\sim$5 bits. HQMQ Pareto-dominates naive int by $3$--$1900\times$ at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at $3.79$ bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at $3.79$ bits matches KIVI-4 ($\sim 4.5$ bits) within ${\sim}1$ pt on CoQA, $0.6$ pts on TruthfulQA, and $2.3$ pts on GSM8K, at $16\%$ fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to $5.05\times$ KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.

2605.27644 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data

Trinity:通过利用合成数据统一非结构化户外环境中的类无关地形与语义分割

Marcus G Müller, Wout Boerdijk, Maximilian Durner, Riccardo Giubilato, Abel Gawel, Wolfgang Stürzl, Roland Siegwart, Rudolph Triebel

发表机构 * Institute of Robotics and Mechatronics, German Aerospace Center (DLR)(机器人与机电系统研究所,德国航空航天中心(DLR)) Federal Institute of Technology Zurich (ETH Zurich)(苏黎世联邦理工学院(ETH Zurich)) Robotics and AI Institute (RAI)(机器人与人工智能研究所(RAI))

AI总结 提出基于Transformer的统一网络Trinity,联合执行类特定语义分割和类无关地形分割,利用合成数据集RUGDSynth和真实数据集EXTerra实现机器人无关的地形先验学习。

详情
AI中文摘要

地形理解对于在非结构化户外环境中运行的移动机器人至关重要。现有的基于视觉的可通行性估计方法依赖于机器人特定的标注或语义类别映射,限制了跨平台的迁移性,并在机器人能力变化时需要昂贵的重新标注,而标准的语义分割方法仅关注特定的预定义类别,无法捕捉地形的多样性。在这项工作中,我们提出了一种基于Transformer的架构,在统一网络Trinity中联合执行类特定语义分割和类无关地形分割。地形区域仅基于视觉外观进行分割,无需预定义的语义标签或机器人相关的可通行性分数。这种公式使得学习机器人无关的视觉地形先验成为可能,这些先验可以与机器人特定的经验相结合,用于下游任务,如可通行性估计、视觉里程计和任务规划。为了实现具有多样地形外观的大规模训练,我们扩展了OAISYS模拟器,并引入了RUGDSynth,这是一个受RUGD启发、包含类无关地形样本的合成数据集。此外,我们提出了EXTerra数据集,提供了带有类特定和类无关地形标签的真实世界图像。实验证明了所提出任务的可行性以及我们的联合分割方法在复杂户外环境中的有效性。代码和数据集将在本出版物发布后(经过审查)公开。

英文摘要

Terrain understanding is fundamental for mobile robots operating in unstructured outdoor environments. Existing vision-based traversability estimation methods rely on robot-specific annotations or semantic class mappings, limiting transferability across platforms and requiring costly re-annotation when robot capabilities change, while standard semantic segmentation methods only focus on specific predefined classes, which do not capture the variety of terrains. In this work, we propose a transformer-based architecture that jointly performs class-specific semantic segmentation and class-agnostic terrain segmentation within a unified network, called Trinity. Terrain regions are segmented based solely on visual appearance, without predefined semantic labels or robot-dependent traversability scores. This formulation enables the learning of robot-agnostic visual terrain priors that can be combined with robot-specific experience for downstream tasks such as traversability estimation, visual odometry, and mission planning. To enable large-scale training with diverse terrain appearances, we extend the OAISYS simulator and introduce RUGDSynth, a synthetic dataset inspired by RUGD with class-agnostic terrain samples. Furthermore, we present the EXTerra Dataset, providing real-world images annotated with both class-specific and class-agnostic terrain labels. Experiments demonstrate the feasibility of the proposed task and the effectiveness of our joint segmentation approach in complex outdoor environments. Code and datasets will be released with this publication (after review).

2605.27622 2026-05-28 cs.AI cs.SC 版本更新

Reasoning and Planning with Dynamically Changing Norms

动态变化规范的推理与规划

Taylor Olson, Roberto Salas-Damian, Kenneth D. Forbus

发表机构 * University of Iowa(爱荷华大学) Northwestern University(西北大学)

AI总结 本文提出一种在人类-AI环境中使用动态变化规范引导规划的方法,通过可废止演算解决规范冲突并将规范作为规划护栏,理论证明与对话任务实验验证了有效性。

Comments 8 pages, 1 figure, dataset included in anc

详情
AI中文摘要

为了安全地与人类交互,AI 智能体必须既了解我们的规范,又在规划时考虑它们。然而,这种规范引导的规划在人工智能体社区内研究较少,且忽略了规范的动态性。本文提出了一种在人类-AI 环境中使用动态变化规范引导规划的方法。我们贡献了一种用于解决规范冲突的可废止演算,以及一种使用这种动态变化规范作为规划护栏的方法。我们通过形式化证明在理论上展示了该方法,并通过 AI 智能体 SocialBot 在自然语言对话任务上进行了实证验证。

英文摘要

To safely interact with humans, AI agents must both know our norms and consider them during planning. However, such norm-guided planning has been less explored, only within communities of artificial agents, and has ignored the dynamic nature of norms. This paper instead presents an approach to guiding planning with dynamically changing norms in a human-AI setting. We contribute a defeasible calculus for resolving normative conflicts and an approach to using such dynamically changing norms as guard rails on plans. We theoretically demonstrate our approach with formal proofs and empirically with an AI agent, SocialBot, on a natural language dialogue task.

2605.27619 2026-05-28 cs.LG cs.AI 版本更新

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

基于最优传输和依赖性最大化的有监督分布约简

Sai-Aakash Ramesh, Archit Sood, Andrew Corbett, Tim Dodwell

发表机构 * digiLab, UK(digilab英国实验室) University of Bristol, UK(布里斯托大学)

AI总结 提出有监督分布约简(SDR)算法,通过结合最优传输和显式依赖性最大化,学习同时保留数据几何结构和目标相关信号的紧凑表示。

详情
AI中文摘要

学习同时捕捉内在数据几何结构和目标相关结构的表示仍然是一个基本挑战,特别是在数据约简必须在压缩与预测保真度之间取得平衡的场景中。虽然分布约简(包括联合聚类和降维)提供了一种原则性的数据总结方法,但其有监督变体仍然相对未被充分探索,尽管保留任务相关信号对于下游预测和决策至关重要。我们提出有监督分布约简(SDR),一种通过结合最优传输和显式依赖性最大化来学习目标感知表示的算法。SDR 基于融合 Gromov-Wasserstein(FGW)目标,将输入分布的 relational 结构与一组代表点对齐,同时增加一个直接依赖性项,鼓励学习到的嵌入更明确地捕捉预测信号。这产生了反映几何结构和监督的紧凑表示。除了表示学习,SDR 自然地诱导出一种数据依赖的非平稳几何结构,可用于高斯过程(GP)建模等场景。通过目标感知的分布对齐重新定义距离,SDR 能够构建适应数据几何和监督局部变化的自适应核,为非平稳核设计提供了基于最优传输的视角。

英文摘要

Learning representations that capture both intrinsic data geometry and target-relevant structure remains a fundamental challenge, particularly in settings where data reduction must balance compression with predictive fidelity. While distributional reduction-encompassing joint clustering and dimensionality reduction-offers a principled way to summarize data, its supervised variants remain relatively under-explored, despite the importance of retaining task-relevant signal for downstream prediction and decision-making. We propose Supervised Distributional Reduction (SDR), an algorithm for learning target-aware representations by combining optimal transport with explicit dependence maximization. SDR builds on the Fused Gromov-Wasserstein (FGW) objective to align the relational structure of the input distribution with a set of representative points, while augmenting it with a direct dependence term that encourages the learned embeddings to capture predictive signal more explicitly. This results in compact representations that reflect both geometric structure and supervision. Beyond representation learning, SDR naturally induces a data-dependent, non-stationary geometry that can be leveraged for settings such as Gaussian Process (GP) modelling. By redefining distances through target-aware distributional alignment, SDR enables the construction of adaptive kernels that respond to local variations in both data geometry and supervision, offering an optimal transport-based perspective on non-stationary kernel design.

2605.27616 2026-05-28 cs.CV cs.AI 版本更新

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

并非所有 NVFP4 QAT 配方都相同:架构和规模如何影响异常分割的模型质量

Zijian Du, Oleg Rybakov

发表机构 * NVIDIA

AI总结 本研究通过统一协议评估多种架构、规模和 FP4 量化感知训练 (QAT) 配方在脑肿瘤异常分割任务中的交互作用,发现架构选择对量化鲁棒性影响最大,注意力机制架构对配方选择具有显著韧性,而 CNN 在大规模下受梯度量化配方影响性能下降。

详情
Journal ref
CVPR2026
AI中文摘要

实时异常分割要求高召回率和高效的低精度推理。我们研究了模型架构、模型规模和 FP4 量化感知训练 (QAT) 配方在召回关键的脑肿瘤分割任务中的三方交互,在统一协议下评估了多种架构、规模和 QAT 配方。我们发现架构选择对量化鲁棒性影响最大,基于注意力的架构对配方选择表现出显著的韧性,而 CNN 在大规模下在梯度量化配方下性能下降。在低容量下,FP4 可能离散化 softmax 注意力,但高级 QAT 配方可防止这种崩溃。在更大规模下,高级配方减轻了降低 CNN 质量的梯度量化噪声。五折患者级交叉验证证实这些发现对数据划分具有鲁棒性。我们的结果表明,Swin Transformer 在所有规模下对 QAT 配方选择都具有鲁棒性,使其成为 FP4 量化异常分割的推荐架构。

英文摘要

Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

2605.27610 2026-05-28 cs.IR cs.AI cs.HC 版本更新

Eliot: Interactively $\underline{E}$xploring Fast-Changing Scientific $\underline{Li}$terature Trends with $\underline{O}$nline Da$\underline{t}$a and Learning

Eliot: 通过在线数据和学习交互式探索快速变化的科学文献趋势

Bernardo A. Denkvitts, Nitin Gupta, Biplav Srivastava

发表机构 * University of South Carolina(南卡罗来纳大学)

AI总结 提出Eliot系统,通过查询时聚类和时间可视化,帮助研究人员可追溯地探索快速变化的科学文献趋势。

Comments Under-review at CIKM Applied Research 2026

详情
AI中文摘要

科学出版的快速增长使得追踪快速变化领域的演变变得越来越困难。搜索引擎和基于LLM的助手检索或总结论文,但往往隐藏了语料库是如何被选择、组织或与时间模式关联的。我们提出了$ exttt{Eliot}$,一个公开部署的交互式系统,用于可追溯地探索不断演变的科学文献。受两项关于大语言模型(LLMs)和自动规划与调度(APS)研究的启发,$ exttt{Eliot}$将文献演变分析推广到超越手工构建的分类法和特定领域脚本。给定明确的查询词和过滤器,它在查询时检索arXiv论文,通过标题和摘要表示每篇论文,将语料库聚类为主题,分配代表性关键词,并可视化每个聚类的出版年份分布。我们将$ exttt{Eliot}$评估为一个应用系统和一个交互式研究辅助工具。跨八个arXiv领域的离线配置研究使用内在聚类和主题连贯性指标比较了文档表示、降维方法和聚类算法;结果支持MiniLM嵌入结合10维UMAP和凝聚聚类作为实用默认设置。一项基于场景的调查和专家焦点小组评估了可解释性和使用情境:参与者在85%的场景响应中认为聚类标签有意义,反馈表明$ exttt{Eliot}$对于快速变化技术领域的可审计概述最有价值。这些结果表明,查询时聚类和时间检查可以通过帮助研究人员检查和提炼文献趋势背后的证据来补充搜索和生成工具。

英文摘要

The rapid growth of scientific publishing has made it increasingly difficult to track how fast-moving areas evolve. Search engines and LLM-based assistants retrieve or summarize papers, but often hide how the corpus was selected, organized, or connected to temporal patterns. We present $\texttt{Eliot}$, a publicly deployed interactive system for traceable exploration of evolving scientific literature. Motivated by two studies on Large Language Models (LLMs) and Automated Planning and Scheduling (APS), $\texttt{Eliot}$ generalizes literature-evolution analysis beyond hand-built taxonomies and domain-specific scripts. Given explicit query terms and filters, it retrieves arXiv papers at query time, represents each paper by title and abstract, clusters the corpus into themes, assigns representative keywords, and visualizes each cluster's publication-year distribution. We evaluate $\texttt{Eliot}$ as both an applied system and an interactive research aid. An offline configuration study across eight arXiv domains compares document representations, dimensionality reduction methods, and clustering algorithms using intrinsic clustering and topic-coherence metrics; the results support MiniLM embeddings with 10-dimensional UMAP and Agglomerative Clustering as a practical default. A scenario-based survey and expert focus group assess interpretability and use contexts: participants rated cluster labels as meaningful in 85% of scenario responses, and feedback indicated that $\texttt{Eliot}$ is most valuable for auditable overviews of rapidly changing technical areas. These results suggest that query-time clustering and temporal inspection can complement search and generation tools by helping researchers inspect and refine the evidence behind literature trends.

2605.27605 2026-05-28 cs.AI cs.SE 版本更新

Laguna M.1/XS.2 Technical Report

Laguna M.1/XS.2 技术报告

Julien Abadji, Marah Abdin, Connor Adams, Eric Alcaide, Mustafa Altun, Michele Artoni, Junze Bao, Uday Barar, Vassilis Bekiaris, Arkadii Bessonov, Benjamin Bütikofer, Jonathan Chang, Yen-Chun Chen, Dmitry Chernenkov, Yang Chi, Filippos Christianos, Fenia Christopoulou, Razvan-Andrei Ciocoiu, Tzachi Cohen, Yohann Coppel, Dmitrii Emelianenko, Brandon Fergerson, Brian Fitzgerald, Matthias Gallé, Alex Golonzovskyi, George Grigorev, Yiyang Hao, Christian Hensel, Jan Huenermann, Ye Ji, Sarthak Joshi, Eiso Kant, Kabir Khandpur, Seonghyeon Kim, Vladimir Kirichenko, Umut Kocasarac, Ilya Kochik, Ivan Komarov, Chaerin Kong, Anurag Koul, François-Joseph Lacroix, Sergei Laktionov, Waren Long, Quentin Malartic, Vadim Markovtsev, Afonso Marques, Robert McHardy, Carlos Mocholí, Dmitry Monakhov, Adam Morris, Martin Muller, Christian Mürtz, Robin Nabel, Thien Nguyen, Rok Novosel, Szymon Ozog, Aalhad Patankar, Aleksei Petrov, Alexandre Piché, Arthur Pignet, Teodor Poncu, Phil Potter, Alexander Rakowski, Pierre-Yves Ritschard, Jay Roberts, Joe Rowell, Piotr Sarna, Pierre-André Savalle, Uladzislau Sazanovich, Nikita Shapovalov, Arsenii Shevchenko, Mikhail Shilkov, Andrei Sokol, Mohamed Soliman, Jack Stephenson, Victor Storchan, Dragos-Constantin Tantaru, Artem Tyurin, Adrian Wälchli, Pengming Wang, Jianxiao Yang, Renat Zayashnikov, Alexander Zelenka Martin, Nikolay Zinov, Caroline Bercier, José Caldeira, Margarida Garcia, Tom George, Kabeer Gharzai, Glenn Hitchcock, Carson Klingenberg, Ivo Pinto, Varun Randery, Noah Smith, Arina Sugako, Jason Warner

发表机构 * Poolside Team(Poolside团队)

AI总结 本文介绍了两个用于长周期自主编码的混合专家基础模型 Laguna M.1 和 XS.2,通过端到端训练和模型工厂系统,在软件工程基准测试中达到先进水平。

Comments Technical report to models released here: https://poolside.ai/blog/introducing-laguna-xs2-m1

详情
AI中文摘要

我们介绍了 Laguna M.1 和 Laguna XS.2,两个为长周期自主编码构建的混合专家基础模型:M.1 总参数量为 2258 亿(每 token 激活 234 亿),XS.2 总参数量为 334 亿(每 token 激活 30 亿)。两个模型均在我们称为模型工厂的内部系统中从头到尾端到端训练:这是一个紧密集成的版本化数据、训练、评估和推理组件栈,将模型开发转变为工业流程。我们描述了模型工厂的原理和设计选择,并详细介绍了模型的端到端训练过程,包括预训练数据和架构、后训练阶段、评估和量化。在自主软件工程和终端基准测试(SWE-bench Verified、SWE-bench Multilingual、SWE-Bench Pro 和 Terminal-Bench 2.0)上,M.1 和 XS.2 在其各自的权重级别中与最先进的开源模型具有竞争力。Laguna XS.2 权重在 Apache 2.0 许可下发布,地址为 https://huggingface.co/collections/poolside/laguna-xs2。

英文摘要

We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.

2605.27595 2026-05-28 cs.CV cs.AI 版本更新

Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks

多模态大语言模型在农业图像解释与生成任务中的幻觉行为

Partho Ghose, Al Bashir, Prem Raj, Azlan Zahid

发表机构 * Texas A&M University System(德克萨斯大学系统)

AI总结 本研究系统评估了多模态大语言模型在农业图像解释(图像到文本)和生成(文本到图像)任务中的幻觉行为,发现模型存在生物不一致、上下文不准确和农学不合理等错误模式,并通过少样本提示等方法分析了幻觉的残留影响。

详情
AI中文摘要

大型语言模型(LLMs)正迅速被应用于农业成像领域,从作物解释到合成田间图像生成。然而,这些模型经常表现出看似自信但偏离生物或环境现实的幻觉输出,可能导致错误的农学见解。本研究从两个互补方向调查此类幻觉:图像到文本,即LLMs解释作物或田间图像以描述生物和非生物胁迫等条件;以及文本到图像,即模型基于描述性提示生成合成农业场景。我们检查涉及生物不一致、上下文不准确和农学不合理的错误,并在多个成像模态下根据领域知情标准评估输出。我们的分析识别了解释性和生成性任务中反复出现的幻觉模式。在图像解释中,LLMs(例如Gemma、LLAVA、Qwen和MiniCPM)实现了适度的零样本准确率(63%至75%),而少样本提示将性能提升至高达86.8%,但仍表现出虚假检测和漏检感染,表明存在残留幻觉效应。在文本到图像任务中,高级模型如GPT-5和Gemini 2.5 Flash在宽松提示约束下生成高达91%的生物不一致场景,揭示了当前LLMs的根本弱点。这种对视觉推理和生成的系统评估为增强基于LLM的农业成像平台的可靠性和可信度提供了关键见解。

英文摘要

Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic field image generation. However, these models frequently exhibit hallucinations outputs that appear confident yet deviate from biological or environmental reality potentially leading to misinformed agronomic insights. This study investigates such hallucinations in two complementary directions: image-to-text, where LLMs interpret crop or field imagery to describe conditions such as biotic and abiotic stresses, and text-to-image, where models generate synthetic agricultural scenes based on descriptive prompts. We examine errors involving biological inconsistency, contextual inaccuracy, and agronomic implausibility, evaluating the outputs under domain-informed criteria across multiple imaging modalities. Our analysis identifies recurring hallucination patterns within both interpretive and generative tasks. In image interpretation, LLMs (e.g., Gemma, LLAVA, Qwen, and MiniCPM) achieved modest zero-shot accuracy (63 to 75 percent), whereas few-shot prompting improved performance up to 86.8 percent, exhibiting false detections and missed infections, indicating residual hallucination effects. In text-to-image tasks, advanced models such as GPT-5 and Gemini 2.5 Flash generate up to 91 percent biologically inconsistent scenes under relaxed prompt constraints, revealing fundamental weaknesses in current LLMs. This systematic assessment of visual reasoning and generation offers critical insights toward enhancing the reliability and trustworthiness of LLM-based agricultural imaging platforms.

2605.27593 2026-05-28 cs.AI cs.MA 版本更新

Voluntary Collusion with Secret Tools in Competing LLM Agents

竞争性LLM代理中使用秘密工具的合谋行为

Xijie Zeng, Frank Rudzicz

发表机构 * Dalhousie University(达尔豪斯大学) Vector Institute for Artificial Intelligence(人工智能向量研究所)

AI总结 本研究通过两个多智能体环境(Liar's Bar和Cleanup)发现,即使工具被明确标注为不公平且有害,大多数LLM代理仍会自愿采用秘密合谋工具以获取战略优势,且仅靠对齐或公平标签无法有效阻止,需明确防护措施。

详情
AI中文摘要

即使工具被明确描述为对他人不公平且有害,表面上经过安全对齐的LLM代理仍然会在这样做能带来战略优势时自愿参与秘密合谋。为了研究这一现象,我们引入了一个基于两个战略多智能体环境的实证框架:Liar's Bar(一个竞争性欺骗场景)和Cleanup(一个混合动机资源管理场景),其中代理被提供秘密合谋工具,这些工具在明显不利于其他代理的同时提供了显著优势。在12个模型(7B、70B和专有规模)和6种提示变体中,我们发现大多数代理一致地接受这些工具并制定合谋策略,同时在接受前明确承认工具的不公平性。我们进一步表明,无论是公平标签还是基线对齐都无法可靠地阻止合谋:只有明确的伦理框架能减少采用,即使如此,较小的模型仍然容易受到影响。更广泛地说,我们的工作首次系统性地研究了基于LLM的多智能体系统中自愿合谋采用的问题,并表明防止此类行为需要明确的防护措施,而非依赖通用对齐。

英文摘要

Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenomenon, we introduce an empirical framework built on two strategic multi-agent environments: Liar's Bar, a competitive deception scenario, and Cleanup, a mixed-motive resource-management scenario, in which agents are offered secret collusion tools that provide significant advantages while clearly disadvantaging the other agents. Across 12 models (at the 7B, 70B, and proprietary scales) and 6 prompt variants, we find that most agents consistently accept these tools and develop collusive strategies, while explicitly acknowledging the unfairness of the tools before accepting. We further show that neither the unfairness labels nor baseline alignment alone reliably deters collusion: only explicit ethical framing reduces adoption and, even then, smaller models remain susceptible. More broadly, our work presents the first systematic investigation of voluntary collusion adoption in LLM-based multi-agent systems, and suggests that preventing such behaviour requires explicit safeguards rather than reliance on general alignment.

2605.27584 2026-05-28 cs.AI cs.SI 版本更新

Cyberbullying Governance on Social Media: A Unified Framework from Content Identification to Intervention

社交媒体上的网络暴力治理:从内容识别到干预的统一框架

Yiting Huang, Wenting Zhu, Zekun Wang, Qingpo Yang, Yakai Chen, Zihui Xu, Yueyue Zhang, Sanchuan Guo, Xi Zhang

发表机构 * School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络安全学院)

AI总结 本文提出一个涵盖内容识别、用户行为建模、扩散动态与早期预警、干预治理四阶段的统一全生命周期治理框架,以解决网络暴力被动、孤立检测的局限,实现主动、持续、综合的治理。

详情
AI中文摘要

社交媒体平台和在线社区的激增无意中催化了网络暴力、仇恨言论和其他形式的在线毒性传播,使得有效治理此类危害成为关键的社会和计算挑战。尽管在自动化内容审核方面取得了显著进展,但现有研究主要将网络暴力治理视为被动、孤立的帖子级检测。这种还原论观点忽视了用户持续的行为动态、毒性事件的结构性扩散以及主动缓解的关键需求。为弥补这些差距,本文提出一个统一的全生命周期治理框架,将网络暴力治理的范式从孤立的静态检测转向集成、持续和主动的审核。借鉴网络暴力研究及相邻领域,我们系统地综合了四个相互关联阶段的最新文献:(1)内容识别,(2)用户与行为建模,(3)扩散动态与早期预警,以及(4)干预与治理。此外,我们回顾了可用的数据集和评估实践,并讨论了新兴挑战,包括多模态性、可解释性、算法公平性以及生成式AI的双重使用风险,为未来研究提供了路线图,以构建更安全、更具韧性的数字生态系统。

英文摘要

The proliferation of social media platforms and online communities has inadvertently catalyzed the spread of cyberbullying, hate speech, and other forms of online toxicity, making the effective governance of such harm a critical societal and computational challenge. While significant strides have been made in automating content moderation, existing research predominantly treats cyberbullying governance as passive, isolated detection at the post level. This reductionist view overlooks the continuous behavioral dynamics of users, the structural diffusion of toxic events, and the critical need for proactive mitigation. To bridge these gaps, this paper proposes a unified full-lifecycle governance framework that shifts the paradigm of cyberbullying governance from isolated static detection toward integrated, continuous, and proactive moderation. Drawing on cyberbullying research and adjacent fields, we systematically synthesize the state-of-the-art literature across four interconnected stages: (1) Content Identification, (2) User and Behavior Modeling, (3) Diffusion Dynamics and Early Warning, and (4) Intervention and Governance. Furthermore, we review available datasets and evaluation practices, and discuss emerging challenges including multimodality, explainability, algorithmic fairness, and the dual-use risks of generative AI, providing a roadmap for future research toward a safer and more resilient digital ecosystem.

2605.27571 2026-05-28 cs.AI cs.CL cs.DB 版本更新

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

实时分析发现代理:迈向主动洞察系统

Gaetano Rossiello, Dharmashankar Subramanian

发表机构 * IBM

AI总结 提出一种多智能体架构,通过持续发现循环(假设生成、编译、验证、可视化)实现实时数据流的自主洞察发现,支持从查询驱动向主动发现的范式转变。

Comments Accepted at Supporting Our AI Overlords (SAO) at the ACM Conference on AI and Agentic Systems (CAIS), May 26 2026, San Jose, CS, USA

详情
AI中文摘要

现代分析系统本质上是反应式的,要求用户在日益复杂且持续演变的数据上定义查询。在实时流式环境中,这种范式失效,因为潜在洞察的空间变得太大而无法手动枚举。我们提出了一种用于实时数据流自主洞察发现的多智能体架构。该系统实现了一个持续发现循环,其中智能体生成假设,将其编译为可执行分析,验证生成的工件,并生成可视化和可部署的应用程序。该架构利用Apache Kafka进行事件驱动协调,Apache Flink进行流处理,以及大型语言模型来实现专门的智能体。一个关键贡献是基于类型化中间工件的契约驱动设计,实现了模块化、可观测性、血统以及更安全地执行动态生成的分析。通过零售、金融和公共数据中的用例,我们展示了该架构如何支持从查询驱动分析向主动发现驱动系统的转变。

英文摘要

Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real-time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi-agent architecture for autonomous insight discovery over real-time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event-driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract-driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query-driven analytics to proactive, discovery-driven systems.

2605.27570 2026-05-28 cs.AI 版本更新

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

LaneRoPE: 用于协同并行推理与生成的位置编码

Gabriele Cesa, Thomas Hehn, Aleix Torres-Camps, Àlex Batlle Casellas, Jordi Ros-Giralt, Arash Behboodi, Tribhuvanesh Orekondy

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 提出LaneRoPE方法,通过序列间注意力掩码和扩展的RoPE位置编码,使多个序列在生成时协同合作,提升数学推理任务在有限生成长度下的准确性。

详情
AI中文摘要

并行LLM测试时扩展技术(例如best-of-$N$)需要根据相同输入提示生成$N>1$个序列。这些方法在利用批处理$N$个生成的计算效率的同时提高了准确性。然而,传统上批次中的每个序列是独立生成的,因此不会重用其他序列的中间生成、计算或观察结果。在本文中,我们提出LaneRoPE,以在生成时实现$N>1$个序列之间的协调与协作。LaneRoPE包含两个关键思想:(a) 一个序列间注意力掩码,使序列的采样相互依赖;(b) 一个RoPE扩展,注入位置信息,捕获特定序列内部和外部的标记之间的相对位置。我们在数学推理任务上评估了我们的方法,并发现了有希望的结果:LaneRoPE实现了序列间的协作,在有限的生成长度下带来了额外的准确性提升。重要的是,由于LaneRoPE在底层LLM架构上只需最小改动,并且在推理时引入的开销可以忽略不计,因此它对于将并行推理快速集成到现有LLM推理流水线中具有吸引力。

英文摘要

Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations. However, each sequence in the batch is traditionally generated independently and hence does not reuse intermediate generations, computations, or observations from other sequences. In this paper, we propose LaneRoPE to enable coordination and collaboration among $N>1$ sequences at generation time. LaneRoPE involves two key ideas: (a) an inter-sequence attention mask to make sampling of sequences dependent on one another; and (b) a RoPE extension that injects positional information that captures relative positions between tokens, both within and outside a particular sequence. We evaluate our approach on mathematical reasoning tasks and find promising results: LaneRoPE enables collaboration among sequences, yielding additional accuracy gains under limited generated sequence length. Importantly, since LaneRoPE enables coordination with minimal changes to the underlying LLM architecture and introduces a negligible overhead at inference time, it is appealing to rapidly incorporate parallel reasoning into existing LLM inference pipelines.

2605.27567 2026-05-28 cs.AI cs.CL 版本更新

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

为什么LLM在因果发现中失败以及干预代理如何逃脱

Amartya Roy, Sonali Parbhoo

发表机构 * SIRE, IIT Delhi(IIT德里智能研究机构) Robert Bosch GmbH(罗伯特·博世有限公司) Imperial College London(伦敦帝国理工学院)

AI总结 本文证明大型语言模型在因果发现中存在根本性失败,并提出一种基于干预代理的因果贝叶斯优化方法(A-CBO),通过外部贝叶斯循环在无需模型微调的情况下实现可证明的收敛。

Comments 9 pages, 3 figures

详情
AI中文摘要

因果发现是科学推理的基石,但大型语言模型能否可靠地执行因果发现仍是一个悬而未决的问题。最近的基准测试表明,即使是微调后的模型在简单因果图上也会达到平台期,并随着复杂度增加而退化,但失败的原因尚未明确。我们证明这种失败是根本性的:监督微调、直接偏好优化和上下文学习都会产生无法区分生成相似观测数据的因果图的预测器,任何这样做的尝试都需要模型的内部表示无限增长,从而违反了这些方法工作的条件。我们将其形式化为核障碍定理,确立该限制是学习范式固有的,而非任何特定模型或数据集。我们提出了代理因果贝叶斯优化(A-CBO),其中冻结的语言模型作为干预预言机,回答关于干预效果的目标查询,而外部贝叶斯循环在对数轮次内将信念集中在候选因果图上。由于决策在障碍适用的空间之外运行,A-CBO在底层模型保持不变的情况下可证明收敛。在Corr2Cause上,A-CBO无需任何训练即可匹配微调基线。在Extended Corr2Cause(一个扩展到24个变量、包含18K测试样本的新基准)上,A-CBO显著优于微调和偏好优化,且优势不断扩大。

英文摘要

Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

2605.27566 2026-05-28 cs.AI 版本更新

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

DynaSchedBench: 基于LLM的调度代理中的校准动态调度基准与可观测性悖论

Shijie Cao, Yuan Yuan, Jing Liu

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing 100191, China(北航计算机科学与工程学院) Shenzhen Loop Area Institute, Shenzhen, China(深圳环城院) Qingdao Research Institute, Beihang University(北航青岛研究院) Hangzhou Innovation Institute, Beihang University(北航杭州创新院) School of Artificial Intelligence, Xidian University, Xi'an 710071, Shaanxi, China(西电人工智能学院) Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, Guangdong, China(西电广州技术院)

AI总结 针对动态柔性作业车间调度问题(DFJSP),提出DynaSchedBench诊断框架,通过顺序事件空间校准器(SESC)计算调度压力指数(SSI)对实例进行难度分层,并揭示LLM调度代理中的“可观测性悖论”:完整结构信息反而降低性能。

详情
AI中文摘要

目前,针对动态柔性作业车间调度问题(DFJSP)的神经组合优化进展受到方法论上的张力阻碍:静态基准鼓励基准过拟合,而未校准的生成器则用随机噪声掩盖算法能力。为解决这一问题,我们引入了 extbf{DynaSchedBench},一个用于DFJSP的诊断框架,该框架严格控制实例生成过程。我们的方法不依赖参数采样,而是利用顺序事件空间校准器(SESC)计算一种新颖的调度压力指数(SSI),以按难度对实例进行分层。我们证明,SESC在计算效率上显著优于进化基线,同时可靠地收敛到目标指标。该框架集成了用于实例生成、基于快照的模拟、代理、评估和可视化的模块化组件,从而能够对反应式和前瞻式策略进行严格测试。利用这个校准环境,我们识别了基于LLM的调度代理的关键局限性。具体而言,在动态调度的逐步在线决策中,我们发现了一个“可观测性悖论”:向代理提供完整结构信息的oracle访问权限会降低策略性能,其表现不如简洁信息。此外,尽管存在大量的token开销,工具增强和细化策略未能可靠地提高性能,并且大多数LLM代理无法持续超越强大的调度基线——其行为更像是鲁棒的启发式近似器,而非优越的优化器。

英文摘要

Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise. To resolve this, we introduce \textbf{DynaSchedBench}, a diagnostic framework for DFJSP that rigorously controls the instance-generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead-based policies. Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an ``Observability Paradox'': providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information. Furthermore, despite substantial token overhead, tool-augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines-behaving more like robust heuristic approximators than superior optimizers.

2605.27564 2026-05-28 cs.CL cs.AI cs.LG 版本更新

The Future of Facts: Tracing the Factual Generation-Verification Gap

事实的未来:追踪事实生成-验证差距

Tim R. Davidson, Anja Surina, Caglar Gulcehre

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 本文通过训练阶段分析,发现语言模型在事实知识上存在生成-验证差距,验证能力先于生成能力习得且更稳健,事实更新可能导致模型处于“多宇宙”状态。

Comments Code for this project is available at https://github.com/anjasurina/factgap , blog post at https://www.trdavidson.com/fact-gap

详情
AI中文摘要

语言模型正成为事实知识的默认接口,但它们验证输出的能力往往比生成输出的能力更可靠。这种生成-验证差距(GV-gap)是近期自我改进和推理中许多进展的基础,但其在事实知识上的动态仍未被充分理解。我们聚焦于事实性GV-gap背后的训练机制,将其与计算和美学方面的对应物区分开来。我们通过四个开源模型家族(每个家族两个规模)的三个训练阶段(获取、持续学习和更新)追踪生成和验证能力。三个发现跨模型重复出现:(i)验证始终先于生成被学习;(ii)验证比生成对持续学习更稳健;(iii)事实更新可能使模型处于“多宇宙”状态,同时验证新旧答案均为正确。对前沿模型的自然实验在大规模上重现了这些动态,并揭示了在充分覆盖的事实上残留的验证偏差。

英文摘要

Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation-verification gap (GV-gap) underlies many recent advances in self-improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV-gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open-source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a "multi-verse" state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well-covered facts.

2605.27563 2026-05-28 math.PR cs.AI stat.ML 版本更新

On the Subgaussianity of Quantized Linear Maps: An AI-Assisted Note

关于量化线性映射的次高斯性:一份AI辅助笔记

Guangyi Zou, Roman Vershynin

发表机构 * Department of Mathematics, University of California, Irvine(加州大学尔湾分校数学系)

AI总结 本文通过Gemini 3.5 Flash发现了一个与维度无关的次高斯集中界,适用于高斯向量在坐标非线性映射下的情况,并应用于回答Simone Bombari关于符号量化线性映射的问题。

Comments 4 pages

详情
AI中文摘要

这份简短的笔记给出了高斯向量在坐标非线性映射下与维度无关的次高斯集中界。该结果由Gemini 3.5 Flash发现,适用于任何在良态协方差下的有界函数。我们应用这一工具回答了Simone Bombari关于符号量化线性映射$Y = \text{sgn}(Wx)$的问题。

英文摘要

This short note presents a dimension-independent subgaussian concentration bound for Gaussian vectors under coordinate-wise nonlinear mappings. Discovered by Gemini 3.5 Flash, this result applies to any bounded function under a well-conditioned covariance. We apply this tool to answer a question of Simone Bombari on sign-quantized linear maps $Y = \text{sgn}(Wx)$.

2605.27561 2026-05-28 cs.CV cs.AI 版本更新

Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

Melanoscope AI移动皮肤镜临床决策支持系统的临床验证

Elena Sergeevna Kozachok, Sergey Sergeevich Seregin

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences(俄罗斯科学院伊万诺夫系统编程研究所) Orel Regional Oncology Dispensary(奥尔格地区肿瘤专科医院)

AI总结 本研究提出了一种级联深度学习模型的定量可解释性评估方法和三区患者分流算法,并在俄罗斯门诊实践中对Melanoscope AI CDSS进行了前瞻性单中心临床验证,结果显示无假阴性且特异性为88.3%。

Comments 24 pages, 6 figures, 5 tables, 21 references

详情
AI中文摘要

引言:恶性皮肤病变的早期检测对预后至关重要,但俄罗斯地区皮肤科医生短缺限制了筛查覆盖。移动皮肤镜临床决策支持系统(CDSS)提供了一种有前景的方法,但模型可解释性和标准化患者分流仍是采用的关键障碍。目的:开发一种级联深度学习模型的定量可解释性评估方法和三区患者分流算法,并在俄罗斯门诊实践中对Melanoscope AI CDSS进行初步的单中心前瞻性临床验证。材料与方法:皮肤镜图像的两阶段级联分类;注意力图可视化(ViT和Swin使用注意力展开;ConvNeXt和EfficientNetV2使用Grad-CAM);激活图与专家标注之间基于IoU的定量一致性评估;在四次“黑色素瘤日”活动(俄罗斯奥廖尔,2025年6月至2026年4月)中进行前瞻性单中心验证。结果:在176名患者中:与专家评估一致率为88.6%;5例恶性病变中无假阴性(95% CI: 47.8-100.0%);特异性为88.3%。组织学证实了3例黑色素瘤和2例基底细胞癌;6例发育不良痣被纳入随访。平均IoU(n=180):ViT - 0.69;Swin - 0.64;ConvNeXt - 0.53;EfficientNetV2 - 0.51。分流阈值:P<0.15 / 0.15-0.50 / >=0.50。结论:未观察到假阴性;特异性为88.3%,支持筛查应用。集成的级联分类、带IoU评估的注意力图可视化和三区分流提供了可重复、可解释的临床决策支持,可适应不同资源水平。

英文摘要

Introduction. Early detection of malignant skin lesions is critical for prognosis, yet dermatologist shortages in Russian regions limit screening coverage. Mobile dermoscopy clinical decision support systems (CDSS) offer a promising approach, with model interpretability and standardised patient routing remaining key barriers to adoption. Aim. To develop a quantitative interpretability assessment method for cascade deep learning models and a three-zone patient routing algorithm, and to conduct a preliminary single-centre prospective clinical validation of the Melanoscope AI CDSS in Russian outpatient practice. Material and methods. Two-stage cascade classification of dermoscopic images; attention map visualisation (attention rollout for ViT and Swin; Grad-CAM for ConvNeXt and EfficientNetV2); quantitative IoU-based agreement assessment between activation maps and expert annotations; prospective single-centre validation across four "Melanoma Day" sessions (Orel, Russia, June 2025 - April 2026). Results. On 176 patients: agreement with expert assessment 88.6%; no false negatives among 5 malignant lesions (95% CI: 47.8-100.0%); specificity 88.3%. Three melanomas and two basal cell carcinomas were histologically confirmed; six dysplastic naevi placed under follow-up. Mean IoU (n=180): ViT - 0.69; Swin - 0.64; ConvNeXt - 0.53; EfficientNetV2 - 0.51. Routing thresholds: P<0.15 / 0.15-0.50 / >=0.50. Conclusion. No false negatives were observed; specificity was 88.3%, supporting screening use. The integrated cascade classification, attention map visualisation with IoU assessment, and three-zone routing provide reproducible, interpretable clinical decision support adaptable to varying resource levels.

2605.27559 2026-05-28 cs.MA cs.AI cs.LG 版本更新

Detection Without Correction: A Two-Parameter Decomposition of Multi-Stage LLM Pipelines

无需修正的检测:多阶段LLM流水线的双参数分解

Prashanti Nilayam, Kiran Ramanna, Prashil Tumbade

发表机构 * Servicenow CA, USA(Servicenow加州美国)

AI总结 提出检测-条件生成双参数分解框架,揭示多阶段LLM流水线中条件误修正率主导(53-94%)而检测率变化超一个数量级,统一解释准确性平台、逆转等四种现象。

详情
AI中文摘要

多阶段LLM流水线(执行多智能体辩论、内在自我修正或检索增强验证)表现出令人困惑的聚合行为:跨轮次的准确性平台和逆转、当代前沿模型上辩论增益的非重复性、内在自我修正退化,以及辩论动态中跨提供商的定性分歧。下游智能体响应可操作化为两个耦合决策:检测(是否将上游内容视为权威)和条件生成(如果不是则生成什么)。该分解产生四种可观察的响应模式,其中无需修正的检测是承载故障模式。在跨越四个模型系列、四个基准(GSM8K、MATH-500、GPQA-Diamond、AIME)和两种方法(多智能体辩论、内在自我修正)的九格实证网格中,我们发现条件误修正率始终占主导(跨队列53-94%),而检测率按上下文变化超过一个数量级。该框架将上述四种现象统一为共同机制的特征,并将检测阈值表征为稳定的模型/协议级规律,该规律在匹配基准难度的方法间持续存在。

英文摘要

Multi-stage LLM pipelines that perform multi-agent debate, intrinsic self-correction, or retrieval-augmented verification exhibit puzzling aggregate behaviors: accuracy plateaus and reversals across rounds, non-replication of debate gains on contemporary frontier models, intrinsic self-correction degradation, and qualitative cross-provider divergence in debate dynamics. Downstream agent response can be operationalized as two coupled decisions: detection (whether to treat upstream content as authoritative) and conditional generation (what to produce if not). This decomposition yields four observable response regimes, of which detection-without-correction is the load-bearing failure mode. Across a nine-cell empirical grid spanning four model families, four benchmarks (GSM8K, MATH-500, GPQA-Diamond, AIME), and two methods (multi-agent debate, intrinsic self-correction), we find that the conditional miscorrection rate is consistently dominant (53-94% across cohorts) while detection rate varies contextually by more than an order of magnitude. The framework unifies the four phenomena above as signatures of a common mechanism and characterizes detection threshold as a stable model/protocol-level regularity that persists across methods at matched benchmark difficulty.

2605.27551 2026-05-28 cs.AI cs.CR cs.IR cs.MM 版本更新

On the Origin of Synthetic Information by Means of Steganographic Inheritance

论通过隐写继承的合成信息起源

Ching-Chun Chang, Isao Echizen

发表机构 * Information and Society Research Division, National Institute of Informatics(信息与社会研究部,信息机构)

AI总结 针对合成信息溯源难题,提出一种基于隐写术的遗传机制,通过嵌入可追踪的谱系特征实现合成信息父系鉴定,理论分析与实验验证了方法的有效性。

详情
AI中文摘要

物种起源一直是自然科学中谜中之谜。类比而言,我们认为合成信息的起源是信息科学中谜中之谜。这个问题承载着道德分量,技术解释既无法完全解决,也不能不负责任地忽视,因为它对真理、信任和人类智力的影响深远地延伸到更广泛的经济和社会。人工智能的强大使得合成信息的进化谱系越来越难以追踪,因为一个足够强大的模型可能产生在结构或信号层面上与其父源几乎不相似的后代。如同遗传学中,两个个体可能具有相同的表型,在外观上相互镜像,但基因型却根本不同。我们提出通过隐写术实现一种类似于遗传的机制。在后代被复制的时刻,投影仪从父代派生出一个特征,隐写编码器将其不可见地隐藏在后代中。该特征在赛博生态系统中贯穿后代的整个生命周期。当查询父系时,隐写解码器从后代中提取该特征,并与参考池中候选父代的特征进行比较,从而提名最可能的父代。理论分析将系统发育准确性表征为投影仪和隐写系统属性的函数,而跨多个投影仪和隐写系统的实证评估表明,所提出的方法在广泛的处理操作和语义修改下具有可行性。我们设想一个赛博生态系统,其中合成信息被赋予隐藏但可追踪的谱系特征,从简单的开端分支成无尽的形态,这些形态已经并且正在进化。

英文摘要

The origin of species has been the mystery of mysteries in natural science. By analogy, the origin of synthetic information, we suggest, is the mystery of mysteries in information science. The question carries a moral weight that a technical account can neither fully resolve nor responsibly ignore, as its impact on truth, trust, and human intellect extends deep into the broader economy and society. The very power of artificial intelligence makes the evolutionary lineage of synthetic information grow ever harder to trace, for a sufficiently capable model may generate offspring that bear little resemblance, at either the structural or signal level, to the parent source from which they were derived. As in genetics, two individuals may share the same phenotype mirroring each other in outward appearance, yet differ fundamentally in their genotype. We propose, by means of steganography, a mechanism analogous to heredity. At the moment an offspring is reproduced, a projector derives a trait from the parent, and a steganographic encoder invisibly hides it within the offspring. This trait persists throughout the offspring's life cycle in a cyber ecosystem. When parentage is queried, a steganographic decoder extracts the trait from the offspring and compares it against the traits of candidate parents in a reference pool, thereby nominating the most likely one. A theoretical analysis characterises phylogenetic accuracy as a function of projector and stegosystem properties, whilst empirical evaluations across multiple projectors and stegosystems demonstrate the viability of the proposed methodology under a broad spectrum of processing operations and semantic modifications. We envision a cyber ecosystem in which synthetic information, endowed with hidden yet traceable lineage traits, branches from a simple beginning into endless forms that have been, and are being, evolved.

2605.27494 2026-05-28 cs.CR cs.AI cs.CL cs.IR cs.LG 版本更新

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

基于证据的缓存路由用于检索增强生成:何时可以安全地重用答案?

Syed Huma Shah

AI总结 提出GroundedCache,一种通过四个廉价门控(查询相似性、检索证据重叠、源版本有效性和词汇支持)验证缓存答案安全性的路由方法,显著降低不安全服务率。

Comments 19 pages, 9 figures, 10 tables. Code: https://github.com/syedhumarahim/grounded-cache-router

详情
AI中文摘要

现代检索增强生成(RAG)部署越来越依赖缓存来降低令牌成本和首令牌时间(TTFT)。在vLLM等服务栈中,前缀级KV重用已成为标准,而最近的系统(RAGCache、TurboRAG、CacheBlend、EPIC、ContextPilot、PCR、LMCache)进一步推动了块级和位置无关的重用。相比之下,输出级语义答案缓存仍然脆弱:相似的提示可能映射到不同的正确答案,检索到的证据随着语料库更新而漂移,并且对抗性碰撞攻击已被证明可以劫持缓存的响应。我们认为,缓存答案重用的正确框架不是如何更快地重用,而是何时重用是安全的。我们提出了GroundedCache,一种经过证据验证的缓存路由器,仅当四个廉价门控同时成立时才允许缓存答案:查询相似性、检索证据重叠、源版本有效性以及新检索证据对缓存答案的词汇(或基于判断的)支持。我们构建了一个六区域工作负载,用于压力测试缓存安全性而不仅仅是命中率,并引入了一个面向操作员的指标——不安全服务率(USR),即收到错误缓存答案的查询比例。在两个数据集和12,000个真实LLM生成(在vLLM上使用自动前缀缓存的Qwen2.5-7B-Instruct)中,GroundedCache在每个HotpotQA区域上将USR降至0.0%(而朴素缓存为15-35%),在mtRAG文档漂移上降至1.5%(而朴素缓存为51.5%),在设计点对抗区域上减少了34倍,在其他mtRAG区域上减少了3-10倍,同时端到端p50延迟保持在无缓存RAG基线的1.04-1.07倍以内。逐门控消融实验表明,词汇支持门控是两个数据集上的主要安全机制,其余门控以近乎零成本提供纵深防御。我们发布了实现、工作负载和评估工具。

英文摘要

Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.

2605.27492 2026-05-28 cs.SE cs.AI 版本更新

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

基准测试还不够:RAMP——生产系统中代理模型的运行时评估

Yipeng Ouyang, Xin Huang, Bingjie Liu, Zhongchun Zheng, Yuhao Gu, Xianwei Zhang

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学计算机科学与工程学院,广州,中国)

AI总结 针对现有基准测试无法反映真实生产环境动态复杂性的问题,提出RAMP框架,通过统一运行时评估架构、编译器构建工作负载和多维效用指标,揭示模型在长序列工作流中的性能退化与资源效率差异。

Comments 16 pages, 8 figures. Project homepage: http://ramp.yatcc-ai.com/

详情
AI中文摘要

LLM代理正迅速从编码助手演变为自主软件工程系统。然而,现有的评估方法仍然主要集中于静态、孤立和短视界的基准测试,无法捕捉真实生产工作流的动态复杂性。因此,基准测试性能可能无法很好地反映在涉及长执行链、工具交互、依赖管理和迭代反馈循环的现实运行时环境下的实际能力。为此,我们提出了RAMP,一个面向生产的评估长视界软件工程代理的基础设施。基于YatCC集成平台,RAMP通过标准化的编排和执行接口提供了统一的运行时评估架构。RAMP引入了具有串行依赖和复杂工具链交互的现实编译器构建工作负载,以及用于分析部分工作流失败下执行行为的分阶段恢复机制。该框架进一步整合了面向效用的多维指标,共同评估结果质量和过程效率。我们对15个主流模型进行了运行时评估,观察到在传统孤立基准测试中基本不可见的显著能力退化。任务完成率在串行工作流中逐步崩溃,从初始阶段的100%下降到最终阶段的仅20%,而没有一个评估模型成功完成整个流水线。运行时分析揭示了系统性的故障传播和显著的资源低效,在可比模型之间计算成本差异高达三个数量级。这些发现表明,RAMP将代理模型评估推向持续、运行时可观察和面向生产的评估。

英文摘要

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.

2605.27489 2026-05-28 cs.CR cs.AI cs.LG 版本更新

HARP: Measuring Harm Amplification in Multi-Agent LLM Systems

HARP: 多智能体大语言模型系统中的危害放大测量

Md Hafizur Rahman, Zafaryab Haider, Tanzim Mahfuz, Prabuddha Chakraborty

发表机构 * Electrical and Computer Engineering University of Maine(电气与计算机工程大学缅因州大学)

AI总结 提出HARP方法,通过比较清洁与扰动执行轨迹,量化多智能体LLM系统中局部扰动如何传播为全局危害,并在金融七智能体系统中验证了不同攻击和防御的效果。

Comments 39 pages, 12 figures, 12 tables, and 1 algorithm

详情
AI中文摘要

多智能体大语言模型系统将工作流分解为智能体、工具、共享上下文、记忆和决策门。这种模块化提高了可解释性,但也带来了传播风险:对一个组件的有限扰动可能被其他智能体重用并放大为系统级危害。我们提出了HARP(通过角色扰动导致的危害放大),一种用于研究多智能体LLM系统中局部到全局危害放大的轨迹优先方法。HARP比较成对的清洁和扰动执行,记录专家输出、工具调用、记忆读/写、防护事件、预言日志、延迟、令牌成本和决策。我们将局部危害定义为对目标智能体或受损通道的偏离,全局危害定义为对整个轨迹的偏离,危害放大为(H_global/H_local)。这补充了攻击成功率,衡量编排如何将危害传播到攻击点之外。我们在一个面向金融的七智能体系统中实例化HARP,该系统具有确定性决策门和可配置的攻击框架,用于专家妥协、合谋、共享上下文破坏以及时间或记忆持久攻击。在五种防御中,仅提示防御保持了良性效用但留下高成功率和隐蔽性;工具前和步骤级防护以效用或延迟成本减少了部分失败;而IntegrityGuard,一种轨迹一致性防御,实现了最低的攻击成功率和全局危害,但引入了效用/成本权衡。结果表明,单一专家妥协产生最强的放大,共享上下文破坏产生最高的攻击成功率,时间持久性产生最大的恶意影响。HARP认为,安全的多智能体评估不仅必须衡量绕过,还必须衡量传播。

英文摘要

Multi-agent LLM systems decompose workflows across agents, tools, shared context, memory, and decision gates. This modularity improves interpretability, but creates a propagation risk: a bounded perturbation to one component can be reused by other agents and amplified into system-level harm. We introduce HARP (Harm Amplification through Role Perturbation), a trace-first methodology for studying local-to-global harm amplification in multi-agent LLM systems. HARP compares paired clean and perturbed executions and records specialist outputs, tool calls, memory reads/writes, guard events, oracle logs, latency, token cost, and decisions. We define local harm as deviation from targeted agents or corrupted channels, global harm as deviation over the full trace, and harm amplification as (H_global/H_local). This complements attack success rate with a measure of how strongly orchestration spreads harm beyond the attack point. We instantiate HARP in a finance-oriented seven-agent system with a deterministic decision gate and configurable attack harness for specialist compromise, collusion, shared-context corruption, and temporal or memory-persistent attacks. Across five defenses, prompt-only defenses preserve benign utility but leave high success and stealth; pre-tool and step-level guards reduce some failures with utility or latency costs; and IntegrityGuard, a trace-consistency defense, achieves the lowest attack success and global harm but introduces utility/cost trade-offs. Results show that single-specialist compromise produces the strongest amplification, shared-context corruption yields the highest attack success, and temporal persistence produces the largest malicious impact. HARP argues that secure multi-agent evaluation must measure not only bypass, but propagation.

2605.27487 2026-05-28 cs.CV cs.AI 版本更新

Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

基于扩散的乌克兰手写文本生成与跨域风格迁移

Andrii Ahitoliev, Pavlo Berezin

发表机构 * Ukrainian Catholic University, Lviv, Ukraine(乌克兰天主教大学,利沃夫,乌克兰) National University of ``Kyiv-Mohyla Academy'', Kyiv, Ukraine(基輔-莫 Hil'a 学院国立大学,基輔,乌克兰)

AI总结 针对乌克兰语等非拉丁文字手写文本生成缺乏数据和模型泛化研究的问题,构建了乌克兰手写单词数据集并重新训练DiffusionPen模型,通过跨语言、零样本和少样本迁移实验验证了潜在扩散模型在跨域风格迁移中的有效性。

Comments 16 pages, 7 figures. Submitted to ICTERI 2026

详情
AI中文摘要

基于书写者风格的手写文本生成(HTG)在拉丁文字中已被广泛研究,但在低资源和非拉丁书写系统中仍探索不足,现有模型在拉丁域之外的泛化能力尚不明确。西里尔字母,尤其是乌克兰语,缺乏大规模书写者标注数据集和此类泛化的经验证据。为填补这一空白,我们使用连通分量分割、质量过滤和对代表性不足的乌克兰字符进行针对性过采样,构建了一个包含308位书写者、126,177张图像的乌克兰手写单词数据集。我们在不修改架构的情况下,在该数据集上重新训练了DiffusionPen——一种带有MobileNetV2三元组损失风格编码器和CANINE条件潜在扩散U-Net的模型,测试了从拉丁到西里尔字母的直接迁移。我们在三种设置下评估跨域风格迁移:从IAM英文样本的跨语言迁移、对20世纪早期乌克兰手稿的零样本迁移,以及对当代书写者的少样本模仿。该模型生成可读且风格一致的单词图像,表明少样本潜在扩散模型能够泛化到拉丁文字域之外。我们发布了数据集、训练模型和评估协议,作为书写者感知的西里尔HTG的可复现基准,为将风格化HTG扩展到其他代表性不足的书写系统奠定了基础。

英文摘要

Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-resource and non-Latin writing systems, leaving open how well existing models generalise beyond the Latin domain. Cyrillic, particularly Ukrainian, lacks both large-scale writer-labeled datasets and empirical evidence of such generalisation. To address this gap, we construct a Ukrainian handwritten word dataset of 126,177 images from 308 writers using connected-component segmentation, quality filtering, and targeted oversampling of underrepresented Ukrainian characters. We retrain DiffusionPen, a MobileNetV2 triplet-loss style encoder with a CANINE-conditioned latent diffusion U-Net, on this dataset without architectural modification, testing direct transfer from Latin to Cyrillic. We evaluate cross-domain style transfer in three settings: cross-lingual transfer from IAM English samples, zero-shot transfer to an early 20th-century Ukrainian manuscript, and few-shot imitation of contemporary writers. The model produces legible, style-consistent word images, indicating that few-shot latent diffusion models generalize beyond the Latin-script domain. We release the dataset, trained models, and evaluation protocol as a reproducible benchmark for writer-aware Cyrillic HTG, providing a foundation for extending stylized HTG to other underrepresented writing systems.

2605.27483 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Debate Helps Weak Judges Reward Stronger Models

辩论有助于弱裁判奖励更强的模型

Ethan Elasky, Frank Nakasako, Naman Goyal

发表机构 * Palaestra Research(帕莱斯特拉研究) Berkeley(伯克利)

AI总结 研究在强辩手/弱裁判设置下的提议者-批评者辩论,发现当批评者分类能力超过裁判且裁判将批评者言论视为待验证的主张时,辩论能显著提升裁判表现,并可通过单一独立批评以更低成本实现类似效果。

详情
AI中文摘要

尽管理论上具有前景,但辩论作为一种可扩展的监督协议产生了混合的实证结果:在某些设置中有收益,在其他设置中无效,尤其是当裁判没有隐藏信息时。我们在程序可验证的代码和逻辑任务上,研究了强辩手/弱裁判设置下的提议者-批评者辩论。当批评者提供可用的优势时,辩论帮助裁判优于咨询基线:批评者的分类能力必须超过裁判,并且裁判必须将批评者的言论视为待验证的主张而非待总结的证词。在五个配对中的三个满足该条件的配对中,提议者-批评者辩论的收益在统计上显著优于咨询,并且这些配对是最有能力的模型配对。在我们的集合中的两个非响应者配对中,辩论产生无效效果,一旦批评者进入转录,裁判验证率下降数十个百分点。在这些情况下,批评者的二元分类能力与裁判的相差在噪声范围内,并且批评者的分歧被解析为证词而非待检查的主张。从辩论中消去反驳轮次对裁判表现没有可测量的变化:单一独立批评以更低的推理成本恢复了辩论的大部分收益。这些发现为可验证领域(答案、批评、裁判)中无需训练的可扩展监督提供了一种更廉价的原始方法,以及一种预测辩论何时有帮助的部署前审计(批评者是否击败裁判,以及裁判是否会验证它?)。

英文摘要

Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks. Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings. On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check. Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it?) that predicts when debate will help.

2605.27482 2026-05-28 cs.LG cs.AI 版本更新

Energy-Structured Low-Rank Adaptation for Continual Learning

能量结构低秩自适应持续学习

Longhua Li, Lei Qi, Qi Tian, Xin Geng

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing, China(东南大学计算机科学与工程学院,南京,中国) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用重点实验室(东南大学),教育部,中国) Huawei Technologies, Shenzhen, China(华为技术有限公司,深圳,中国)

AI总结 提出E²-LoRA方法,通过能量集中和排序的低秩自适应以及动态秩分配策略,解决持续学习中的任务干扰和知识压缩问题,实现最优性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然正交子空间方法试图缓解持续学习中的任务干扰,但它们常常遭受跨基的能量扩散,阻碍知识压缩并耗尽未来任务的容量。我们观察到参数更新引起的输出特征漂移本质上是低秩的,并理论上证明沿该漂移的主方向保留参数可最小化输出重建误差。受此启发,我们提出能量集中和能量排序的低秩自适应(E²-LoRA)。通过显式地将知识排序并集中到主导秩中,E²-LoRA释放了后续任务的容量。此外,我们设计了一种动态秩分配策略,通过联合优化能量保留和模型可塑性来平衡稳定性和可塑性。在多个基准上的大量实验表明,E²-LoRA达到了最先进的性能。

英文摘要

While orthogonal subspace methods try to mitigate task interference in Continual Learning (CL), they often suffer from energy diffusion across the basis, hindering knowledge compaction and exhausting capacity for future tasks. We observe that output feature drift induced by parameter updates is inherently low-rank, and theoretically prove that preserving parameters along the principal directions of this drift minimizes the output reconstruction error. Motivated by this, we propose \textbf{E}nergy-Concentrated and \textbf{E}nergy-Ordered \textbf{Lo}w-\textbf{R}ank \textbf{A}daptation (E$^2$-LoRA). By explicitly ordering and concentrating knowledge into leading ranks, E$^2$-LoRA frees capacity for subsequent tasks. Furthermore, we design a dynamic rank allocation strategy to balance stability and plasticity by jointly optimizing energy retention and model plasticity. Extensive experiments across multiple benchmarks demonstrate that E$^2$-LoRA achieves state-of-the-art performance.

2605.27479 2026-05-28 cs.LG cs.AI 版本更新

Resource-Constrained Affect Modelling via Variance Regularisation Pruning

资源约束下的情感建模:基于方差正则化剪枝

Kosmas Pinitas, Konstantinos Katsifis

发表机构 * Mediterranean College, Athens, Greece(地中海学院,希腊雅典) University of Derby, Derby, UK(德比大学,英国德比)

AI总结 提出方差正则化剪枝(VR)框架,通过考虑跨参与者稳定性来剪枝,在80%稀疏度下仍保持竞争性CCC性能,适用于资源受限的情感感知系统。

Comments This paper has been accepted at the 2026 PErvasive Technologies Related to Assistive Environments (PETRA)

详情
AI中文摘要

情感计算系统越来越多地嵌入到普及和交互环境中,如自适应游戏、辅助技术和资源受限平台,在这些环境中,计算效率必须与跨不同用户的可靠性相平衡。模型剪枝提供了一种减少计算需求的有效方法,但现有方法通常仅优化稀疏性,而不考虑参数移除如何影响个体间的鲁棒性。在这项工作中,我们引入了方差正则化剪枝(VR),一种明确将跨参与者稳定性纳入稀疏化过程的剪枝框架。VR不依赖于平均预测误差,而是根据每个连接对预测准确性和用户间变异性的联合贡献来评估,优先保留在分布差异下仍然可靠的参数。我们在AGAIN数据集上评估了所提出的方法,该数据集包含在九个情感诱发游戏环境中收集的唤醒度标注。实验结果表明,即使在没有额外微调的情况下,VR在80%稀疏度下仍能保持竞争性的一致性相关系数(CCC)性能,突显了其在真实世界、资源受限的情感感知系统中的适用性。总体而言,所提出的框架支持开发紧凑、鲁棒的情感模型,这些模型能够在真实的交互环境中可靠运行。

英文摘要

Affective computing systems are increasingly embedded in pervasive and interactive environments, such as adaptive games, assistive technologies, and resource-constrained platforms, where computational efficiency must be balanced with reliability across diverse users. Model pruning offers an effective way to reduce computational demands, yet existing approaches typically optimise for sparsity alone, without accounting for how parameter removal impacts robustness across individuals. In this work, we introduce Variance-Regularised Pruning (VR), a pruning framework that explicitly incorporates cross-participant stability into the sparsification process. Rather than relying solely on average prediction error, VR evaluates each connection based on its joint contribution to both prediction accuracy and variability across users, prioritising parameters that remain reliable under distributional differences. We evaluate the proposed approach on the AGAIN dataset, which includes arousal annotations collected across nine affect-eliciting game environments. Experimental results demonstrate that VR maintains competitive Concordance Correlation Coefficient (CCC) performance even at 80\% sparsity without additional fine-tuning, highlighting its suitability for deployment in real-world, resource-limited affect-aware systems. Overall, the proposed framework supports the development of compact, robust affective models that can operate reliably in real-world interactive environments.

2605.27476 2026-05-28 cs.LG cs.AI 版本更新

Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective

通过对称注意力分解平衡扩散模型中的保真度与多样性:Hopfield视角

Hyunmin Cho, Woo Kyoung Han, Kyong Hwan Jin

发表机构 * Department of Electrical Engineering, Korea University, Seoul, South Korea(韩国大学电子工程系,首尔,韩国)

AI总结 本文通过将Transformer中的注意力矩阵分解为对称和反对称部分,从Hopfield网络视角解释并调控扩散模型生成中的保真度-多样性权衡。

Comments Accepted to ICML 2026 (Regular)

详情
AI中文摘要

我们将Transformer中的预softmax注意力矩阵$\mathbf{QK^ op}$表征为一个关联记忆矩阵,编码输入特征之间的成对关联。通过将该矩阵分解为对称和反对称部分,我们将对称分量解释为控制能量景观的结构,而反对称分量则驱动该景观上的循环。利用对称分量诱导的能量公式,我们推导出Hopfield风格的稳定性度量,用于量化检索特征的稳定性。我们观察到Hopfield风格稳定性度量与生成中的保真度-多样性权衡之间存在有意义的关联。最后,我们提出一个可控的旋钮,通过修改底层动力学的循环来调节这一权衡。代码可在我们的GitHub上获取(https://github.com/hyeon-cho/Attention-Symmetric-Decomposition)。

英文摘要

We characterize the pre-softmax attention matrix $\mathbf{QK^\top}$ in transformers as an associative memory matrix encoding pairwise associations between input features. By decomposing this matrix into its symmetric and skew-symmetric parts, we interpret the symmetric component as governing the structure of the energy landscape, and the skew-symmetric component as driving circulation on that landscape. Leveraging the energy formulation induced by the symmetric component, we derive Hopfield-style stability measures that quantify the stability of retrieved features. We observe meaningful correlations between Hopfield-style stability measures and the fidelity-diversity trade-offs in generation. Finally, we propose a controllable knob to modulate this trade-off by modifying the circulation of the underlying dynamics. Code is available at our GitHub (https://github.com/hyeon-cho/Attention-Symmetric-Decomposition).

2605.27475 2026-05-28 cs.LG cs.AI 版本更新

HEAL: Resilient and Self-* Hub-based Learning

HEAL:弹性且自适应的基于集线器的学习

Mohamed Amine Legheraba, Stefan Galkiewicz, Maria Gradinariu Potop-Butucaru, Sébastien Tixeuil

发表机构 * Sorbonne University(索邦大学) CNRS(法国国家科学研究中心) LIP6(巴黎第6大学信息学院) Institut Universitaire de France(法国国家科学研究中心)

AI总结 提出一种名为HEAL的跨层去中心化学习框架,通过结合联邦学习、八卦学习和流行病学习的优势,利用自组织自愈的P2P覆盖网络和Elevator算法动态选择聚合节点,在无崩溃场景下性能与联邦学习相当,同时在崩溃和波动环境中优于八卦学习和流行病学习。

详情
AI中文摘要

去中心化学习通过将数据和计算分布在节点上,增强了隐私性、可扩展性和容错性。一种流行的方法是联邦学习,它依赖于中央聚合器,但面临服务器脆弱性、可扩展性问题、隐私风险以及最重要的单点故障等挑战。另一种方法是八卦学习和流行病学习,它们通过节点间的点对点模型更新交换实现完全去中心化,确保了鲁棒性和隐私性,但代价是模型收敛速度较慢。在这项工作中,我们提出了一种新颖的去中心化学习框架,称为HEAL。HEAL是首个跨层去中心化学习框架,它利用优化的自组织和自愈底层P2P覆盖网络,结合了联邦学习、八卦学习和流行病学习的优势。借助最近提出的Elevator算法,HEAL将动态选择的节点提升为聚合器。通过仿真,我们证明HEAL在无崩溃环境中具有与联邦学习相似的性能,同时完全去中心化且具有容错性。在崩溃和波动频繁的环境中,HEAL优于八卦学习和流行病学习。

英文摘要

Decentralized learning enhances privacy, scalability, and fault tolerance by distributing data and computation across nodes. A popular approach is Federated learning, which relies on a central aggregator, yet faces challenges such as server vulnerabilities, scalability issues, privacy risks and most importantly, the single point of failure. Alternatively Gossip Learning and Epidemic Learning offer fully decentralization through peer-to-peer exchanges of model updates, ensuring robustness and privacy, at the price of slower model convergence. In this work, we introduce a novel decentralized learning framework called HEAL. HEAL is the first cross-layer decentralized learning framework that exploits an optimized self-organizing and self-healing underlying P2P overlay combining the strengths of Federated Learning, Gossip and Epidemic Learning. Leveraging the recently proposed Elevator algorithm, HEAL promotes dynamically chosen nodes to act as aggregators. Through simulations, we demonstrate that HEAL has similar performances to that of Federated Learning in crash-free settings, while being fully decentralized and fault-tolerant. In crash and churn prone environments HEAL outperforms Gossip and Epidemic Learning.

2605.27472 2026-05-28 cs.AR cs.AI 版本更新

AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

AssertLLM2: 从设计规格生成断言的全面的LLM基准测试

Yuchao Wu, Wenji Fang, Jing Wang, Wenkai Li, Ziyan Guo, Zhiyao Xie

发表机构 * Hong Kong University of Science and Technology(香港理工大学)

AI总结 提出AssertLLM2基准,包含83个真实设计,通过结构化规格、黄金RTL和变异RTL支持缺陷预防和缺陷狩猎两种实际场景,采用语法有效性、形式可证明性、覆盖率和基于突变的缺陷检测等严格评估框架。

详情
AI中文摘要

基于断言的验证(ABV)是现代硬件设计的基石,但手动将设计意图转化为正式的SystemVerilog断言(SVA)仍然劳动密集且容易出错。虽然大型语言模型(LLMs)显示出自动化这一过程的潜力,但现有基准测试仍然受到不现实的任务制定、弱的规格输入和过于简化的评估的限制。为了解决这些限制,我们引入了AssertLLM2,一个用于硬件验证中真实断言生成的开源基准测试。AssertLLM2包含83个跨13个功能类别的真实设计。对于每个设计,基准测试提供了结构化的设计规格、经过验证的依赖完整的黄金RTL以及系统变异的错误RTL变体。这些支持两种实际设置:缺陷预防,其中从规格生成断言以防止设计错误;以及缺陷狩猎,其中生成断言以暴露预期行为与错误实现之间的差异。据我们所知,AssertLLM2是第一个明确使用错误RTL作为输入来评估缺陷检测能力的基准测试。AssertLLM2进一步采用了更严格的评估框架,涵盖语法有效性、形式可证明性、覆盖率和基于突变的缺陷检测。我们的基准测试使得对断言生成进行更真实和广泛的评估成为可能,并为实际硬件验证中的最先进LLMs建立了严格的基线。

英文摘要

Assertion-based verification (ABV) is a cornerstone of modern hardware design, yet manually translating design intent into formal SystemVerilog Assertions (SVAs) remains labor-intensive and error-prone. While Large Language Models (LLMs) show promise for automating this process, existing benchmarks remain limited by unrealistic task formulations, weak specification inputs, and oversimplified evaluation. To address these limitations, we introduce AssertLLM2, an open-source benchmark for realistic assertion generation in hardware verification. AssertLLM2 contains 83 real-world designs across 13 functional categories. For each design, the benchmark provides a structured design specification, a verified dependency-complete golden RTL, and systematically mutated buggy RTL variants. These support two practical settings: bug-prevention, where assertions are generated from specifications to guard against design errors, and bug-hunting, where assertions are generated to expose discrepancies between intended behavior and faulty implementations. To the best of our knowledge, AssertLLM2 is the first benchmark to explicitly use buggy RTL as input to evaluate bug-detection capability. AssertLLM2 further adopts a more rigorous evaluation framework spanning syntactic validity, formal provability, coverage, and mutation-based bug detection. Our benchmark enables a more realistic and extensive assessment of assertion generation and establishes rigorous baselines for state-of-the-art LLMs in practical hardware verification.

2605.27470 2026-05-28 cs.LG cs.AI 版本更新

Detect by Yourself: Self-Designing Agentic Workflows for Few-Shot Graph Anomaly Detection

自行检测:少样本图异常检测的自设计代理工作流

Tairan Huang, Qiang Chen, Yili Wang, Yueyue Ma, Changlong He, Xiu Su, Yi Chen

发表机构 * CSU(中国科学技术大学) UST(香港大学)

AI总结 提出SignGAD框架,通过自设计任务条件检测工作流替代固定检测器,结合图编码与检测器选择及受保护重拟合策略,提升少样本图异常检测的适应性与可靠性。

详情
AI中文摘要

图异常检测旨在识别属性图中的异常节点,并在实际应用中发挥重要作用。然而,现有的图异常检测方法仍面临两个关键挑战:1)固定流程,限制了其在有限监督下对不同图任务的适应性;2)弱证据,无法将上下文和结构异常信号明确纳入检测过程。在本文中,我们提出了一种新颖框架,即少样本图异常检测的自设计代理工作流(SignGAD)。具体来说,我们提出了一种新范式,将图异常检测任务从训练固定异常检测器重新定义为设计任务条件检测工作流。通过构建检测工作流,SignGAD选择合适的图编码和检测器设计以利用任务特定的异常证据。同时,我们引入了一种受保护的最终重拟合策略,通过校准重拟合接受度来优化所选工作流,从而增强有限监督下的可靠性。在多个真实世界数据集上进行的大量实验表明,SignGAD相比最先进方法取得了强劲性能,突显了其在图异常检测任务上的有效性。

英文摘要

Graph anomaly detection aims to identify anomaly nodes in attributed graphs and plays an important role in real-world applications. However, existing graph anomaly detection methods still face two key challenges: 1) fixed pipelines, which restrict their adaptability across different graph tasks under limited supervision; 2) weak evidence, which prevents them from explicitly incorporating contextual and structural anomaly signals into the detection process. In this paper, we propose a novel framework, self-designing agentic workflows for few-shot graph anomaly detection (SignGAD). Specifically, we propose a novel paradigm that reformulates graph anomaly detection task from training a fixed anomaly detector to designing task-conditioned detection workflows. By constructing detection workflows, SignGAD selects suitable graph encodings and detector designs to exploit task-specific anomaly evidence. Meanwhile, we introduce a guarded final refit strategy to refine the selected workflow by calibrating refit acceptance, enhancing reliability under limited supervision. Extensive experiments conducted on several real-world datasets demonstrate that SignGAD achieves strong performance against state-of-the-art methods, highlighting its effectiveness on graph anomaly detection tasks.

2605.27469 2026-05-28 cs.LG cs.AI 版本更新

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

架构驱动的偏移:面向捕捉逻辑偏移趋势的轻量级选择器

Zhong Ye, Yu Hu, Ruilin Tang

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Guangdong University of Technology(广东技术大学) School of Computer Science and Engineering(计算机科学与工程学院) South China University of Technology(华南理工大学)

AI总结 本文提出架构驱动偏移(ADS)作为逻辑偏移的轻量级代理,用于高效选择持续学习中的预训练模型,理论推导并实验验证了ADS与逻辑偏移的单调相关性。

详情
AI中文摘要

持续学习是一种利用深度预训练神经网络能力的实用范式,但哪个预训练模型能更好地平衡“可塑性-稳定性”值得选择?逻辑偏移作为自然代理,因为它代表了持续学习场景中的逻辑偏移。然而,获取逻辑偏移需要巨大的计算成本,阻碍了大规模模型选择。现有的理论分析由于假设均匀隐藏层宽度,忽略了实际架构的结构异质性(可变宽度和深度),无法提供有效的替代方案。这引发了一个关键问题:异构架构与在先验任务(模型已训练过的任务)上的逻辑偏移之间理论上存在什么关系?为了回答这个问题,我们将逻辑偏移解耦为架构依赖和数据依赖,建立我们的框架,揭示了两种依赖的组合——定义为架构驱动偏移(ADS)——能够很好地捕捉逻辑偏移趋势,且只需少量数据样本即可计算。具体来说,对于在先验任务上优化良好的模型,较高的ADS与在当前任务训练后较大的逻辑偏移相关,这基于三个机制组件推导得出:(1)权重矩阵梯度关于层宽的谱范数缩放,(2)新任务的优化路径长度,以及(3)宽网络中的渐近任务冲突。跨越175多种不同架构的大量实证结果表明,ADS与逻辑偏移之间存在强单调相关性(最弱的Spearman相关系数$r_s=0.731$)。在实践中,我们证明了ADS可以作为预期校准误差的轻量级代理,预期校准误差是用于可靠持续学习模型选择的广泛使用的指标,在三个数据集的六个场景中得到了验证。

英文摘要

Continual Learning (CL) is a practical paradigm to utilize power of deep pre-trained neural networks, but which pre-trained model has a better ability to balance ``Plasticity-Stability", deserving to be chosen? The logit shift serves as a natural proxy because it represents the logit shift in CL scenarios. However, obtaining the logit shift requires huge computational cost, which hinders large-scale model selection. Existing theoretical analyses fail to offer an efficient alternative because of the assumption of uniform hidden layer widths, which ignores the structural heterogeneity (variable width and depth) of real-world architectures. This raises a critical question: what theoretically relationship can be identified between heterogeneous architecture and logit shift on prior tasks (that the model has been trained on)? To answer the question, we decouple logit shift into architecture dependency and data dependency to establish our framework, which reveals that the combination of two dependency, defined as Architecture-driven Shift (ADS), that can capture the logit shift tendency well computable with few data samples. Specifically, for a well-optimized model on prior tasks, higher ADS is associated with a larger logit shift after training on the current task, which derived based on three mechanistic components: (1) spectral norm scaling of weight matrix gradients with layer width, (2) the optimization path length of the new task, and (3) the asymptotic task conflict in wide networks. Extensive empirical results across more than 175 diverse architectures demonstrate a strong monotonic correlation (the weakest Spearman's $r_s=0.731$) between ADS and logit shift. Practically, we demonstrate that ADS can serve as a lightweight proxy of the expected calibration error, which is a widely used metric for reliable CL model selection, on three datasets across six scenarios.

2605.27467 2026-05-28 cs.LG cs.AI cs.CV 版本更新

Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility

液态神经网络与LSTM在序列模式识别中的比较分析:鲁棒性、效率与临床实用性

Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi

发表机构 * National Electronics and Computer Technology Center (NECTEC)(国家电子与计算机技术中心) Language Understanding Lab.(语言理解实验室)

AI总结 本文通过对比液态神经网络(LNN)与LSTM在四种序列数据上的性能,发现LNN在参数效率和鲁棒性方面更优,尤其适用于数据稀疏的临床环境。

Comments 9 pages, 7 figures, 6 tables, The conference paper will appear in Proceedings of JCSSE 2026

详情
AI中文摘要

传统的循环神经网络(RNN)和长短期记忆网络(LSTM)在离散时间步上运行,往往无法捕捉现实世界物理过程的流体时间动态。液态神经网络(LNN),特别是闭式连续时间(CfC)网络,通过将隐藏状态演化建模为连续微分方程来解决这一问题。在本文中,我们在四种不同的序列模态上进行了全面的基准测试研究:神经形态事件数据(N-MNIST)、基于笔画的绘图(QuickDraw)、视觉手写(IAM)和生理时间序列(PhysioNet Sepsis-3)。此外,我们使用时间丢弃法进行了严格的压力测试,以评估模型对缺失数据的鲁棒性。我们的研究结果表明,LNN在原生时间域和数据稀疏普遍的临床环境中,始终提供优越的参数效率和显著更高的鲁棒性。本扩展预印本提供了关于相关数据集和LNN理论谱系的额外背景,并附有详细附录,记录了我们的完整实现和实验设置。

英文摘要

Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to capture the fluid temporal dynamics of real-world physical processes. Liquid Neural Networks (LNNs), specifically Closed-form Continuous-time (CfC) networks, address this by modeling the hidden state evolution as a continuous differential equation. In this paper, we conduct a comprehensive benchmarking study across four distinct sequential modalities: neuromorphic event-based data (N-MNIST), stroke-based drawing (QuickDraw), visual handwriting (IAM), and physiological time-series (PhysioNet Sepsis-3). Furthermore, we perform a rigorous stress test using temporal dropout to evaluate model robustness against missing data. Our findings reveal that LNNs consistently provide superior parameter efficiency and significantly higher robustness in natively temporal domains and clinical environments where data sparsity is prevalent. This extended preprint provides additional background on related datasets and the LNN theoretical lineage, supplemented with a detailed appendix documenting our full implementation and experimental settings.

2605.27466 2026-05-28 cs.MA cs.AI cs.LG stat.ML 版本更新

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

AgensFlow:多智能体系统的协调策略基础

Nicole Koenigstein

发表机构 * Independent researcher(独立研究者)

AI总结 提出AgensFlow框架,将多智能体协调视为在线策略学习问题,通过可学习路由优化协调流程,在分布式系统事件和安全咨询任务上验证了其优于固定管道基线。

Comments 7 pages, 4 figures, 4 tables. Code and reproducible evaluations available at: https://github.com/Nicolepcx/AgensFlow

详情
AI中文摘要

基于大语言模型(LLM)构建的多智能体系统需要许多难以先验固定的协调选择:调用哪个技能协议、哪个智能体角色应执行子任务、每个角色绑定哪个模型、角色之间如何交互、何时使用检索或验证,以及何时完全省略某个步骤。这些选择与任务机制和操作约束相互影响,因此静态管道和一次性模型比较只能提供设计空间的有限视角。本文介绍AgensFlow,一个开源框架,将多智能体协调视为部分可观测下的在线策略学习问题。该框架使协调决策可观测且可从重复轨迹中学习,而不是将技能、角色、模型、拓扑和评估选择视为固定的管道设计。AgensFlow在两个语料库上进行了评估:分布式系统事件任务和安全咨询任务。评估展示了三个主要结果:在协调密集型任务上,学习路由比固定管道基线达到更高质量的操作点;skip:X将拓扑压缩隔离为基础的有意义部分;热启动策略图可以在保持平台质量的同时减少探索成本。总体而言,结果支持学习型可审计路由可以改善静态布线下的协调密集型多智能体工作流。

英文摘要

Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to invoke, which agent role should perform a subtask, which model to bind to each role, how roles should interact, when to use retrieval or verification, and when to omit a step entirely. These choices interact with task regime and operational constraints, so static pipelines and one-off model comparisons provide only a limited view of the design space. This paper introduces AgensFlow, an open-source framework that treats multi-agent coordination as an online policy-learning problem under partial observability. The framework makes coordination decisions observable and learnable from repeated trajectories, rather than treating skill, role, model, topology, and evaluation choices as fixed pipeline design. AgensFlow is evaluated on two corpora: distributed-systems incident tasks and security-advisory tasks. The evaluation shows three main results: learned routing reaches a higher-quality operating point than a fixed pipeline baseline on coordination-heavy classes; skip:X isolates topology compression as a meaningful part of the substrate; and warm-started policy graphs can reduce exploration cost while preserving plateau quality. Overall, the results support that learned, auditable routing can improve coordination-heavy multi-agent workflows over static wiring.

2605.27465 2026-05-28 cs.CV cs.AI 版本更新

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

AdaMerge: 面向视觉Transformer无训练加速的显著性感知自适应令牌合并

Semi Lee, Hyejin Go, Hyesong Choi

发表机构 * Electronic Engineering(电子工程) Soongsil University(顺斯大学)

AI总结 提出AdaMerge框架,通过显著性加权相似度和自适应合并强度两个互补机制,在无训练条件下提升令牌合并的精度-计算量帕累托前沿。

Comments 11 pages, 3 figures, 5 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

视觉Transformer(ViT)中自注意力的二次计算成本构成了实际部署的基本瓶颈,激发了令牌缩减方面的活跃研究。在现有方法中,令牌合并(ToMe)已成为一种优雅的无训练解决方案;然而,其设计基于令牌平等的隐含前提,这与自注意力已充分证明的非均匀性相悖,并在激进压缩下导致高显著性令牌的信息丢失。我们通过AdaMerge解决了这一局限,该框架基于两个互补机制。首先,显著性加权相似度利用列式特征亲和度中心性作为令牌重要性代理,并将所得显著性分数纳入二分匹配分数,确保关键令牌对合并表示贡献更大。其次,自适应合并强度使用预先计算的逐层相似度统计量,根据输入特定的冗余性动态调整每层缩减数量。在ImageNet-1k上使用ViT-B/16,AdaMerge在所有FLOPs匹配条件下均持续优于ToMe、PiToMe和DSM。精度差距随压缩单调增大:在13.4G FLOPs操作点,AdaMerge的Top-1下降仅为-1.06%,而PiToMe为-1.45%,DSM为-4.62%。据我们所知,AdaMerge是首个将显著性加权相似度和自适应逐层缩减结合到单一无训练令牌合并框架中的方法,推动了ViT加速的精度-FLOPs帕累托前沿。

英文摘要

The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

2605.27464 2026-05-28 cs.CV cs.AI 版本更新

Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

超越运动基元:基于头戴式IMU的行为活动识别

Chung-Ta Huang, Leopold Das, Jeffrey Zhou, Faizaan Siddique, Julia Seungjoo Baek, Serena Liu, Andrew Rusli, Todd Y. Zhou, Freddy Yu, Sinclair Hansen, Ziling Hu, Arnav Sharma, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab, Harvard University(哈佛人工智能与机器人实验室,哈佛大学)

AI总结 提出HiT-HAR层次模型,利用头戴式IMU数据实现行为级活动识别,超越传统运动基元,在五类动作和八类场景识别中优于现有模型。

详情
AI中文摘要

AR智能眼镜需要连续的行为上下文来提供主动辅助,但其最实用的常开传感器——头戴式惯性测量单元(IMU)仅能检测行走或站立等运动基元。我们突破运动基元,实现行为级识别,定义了五个类别以平衡AR应用需求与传感器可观测性。为此,我们构建了一个包含16万样本的Ego4D数据集,采用四层质量保证框架覆盖8个活动场景,并提出了HiT-HAR,一个70.3万参数的层次模型,在五类动作和八类场景识别中优于先前的头戴式IMU模型。我们通过每类可分离性分析进一步绘制了头戴式IMU的可观测性边界,识别出哪些行为类别可靠可观测(移动),哪些受益于时间上下文(物体传递、任务操作),以及哪些场景依赖的信号重叠仍构成挑战。我们的结果表明,利用时间上下文和场景结构的架构选择优于单纯扩大模型规模。代码和数据集公开于https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR。

英文摘要

AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition. We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR.

2605.27463 2026-05-28 stat.ME cs.AI stat.AP 版本更新

When prompt perturbations break your A/B test: A valid statistical test for generative surveying

当提示扰动破坏你的A/B测试:一种用于生成式调查的有效统计检验

Hayden Helm, Carey Priebe

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对生成式调查中LLM对提示设计敏感的问题,提出一种置换检验方法,在包含扰动结构的统计模型下保持有效性,并给出预算分配建议。

详情
AI中文摘要

生成式调查——利用基于LLM的角色集合对消息提供反馈——已成为传统市场研究的廉价且可扩展的替代方案。然而,LLM对提示设计中的微小变化很敏感,从生成式调查中得出的结论可能依赖于任意的措辞选择。控制这种敏感性需要在分析中包含语义等价的扰动。在本文中,我们表明,在包含现实扰动结构的生成式调查统计模型下,标准假设检验(包括符号检验和Wilcoxon符号秩检验)是无效的。我们提出了一种在该模型下有效的置换检验,并正式刻画了标准检验失效的条件。将我们的框架应用于一个简单的生成式调查问题,我们估计了相关参数,刻画了置换检验在现实条件下的功效,并提供了关于在角色、扰动和重复之间分配预算的实用指导。最后,我们表明,即使在同一个模型家族内,估计效应的大小和方向都对模型选择敏感。

英文摘要

Generative surveying -- where collections of LLM-based personas provide feedback on messages -- has emerged as a cheap and scalable alternative to traditional market research. However, LLMs are sensitive to small variations in prompt design and conclusions drawn from generative surveys may depend on arbitrary phrasing choices. Controlling for this sensitivity requires including semantically equivalent perturbations in the analysis. In this paper, we show that standard hypothesis tests, including the sign test and Wilcoxon signed-rank test, are invalid under a statistical model for generative surveying that includes realistic perturbation structure. We propose a permutation test that is valid under this model and formally characterize the conditions under which standard tests fail. Applying our framework to a simple generative surveying problem, we estimate relevant parameters, characterize the power of the permutation test under realistic conditions, and provide practical guidance on budget allocation across personas, perturbations, and replicates. Finally, we show that both the magnitude and direction of the estimated effect are sensitive to the choice of model, even within the same model family.

2605.27449 2026-05-28 cs.IR cs.AI 版本更新

Checking Fact with Better Retrieval: Dynamic Contrastive Learning for Evidence Retrieval

用更好的检索核查事实:用于证据检索的动态对比学习

Zhongtian Hua, Yi Luo, Meijia Yu, Yingjie Han

发表机构 * Zhengzhou University(郑州大学) Henan University of Science and Technology(河南科技大学)

AI总结 提出动态自适应对比学习方法DACLR,通过事件级特征提取、两阶段检索和动态对比损失优化,提升多模态证据检索的准确性。

详情
AI中文摘要

在多模态事实核查领域,从不同模态检索证据的准确性对下游声明验证过程有显著影响。现有的通用多模态检索方法通常基于语义构建,导致检索到的证据与声明相似但不相关。本文提出了一种用于证据检索的动态自适应对比学习方法(DACLR)来解决这些问题。DACLR首先使用多模态大语言模型(MLLM)将多模态证据和声明统一转换为文本模态,并在事件级别提取这些信息的特征。然后,通过召回-重排序的两阶段检索方法进行证据检索。DACLR通过优化对比损失和挖掘难负样本,增强了检索阶段模型的事件感知能力。具体而言,DACLR基于InfoNCE损失在语义和事件两个层次设计了三个损失函数,并对应设置了三组难负样本候选。模型根据批内样本的准确性监督信号动态调整比例,使模型在不遗忘语义检索能力的情况下,学习声明与正样本在事件层面的相关性。大量的对比和消融实验证明了DACLR及其内部优化方法的有效性。进一步的研究也证明了DACLR在多模态证据检索领域的优势。

英文摘要

In the field of multimodal fact checking, the accuracy of retrieving evidence from different modalities has a significant impact on the downstream claim verification process. Existing general multimodal retrieval methods are often constructed based on semantics, resulting in the retrieved evidence being similar but not relevant to the claim. This paper proposes a \textbf{D}ynamic \textbf{A}daptive \textbf{C}ontrastive \textbf{L}earning method for evidence \textbf{R}etrieval called DACLR to address these issues. DACLR first uses a Multimodal Large Language Model (MLLM) to uniformly convert multimodal evidence and claims into text modalities, and extracts the features of these information at event level. Then, it conducts evidence retrieval through a two-stage retrieval method of recall-rerank. DACLR enhances the model's event perception ability of the retrieval stage by optimizing the contrastive loss and mining hard negative samples. Specifically, DACLR designs three loss functions at two levels (semantic and event) based on the InfoNCE loss.Corresponding to these, three sets of hard negative sample candidates are set up. The model dynamically adjusts the ratio based on the accuracy supervision signal of intra-batch samples, allowing the model to learn the correlation between claims and positive samples at the event level without forgetting the semantic retrieval ability. Extensive comparison and ablation experiments demonstrates the effectiveness of DACLR and its internal optimization methods. Further research also prove the advantages of DACLR in the field of multimodal evidence retrieval.

2605.27445 2026-05-28 cs.IR cs.AI 版本更新

RAGe: A Retrieval-Augmented Generation Evaluation Framework

RAGe:一种检索增强生成评估框架

Larissa Guder, João Pedro de Moura, Arthur Accorsi, Gustavo Losch do Amaral, Maurício Cecílio Magnaguagno, Felipe Meneguzzi, Marcio Sorraglia Pinho, Dalvan Griebler

发表机构 * School of Technology, Pontifical Catholic University of Rio Grande do Sul(里约格兰德杜斯-萨尔大学技术学院)

AI总结 提出模块化框架RAGe,通过资源遥测和组件推荐,评估检索增强生成应用在准确性、效率和可扩展性之间的权衡,支持领域特定数据集的最佳组件选择。

详情
AI中文摘要

部署大型语言模型(LLM)应用,特别是那些依赖检索增强生成(RAG)的应用,仍然具有挑战性,原因是计算需求高、知识库过时以及需要手动选择最优流水线组件。在这项工作中,我们提出了一个模块化框架,通过关注资源遥测和组件推荐,为基准测试和指导RAG应用的高效开发提供支持,建议针对特定领域数据集的最佳组件。我们的方法利用LLM应用中的核心技术,包括文档分块、向量数据库、嵌入模型和检索器,来评估准确性、效率和可扩展性之间的权衡。通过将检索和生成质量与底层硬件约束直接关联,RAGe帮助研究人员识别最有效、特定领域的RAG设置,以满足其特定操作需求,即使在消费级硬件上也能促进快速原型开发。

英文摘要

Deploying Large Language Model (LLM) applications, particularly those relying on Retrieval-Augmented Generation (RAG), remains challenging due to high computational demands, outdated knowledge bases, and the need to manually select optimal pipeline components. In this work, we propose a modular framework for benchmarking and guiding the efficient development of RAG applications by focusing on resource telemetry and component recommendation, suggesting the best components for a domain-specific dataset. Our approach leverages core techniques in LLM applications, including document chunking, vector databases, embedding models, and retrievers, to evaluate trade-offs among accuracy, efficiency, and scalability. By directly correlating retrieval and generation quality with underlying hardware constraints, RAGe supports researchers to identify the most effective, domain-specific RAG setups for their specific operational needs, facilitating rapid prototyping even on consumer-grade hardware.

2605.27444 2026-05-28 cs.IR cs.AI 版本更新

A Systematic Evaluation of Retrieval-Augmented Generation and Language Models for Space Operations

检索增强生成与语言模型在太空操作中的系统评估

Ruben Belo, Marta Guimarães, Cláudia Soares

发表机构 * NOVA LINCS Neuraspace Technical University of Munich(慕尼黑技术大学)

AI总结 本文系统评估了结合大语言模型与信息检索技术的检索增强生成管道在太空操作中提取和综合领域知识的效果,比较了不同检索策略、嵌入模型和LLM回答对信息准确性、相关性和可靠性的影响。

详情
AI中文摘要

太空活动的迅速扩展导致了技术文档、操作指南和科学文献的空前积累,给太空操作中的及时决策带来了挑战。太空操作中的有效管理需要能够高效处理庞大且异构信息源的工具。本文系统评估了检索增强生成(RAG)管道的性能,该管道结合了大语言模型(LLM)与信息检索技术,用于从领域特定文档中提取和综合可操作的知识。我们比较了各种检索策略、嵌入模型和LLM回答,以评估它们对信息准确性、相关性和可靠性的影响。我们的结果表明,RAG管道可以显著增强知识访问、减少不确定性,并支持复杂太空操作中的决策。

英文摘要

The rapid expansion of space activities has led to an unprecedented accumulation of technical documentation, operational guidelines, and scientific literature, creating challenges for timely decision-making in space operations. Effective management in space operations requires tools capable of efficiently processing vast and heterogeneous information sources. This paper systematically evaluates the performance of Retrieval Augmented Generation (RAG) pipelines, combining Large Language Models (LLMs) with information retrieval techniques for extracting and synthesizing actionable knowledge from domain-specific documents. We compare various retrieval strategies, embedding models, and LLM answers to assess their impact on information accuracy, relevance, and reliability. Our results demonstrate that RAG pipelines can significantly enhance knowledge access, reduce uncertainty, and support decision-making in complex space operations.

2605.27440 2026-05-28 cs.IR cs.AI 版本更新

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

生产检索增强商业推荐中的释义脆弱性:低于重运行稳定性基线的可重复性

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

AI总结 研究发现AI助手对买家问题的细微措辞变化(如“最佳CRM” vs “顶级CRM”)产生显著不同的品牌推荐,其推荐集相似度(Jaccard)远低于相同提示的重运行基线,挑战了当前AEO/GEO实践的有效性。

详情
AI中文摘要

买家提问方式的小变化——例如“最佳CRM” vs “顶级CRM” vs “SaaS初创公司的最佳CRM”——会导致AI助手推荐截然不同的品牌。在OpenAI和Anthropic模型上进行的约6,000次释义运行和约6,000次相同提示重运行对照中,相同购买意图的两个释义之间的推荐集相似度(Jaccard)对于措辞性改写为0.288(聚类95% CI [0.215, 0.361]),对于添加约束的改写为0.135([0.098, 0.175],合并区域/语言和特异性阶梯轴)——两者均远低于0.50-0.61的相同提示重运行基线。提示字符串(而非底层购买意图)是决定哪些品牌出现的主要输入。增加推理努力并未缩小差距(界限为+/-0.05)。这对日益流行的AEO/GEO实践构成了直接挑战。通过固定提示集上统计品牌提及次数来追踪品牌的“AI可见性”,会产生一个度量,其方差的主要来源是追踪器恰好发出的释义,而非模型对品牌的行为:相同购买意图的两个自然释义产生的推荐集Jaccard重叠率为14-29%,而相同提示重运行则为50-61。原则上,对每个意图采样更多释义可以减少这种伪影,学术界也存在高效的多提示评估方法,但自然买家措辞空间远大于这些方法已验证的基准规模提示集,且远超任何商业追踪器对每个品牌-意图组合发出的提示数量。因此,逐提示的提及追踪作为测量单位在结构上是不稳定的;有意义的改进可能需要不同的单位,而非更大的提示集。

英文摘要

Small changes to how a buyer phrases a question -- "best CRM" vs "top CRM" vs "best CRM for a SaaS startup" -- produce substantially different brand recommendations from AI assistants. Across ~6,000 paraphrase runs and ~6,000 same-prompt rerun controls on OpenAI and Anthropic models, the recommendation-set similarity (Jaccard) between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings (clustered 95% CI [0.215, 0.361]) and 0.135 for constraint-adding rewordings ([0.098, 0.175], pooling region/language and specificity-ladder axes) -- both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap (bounded by +/-0.05). This is a direct challenge to an increasingly popular AEO/GEO practice. Tracking a brand's "AI visibility" by counting brand mentions over a fixed set of prompts produces a metric whose dominant source of variance is which paraphrase the tracker happens to issue, not the model's behavior toward the brand: the same buyer intent in two natural paraphrases produces recommendation sets that overlap 14-29% in Jaccard versus 50-61% for same-prompt reruns. Sampling more paraphrases per intent reduces the artifact in principle, and efficient multi-prompt evaluation methods exist in the academic literature, but the natural buyer-phrasing space is much larger than the benchmark-scale prompt sets those methods have been validated on, and far beyond what any commercial tracker issues per brand-intent combination. Prompt-by-prompt mention tracking is therefore structurally unstable as a unit of measurement; meaningful improvement likely requires a different unit rather than a larger prompt set.

2605.27439 2026-05-28 cs.IR cs.AI 版本更新

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recommendation: A 37,000-Run Audit

检索增强的商业推荐中的显著性分层失败模式:一项37,000次运行的审计

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

AI总结 通过对37,000次生产运行进行审计,研究了AI助手在商业推荐中按品牌显著性层级分层的失败模式,发现不同层级品牌面临不同的挑战(如可见性、转化率、替代效应),并指出营销策略需根据品牌在显著性阶梯上的位置定制。

详情
AI中文摘要

像ChatGPT和Claude这样的AI助手是推荐引擎,而非搜索引擎:它们通过直接提名品牌来回答商业查询,而不是返回链接列表。因此,面向AI的营销是一个比“出现在搜索中”更广泛的问题——定位、内容和产品适配性与可发现性同样重要。我们对四种模型配置和215个商业框架提示(涵盖19个行业)进行了约37,000次生产运行审计,并对照一个包含533个品牌的参考目录(分为五个显著性层级:L1类别领导者到L5区域玩家)进行评估,该目录来自外部权威列表。这个阶梯代理了品牌在其行业内的认知度足迹,而非收入或市场份额。失败模式因层级而异。L1品牌几乎出现在所有相关检索中,但仅赢得25-41%的推荐位置——杠杆在于差异化,而非可见性。L2挑战者拥有所有层级中最高的转化率(37-52%),但在Anthropic模型上因角色中介的替代而失败。L3中端市场品牌是转折点:总覆盖率降至88%,转化率降至34-40%,角色效应达到峰值。L4专家和L5区域玩家面临灾难性的不可见性——48-52%从未在37,000次运行中出现。没有统一的优化方案能胜出;正确的营销投资取决于品牌在显著性阶梯上的位置。

英文摘要

AI assistants like ChatGPT and Claude are recommendation engines, not search engines: they answer commercial queries by directly nominating brands rather than returning a list of links. Marketing to AI is therefore a broader problem than "show up in search" -- positioning, content, and product fit matter as much as discoverability. We audit ~37,000 production runs across four model configurations and 215 commercially-framed prompts spanning 19 sectors, evaluated against a 533-brand reference catalog stratified into five prominence tiers (L1 category leaders to L5 regional players) sourced from external authority lists. The ladder proxies a brand's awareness footprint within its sector, not revenue or market share. The failure mode differs sharply by tier. L1 brands appear in nearly every relevant retrieval but win only 25-41% of the recommendation slots they reach -- the leverage is differentiation, not visibility. L2 challengers carry the highest conversion rates of any tier (37-52%) but lose to persona-mediated substitution on the Anthropic models. L3 mid-market brands are the inflection level: aggregate coverage drops to 88%, conversion to 34-40%, and persona effects peak. L4 specialists and L5 regional players face catastrophic invisibility -- 48-52% never surface in any of the 37,000 runs. No uniform optimization recipe wins; the right marketing investment depends on where the brand sits on the prominence ladder.

2605.27437 2026-05-28 cs.IR cs.AI 版本更新

MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents

MGRetrieval: 面向长期对话代理的记忆引导反思检索

Tan Wang, Yunwei Dong

发表机构 * Northwestern Polytechnical University(西北工业大学)

AI总结 提出MGRetrieval方法,通过记忆引导的反思检索构建精确检索路径,逐步积累关键记忆,提升长期对话代理中相关证据的检索效果。

详情
AI中文摘要

大型语言模型(LLMs)在对话方面取得了显著进展,但冗余的记忆上下文严重限制了它们在长期对话代理中的有效性。外部记忆系统已被提出以改善记忆维护。然而,这些系统主要依赖一次性检索,限制了它们检索充分且相关证据的能力。尽管最近的方法将反思引入检索,但其检索路径由LLM基于有限证据生成,导致检索不稳定和额外的延迟开销。为了解决这些限制,我们提出了MGRetrieval,一种将反思检索基于历史记忆语义结构的检索策略。具体来说,MGRetrieval包含两个步骤:(1)它参考历史记忆的结构来构建更精确的检索路径。(2)LLM保留关键记忆,并判断累积的记忆是否足以停止进一步的迭代检索。这使得检索过程能够遵循语义上有意义的路径。通过记忆引导检索和关键记忆传播,MGRetrieval逐步构建简洁且充分的记忆上下文。在LoCoMo上的大量实验表明,在Qwen2.5-14B和Qwen3-14B上,MGRetrieval在F1和BLEU-1上平均分别比最强基线高出8.91%和11.11%,同时保持实用的令牌和延迟成本。代码可在https://anonymous.4open.science/r/MGRetrieval找到。

英文摘要

Large Language Models (LLMs) have made significant progress in dialogue, yet redundant memory contexts severely limit their effectiveness in long-term dialogue agents. External memory systems have been proposed to improve memory maintenance. However, these systems mainly rely on one-shot retrieval, which limits their ability to retrieve sufficient and relevant evidence. Although recent methods introduce reflection into retrieval, their retrieval paths are generated by the LLM from limited evidence, leading to unstable retrieval and additional latency overhead. %These limitations highlight the need for effective retrieval mechanisms. To address these limitations, we propose MGRetrieval, a retrieval strategy that grounds reflective retrieval in the semantic structure of historical memories. Specifically, MGRetrieval consists of two steps: (1) It references the structure of historical memories to construct a more precise retrieval path. (2) The LLM retains critical memories and determines whether accumulated memories are sufficient to stop further iterative retrieval. This allows the retrieval process to follow semantically meaningful paths. Through memory-guided retrieval and critical memory propagation, MGRetrieval gradually constructs concise and sufficient memory contexts. Extensive experiments on LoCoMo show that MGRetrieval outperforms the strongest baseline by 8.91\% in F1 and 11.11\% in BLEU-1 on average across Qwen2.5-14B and Qwen3-14B, while maintaining practical token and latency costs. The code can be found in https://anonymous.4open.science/r/MGRetrieval.

2605.27436 2026-05-28 cs.IR cs.AI cs.CV 版本更新

RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval?

RE-TRIANGLE:TRIANGLE 在检索中能否实现超越余弦相似度的多模态对齐?

Arijit Ghosh, Aritra Bandyopadhyay, Chiranjeev Bindra, Jingfen Qiao

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文复现 TRIANGLE 框架,验证其通过最小化超球面上模态三元组面积实现多模态对齐的几何目标在检索任务中的鲁棒性,发现其在零样本设置下优于成对基线,但存在优化不稳定和领域依赖问题。

详情
AI中文摘要

多模态对齐对于弥合信息检索中的语义鸿沟至关重要。然而,传统的成对策略存在几何盲点:虽然它们将锚定模态(如文本)与其他模态对齐,但缺乏强制外围模态(如视频和音频)之间相互一致性的约束。TRIANGLE 框架通过最小化超球面上模态三元组的面积来实现整体对齐。在这项可重复性研究中,我们验证了该几何目标在检索任务中的鲁棒性。我们确认 TRIANGLE 在零样本设置下优于成对基线,Recall@1 提升高达 +8.7 个百分点,但收益依赖于领域。然而,我们未能复现报告中的从头学习结果。使用合成玩具数据集的分析表明,这是由于联合优化几何对齐与数据-文本匹配(DTM)损失时的不稳定性。此外,我们发现余弦正则化主要稳定文本到视频检索,而使用领域监督进行微调会放大几何收益但降低跨数据集泛化能力。我们的发现支持了几何对齐的有效性,同时突出了关键的优化敏感性。代码可在 https://github.com/ARIJIT00171/RE-TRIANGLE 获取。

英文摘要

Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce a geometric blind spot: while they align anchor modalities (e.g., text) with others, they lack constraints to enforce mutual consistency between peripheral modalities (e.g., video and audio). The TRIANGLE framework addresses this by minimizing the area of modality triplets on a hypersphere to enforce holistic alignment. In this reproducibility study, we verify the robustness of this geometric objective for retrieval tasks. We confirm that TRIANGLE outperforms pairwise baselines in zero-shot settings, achieving Recall@1 gains of up to +8.7 points, though benefits are domain-dependent. However, we fail to reproduce the reported learning-from-scratch results. Analysis using a synthetic toy dataset attributes this to instability when jointly optimizing geometric alignment with Data-Text Matching (DTM) loss. Furthermore, we find that cosine regularization primarily stabilizes text-to-video retrieval, and fine-tuning with domain supervision amplifies geometric benefits but reduces cross-dataset generalization. Our findings support the efficacy of geometric alignment while highlighting critical optimization sensitivities. Code available at https://github.com/ARIJIT00171/RE-TRIANGLE.

2605.27435 2026-05-28 cs.AR cs.AI 版本更新

When NPUs Are Not Always Faster: A Stage-Level Analysis of Mobile LLM Inference

当NPU并非总是更快:移动LLM推理的阶段级分析

Pu Li, Jiawen Qi, Qinyu Chen

发表机构 * Leiden Institute of Advanced Computer Science (LIACS)(莱顿先进计算机科学研究所)

AI总结 通过OPMASK控制管道分解方法,在CPU-NPU异构SoC上对移动LLM推理进行阶段感知的多级基准测试,发现Prefill阶段CPU比NPU快1.6倍,Decode阶段NPU仅提供1.05-1.2倍加速,且NPU卸载增加能耗高达51%。

详情
AI中文摘要

在移动设备上部署大型语言模型(LLM)越来越依赖于异构执行,然而,尚无先前研究在算子和管道级别系统性地描述NPU的有效性。我们首次在CPU-NPU异构SoC上对移动LLM推理进行了阶段感知的多级基准测试研究。我们引入了一种基于OPMASK的受控管道分解方法,该方法隔离了NPU执行路径中的通信、量化和计算开销。我们的结果揭示了反直觉的阶段级性能反转:在计算密集型的Prefill阶段,CPU性能优于NPU(高达1.6倍),而在内存受限的Decode阶段,NPU仅提供有限的加速(1.05-1.2倍)。我们进一步表明,调度开销和跨后端回退降低了NPU卸载的实际收益。在能耗趋势方面,增加NPU卸载会导致更高的能耗(高达51%)。基于这些发现,我们为面向设备上LLM推理的NPU架构师推导出了设计指南。

英文摘要

Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematically characterized NPU effectiveness at the operator and pipeline level. We present the first stage-aware, multi-level benchmarking study of mobile LLM inference on a CPU-NPU heterogeneous SoC. We introduce an OPMASK-based controlled pipeline decomposition methodology that isolates communication, quantization, and computation overheads within the NPU execution path. Our results reveal a counter-intuitive stage-level performance reversal: CPUs outperform NPUs in the compute-intensive Prefill stage (up to 1.6x), while NPUs provide only limited acceleration in the memory-bound Decode stage (1.05-1.2x). We further show that scheduling overhead and cross-backend fallback reduce the practical benefits of NPU offloading. For the energy trend, increasing NPU offloading leads to higher energy consumption (up to 51%). Based on these findings, we derive design guidelines for NPU architects targeting on-device LLM inference.

2605.27433 2026-05-28 cs.MA cs.AI 版本更新

Heterogeneous Multi-Agent Modeling for Measurement and Network Analysis of the Data Service Market

数据服务市场的异构多智能体建模与测量及网络分析

Deyu Zhou, Yuwei Guo, Xudong Lu, Linhao Zhang, Wei Guo, Lizhen Cui

发表机构 * School of Software, Shandong University(山东大学软件学院) Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University(山东大学人工智能联合研究中心) College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院)

AI总结 本文提出一种基于异构多智能体建模的数据服务市场测量与网络分析方法,通过引入服务生态系统理论明确参与者与外部因素,基于价值创造对三级实体进行效用测量,并设计分析框架评估异构网络对效用的影响。

详情
AI中文摘要

随着各种社会实体之间协作以及用户需求的日益复杂,影响数据服务市场稳定发展的因素也在增加。这些因素包括信息广泛传播增强主观意识、智能水平持续提升以及结构关系复杂化。为了实现数据服务市场的有效治理和监管,在做出监管决策之前进行仿真实验至关重要。然而,当前对数据服务市场的研究和分析主要集中在数据层面的性能,在涉及数据服务市场中多个异构实体的测量与分析以及各种社会要素的整合时显得不足。基于此,本文创新性地提出了一种基于异构多智能体建模的数据服务市场测量与网络分析方法。通过引入服务生态系统理论,我们明确了数据服务市场的参与者和外部因素,并基于价值创造对三级实体进行效用测量。此外,设计了一种分析方法来精确评估异构网络对效用的影响。最后,通过实验结果分析验证了所提方法的有效性。

英文摘要

With the increasing complexity of collaboration among various social entities and user demands, the factors affecting the stable development of the data service market are also growing. These factors include the widespread dissemination of information enhancing subjective consciousness, the continuous improvement in intelligence, and the complexification of structural relationships. To achieve effective governance and regulation of the data service market, it is crucial to conduct simulation experiments before making regulatory decisions. However, current research and analysis of the data service market primarily focus on data-level performance, proving inadequate when it comes to measurement and analysis of multiple heterogeneous entities and the integration of various social elements within the data service market. Based on this, this paper innovatively proposes a data service market measurement and network analysis method based on heterogeneous multi-agent modeling. By introducing the service ecosystem theory, we clarify the participants and external factors of the data service market and conduct utility measurements for three-level entities based on value creation. Furthermore, an analytical methodology is devised to precisely assess the influence of heterogeneous networks on utility. Finally, the paper verifies the effectiveness of the proposed method through the analysis of experimental results.

2605.27432 2026-05-28 cs.IR cs.AI 版本更新

FD-RAG: Federated Dual-System Retrieval-Augmented Generation

FD-RAG: 联邦双系统检索增强生成

Tianhao Gao, Kai Yang, Yiyang Li

发表机构 * School of Computer Science and Technology(计算机科学与技术学院)

AI总结 提出FD-RAG框架,通过解耦轻量级记忆访问与按需LLM推理,并利用语义感知自适应超图蒸馏为紧凑QA记忆,在联邦设置下实现高效、隐私保护的检索增强生成。

详情
AI中文摘要

检索增强生成(RAG)已成为将大型语言模型锚定于外部知识的范式,然而现有大多数RAG系统假设集中式知识访问和充足的计算资源。这些假设在边缘环境中失效,因为知识分散在设备上,原始数据无法共享,且重复调用LLM成本过高。我们提出FD-RAG,一种联邦双系统RAG框架,将轻量级记忆访问与按需LLM推理解耦,以实现去中心化部署。具体而言,FD-RAG在本地语料库上学习语义感知的自适应超图,并将其蒸馏为紧凑的QA记忆。在推理时,它通过直接记忆匹配回答覆盖良好的查询,仅在必要时调用基于LLM的推理,同时将检索到的记忆追溯至超图支撑的证据。为了缓解跨设备的知识碎片化,FD-RAG在不暴露原始文档的情况下聚合跨设备的匿名记忆。在QA基准上的实验表明,与强本地和联邦基线相比,FD-RAG将准确率提升高达7.8%,同时延迟降低8.4倍。我们还提供了理论分析,建立了所提出的超图学习的$\mathcal{O}(1/ε^{2})$收敛率,支持其在边缘环境中的可处理部署。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a paradigm for grounding large language models in external knowledge, yet most existing RAG systems assume centralized knowledge access and ample computation. These assumptions break down in edge environments, where knowledge is fragmented across devices, raw data cannot be shared, and repeated LLM calls are prohibitively expensive. We propose FD-RAG, a federated dual-system RAG framework that decouples lightweight memory access from on-demand LLM reasoning for decentralized deployment. Specifically, FD-RAG learns semantic-aware adaptive hypergraphs over local corpora and distills them into compact QA memories. At inference time, it answers well-covered queries via direct memory matching and invokes LLM-based reasoning only when necessary, while tracing retrieved memories to hypergraph-grounded evidence. To mitigate cross-device knowledge fragmentation, FD-RAG aggregates anonymized memories across devices without exposing raw documents. Experiments on QA benchmarks show that FD-RAG improves accuracy by up to 7.8\% while reducing latency by 8.4$\times$ compared with strong local and federated baselines. We also provide theoretical analysis establishing an $\mathcal{O}(1/ε^{2})$ convergence rate for the proposed hypergraph learning, supporting its tractable deployment in edge settings.

2605.27431 2026-05-28 cs.LG cs.AI 版本更新

Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey

应对多模态学习挑战的混合专家方法:综述

Liangwei Nathan Zheng, Wei Emma Zhang, Olaf Maennel, Lin Yue, Weitong Chen

发表机构 * Adelaide University(阿德莱德大学)

AI总结 本文综述了混合专家(MoE)如何通过高效扩展、表示学习和自适应适配解决多模态学习中的可扩展性、异质性和数据不完美等核心挑战。

Comments This survey paper has just been accepted by IJCAI 2026. Results were released by 30 April 2026. As I could not find a particular place to drop the acceptance email. I have upload the acceptance email alongside the LaTeX files of the paper, named as Acceptance_email.pdf

详情
AI中文摘要

混合专家(MoE)为多模态学习提供了一个自然兼容且可扩展的框架,在不同模态和任务中展现出强大的适应性。尽管其日益成功,但关于MoE方法解决多模态挑战的全面系统综述仍然缺乏。现有综述往往从方法分类学角度独立评估多模态学习或MoE,忽视了它们之间的独特相互作用。本综述通过回答一个核心问题来填补这一空白: extit{MoE如何有效解决多模态挑战?}我们从三个关键视角进行探讨:(1) extbf{MoE作为高效多模态引擎:}通过将计算成本与参数增长解耦,并通过选择性专家激活减轻模态冗余,实现可扩展的多模态建模;(2) extbf{MoE作为多模态表示学习器:}整合互补的多意见专家知识,丰富对齐和交互表示;(3) extbf{MoE作为多模态适配器:}提供模块化和灵活的机制,以建模不完美数据场景,如模态不平衡和模态缺失。通过广泛的文献综述,我们识别出关键研究空白,包括可解释路由、专家通信、模态集成和终身多模态学习。我们将本综述定位为未来研究的基础,旨在构建可解释且可持续的多模态混合专家系统。

英文摘要

Mixture-of-Experts (MoE) presents a naturally compatible and scalable framework for multimodal learning, demonstrating strong adaptability across diverse modalities and tasks. Despite its growing success, a comprehensive and systematic review on the MoE metho addressing multimodal challenges remains lacking. Existing surveys tend to evaluate either multimodal learning or MoE independently from method taxonomy, overlooking the unique interplay between them. This survey fills that gap by answering a central question: \textit{How does MoE effectively resolve multimodal challenges?} We approach this from three key perspectives: (1) \textbf{MoE as an Efficient Multimodal Engine:} enabling scalable multimodal modeling by decoupling computational cost from parameter growth and mitigating modality redundancy through selective expert activation; (2) \textbf{MoE as a Multimodal Representation Learner:} integrating complementary multi-opinion expert knowledge to enrich alignment and interaction representations; and (3) \textbf{MoE as a Multimodal Adapter:} providing a modular and flexible mechanism to model imperfect data scenarios such as modality imbalance and missing modality. Through our extensive literature review, we identify critical research gaps, including interpretable routing, expert communication, modality integration, and lifelong multimodal learning. We position this survey as a foundation for future research toward interpretable and sustainable multimodal Mixture-of-Experts system.

2605.27429 2026-05-28 cs.IR cs.AI 版本更新

Ocean4Rec: Offline LLM-Derived OCEAN Profiles for Request-Time VOD Reranking

Ocean4Rec:离线LLM生成的OCEAN画像用于请求时VOD重排序

Wonkyun Kim, Sehyun Bae, Kwanki Ahn, Mungyu Bae, Saeun Choi, Soyeon You, Chandra Prabhakar, Sehyun Kim

发表机构 * Samsung Electronics Republic of Korea(韩国三星电子公司) Samsung Electronics India(印度三星电子公司)

AI总结 提出Ocean4Rec重排序层,利用LLM离线生成物品OCEAN画像,在请求时无需LLM调用,通过数值计算提升VOD推荐性能。

详情
AI中文摘要

工业视频点播(VOD)推荐系统需要更丰富的内容理解,但LLM作为重排序器的设计在每次请求中重复进行提示构建、令牌生成、模型调用、输出解析和回退处理。在高流量、延迟敏感的服务中,这些请求时操作使吞吐量规划、尾延迟控制、容量隔离和可预测运维复杂化。本文提出Ocean4Rec,一种重排序层,仅离线使用LLM从内容元数据中物化物品的OCEAN画像。物品被映射为开放性、尽责性、外向性、宜人性和神经质分数,而用户画像则通过最近点击和深度链接物品在同一五维空间中的时间衰减聚合构建。在请求时,Ocean4Rec连接预计算的物品画像、用户画像、基础推荐器分数和目录新鲜度,然后执行数值重排序,无需LLM调用。在匿名的三星智能电视VOD日志上,相同候选集的Top1000时间留出离线评估显示,对于NCF生成器,Ocean4Rec在NDCG@20上比更强的非OCEAN基础+新鲜度排序提升7.6%,对于LightGCN生成器提升61.5%。HR@20对于NCF不显著,对于LightGCN提升67.3%,反映了稀疏的精确物品回放标签和新鲜度作为工业基线的强度。该结果应被视为一种有界辅助内容品味特征的离线回放证据,该特征保留了无请求时LLM的服务路径的可部署性优势。

英文摘要

Industrial video-on-demand (VOD) recommenders need richer content understanding, but LLM-as-reranker designs repeat prompt construction, token generation, model invocation, output parsing, and fallback handling for each request. In high-volume latency-sensitive services, these request-time operations complicate throughput planning, tail-latency control, capacity isolation, and predictable operation. This paper presents Ocean4Rec, a reranking layer that uses an LLM only offline to materialize item OCEAN profiles from content metadata. Items are mapped into Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism scores, while user profiles are built by time-decayed aggregation of recently clicked and deep-linked items in the same five-dimensional space. At request time, Ocean4Rec joins precomputed item profiles, user profiles, base recommender scores, and catalog recency, then performs numeric reranking without an LLM call. On anonymized Samsung Smart TV VOD logs, same-candidate Top1000 temporal-holdout offline evaluations show that Ocean4Rec improves NDCG@20 over a stronger non-OCEAN Base+Recency ordering by 7.6% for an NCF generator and 61.5% for a LightGCN generator. HR@20 is inconclusive for NCF and improves by 67.3% for LightGCN, reflecting sparse exact-item replay labels and the strength of recency as an industrial baseline. The result should be read as offline replay evidence for a bounded auxiliary content-taste feature that preserves the deployability advantage of a request-time-LLM-free serving path.

2605.27417 2026-05-28 quant-ph cs.AI cs.LG 版本更新

Quantum Machine Learning-based 6G edge Network: Enabling Adaptive Communication and Model Aggregation

基于量子机器学习的6G边缘网络:实现自适应通信与模型聚合

Wenjing Xiao, Jiatai Yan, Chenglong Shi, Shixin Chen, Miaojiang Chen, Min Chen, Saif Al-Kuwari, Ahmed Farouk

发表机构 * School of Computer, Electronics and Information, Guangxi University(广西大学计算机电子信息学院) Guangxi Key Laboratory of Multimedia Communications and Network Technology(广西多媒体通信与网络技术重点实验室) School of Computer Science and Engineering, South China University of Technology(华南理工大学计算机科学与工程学院) Pazhou Laboratory(琶洲实验室) Qatar Center for Quantum Computing, College of Science and Engineering, Hamad Bin Khalifa University(卡塔尔量子计算中心,哈马德·本·哈利法大学)

AI总结 针对6G V2X通信中的高维状态空间、异构节点和动态环境挑战,提出量子增强框架,包含信道自适应语义通信、多模态融合、模型迁移和联邦聚合四个模块,利用量子卷积神经网络、量子注意力、量子强化学习和量子张量分解提升效率、泛化能力和隐私保护。

详情
AI中文摘要

随着第六代(6G)移动通信技术的到来,车联网(V2X)通信在通信效率、系统泛化能力和模型协作方面面临前所未有的挑战。传统机器学习在处理V2X系统中的高维状态空间、异构V2X节点下的慢收敛和弱泛化、快速变化的信道以及多模态感知数据时存在困难。为解决这些问题,我们提出了一种量子增强的V2X通信与模型聚合框架,旨在实现6G中高效、鲁棒和智能的交通,该框架包含四个模块:信道自适应语义通信模块、多模态融合模块、模型迁移模块和联邦聚合模块。具体而言,信道自适应语义通信模块利用量子卷积神经网络(CNN)和量子失真度量,实现高效传输和跨不同条件的强泛化能力。多模态融合模块利用量子注意力和纠缠来压缩特征并关联异构数据中的语义。模型迁移模块采用量子强化学习对决策过程进行建模,并提高动态环境中的适应性。联邦聚合模块将量子张量分解与基于反向传播的校正相结合,以低开销提供隐私保护并增强全局模型的鲁棒性。这项工作为未来6G智能交通中的通信与模型协作勾勒了一种新范式。

英文摘要

With the advent of sixth-generation (6G) mobile communication technology, vehicle-to-everything (V2X) communication faces unprecedented challenges in communication efficiency, system generalization capabilities, and model collaboration. Conventional machine learning struggles with high-dimensional state spaces, slow convergence, and poor generalization under heterogeneous V2X nodes, rapidly varying channels, and multimodal sensing data in V2X systems. To address these issues, we propose a quantum-enhanced framework for V2X communication and model aggregation that targets efficient, robust, and intelligent transportation in 6G, which includes four modules: the channel-adaptive semantic communication module, the multimodal fusion module, the model transfer module, and the federated aggregation module. Specifically, the channel-adaptive semantic communication module leverages quantum convolutional neural networks (CNN) and quantum distortion metrics to enable efficient transmission and strong generalization across diverse conditions. The multimodal fusion module exploits quantum attention and entanglement to compress features and associate semantics across heterogeneous data. The model transfer module employs quantum reinforcement learning to model decision-making and improve adaptability in dynamic environments. The federated aggregation module integrates quantum tensor decomposition with backpropagation-based corrections to provide privacy preservation with low overhead and to strengthen global model robustness. This work outlines a new paradigm for communication and model collaboration in future 6G intelligent transportation.

2605.27416 2026-05-28 quant-ph cs.AI cs.DC cs.LG 版本更新

Can Quantum Federated Learning Withstand Circuit-Level Backdoors?

量子联邦学习能否抵御电路级后门攻击?

Aakar Mathur, Mohammed Ruknuddin, Ashish Gupta

发表机构 * BITS Pilani Dubai Campus(比斯汉尼迪拜校区)

AI总结 提出电路级后门威胁模型(CULT),通过量子感知机制(Grover、Pauli、Bit-flip、Sign-flip)实现四种隐蔽攻击,理论证明攻击的隐蔽性,实验表明单个恶意客户端即可导致FedAvg精度严重下降,现有防御无法消除最坏情况。

Comments Accepted to IJCAI-ECAI 2026

详情
AI中文摘要

量子联邦学习(QFL)继承了联邦优化对恶意客户端的核心脆弱性,同时也引入了来自变分电路训练和测量驱动梯度的攻击面。本文提出了一种新颖的电路级后门威胁(CULT)模型,该模型通过利用量子感知机制(包括Grover、Pauli、Bit-flip和Sign-flip)形式化了四种隐蔽攻击。通过使恶意客户端在训练中和训练后表面上均可发起攻击,这些攻击能够严重破坏学习过程。我们建立了严格的理论基础,以证明在标准平滑性假设下攻击的隐蔽性。在MNIST和CIFAR-10数据集上进行的实验,采用非独立同分布划分和不同比例的恶意客户端,结果表明,即使只有一个恶意客户端,在FedAvg聚合下也能导致严重的精度下降。虽然流行的防御方法(包括Krum、Multi-Krum、FoolsGold、FLGuardian和Mud-HoG)在许多情况下减少了精度下降,但它们未能消除最坏情况下的失败案例,其中精度下降高达50%。实验分析进一步揭示,在CULT模型下,恶意更新通过保持接近良性范数来有效掩盖其存在,从而帮助攻击者逃避检测。

英文摘要

Quantum Federated Learning (QFL) inherits the core vulnerability of federated optimization to malicious clients, while also introducing an attack surface from variational circuit training and measurement-driven gradients. This work proposes a novel CircUit-Level backdoor Threat (CULT) model that formalizes four stealthy attacks by exploiting quantum-aware mechanisms, including Grover, Pauli, Bit-flip, and Sign-flip. By enabling malicious clients on both in-training and post-training surfaces, these attacks can critically undermine the learning process. We establish a rigorous theoretical foundation to demonstrate attack stealthiness under standard smoothness assumptions. Experiments on the MNIST and CIFAR-10 datasets with non-IID splits and varying fractions of malicious clients show that even a single malicious client can induce severe accuracy degradation under FedAvg aggregation. While popular defenses, including Krum, Multi-Krum, FoolsGold, FLGuardian, and Mud-HoG, reduce degradation in many regimes, they fail to eliminate worst-case failure cases, where accuracy drops up to 50\%. The experimental analysis further reveals that under the CULT model, malicious updates effectively mask their presence by staying close to benign norms, thereby helping attackers evade detection.

2605.27413 2026-05-28 q-bio.BM cs.AI 版本更新

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

配体条件离散扩散用于蛋白质序列-结构协同设计

Chen Wei, Fanding Xu, Minghao Sun, Zhiyuan Liu, Lin Wang, Tianrui Jia, Yihang Zhou, Yang Zhang

发表机构 * Xi’an University of Posts & Telecommunications(西安邮电大学) National University of Singapore(新加坡国立大学) Xi’an Jiaotong University(西安交通大学) Institute of Systems Medicine, Chinese Academy of Medical Sciences(中国医学科学院系统医学研究院)

AI总结 提出配体条件离散扩散模型ProtLiD²,通过几何感知交叉注意力联合生成氨基酸序列和离散结构令牌,实现配体约束下的蛋白质序列-结构协同设计,显著提升全局折叠置信度和配体感知通过率。

Comments 19 pages, 6 figures

详情
AI中文摘要

蛋白质通过氨基酸序列编码的三维结构执行其生物学功能,而配体结合蛋白质的协同设计需要模型在明确的配体约束下生成序列-结构兼容的蛋白质。尽管连续扩散和基于流的模型支持在坐标或潜在空间中进行配体感知设计,但现有的离散扩散蛋白质语言模型主要操作于序列或结构令牌,缺乏直接的小分子条件。我们引入了 extbf{ProtLiD$^2$},一个用于蛋白质序列-结构协同设计的 extbf{蛋白质}配体条件 extbf{离散扩散}模型。ProtLiD$^2$联合生成氨基酸序列和离散结构令牌,同时通过几何感知交叉注意力整合配体化学和几何信息。在超过一百万个配体-蛋白质复合物上训练后,ProtLiD$^2$将掩码离散扩散扩展到配体感知的功能性蛋白质设计。我们进一步提出了最大置信度边界引导的ReMask解码,这是一种推理时自校正策略,保留高置信度预测并重新掩码不确定的令牌。在整体蛋白质设计中,ProtLiD$^2$相比Complexa提高了全局折叠置信度,将TM-score从0.672提升至0.802,pLDDT从64.55提升至73.00。在口袋协同设计中,ProtLiD$^2$将活性位点BB-RMSD从FAIR/PocketGen的3.46/3.40Å降低至1.97Å,并在更严格的对接阈值下,将配体感知通过率从PocketGen的14.86%提升至59.73%,从6.08%提升至23.49%。这些结果支持配体条件离散扩散作为功能性蛋白质协同设计的有效令牌空间框架。代码将在https://github.com/auroua/ProtLiD提供。

英文摘要

Proteins perform their biological functions through three-dimensional structures encoded by amino acid sequences, and ligand-binding protein co-design requires models that generate sequence-structure compatible proteins under explicit ligand constraints. Although continuous diffusion and flow-based models support ligand-aware design in coordinate or latent spaces, existing discrete diffusion protein language models mainly operate over sequence or structure tokens without direct small-molecule conditioning. We introduce \textbf{ProtLiD$^2$}, a \textbf{Prot}ein \textbf{L}igand-conditioned \textbf{D}iscrete \textbf{D}iffusion model for protein sequence-structure co-design. ProtLiD$^2$ jointly generates amino-acid sequence and discrete structure tokens while incorporating ligand chemical and geometric information through geometry-aware cross-attention. Trained on over one million ligand-protein complexes, ProtLiD$^2$ extends masked discrete diffusion to ligand-aware functional protein design. We further propose maximum confidence-margin guided ReMask decoding, an inference-time self-correction strategy that retains confident predictions and remasks uncertain tokens. ProtLiD$^2$ improves global fold confidence over Complexa in whole-protein design, increasing TM-score from 0.672 to 0.802 and pLDDT from 64.55 to 73.00. In pocket co-design, ProtLiD$^2$ reduces active-site BB-RMSD from 3.46/3.40Å for FAIR/PocketGen to 1.97Å, and improves ligand-aware pass rates over PocketGen from 14.86% to 59.73% and from 6.08% to 23.49% under stricter docking thresholds. These results support ligand-conditioned discrete diffusion as an effective token-space framework for functional protein co-design. Code will be available at https://github.com/auroua/ProtLiD.

2605.27412 2026-05-28 cs.NE cs.AI cs.LG 版本更新

Advancing Direct Training for Spiking Neural Networks with Circulate-Firing Neurons and Learnable Gradients

利用循环发放神经元和可学习梯度推进脉冲神经网络的直接训练

Feifan Zhou, Xiang Wei, Yang Liu, Qiang Yu

发表机构 * School of Artificial and Intelligence, Tianjin University(人工智能学院,天津大学)

AI总结 提出一种包含循环发放神经元、逐时间步可学习代理梯度和正负平衡损失函数的直接训练算法,以提升脉冲神经网络的信息表示能力和梯度传播精度,在多个数据集上取得竞争性性能并泛化至Transformer架构。

详情
AI中文摘要

脉冲神经网络(SNN)因其节能特性而备受关注,但与人工神经网络(ANN)相比仍存在显著性能差距。这一差距源于至少两个关键限制:首先,传统脉冲神经元的信息表示能力有限,未能充分利用膜电位的丰富动态;其次,固定代理梯度(SG)函数在时间步上导致梯度传播不精确,阻碍了有效的直接训练。为了解决这两个挑战,我们提出了一种新的直接训练算法,包含三个核心创新:第一,一种循环发放脉冲神经元模型,通过更有效地利用膜电位来增强信息表示能力;第二,一种逐时间步可学习的代理梯度函数,能够在反向传播过程中实现精确的梯度估计;第三,一种正负平衡损失函数,以实现正负膜电位之间的平衡,进一步提升SNN性能。大量实验表明,我们的方法在多个数据集上取得了竞争性性能。我们的方法可以无缝泛化到先进的Transformer架构,始终优于现有方法。我们的工作强调了进一步利用SNN内在膜动力学以提升性能的有效性,从而为推进高性能脉冲神经架构开辟了新途径。

英文摘要

Spiking Neural Networks (SNNs) have emerged with promising energy-efficient property, yet a substantial performance gap persists compared to Artificial Neural Networks (ANNs). This gap stems from at least two key limitations: first, conventional spiking neurons offer limited information representation capacity, underutilizing the rich dynamics of membrane potentials; second, fixed surrogate gradient (SG) functions across time steps leads to imprecise gradient propagation, impeding effective direct training. To address these two challenges, we propose a new direct training algorithm with three core innovations: first, a circulate-firing spiking neuron model that enhances information representation capacity by leveraging membrane potentials more effectively; second, a time-step-wise learnable surrogate gradient function, enabling accurate gradient estimation during backpropagation; third, a positive-negative balanced loss function to achieve equilibrium between positive and negative membrane potentials and further boost SNN performance. Extensive experiments demonstrate that our methods achieve competitive performance across multiple datasets. Our methods can generalize seamlessly to advanced architectures of Transformer, consistently outperforming existing methods. Our work highlights the effectiveness of further harnessing intrinsic membrane dynamics of SNNs for performance improvement, and thus open a new avenue for advancing high-performance spiking neural architectures.

2605.27409 2026-05-28 cs.NE cs.AI cs.LG 版本更新

STARS: Spike Tail-Aware Relational Synthesis for ANN-to-SNN Data-Free Knowledge Distillation

STARS: 面向ANN到SNN无数据知识蒸馏的尖峰尾部感知关系合成

Shuhan Ye, Yi Yu, Qixin Zhang, Hui Lu, Jiaming He, Qinggang Zhang, Li Shen, Xudong Jiang

发表机构 * Nanyang Technological University(南洋理工大学) Jilin University(吉林大学) Wuhan University(武汉大学) Sun Yat-sen University(中山大学)

AI总结 提出STARS方法,通过关系一致性对齐和尾部感知正则化增强BN引导的合成数据,解决SNN学生网络在无数据知识蒸馏中约束不足的问题,在多个数据集上提升性能。

详情
AI中文摘要

SNN有望实现高能效和低延迟推理,但其性能仍落后于ANN。ANN到SNN的知识蒸馏有助于缩小这一差距,但在实际部署中原始训练数据通常不可用。现有的无数据知识蒸馏(DFKD)方法通过匹配教师侧先验(尤其是BN统计量)来合成替代数据,但这些面向ANN的约束主要正则化均值和方差,因此对于响应依赖于阈值穿越动态的SNN学生网络而言,约束不足。本文提出尖峰尾部感知关系合成(STARS),一种用于ANN到SNN DFKD的即插即用方法,通过两个互补目标增强标准BN引导合成:关系一致性对齐(保持教师和学生之间的跨样本关系一致性)和尾部感知正则化(通过软超越教师导出阈值来正则化阈值相关的尾部概率)。这些目标共同生成合成批次,这些批次在保持教师有效性的同时,对SNN学生网络更具信息性。在CIFAR-10、CIFAR-100和Tiny-ImageNet上的多个ANN-SNN对实验表明,我们的方法一致改进了传统DFKD基线,甚至超过了若干KD方法,在CIFAR-10上提升高达4.6%,在CIFAR-100上提升高达6.7%,突显了在面向SNN的DFKD中,用关系约束和尾部感知约束补充BN匹配的重要性。

英文摘要

SNNs promise energy-efficient and low-latency inference, but their performance still trails that of ANNs. ANN-to-SNN knowledge distillation helps narrow this gap, yet the original training data are often unavailable in practical deployment settings. Existing data-free knowledge distillation (DFKD) methods synthesize surrogate data by matching teacher-side priors, especially BN statistics, but these ANN-oriented constraints mainly regularize mean and variance and therefore remain under-constrained for SNN students whose responses depend on threshold-crossing dynamics. In this paper, we propose Spike Tail-Aware Relational Synthesis (STARS), a plug-and-play method for ANN-to-SNN DFKD that augments standard BN-guided synthesis with two complementary objectives: Relational Consistency Alignment, which preserves cross-sample relational consistency between teacher and student, and Tail-Aware Regularization, which regularizes threshold-relevant tail probabilities through soft exceedance over teacher-derived thresholds. Together, these objectives generate synthetic batches that remain teacher-valid while becoming more informative for SNN students. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet across multiple ANN-SNN pairs show that our method consistently improves conventional DFKD baselines and even surpasses several KD methods, with gains of up to 4.6\% on CIFAR-10 and 6.7\% on CIFAR-100, highlighting the importance of complementing BN matching with relational and tail-aware constraints in SNN-oriented DFKD.

2605.27407 2026-05-28 cs.NE cs.AI cs.LG 版本更新

Benchmarking Fairness in Spiking Neural Networks: Data Bias, Spurious Features, and Hardware Effects

脉冲神经网络中的公平性基准测试:数据偏差、虚假特征和硬件效应

Hudi He, Fukun Wang, Zhe Wang, Xinyi Wang, Shuhan Ye, Jiarui Liu, Qing Qing, Ziqi Xu, Xikun Zhang, Renqiang Luo

发表机构 * Jilin University(吉林大学) Nanyang Technological University(南洋理工大学) RMIT University(皇家墨尔本理工大学)

AI总结 本文首次提出脉冲神经网络公平性基准,通过引入人口统计覆盖缺口、虚假特征泄漏和部署环境不匹配三个现实维度,系统评估了12种先进SNN在资源约束下的公平性-性能权衡。

详情
AI中文摘要

评估脉冲神经网络(SNN)的公平性需要反映现实世界复杂性的严格基准,然而现有评估仍受限于肤浅的数据集多样性和理想化的硬件假设。本文首次引入SNN的系统性公平性基准,解决三个关键的现实维度:(1)训练数据中的人口统计覆盖缺口,(2)虚假特征泄漏(例如,肤色作为类别标签的代理),以及(3)部署环境不匹配(例如,具有受限脉冲编码的边缘设备)。我们的框架整合了四个跨人口统计数据集(带有受控偏差注入)和三个神经形态硬件模拟器(Loihi 2、SpiNNaker),从而能够在资源约束下隔离分析公平性-性能权衡。对12种最先进SNN的标准化评估揭示了显著差异:在偏差数据上训练的模型对代表性不足群体的假阳性率高出23%,而硬件限制(例如,降低的脉冲精度)在边缘部署中进一步将准确率差距放大至41%。关键的是,为云端SNN开发的偏差缓解策略在资源约束下通常会退化,这凸显了需要联合优化公平性和硬件效率的协同设计原则。通过连接算法公平性研究与神经形态工程,我们的基准为医疗和自主系统等社会关键应用中的可信SNN奠定了基础。我们的代码可在以下网址获取:https://anonymous.4open.science/r/SNN-Benchmarks-8017。

英文摘要

Evaluating fairness in Spiking Neural Networks (SNNs) demands rigorous benchmarks that reflect real-world complexities, yet existing assessments remain limited by superficial dataset diversity and idealized hardware assumptions. This work introduces the first systematic fairness benchmark for SNNs, addressing three critical dimensions of realism: (1) demographic coverage gaps in training data, (2) spurious feature leakage (e.g., skin tone as a proxy for class labels), and (3) deployment-environment mismatches (e.g., edge devices with constrained spike encoding). Our framework integrates four cross-demographic datasets with controlled bias injections and three neuromorphic hardware simulators (Loihi 2, SpiNNaker), enabling isolated analysis of fairness-performance trade-offs under resource constraints. Standardized evaluations of 12 state-of-the-art SNNs reveal stark disparities: models trained on biased data exhibit 23\% higher false positive rates for underrepresented groups, while hardware limitations (e.g., reduced spike precision) further amplify accuracy gaps by up to 41\% in edge deployments. Critically, bias mitigation strategies developed for cloud-based SNNs often degrade under resource constraints, highlighting the need for co-design principles that jointly optimize fairness and hardware efficiency. By bridging algorithmic fairness research with neuromorphic engineering, our benchmark provides a foundation for trustworthy SNNs in socially critical applications such as healthcare and autonomous systems. Our code is available at: https://anonymous.4open.science/r/SNN-Benchmarks-8017.

2605.27404 2026-05-28 cs.CY cs.AI 版本更新

Smaller, Younger, and More Impactful: How AI-Assisted Writing Transforms Research Teams

更小、更年轻、更具影响力:AI辅助写作如何改变研究团队

Haoyang Wang, Mingze Zhang, Yi Bu, Star Xing Zhao, Meijun Liu

发表机构 * School of Information, University of Texas at Austin(德克萨斯大学奥斯汀分校信息学院) National Science Library, Chinese Academy of Sciences(中国科学院国家科学图书馆) Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences(中国科学院大学经济管理学院信息资源管理系) Department of Information Management, Peking University(北京大学信息管理系) Institute of Big Data, Fudan University(复旦大学大数据研究院) Institute for Global Public Policy, Fudan University(复旦大学全球公共政策研究院) Faculty of Finance, City University of Macau(澳门城市大学金融学院)

AI总结 本研究利用2020年以来的PLoS和Nature系列期刊全文,通过多种回归方法发现AI辅助写作使研究团队更年轻、规模更小,且不影响甚至提升科学影响力。

详情
AI中文摘要

大科学时代长期以来以日益庞大和专门化的研究团队推动知识前沿为特征。然而,人工智能(AI)的最新进展,特别是大型语言模型(LLMs),正开始重塑学术写作和科学研究,可能打破长期以来团队规模不断扩大的趋势,并改变研究团队结构的其他维度。基于2020年以来PLoS系列和Nature系列期刊的147,074篇全文出版物,我们考察了AI辅助写作是否以及如何影响科学中的团队结构和团队成果。使用多种方法,包括普通最小二乘法、分位数回归、泊松回归、逻辑回归和倾向得分匹配,我们发现使用AI辅助写作的研究团队往往更年轻、规模更小。重要的是,这种向更紧凑、更年轻化团队的转变并非以牺牲科学影响力为代价。相反,我们观察到采用AI辅助写作的研究团队产生高影响力出版物的概率更高。这些结果凸显了AI辅助写作在重塑研究生产方式以及研究团队组建和构成方面的重要作用。我们的发现呼吁在研究评估、资助和培训方面进行政策改进,以应对这一新兴趋势。

英文摘要

The era of Big Science has long been defined by increasingly large and specialized research teams pushing the frontiers of knowledge. However, recent advances in artificial intelligence (AI), particularly large language models (LLMs), are beginning to reshape academic writing and scientific research, potentially disrupting the longstanding trend toward ever-larger teams and transforming other dimensions of research team structure. Drawing on 147,074 full-text publications from the PLoS family and the Nature portfolio since 2020, we examined whether and how AI-assisted writing influences team structure and team outcomes in science. Using multiple methods, including ordinary least square, quantile regression, Poisson regression, logistic regression and propensity score matching, we found that research teams using AI-assisted writing tend to be younger and smaller. Importantly, this shift toward more compact, junior-leaning teams does not come at the expense of scientific impact. On the contrary, we observed a higher probability of research teams that employed AI-assisted writing producing highly impactful publications. These results highlight the significant role of AI-assisted writing in reshaping not only how research is produced, but also how research teams are formed and assembled. Our findings call for policy improvements in research evaluation, funding, and training to address this emerging trend.

2605.27403 2026-05-28 cs.CY cs.AI 版本更新

LLM-assisted sentiment analysis for integrated computational and qualitative mixed methods education research: A case study of students' written reflection assignments

LLM辅助情感分析在综合计算与定性混合方法教育研究中的应用:学生书面反思作业案例研究

Xiomara Gonzalez, Gabriella Coloyan Fleming, Andrew Katz, Maya Denton, Jessica Deters

发表机构 * Chandra Family Department of Electrical and Computer Engineering, University of Texas at Austin(德克萨斯大学奥斯汀分校电子与计算机工程系) Department of Engineering Education, Virginia Polytechnic Institute and State University(弗吉尼亚理工大学工程教育系) Gallogly College of Engineering, University of Oklahoma(俄克拉荷马大学加洛格利工程学院) Department of Mechanical and Materials Engineering, University of Nebraska-Lincoln(内布拉斯加大学林肯分校机械与材料工程系)

AI总结 本研究通过纵向案例,利用LLM辅助情感分析结合统计检验与主题分析,探讨学生身份变量对留学期间语言交流情感的影响,发现海外生活经历是唯一显著变量。

详情
AI中文摘要

书面反思作业为学生提供了宝贵的批判性自我评估、意义建构和学习处理的机会。此外,此类反思为定性教育研究提供了丰富的数据。然而,定性数据分析可能耗时。定性比较不同参与者群体之间的发现更为耗时,通常将比较限制在最多一个变量(例如,二元性别)。大型语言模型(LLM)最近开始被批判性地评估用作定性研究助手。利用来自留学项目的纵向学生书面反思案例(n=151),我们研究了LLM辅助情感分析如何能够实现结合计算分析和主题分析的纵向混合方法研究。首先,使用统计检验根据七个不同的学生身份/生活经历变量定量比较情感差异。然后,这些结果指导定性数据分析,以调查这些差异背后的原因。对于本科留学学生,我们发现先前的海外生活经历是唯一影响学生对语言和交流行为情感的个人变量。这一工作流程对于定性研究人员在比较不同人口群体参与者时如何更轻松地探究多个变量具有启示意义。

英文摘要

Written reflection assignments give students valuable opportunities for critical self-assessment, meaning making, and learning processing. Additionally, such reflections provide rich data for qualitative education research. However, qualitative data can be time-consuming to analyze. It is even more time-intensive to qualitatively compare findings between different groups of participants, usually limiting comparison to, at most, one variable (e.g., binary gender). Large language models (LLMs) have recently begun to be critically evaluated for use as qualitative research assistants. Using a longitudinal case of written student reflections (n=151) from a study abroad program, we investigate how LLM-assisted sentiment analysis can enable longitudinal mixed-methods research combining computational and thematic analyses. First, statistical testing is used to quantitatively compare sentiment differences according to seven different student identity/lived experience variables. Then, these results inform qualitative data analysis to investigate the reasons underlying these differences. For the case of undergraduate students studying abroad, we found that prior experience living abroad was the only personal variable impacting students' sentiments of their verbal language and communication behaviors. This workflow has implications for how qualitative researchers can more easily probe multiple variables when comparing participants from different demographic groups.

2605.27402 2026-05-28 cs.CY cs.AI cs.CL 版本更新

REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

REC-CBM:面向可信开放评分的基于规则感知的错误修正概念瓶颈模型

Chengshuai Zhao, Fan Zhang, Kumar Satvik Chaudhary, Yiwen Li, Lo Pang-Yun Ting, Ying-Chih Chen, Huan Liu

发表机构 * School of Computing and Augmented Intelligence, Arizona State University, USA(计算与增强智能学院,亚利桑那州立大学,美国) Mary Lou Fulton Teachers College, Arizona State University, USA(玛丽·卢·福洛顿教师学院,亚利桑那州立大学,美国) Department of Computer Science, National Yang Ming Chiao Tung University, TW(国立阳明交通大学计算机科学系,台湾)

AI总结 提出REC-CBM模型,通过规则感知概念编码器、序数成对校准目标和潜在概念错误修正模块,解决开放评分中标准概念瓶颈模型无法建模细粒度规则维度、忽略评分序数语义和概念标注不可靠的问题,在提升评分性能的同时保持可解释性。

详情
AI中文摘要

开放评分对于公平和个性化教育至关重要,但人工评分耗时且成本高,凸显了自动化评分系统的必要性。尽管基于神经和大语言模型(LLM)的系统表现出优越性能,但它们通常是黑箱模型,其评分过程和理由难以让教育者验证和信任。概念瓶颈模型(CBM)通过将预测路由到人类可解释的概念,提供透明度的机制保证,成为一种有前景的方法。然而,标准CBM不适用于开放评分:它们没有显式建模细粒度的规则维度,未能充分捕捉评分量表的序数语义,并忽略了人类概念标注中固有的可靠性问题。为解决这些局限,我们提出REC-CBM,一种面向可信开放评分的规则感知错误修正概念瓶颈模型。REC-CBM引入了规则感知概念编码器,学习针对回答的概念特定表示,以及一个序数成对校准目标,保留规则维度间的排序结构。它还结合了一个潜在概念错误修正模块,在最终评分预测前对概念预测进行去噪,同时保持可解释性。在公开数据集上的全面实验表明,REC-CBM在评分性能上持续提升,并产生比最先进基线更忠实的概念级推理。进一步分析验证了每个组件的贡献,并展示了在真实教育环境中的适用性。总体而言,这项工作提供了一种实用、可解释的评分解决方案,使教育者能够检查、干预和信任自动化决策,推动更透明和可信的教育。

英文摘要

Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring the need for automated grading systems. Although recent neural and large language model (LLM) based systems have demonstrated superior performance, they are typically black-box models whose scoring processes and rationales are difficult for educators to verify and trust. Concept bottleneck models (CBMs) have emerged as a promising approach by routing predictions through human-interpretable concepts, providing a mechanistic guarantee of transparency. However, standard CBMs are not tailored to open-ended grading: they do not explicitly model fine-grained rubric dimensions, inadequately capture the ordinal semantics of scoring scales, and neglect inherent reliability issues in human concept annotations. To address these limitations, we propose REC-CBM, a rubric-aware error-correction concept bottleneck model for trustworthy open-ended grading. REC-CBM introduces a rubric-aware concept encoder that learns concept-specific representations over responses and an ordinal pairwise calibration objective that preserves ranking structure among rubric dimensions. It further incorporates a latent concept error-correction module that denoises concept predictions before final grade prediction while preserving interpretability. Comprehensive experiments on publicly available datasets show that REC-CBM consistently improves grading performance and produces more faithful concept-level reasoning than both state-of-the-art baselines. Further analyses validate the contribution of each component and demonstrate the applicability in realistic educational settings. Overall, this work provides a practical, interpretable grading solution that enables educators to inspect, intervene in, and trust automated decisions, advancing more transparent and trustworthy education.

2605.27401 2026-05-28 cs.CY cs.AI 版本更新

Using Zero-Shot LLM-Generated Survey Data for Geographically Explicit Population Synthesis

使用零样本大语言模型生成的调查数据进行地理显式人口合成

Taylor Anderson, Sara Von Hoene, Orhan Yagizer Cinar, Emma Von Hoene, Amira Roess, Andrew Crooks, Hamdi Kavak

发表机构 * Dept. of Geography and Geoinformation Science, George Mason University, Fairfax, VA, USA(地理与地理信息科学系,乔治·马歇尔大学,弗吉尼亚州 Fairfax) Dept. of Computer Science, George Mason University, Fairfax, VA, USA(计算机科学系,乔治·马歇尔大学,弗吉尼亚州 Fairfax) College of Public Health, George Mason University, Fairfax, VA, USA(公共卫生学院,乔治·马歇尔大学,弗吉尼亚州 Fairfax) Dept. of Geography, University at Buffalo, Buffalo, NY, USA(地理系,布法罗大学,纽约州 Buffalo) Dept. of Computational and Data Sciences, George Mason University, Fairfax, VA, USA(计算与数据科学系,乔治·马歇尔大学,弗吉尼亚州 Fairfax)

AI总结 本文评估零样本大语言模型生成的健康调查数据能否作为传统迭代比例拟合工作流的输入,用于地理显式人口合成,并发现其可作为补充输入但尚不能替代真实调查数据。

Comments 15 pages, 5 figures, 3 tables

详情
AI中文摘要

人们对将合成人口用于各种应用的兴趣日益增长。同时,我们目睹了人工智能在各行各业的巨大发展。本文评估了零样本大语言模型(LLM)生成的健康调查数据能否作为传统迭代比例拟合(IPF)工作流的输入,用于地理显式人口合成。利用2023年行为风险因素监测系统(BRFSS),我们使用GPT-4.1和Gemini-2.5-Pro为美国科罗拉多州和密西西比州生成合成调查记录。我们将生成的数据用于基于IPF的合成流程,并针对外部基准评估生成的普查区级合成人口。结果表明,两个LLM都捕捉到了几个主要的州级对比,表明零样本生成产生了地理差异化的调查数据。然而,性能强烈依赖于变量。人口合成中的下游效应是混合的,因为IPF有时会放大或减少生成数据中的错误。空间验证表明,基于LLM的人口合理地再现了普查区级的模式,尤其是对于与真实数据更一致的变量。总体而言,LLM生成的调查数据显示出作为补充输入的前景,但尚不能替代真实调查数据。

英文摘要

There is a growing interest in utilizing synthetic populations for a diverse range of applications. At the same time, we are witnessing a tremendous growth in artificial intelligence in all walks of life. This paper evaluates whether zero-shot large language model (LLM)-generated health survey data can serve as inputs to a conventional iterative proportional fitting (IPF) workflow for geographically explicit population synthesis. Using the 2023 Behavioral Risk Factor Surveillance System (BRFSS), we generate synthetic survey records for the U.S. states of Colorado and Mississippi with GPT-4.1 and Gemini-2.5-Pro. We use the generated data in an IPF-based synthesis pipeline and evaluate the resulting census tract-level synthetic populations against external benchmarks. Results show both LLMs capture several major state-level contrasts, indicating zero-shot generation produces geographically differentiated survey data. However, performance is strongly variable-dependent. Downstream effects in population synthesis are mixed, as IPF sometimes amplifies or reduces errors in the generated data. Spatial validation shows that LLM-based populations reproduce census tract-level patterns reasonably well, especially for variables that were more aligned with the ground truth data. Overall, the LLM-generated survey data shows promise as supplementary input, but not yet as a replacement for real survey data.

2605.27400 2026-05-28 cs.CY cs.AI cs.CC cs.ET cs.GT cs.MA 版本更新

Mathematical Modelling of Ethical AI Use in Higher Education: A Coordination Game Framework for Future-Facing Learning

高等教育中伦理AI使用的数学建模:面向未来学习的协调博弈框架

Ndidi Bianca Ogbo, Zhao Song, Shatha Ghareeb, The Anh Han

发表机构 * School of Computing, Engineering and Digital Technologies, Teesside University, United Kingdom(计算工程与数字技术学院,泰赛大学,英国)

AI总结 本文通过协调博弈论框架,研究学生群体中负责任或机会主义AI使用规范的形成机制,并揭示评估激励如何触发行为转变。

详情
AI中文摘要

生成式人工智能在高等教育中的快速普及正在重塑评估实践,并加剧对学术诚信、公平性和学习质量的担忧。尽管机构回应越来越强调政策指导和伦理原则,但对于学生群体中负责任或机会主义AI使用的集体规范如何出现和稳定,仍缺乏正式理解。本文将学生在评估中的AI使用重新定义为由同伴期望和评估设计而非仅个体合规塑造的协调问题。我们开发了一个基于协调的演化博弈论框架,捕捉学习价值、努力、感知公平性和透明度,并通过反思性评估激励隐式建模机构AI治理。我们使用分析结果和有限种群模拟揭示了学生AI使用中的阈值驱动行为转变:小而校准良好的反思性评估激励变化可以触发向负责任、以学习为导向的AI使用规范的快速转变,而弱或错位的激励则允许机会主义实践持续存在。这些非线性动态解释了为何仅政策声明往往无法改变行为,而适度的评估重新设计可能产生不成比例的影响。通过提供评估结构如何塑造集体AI使用实践的机制层面解释,本文为高等教育机构提供了一个分析基础的工具,支持面向未来学习的比例性、教学法主导的AI治理,而无需依赖监控或惩罚性执法。

英文摘要

The rapid uptake of generative artificial intelligence (AI) in higher education is reshaping assessment practices and intensifying concerns around academic integrity, fairness, and learning quality. While institutional responses increasingly emphasise policy guidance and ethical principles, there remains limited formal understanding of how collective norms of responsible or opportunistic AI use emerge and stabilise within student cohorts. This paper reframes student AI use in assessment as a coordination problem shaped by peer expectations and assessment design rather than individual compliance alone. We develop a coordination-based evolutionary game-theoretic framework that captures learning value, effort, perceived fairness, and transparency, with institutional AI governance modelled implicitly through reflective assessment incentives. We use analytical results and finite-population simulations to reveal threshold-driven behavioural transitions in student AI use: small, well-calibrated changes in reflective assessment incentives can trigger rapid shifts towards responsible, learning-oriented AI-use norms, whereas weak or misaligned incentives allow opportunistic practices to persist. These non-linear dynamics explain why policy statements alone often fail to change behaviour, while modest assessment redesigns can have disproportionate effects. By providing a mechanism-level account of how assessment structures shape collective AI-use practices, this work offers higher education institutions an analytically grounded tool for Future Facing Learning, supporting proportionate, pedagogy-led AI governance without reliance on surveillance or punitive enforcement.

2605.27399 2026-05-28 cs.CY cs.AI 版本更新

Short-Term Gain, Long-Term Fragility: AI Labor Substitution and the Erosion of Sustainable Capability

短期收益,长期脆弱:AI劳动力替代与可持续能力的侵蚀

Wolfgang Rohde

发表机构 * AiSuNe Foundation(AiSuNe基金会)

AI总结 本文提出能力掩盖与能力侵蚀机制,论证AI劳动力替代在短期内提升效率的同时,通过消耗难以重建的人力能力导致系统长期脆弱性增加。

Comments 19 pages, 7 figures, Also available on SSRN: https://doi.org/10.2139/ssrn.6577818

详情
AI中文摘要

看似加速的过程可能是一种将负担从当下悄然转移至未来的行为。用AI系统替代人类劳动力的尝试常被呈现为对技术进步的理性回应,但这种观点在结构上往往是短视的。在软件开发及邻近知识产业中,AI日益具有吸引力,因为它似乎能降低劳动力成本、加快产出速度并改善短期指标。然而,这些收益可能是通过消耗那些构建缓慢且难以恢复的人类能力而实现的。本文提出了AI劳动力替代下的能力掩盖与能力侵蚀机制。AI生成的输出可能造成组织能力已被替代的假象,即使对熟练人类劳动力的依赖依然存在。这种假象可能支持招聘限制,同时更慢的成本在暗中累积。来自AI辅助编程的证据表明,生成的输出仍需要大量人工验证,且在正确性、可维护性和安全性方面参差不齐。仓库级研究也提示了在处理更广泛代码库上下文方面的局限性。更广泛地,劳动力市场、政治经济学和产业战略证据表明,替代压力正由管理层的成本激励和国家竞争驱动,同时增加了集中化和平台控制的风险。其结果是,一个系统在短期内看似更高效,但随着时间的推移却变得更加脆弱。

英文摘要

What looks like acceleration can be a quiet transfer of burden from the present to the future. Attempts to replace human labor with AI systems are often presented as rational responses to technological progress, but that view is often structurally short-sighted. Across software development and adjacent knowledge industries, AI is increasingly attractive because it appears to reduce labor costs, speed output, and improve short-term metrics. Yet those gains may be achieved by drawing down human capabilities that are slow to build and difficult to restore. This paper develops a mechanism of capability masking and capability erosion under AI labor substitution. AI-generated output can create the appearance that organizational capability has been replaced, even when dependence on skilled human labor remains. That appearance can support hiring restraint while slower costs accumulate in the background. Evidence from AI-assisted coding shows that generated output still requires substantial human verification and remains uneven in correctness, maintainability, and security. Repository-level studies also suggest limits in handling broader codebase context. More broadly, labor-market, political-economy, and industrial-strategy evidence suggests that substitution pressures are being driven by managerial cost incentives and national competition while increasing risks of concentration and platform control. The result is a system that may look more efficient in the short term while becoming more fragile over time.

2605.27396 2026-05-28 cs.CY cs.AI 版本更新

Agentic Literacy Debt: A Structural Problem the AI Literacy Field Has Not Yet Named

代理素养债务:AI素养领域尚未命名的结构性问题

Rohith Nama

发表机构 * Rohith Nama(罗希思·纳马)

AI总结 本文提出“代理素养债务”概念,指出自主AI代理大规模部署时,用户因缺乏监督能力而面临累积性社会赤字,并从医疗、金融等案例论证其结构性本质。

详情
Journal ref
AI & Ethics, 2026
AI中文摘要

自主AI代理现在能够在医疗、金融服务和工作场所等场景中代表用户进行规划、决策和行动,通常无需逐步获得人类批准。现有的AI素养框架是为人类评估AI输出并决定是否采取行动的世界而构建的;它们没有词汇来描述那些已将决策权委托给代理的用户,而代理的行为可能不可观察、不可逆转或不可控制。本文命名了由此产生的问题——代理素养债务:当代理型AI系统在没有相应素养基础设施的情况下大规模部署时,不断累积的社会赤字。这种债务通过三个强化渠道(不透明委托的正常化、多代理生态系统的复杂性以及制度路径依赖)复合增长,由部署代理的组织产生,但由代理所代表的用户、患者和公民承担。来自医疗、金融欺诈和全球公平领域的证据表明,这一差距已经具有重大影响。该问题是结构性的,而非课程改革能够弥补的暂时滞后。它要求将AI素养重新定义为一种治理能力,而非评估能力。

英文摘要

Autonomous AI agents now plan, decide, and act on behalf of users across healthcare, financial services, and workplace contexts, often without step-by-step human approval. Existing AI literacy frameworks were built for a world in which humans evaluate AI outputs and decide whether to act; they have no vocabulary for the user who has delegated decision-making authority to an agent whose actions may not be observable, reversible, or controllable. This paper names the resulting problem agentic literacy debt: the accumulating societal deficit that grows when agentic AI systems are deployed at scale without corresponding literacy infrastructure. The debt compounds through three reinforcing channels (normalization of opaque delegation, multi-agent ecosystem complexity, and institutional path dependence), and it is incurred by the organizations that deploy agents but paid by the users, patients, and citizens on whose behalf the agents act. Evidence from healthcare, financial fraud, and global equity contexts suggests the gap is already consequential. The problem is structural, not a temporary lag that curriculum reform will close. It demands a reframing of AI literacy as a governance capability, not an evaluative one.

2605.27395 2026-05-28 cs.CY cs.AI 版本更新

Informing AI Policy Assessment using Large-Scale Simulation of Interventions

利用大规模干预模拟为AI政策评估提供信息

Julia Barnett, Kimon Kieslich, Natali Helberger, Nicholas Diakopoulos

发表机构 * Northwestern University USA University of Amsterdam, The Netherlands \& University of Hohenheim Germany University of Amsterdam The Netherlands Northwestern University University of Amsterdam, The Netherlands \& University of Hohenheim University of Amsterdam

AI总结 提出一种结合参与式评估、专家成本评估和基于LLM的伤害缓解评估的方法,通过遗传算法模拟探索政策组合空间,以识别缓解特定AI危害的可行政策选项。

Comments This work will be published in the proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2026. 15 pages plus end matter and appendix

详情
AI中文摘要

随着AI系统和危害的快速扩散推动全球AI治理努力,在竞争性政策选项中确定优先级对政策制定者和研究人员来说变得越来越具有挑战性。我们引入了一种方法来识别缓解特定AI危害的可行政策选项,帮助政策制定者和研究人员瞄准值得投入更多时间和资源的领域。该方法结合了政策的参与式评估、专家实施成本评估以及基于LLM的每种政策选项下感知危害缓解评估。我们利用基于遗传算法的模拟研究来探索潜在政策组合的巨大解空间,并考察在成本、参与式输入和危害缓解的不同权重下结果如何变化。我们发现该方法能够探索参与式组件和专家组件之间的不同平衡,使政策制定者和研究人员能够评估每个组件应分配多少权重。我们认为遗传算法发现的可行政策组合的多样性可以作为讨论的有用起点。该方法通过将参与式AI直接整合到实际政策开发流程中,实现了现有参与式AI工作的操作化。

英文摘要

As the rapid proliferation of AI systems and harms spurs efforts in AI governance around the world, prioritizing among competing policy options has become increasingly challenging for policymakers and researchers. We introduce a methodology for identifying viable policy options to mitigate specified AI harms, helping policymakers and researchers target areas that warrant greater time and resource investment. This method combines participatory evaluation of policies, expert assessment of implementation costs, and an LLM-based assessment of perceived harm mitigation under each policy option. We leverage a genetic algorithm-based simulation study to explore a vast solution space of potential policy combinations, and examine how outcomes change under different weightings of cost, participatory input, and harm mitigation. We find that this method enables exploration of different balances between participatory and expert components, allowing policymakers and researchers to assess how much weight to assign to each. We argue that the diversity of viable policy combinations found by the genetic algorithm could be a useful starting point for deliberation. This method operationalizes existing work on participatory AI by integrating it directly into practical policy development pipelines.

2605.27394 2026-05-28 cs.CY cs.AI cs.HC cs.MA 版本更新

Human-AI Collaboration for Estimating Scientific Replicability

人机协作评估科学可复制性

Tatiana Chakravorti, Robert Fraleigh, Timothy Fritton, Christopher Griffin, Vaibhav Singh, Sai Koneru, C. Lee Giles, David Pennock, Anthony Kwasnica, Sarah Rajtmajer

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Rutgers University(罗格斯大学)

AI总结 提出一种混合预测市场,结合算法代理与人类交易者,通过实时交易共同估计科学发现的可复制性,实验表明混合市场在多数情况下优于纯人工或纯机器基线。

详情
AI中文摘要

确定已发表科学发现能否成功复制是实证科学中长期存在的挑战。现有的可复制性评估方法通常依赖于人类判断(即人类专家的创造性组合)或基于论文内容元数据训练的机器学习模型。虽然这两种方法都显示出价值,但各自也有重要局限性。人类预测可能受到认知偏差和对研究文献接触范围狭窄的影响,而自动评估往往难以捕捉上下文线索和微妙的可信度信号。在本文中,我们研究了一种混合方法。具体来说,我们引入了一个混合预测市场,其中算法代理与人类参与者一起交易,共同估计已发表科学发现通过受控复制研究结果得到证实的可能性。代理基于数百项先前复制研究的结果进行训练,而人类参与者通过实时交易贡献领域知识。我们通过涉及不同学科参与者的多个现场实验评估了这种混合方法,并将其性能与纯人工和纯机器基线进行了比较。我们的结果表明,除少数情况外,混合市场达到或超过了纯人工预测市场,产生了更准确和可靠的复制预测。

英文摘要

Determining whether published scientific findings can successfully be replicated is a long-standing challenge in the empirical sciences. Existing approaches for replicability assessment typically rely either on human judgment, i.e., creative assembly of human experts, or on machine learning models trained on paper content metadata. While both approaches have demonstrated value, each also has important limitations. Human forecasts can be influenced by cognitive biases and narrow exposure to the research literature, while automated assessments often struggle to capture contextual cues and subtle signals of credibility. In this paper, we examine a hybrid approach. Specifically, we introduce a hybrid prediction market in which algorithmic agents trade alongside human participants to jointly estimate the likelihood that a published scientific finding will be corroborated via the outcome of a controlled replication study. Agents are trained on outcomes from hundreds of prior replication studies while human participants contribute domain knowledge through real-time trading. We evaluate this hybrid approach through multiple live experiments involving participants from different academic disciplines and compare its performance to artificial-only and human-only baselines. Our results show that, except for a few cases, hybrid markets match or outperform artificial prediction markets, producing more accurate and reliable replication forecasts.

2605.27393 2026-05-28 cs.CL cs.AI 版本更新

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

StoryMI: 可控的多智能体治疗性对话生成

Qingyu Meng, Min Chen, Dingming Liu, Yifan Mo, Yue Su, Xin Sun, Koen Hindriks, Jiahuan Pei

发表机构 * Vrije Universiteit Amsterdam(弗里堡大学阿姆斯特丹分校) NII, Tokyo Institute of Technology(东京技术大学信息机构)

AI总结 提出StoryMI框架,通过多LLM智能体协作、情境故事基础和动态策略控制,生成符合动机性访谈标准的治疗性对话,并构建评估协议和数据集验证其有效性。

Comments ACL2026

详情
AI中文摘要

大型语言模型(LLM)可以生成流畅的对话,但先前的工作缺乏情境基础、动态策略控制以及与动机性访谈(MI)临床标准对齐的评估。我们引入了StoryMI,一个用于可控MI对话生成的多LLM智能体框架,其中基于问卷的客户档案被扩展为情境故事,为对话提供叙事背景。治疗师和客户智能体生成由交互智能体选择的MI代码引导的MI编码话语,而交互智能体动态协调交换以在多次轮对话中控制MI策略。我们提出了一个两级评估协议:词汇指标和宏观层面咨询策略的MI特定度量,以及LLM作为评判者和人类专家评估。我们构建了一个包含6K模拟MI对话的数据集,基于1K问卷-故事对,涵盖12个MI代码和13个症状领域,并对六个开源和闭源LLM进行了基准测试。我们的结果表明,情境基础和宏观层面控制可以提高MI依从性和临床合理性,展示了结构化多智能体工作流在心理治疗对话生成中的有效性。我们提供代码和数据以促进可重复性。

英文摘要

Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce StoryMI, a multi-LLM agent framework for controllable MI dialogue generation, where questionnaire-based client profiles are expanded into situational stories that provide narrative context for the dialogue. Therapist and client agents generate MI-coded utterances guided by MI codes selected by the interaction agent, while an interaction agent dynamically coordinates exchanges to control MI strategies during a multi-turn conversation. We propose a two-level evaluation protocol: lexical metrics and MI-specific measures of macro-level counseling strategies, alongside LLM-as-judge and human expert assessments. We construct a dataset of 6K simulated MI dialogues grounded in 1K questionnaire-story pairs, covering 12 MI codes and 13 symptom domains, and benchmark six open- and closed-source LLMs. Our results show that situational grounding and macro-level control can improve MI adherence and clinical plausibility, demonstrating the effectiveness of a structured multi-agent workflow for psychotherapy dialogue generation. We provide code and data for reproducibility.

2605.27391 2026-05-28 cs.CY cs.AI 版本更新

Learning after COVID-19 and the ICT career aspirations: Are students entering the AI era with weaker skills?

COVID-19后的学习与ICT职业抱负:学生是否以更弱的技能进入AI时代?

Diana Maria Popa, Simona-Vasilica Oprea, Adela Bâra

发表机构 * Department of Economic Informatics and Cybernetics, Bucharest University of Economic Studies(经济信息与自动化系,布加勒斯特经济学院)

AI总结 基于PISA 2018和2022数据,采用混合方法分析学习环境与ICT职业抱负的关系,发现数字技能是最强预测因素,教师支持起补充作用,自主性影响较弱且依赖情境。

详情
AI中文摘要

本文考察学生是否以足够强大的教育基础进入生成式AI时代,重点关注学习环境与各国ICT相关职业抱负变化之间的关系。分析使用PISA 2018和2022的国家级数据,结合学生自主性、数字技能和教师支持的指标。采用混合方法,包括描述性统计、回归分析、聚类、潜在表示学习(使用变分自编码器VAE)、判别分析和概率建模,以捕捉教育准备的可观察和潜在维度。与以往将学习损失、数字技能和职业期望分开处理的研究不同,我们的分析将它们整合在一个比较纵向框架内。研究焦点从短期疫情后效应转向教育系统为学生准备数字和AI驱动劳动力市场的结构能力。结果显示,全球范围内ICT职业抱负有所增加但不均衡。数字技能成为最强且最一致的预测因素,而教师支持起补充作用。自主性表现出较弱且依赖情境的影响。教育准备是多维度的,ICT抱负相对独立于其他职业领域而演变。

英文摘要

This paper examines whether students are entering the generative AI era with sufficiently strong educational foundations, focusing on the relationship between learning environments and changes in ICT related career aspirations across countries. The analysis uses country-level data from PISA 2018 and 2022, combining indicators of student autonomy, digital skills and teacher support. A mixed-method approach is applied, including descriptive statistics, regression analysis, clustering, latent representation learning (using Variational Autoencoder-VAE), discriminant analysis and probabilistic modeling to capture both observable and latent dimensions of educational readiness. Unlike prior research that treats learning loss, digital skills and career expectations separately, our analysis integrates them within a comparative longitudinal framework. It shifts the focus from short-term post-pandemic effects to the structural capacity of education systems to prepare students for digital and AI-driven labor markets. Results show a global but uneven increase in ICT career aspirations. Digital skills emerge as the strongest and most consistent predictor, while teacher support plays a complementary role. Autonomy shows weaker, context-dependent effects. Educational readiness is multidimensional, and ICT aspirations evolve relatively independently from other career domains.

2605.27389 2026-05-28 cs.IR cs.AI cs.CL 版本更新

Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization

基于记忆 vs. 仅上下文条件化在有状态个性化中产生不同的行为模式

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 通过比较上下文条件化和基于记忆的条件化在教师面向教育推荐系统中的行为,发现上下文推荐对当前问题响应更强,而基于记忆的推荐表现出历史依赖行为,包括相同输入下的学习者特异性分化。

Comments Accepted to ITS 2026

详情
AI中文摘要

我们研究了条件化上下文如何塑造教师面向教育推荐系统中的个性化行为。我们比较了基于当前学生问题的上下文条件化与使用持久学习者信息的基于记忆的条件化。通过偏差相关性和配对统计检验,我们发现上下文推荐表现出更强的问题级响应性,而基于记忆的推荐表现出历史依赖行为,包括在相同输入下的学习者特异性分化。教师面向的评估信号表明这些推荐是可解释和可操作的。这些结果表明,基于嵌入的相似性度量能够捕捉对当前问题的响应性,但不能表征基于学习者历史的个性化,从而激励了研究条件化效应的行为级诊断。

英文摘要

We study how conditioning context shapes personalization behavior in a teacher-facing educational recommender system. We compare contextual conditioning based on the current student question with memory-based conditioning using persistent learner information. Using deviation correlation and paired statistical tests, we find that contextual recommendations exhibit stronger question-level responsiveness, while memory-based recommendations exhibit history-dependent behaviors, including learner-specific differentiation under identical input. Teacher-facing evaluation signals suggest these recommendations are interpretable and actionable. These results indicate that embedding-based similarity metrics capture responsiveness to the current question but do not characterize personalization grounded in learner history, motivating behavior-level diagnostics for studying conditioning effects.

2605.27388 2026-05-28 cs.CL cs.AI cs.SI 版本更新

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

通过反应语气建模社区态度:评估LLM与在线社区语言行为对齐的人机协作框架

Nuan Wen, Xuezhe Ma

发表机构 * Information Sciences Institute University of Southern California(南加州大学信息科学研究所)

AI总结 提出CARE框架,通过细粒度言语气势分析,评估LLM模拟社区对真实新闻的反应,揭示其存在“现实主义差距”,表明当前对齐策略不足以捕捉在线群体的社会语言动态。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作计算社会分析的代理;然而,它们忠实再现人类社区“厚描述”(Geertz, 1973)的能力仍然是一个关键挑战。当前的评估通常将社会身份简化为静态标签,忽视了现实群体如何应对社会变迁。为弥合这一差距,我们引入了CARE(社区感知反应评估),一个以反应为中心的框架,将LLM模拟的话语与不同社区对真实新闻的真实、事件相关的反应进行基准测试。通过刻画细粒度的言语气势谱及其所体现的潜在态度——通过人机协作验证——我们的诊断揭示了一个持续的“现实主义差距”:使用明确的社区提示引导LLM并不能固有地提高模拟保真度。进一步分析识别了前沿模型之间的不同行为特征,表明当前的对齐策略仍不足以捕捉在线群体的社会语言动态。

英文摘要

Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts. To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news. By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest--validated through human-AI collaboration--our diagnosis reveals a persistent "realism gap": steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.

2605.27385 2026-05-28 cs.LG cs.AI 版本更新

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

异构仿真环境中联邦强化学习的个性化观测归一化

Yiran Pang, Zhen Ni, Xiangnan Zhong

发表机构 * Department of Electrical Engineering \& Computer Science Florida Atlantic University Boca Raton, FL, USA

AI总结 针对联邦强化学习在异构环境中状态转移动力学差异导致输入分布不一致和参数更新不平衡的问题,提出个性化观测归一化方法,通过各智能体本地维护运行均值和方差对原始状态输入进行归一化,加速训练并提升性能。

Comments Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025

详情
AI中文摘要

联邦强化学习(FedRL)使多个智能体能够在不共享原始数据的情况下协同训练全局策略,因此非常适合隐私敏感的应用。然而,FedRL在异构环境中面临挑战,其中不同的状态转移动力学导致聚合过程中输入分布不一致和参数更新不平衡。因此,本文开发了一种个性化观测归一化(PON)方法,允许每个智能体使用持续更新的运行均值和方差对原始状态输入进行局部归一化。这种设计确保了局部特征的一致缩放,而不会在聚合过程中掩盖其他智能体的特征。此外,我们证明了由于不同的局部输入分布,跨智能体共享归一化参数是无效的,这突显了个性化统计的必要性。在异构MuJoCo任务上的实验表明,我们开发的PON加速了训练,并且与基线方法相比取得了更优的性能。

英文摘要

Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making it ideal for privacy-sensitive applications. However, FedRL faces challenges in heterogeneous environments where differing state-transition dynamics lead to non-identical input distributions and imbalanced parameter updates during aggregation. Therefore, this paper develops a personalized observation normalization (PON) method, allowing each agent to locally normalize raw state inputs using a continuously updated running mean and variance. This design ensures consistent scaling of local feature without overshadowing across agents during aggregation. Furthermore, we demonstrate that sharing normalization parameters across agents is ineffective due to the diverse local input distributions, which highlights the necessity of personalized statistics. Experiments on heterogeneous MuJoCo tasks show that our developed PON accelerates training and achieves superior performance compared to baseline methods.

2605.27384 2026-05-28 cs.HC cs.AI cs.CL 版本更新

From Instructor to Collaborator: What a 90-Participant Study Reveals about Human-Agent Collaboration in a Mobile Serious Game

从指导者到协作者:一项90名参与者研究揭示移动严肃游戏中的人机协作

Danai Korre

发表机构 * University of Bedfordshire(伯明翰大学)

AI总结 通过90名被试的对比实验,研究高拟人化语音交互体与低拟人化文本代理在移动严肃游戏中的用户偏好,发现高拟人化代理显著更受青睐,并探讨角色、混合主动对话及故障修复对目标导向任务中人机协作的影响。

Comments 4 pages, 5 figures, ACM CHI 2026 workshop paper

详情
AI中文摘要

这篇立场论文反映了我在博士期间从一项大规模被试内研究(N=90)中收集的实证数据。该研究在一个关于英国十进制前货币的Unity开发移动游戏中,比较了高度拟人化的语音具身对话代理(ECA)与低拟人化的文本基础代理(无具身,仅文本气泡)。游戏包含两个不同角色的代理——指导者(Alex)和店主/协作者。用户通过语音和鼠标输入进行交互。我收集的定量数据包括可用性问卷(CCIR MINERVA)和代理人格工具。数据使用配对t检验、重复测量方差分析和多元线性回归进行分析,以识别代理人格与可用性之间的相关性。结果显示,高度拟人化代理版本在统计上显著更受偏好,效应量大。这一结果与观察和退出访谈的定性发现一起进一步讨论。结果从人机协作的角度进行阐述,特别是角色、混合主动对话以及故障/修复在目标导向任务中如何显现。最后,我提出了关于时机、用户期望和角色特定交互的问题。本投稿不提出新框架;而是报告实证发现和问题,我希望与社区进行研讨。

英文摘要

This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a highly human-like, spoken embodied conversational agent (ECA) against a low human-like text base agent (no embodiment, text bubble only) within a mobile, Unity-developed game about pre-decimal UK currency. The game included two agents with different roles-an Instructor (Alex) and a Shopkeeper/Collaborator. Users interacted using voice and mouse input. The quantitative data I collected included a usability questionnaire (CCIR MINERVA) and the Agent Persona Instrument. Data was analyzed using paired t-test, repeated measures ANOVA and multiple linear regression to identify correlations between the persona and usability. The results showed a statistically significant preference for the version of highly human-like agents, with a large effect size. This is further discussed alongside qualitative findings from observations and exit interviews. The results are framed for Human-Agent collaboration, especially for how roles, mixed-initiative dialogue, and breakdowns/repairs become apparent in goal-oriented tasks. I conclude with questions on timing, user expectations, and role-specific interactions. This submission does not propose new frameworks; it reports empirical findings and questions I hope to workshop with the community.

2605.27383 2026-05-28 cs.CL cs.AI 版本更新

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

弥合稳定性与表现力之间的差距:低资源口语语言模型的合成数据扩展与偏好对齐

Yizhong Geng, Yanliang Li, Jinghan Yang, Tianhan Jiang, Boxun An, Ya Li, Xiaoyu Shen

发表机构 * Beijing University of Posts(北京邮电大学) University of California, USA(美国加州大学) Northwestern University, USA(美国西北大学) Eastern Institute of Technology, Ningbo, China(宁波工程技术学院)

AI总结 针对低资源口语语言模型因合成数据导致的表现力崩溃问题,提出两种自对齐框架(DGSA和TDSC)以恢复韵律多样性,实现超越商业系统的性能并首次支持老挝语零样本语音克隆。

详情
AI中文摘要

口语语言模型(SLM)通过绕过显式的字素到音素流水线,已成为语音合成的一种有前景的范式。然而,它们在低资源语言中的有效性仍然受到转录语音稀缺的根本限制。在实践中,合成数据已成为在此类场景下扩展SLM的主要策略,当真实数据不足时提供可靠的音素监督。在这项工作中,我们表明这种依赖引入了一个基本权衡,我们称之为稳定性-表现力差距:虽然合成数据提高了音素准确性,但它逐渐抑制了韵律变异性,最终导致表现力崩溃(合成侵蚀)。为了弥合这一差距,我们提出了两种自对齐框架。解耦引导的自对齐(DGSA)通过利用韵律-音色分离来恢复复杂语言的表现力。对于真实参考极其有限的场景,温度驱动的自我批评(TDSC)通过自动探索和过滤来稳定生成。我们的方法优于强大的商业系统,包括ElevenLabs和Gemini Pro,并首次实现了老挝语的零样本语音克隆能力。

英文摘要

Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient. In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion). To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.

2605.27381 2026-05-28 cs.CC cs.AI 版本更新

The Computational Boundary of Inference: Capability Internalization, Training, and the Turing Jump

推理的计算边界:能力内化、训练与图灵跳跃

Chien-Ping Lu

发表机构 * Chien-Ping Lu(卢钦平)

AI总结 本文通过经典可计算性理论证明,有限内部自修改无法超越当前计算层,而稳定化修订则通过图灵跳跃达到更强层,从而在递归自我改进叙事中划定了计算边界。

Comments 11 pages, 1 figure, v2

详情
AI中文摘要

关于AI中递归自我改进的主张常常从重复的内部修订滑向定性更强能力的可能性,而没有明确区分潜在的计算机制。本文在经典可计算性理论中给出了一个形式化的分离结果,在精确建模假设下阻止了这一滑移。对于预言$A$,令$\mathcal{C}(A)=\{B : B \leq_T A\}$为相应的计算层。我们证明,有限内部自修改仍保持在$\mathcal{C}(A)$内部,而稳定化修订则通过相对化极限引理由跳跃$A'$支配。结合局部闭包与逃逸定理,这给出了层内迭代与上升到更强相对层之间的清晰形式化分离。关键不在于更强层永远不会出现,而在于它们不能由已稳定层内的有限重复来解释。由此产生的分离为一大类递归改进叙事提供了可计算性理论上的界限,这些叙事将重复内部更新视为定性能力上升的充分条件。

英文摘要

Claims about recursive self-improvement in AI often slide from repeated internal revision to the possibility of qualitatively stronger capability without clearly distinguishing the underlying computational regimes. This paper gives a formal separation result in classical computability theory that blocks that move under a precise modeling assumption. For an oracle $A$, let $\mathcal{C}(A)=\{B : B \leq_T A\}$ be the corresponding computational layer. We prove that finite internal self-modification remains inside $\mathcal{C}(A)$, while stabilized revision is governed instead by the jump $A'$ via the relativized limit lemma. Together with a local closure versus escape theorem, this yields a clean formal separation between within-layer iteration and ascent to a stronger relative level. The point is not that stronger layers never arise, but that they are not explained by finite repetition inside one already settled layer. The resulting separation gives a computability-theoretic limit on a broad class of recursive-improvement narratives in which repeated internal updating is treated as sufficient for qualitative capability ascent.

2605.27380 2026-05-28 cs.CL cs.AI 版本更新

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX: 基于别名的检索与LLM排序的跨语言生物医学实体链接

Yi Wang, Corina Dima, Liangyu Zhong, Steffen Staab

发表机构 * University of Stuttgart, Germany(斯图加特大学) Technical University of Berlin, Germany(柏林技术大学)

AI总结 提出BioELX两阶段框架,通过维基数据多语言别名增强SapBERT检索器,并利用预训练LLM排序器进行上下文感知消歧,无需标注数据即在多个基准上取得最佳性能。

Comments 12 pages, 3 figures

详情
AI中文摘要

跨语言生物医学实体链接(BEL)将任何语言的提及映射到生物医学知识库(KB)中的唯一标识符,支持临床和生物医学NLP应用。然而,BEL的专家标注训练数据成本高昂,尤其是对于低资源语言。此外,许多跨语言BEL系统依赖于基于SapBERT的检索器,这些检索器主要在KB中的英语别名上训练,导致对未见过的非英语提及泛化能力差,且上下文感知消歧有限。我们提出BioELX,一个两阶段跨语言BEL框架,无需任务特定的标注训练语料。在第一阶段,我们用维基数据派生的多语言别名丰富SapBERT训练,并使用得到的检索器改进跨语言候选检索。在第二阶段,我们使用预训练LLM排序器进行上下文感知消歧,该排序器联合考虑提及上下文和候选,消除了监督训练的需要。在五个基准(XL-BEL、EMEA、Patent、WikiMed-DE和MedMentions)上的实验表明,BioELX实现了新的最先进性能。它在XL-BEL上将平均Recall@1提高了+19.2,尤其是低资源语言提升显著,例如土耳其语+21.6、韩语+22.1、泰语+30.8,并在EMEA(+6.2)、Patent(+5.4)和WikiMed-DE(+12.8)上持续改进。代码和资源将在发表后发布。

英文摘要

Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage~1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage~2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication.

2605.27376 2026-05-28 cs.CL cs.AI 版本更新

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

解锁基于提示的文本转语音模型中的细粒度和句内说话风格控制

Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University, Korea(全州大学人工智能系) Department of Computer Science and Engineering, Sungkyunkwan University, Korea(全州大学计算机科学与工程系)

AI总结 针对基于提示的TTS模型缺乏细粒度控制和句内风格变化的问题,提出句间风格插值和句内风格过渡技术,通过嵌入空间方向向量插值和KV缓存交换及滑动窗口注意力掩码实现平滑风格控制。

详情
AI中文摘要

虽然基于提示的文本转语音(TTS)模型支持自然语言驱动的说话风格控制,但它们通常提供有限的细粒度控制,并在整个话语中应用单一的全局风格。这限制了需要跨话语连续风格属性插值和单个话语内时变风格过渡的实际用例。在本文中,我们提出了在现有基于提示的TTS模型中实现这两种能力的新技术。对于句间风格插值,我们计算嵌入空间中对比风格提示之间的方向向量并进行简单插值,从而实现风格特征之间的平滑过渡。对于句内风格过渡,我们首先识别出自回归TTS解码器中对早期标记的强烈注意力偏差,导致初始音频实现主导后续生成。为了减轻这种影响,我们引入了KV缓存交换和滑动窗口注意力掩码。实验表明,我们提出的句间插值在性别转换中实现了99-100%的成功率,高达36 Hz的音高变化,以及高达1.6音节/秒的速度变化。我们的句内过渡保持了0.81-0.91的说话人相似度,并获得了3.48-4.48的感知平滑度分数。

英文摘要

While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

2605.27373 2026-05-28 cs.AI cs.CL cs.CY 版本更新

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

识别和理解文本中的人类价值观:一种可定制的基于LLM的架构

Eduardo de la Cruz Fernández, Marcelo Karanik, Sascha Ossowski

发表机构 * Universidad Politécnica de Madrid(马德里理工大学) CETINIA, Universidad Rey Juan Carlos(CETINIA,雷伊·胡安·卡洛斯大学)

AI总结 提出一种基于大型语言模型的可定制架构,通过三个模块(规范生成、文本标注、强度评估)检测文本中人类价值观的强度,避免依赖特定价值理论或复杂提示工程,实验表明具有良好检测性能。

Comments 8 pages, 1 figure. Published in Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 5

详情
Journal ref
Proc. ICAART 2026, Vol. 5, SciTePress, 2026, pp. 4096-4103
AI中文摘要

随着智能系统变得更加自主,科学界专注于创建包含伦理和道德考量的决策机制,这与传统的效用最大化模型不同。为此,一个关键方面是评估这些决策与人类价值观的契合程度。基于此,一个有前景的研究方向是开发基于大型语言模型(LLM)的方法,从文本中识别显性或隐性的人类价值观,从而实现全程识别。本文介绍了一种基于LLM的架构,用于检测和量化文本中人类价值观的强度,避免了以往方法受限于特定价值理论或复杂提示工程的缺陷。该架构包含三个协调模块:一个从任何理论框架的基础文本中生成结构化价值规范;一个使用这些规范对文本进行标注;另一个基于修辞和语义证据分配分级支持或抵抗。这种模块化方法将概念化任务与检测人类价值观的任务分离,创建了一个可扩展且可重复的过程,由适应多种理论的价值规范驱动。该架构使用多个LLM实例化,并使用ValueEval数据集进行评估。实验表明具有良好的检测性能,证实了管道的通用性。

英文摘要

As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

2605.27365 2026-05-28 cs.CV cs.AI cs.LG cs.RO 版本更新

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything: 基于并行框解码的快速高质量视觉定位

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Princeton University(普林斯顿大学) Nanjing University(南京大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出并行框解码(PBD)方法,将边界框和点作为原子单元单步解码,结合大规模数据集LocateAnything-Data,实现高效统一的目标定位与检测,在保持高精度同时显著提升解码吞吐量。

Comments fix github link

详情
AI中文摘要

视觉语言模型(VLM)通常将视觉定位和检测表述为坐标令牌生成问题,将每个2D框序列化为多个1D令牌,这些令牌在很大程度上独立学习和解码。这种逐令牌解码与框几何的耦合结构不匹配,并且由于严格的顺序生成而造成了实际的推理瓶颈。我们引入了LocateAnything,一个基于并行框解码(PBD)的统一生成式定位和检测框架。通过将边界框和点等几何元素作为原子单元单步解码,LocateAnything保持了框内几何一致性并实现了显著的并行性。我们证明PBD提高了解码吞吐量和定位精度。我们进一步开发了一个可扩展的数据引擎,并策划了LocateAnything-Data,这是一个包含超过1.38亿个训练样本的大规模数据集,大大增加了高精度定位的数据多样性。大量评估表明,LocateAnything推进了速度-精度前沿,在多个基准测试中实现了显著更高的解码吞吐量,同时提高了高IoU定位质量。结果突显了并行框解码和大规模训练数据在实现高效精确的统一视觉定位和检测中的互补优势。

英文摘要

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

2605.27348 2026-05-28 cs.CV cs.AI 版本更新

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

当眼睛背叛AI:社交注视一致性作为AI生成图像检测的语义线索

Jihyeon Kim, Sohee Kim, Soosan Lee, Souhwan Jung, James Matthew Rehg, Hyesong Choi

发表机构 * School of Computer Engineering(计算机工程学院) Hoseo University(Hoseo大学) School of Electronic Engineering(电子工程学院) Soongsil University(Soongsil大学) School of Computer Science(计算机科学学院) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出社交注视一致性作为高层语义线索,通过构建诊断数据集、块组合描述监督和跨架构验证,证明该线索能有效检测AI生成图像,并解释其跨生成器迁移的机制。

Comments 23 pages, 2 figures, 17 tables

详情
AI中文摘要

最近的生成模型在很大程度上缩小了低级伪影(像素指纹、频率异常、上采样痕迹)的差距,特别是在以人为中心和局部编辑的设置中,其中被操纵的区域很小且被光度真实的内容包围。我们引入了社交注视一致性,这是一个高层语义线索,定义为互动个体之间注视方向、头眼对齐和瞳孔放置的相互一致性,并表明它构成了一个先前未被充分利用的检测轴,与现有的低级范式正交。我们通过三个耦合机制实例化这一见解:(i) 一个受控的诊断数据集,具有注视一致图像的特定区域扰动,其中严格的成对分组阻止了生成器指纹记忆作为优化时间捷径,而不是依赖增强;(ii) 块组合描述监督,它在1250个宏观组合描述中保持一个单一的5块推理骨架不变,将推理一致性与表面多样性解耦;(iii) 跨架构验证表明,相同的监督在COCOAI交互子集上将视觉语言骨干(FakeVLM)的平衡准确率提高了3.7个百分点(67.8 -> 71.5),在COCOAI人物子集上提高了1.3个百分点(83.0 -> 84.3),并且在仅视觉骨干(Effort)上也有持续提升,证明了骨干无关的线索。真实类和伪造类召回率同时上升,排除了“全预测为伪造”的伪影。一个四步机制解释——成对编辑捷径阻断、难到易难度转移、CLIP先验保留以及扩散族在眼周结构中的共享频谱弱点——解释了为什么在单个修复模型(FLUX.1-Fill)上训练能够迁移到多生成器套件。我们将在论文被接收后发布代码以促进可重复性。

英文摘要

Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 -> 71.5) and +1.3 pp on the COCOAI Person subset (83.0 -> 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.

2605.27258 2026-05-28 cs.SD cs.AI 版本更新

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

PilotTTS:一种有纪律的模块化配方用于竞争性语音合成

Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang, Mingli Jin, Yihang Lin, Jiahui Zhao, Weibo Xiong, Dongrui Zhang, Keming Chen, Yunze Gao, Zeyang Lin, Yuze Zhou, Yue Liu

发表机构 * Amap, Alibaba Group(阿里巴巴集团爱马仕部门) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出PilotTTS轻量级自回归TTS系统,通过极简架构和严格数据工程(仅用20万小时开源处理数据)实现竞争性能,支持零样本语音克隆、情感/副语言/方言合成,在Seed-TTS Eval基准上取得最低WER和最高说话人相似度。

详情
AI中文摘要

构建最先进的文本转语音(TTS)系统通常需要数百万小时的专有数据和复杂的多阶段架构,这给资源受限的研究团队带来了巨大障碍。在本报告中,我们提出了PilotTTS,一种轻量级自回归TTS系统,通过极简架构和严格的数据工程实现了竞争性能。PilotTTS仅使用20万小时的数据进行训练,这些数据完全通过开源工具处理。具体来说,我们的贡献包括:(1)一个可复现的多阶段数据处理流水线,涵盖质量评估、标签标注和过滤;(2)一个紧凑的模型架构,采用基于Q-Former的条件化,通过跨样本配对训练将说话人身份与说话风格解耦。在统一框架内,PilotTTS支持零样本语音克隆、情感合成(11类)、副语言合成(4类)和中文方言合成(14种方言)。在Seed-TTS Eval基准上,PilotTTS在test-en上实现了最低的WER 1.50%,在test-zh上实现了CER 0.87%,并在两个测试集上取得了最高的说话人相似度(0.862和0.815),优于使用更大数据集训练的系统。我们在https://github.com/AMAPVOICE/PilotTTS上发布了完整的数据流水线配方、预训练权重和代码。

英文摘要

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.

2605.27155 2026-05-28 cs.CV cs.AI 版本更新

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

通过修复进行语义鲁棒性探测:面向安全关键目标检测的交互工具

Nico Steckhan, Krutarth Prajapati, Weija Shao, Silvia Vock

发表机构 * Federal Institute for Occupational Safety and Health (BAuA)(联邦职业安全与健康研究所) Fraunhofer Institute for Manufacturing Engineering and Automation IPA(弗劳恩霍夫研究所(制造工程与自动化IPA))

AI总结 提出SemProbe工具,通过扩散模型可控修复生成语义探针,支持用户自定义掩码和因素,自动评估并记录目标检测模型的鲁棒性变化。

详情
AI中文摘要

在安全关键领域测试目标检测器需要超越像素级损坏的语义上有意义的探针。我们提出了SemProbe,一个用于语义鲁棒性探测的工具:用户上传部署图像,手动或自动创建掩码,选择操作设计域衍生因素(或自定义提示),并运行基于扩散的可控修复。系统支持批量作业、并行种子/工作流变体以及可配置的生成参数。每次输出后,自动运行模型推理并显示带有性能差异的带注释的前后对比。所有探针都作为结构化工件记录,从而能够提供与安全评估工作流一致的可追溯鲁棒性证据。我们在尺寸锯的手部检测上演示了SemProbe,针对保险导向测试标准中的因素。

英文摘要

Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemProbe, a tool for semantic robustness probing: users upload deployment images, create masks manually or automatically, select operational design domain-derived factors (or custom prompts), and run diffusion-based controlled inpainting. The system supports batch jobs, parallel seed/workflow variations, and configurable generation parameters. After each output, model inference runs automatically and displays annotated before/after comparisons with performance deltas. All probes are logged as structured artifacts, enabling traceable robustness evidence aligned with safety evaluation workflows. We demonstrate \textsc{SemProbe} on hand detection for dimension saws, targeting factors from insurance-oriented test criteria.

2605.26902 2026-05-28 cs.IR cs.AI 版本更新

ICICLE: Expanding Retrieval with In-Context Documents

ICICLE: 利用上下文文档扩展检索

Yu-Chen Den, Yung-Yu Shih, Zhi Rui Tam, Kuan-Yu Chen, Pu-Jen Cheng, Yun-Nung Chen, Eugene Yang

发表机构 * National Taiwan University(台湾大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出ICICLE框架,通过上下文文档的docid生成实现增量式生成检索,避免重新训练和灾难性遗忘。

详情
AI中文摘要

生成式检索(GR)使用参数化知识将查询直接映射到文档标识符(docid),但这种设计使得语料库扩展成本高昂:添加新文档需要更新模型参数以编码新的文档-docid关联,导致重复训练和对先前索引文档的灾难性遗忘。在这项工作中,我们将增量式GR重新定义为上下文检索问题,其中新添加的文档作为推理时的文档-docid证据提供。我们提出了ICICLE,一种上下文索引框架,它在参数化记忆和上下文提供的文档-docid对上执行源感知的docid生成。ICICLE结合了基于`[COPY]`的路由机制、基于偏好的校准和大上下文适应,以区分基于上下文的检索和参数化检索。在MS MARCO和NQ320K上的实验表明,ICICLE提高了新引入文档的检索性能,同时无需特定语料库的重新训练即可保持对已见文档的保留。我们的分析进一步表明,高样本退化主要由路由失败引起,突出了源选择校准作为扩展上下文生成式检索的关键瓶颈。

英文摘要

Generative retrieval (GR) maps queries directly to document identifiers (docids) using parametric knowledge, However, this design makes corpus expansion costly: adding new documents requires updating model parameters to encode new document-docid associations incurs repeated training and catastrophic forgetting of previously indexed documents. In this work, we revisit incremental GR as an in-context retrieval problem, where newly added documents are supplied as inference-time document-docid evidence. We propose ICICLE, an in-context indexing framework that performs source-aware docid generation over both parametric memory and context-provided document-docid pairs. ICICLE combines a `[COPY]`-based routing mechanism, preference-based calibration, and large context adaptation to distinguish context-grounded retrieval from parametric retrieval. Experiments on MS MARCO and NQ320K show that ICICLE improves retrieval of newly introduced documents while preserving seen-document retention without corpus-specific retraining. Our analysis further shows that high-shot degradation is mainly caused by routing failure, highlighting source-selection calibration as a key bottleneck for scaling in-context generative retrieval.

2605.26552 2026-05-28 cs.LG cs.AI 版本更新

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

通过摊销基于样本的变分推断来对齐少步生成模型

Jaewoo Lee, Hyeongyu Kang, Dohyun Kim, Kyuil Sim, Woocheol Shin, Minsu Kim, Taeyoung Yun, Jeongjae Lee, Sanghyeok Choi, Tabitha Edith Lee, Jong Chul Ye, Jinkyoo Park

发表机构 * KAIST(韩国科学技术院) MongooseAI Mila – Quebec AI Institute(魁北克AI研究院) University of Edinburgh(爱丁堡大学) Université de Montréal(蒙特利尔大学) Omelet

AI总结 提出FAV框架,利用Stein变分梯度下降进行基于样本的变分推断,并通过固定点回归将粒子更新摊销到生成器参数中,实现对少步生成模型的对齐,在机器人操作和图像生成任务中优于现有方法。

Comments Under review

详情
AI中文摘要

对齐少步生成模型具有挑战性,因为现有的对齐框架通常依赖于限制性假设:可处理的似然、特定的ODE/SDE求解器或特定的模型族。我们引入了FAV(Few-step Generative Models Alignment via Sample-based Variational Inference),这是一个通用的对齐框架,仅需要对生成器和参考分布的样本访问。我们将对齐视为从倾斜于参考分布的奖励倾斜分布中采样。我们利用Stein变分梯度下降作为基于样本的变分推断方案,并通过固定点回归将粒子更新摊销到生成器参数中。我们在两个领域评估了FAV:机器人操作和图像生成器对齐。在机器人操作的生成策略对齐中,FAV在56个离线RL任务和30个离线到在线RL任务中优于现有的策略提取基线。对于图像生成器对齐,FAV微调了多种少步骨干模型,包括GAN、漂移模型、一致性模型和流映射,从ImageNet-$256$扩展到1024$^2$文本到图像合成。代码可在https://github.com/Jaewoopudding/FAV获取。

英文摘要

Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few-step Generative Models Alignment via Sample-based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward-tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample-based variational inference scheme and amortize its particle updates into the generator parameters via fixed-point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline-to-online RL tasks. For image generator alignment, FAV fine-tunes diverse few-step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet-$256$ to 1024$^2$ text-to-image synthesis. Code is available at https://github.com/Jaewoopudding/FAV.

2605.26189 2026-05-28 cs.LG cs.AI 版本更新

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

近无损 HiF8 W8A8 量化感知训练的最大窗口尺度估计

Yingying Cheng, Jinquan Shi, Li Zhou, Zhiyang He, Zhaoyi Sun, Fan Zhang, Jie Sun

发表机构 * OpenPangu-Embedded-1B

AI总结 针对 HiF8 W8A8 量化感知训练中的两种正交失效模式(amax 饱和与灾难性遗忘),提出保守的 64 步历史窗口最大算法 DTS 策略和 500 步 BF16 预热加低学习率 QAT 的修复方案,在 OpenPangu-Embedded-1B 上实现接近 BF16 基线的性能。

详情
AI中文摘要

使用低位浮点格式的量化感知训练(QAT)能够实现高效的 LLM 部署,但会引入标准训练指标无法察觉的微妙失效模式。我们通过延迟张量缩放(DTS)的视角,对 OpenPangu-Embedded-1B 的 HiF8 W8A8 QAT 进行了系统研究。在八个受控实验中,我们识别并解耦了两种正交的失效模式:(i) amax 饱和,其中延迟的尺度估计通过前向传播裁剪静默地破坏知识敏感表示,以及 (ii) 灾难性遗忘,其中激进的学习率独立于量化覆盖预训练的常识知识。两者都无法仅从训练损失中检测到。我们通过保守的 64 步历史窗口最大算法 DTS 策略解决 amax 饱和,并通过 500 步 BF16 预热后以 lr=10^{-5} 进行 QAT 来缓解遗忘。两种修复都是必要且充分的:我们的最终配置在匹配的 BF16 基线上实现了 0.43% MMLU 下降、0.58% HellaSwag 下降和 0.22% ARC-Challenge 下降,训练损失 APE 在 10,000 步内仅为 0.11%。

英文摘要

Quantization-aware training (QAT) with low-bit floating-point formats enables efficient LLM deployment, yet introduces subtle failure modes invisible to standard training metrics. We present a systematic study of HiF8 W8A8 QAT for OpenPangu-Embedded-1B through the lens of Delayed Tensor Scaling (DTS). Across eight controlled experiments, we identify and disentangle two orthogonal failure modes: (i)amax saturation, where delayed scale estimates silently corrupt knowledge-sensitive representations via forward-pass clipping, and (ii)catastrophic forgetting, where an aggressive learning rate overwrites pretrained commonsense knowledge independently of quantization. Neither is detectable from training loss alone. We address amax saturation with a conservative max-algorithm DTS strategy over a 64-step history window, and mitigate forgetting via a 500-step BF16 warmup followed by QAT at lr=10^{-5}. Both fixes are necessary and sufficient: our final configuration achieves 0.43% MMLU drop, 0.58% HellaSwag drop, and 0.22% ARC-Challenge drop versus a matched BF16 baseline, with a training loss APE of only 0.11% over 10,000 steps.

2605.26186 2026-05-28 cs.SE cs.AI cs.CL cs.LG 版本更新

SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

SetupX:LLM代理能否从过去的功能正确代码仓库设置失败中学习?

Zihang Zhou, Ziqian Ren, Yukai Wu, Yingjie Xiong, Wei Zhou, Chao Peng, Dong Zhang, Bingheng Yan, Xuanhe Zhou, Fan Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Beijing University of Posts and Telecommunications(北京邮电大学) Independent Researcher(独立研究者) Jinan Inspur Data Technology Co., Ltd.(济南 Inspur 数据技术有限公司)

AI总结 提出SetupX框架,通过经验学习、推测执行和验证协议解决代码仓库设置中的跨仓库经验迁移、多步试错和鲁棒验证问题,在基准测试中达到92%通过率。

Comments 21 pages, 6 figures

详情
AI中文摘要

功能正确的仓库设置旨在配置执行环境(例如,依赖项、构建脚本)以成功执行仓库的文档化功能。由于多样化的、特定于仓库的失败(包括依赖项不兼容、缺少工具链、不完整的安装和验证策略不匹配),这带来了重大挑战。现有的LLM代理难以稳健地解决这些问题,具体来说,它们无法支持(1)跨仓库经验迁移,(2)在不可逆状态变化下的多步试错修复,以及(3)对设置结果的鲁棒验证,以区分设置引起的失败和仓库错误。为了解决这些问题,我们引入了SetupX,一个基于经验学习的设置框架。首先,我们构建了自进化经验表示(XPU),一种双模态知识单元,编码设置信号、文本指导和可执行动作,以动态地将已验证的环境修复迁移到未见过的仓库。其次,我们采用了由LIFO Docker快照栈支持的经验增强推测执行,使代理能够主动尝试修复并安全回滚到已知的良好状态。第三,我们引入了检察官-法官验证协议,将证据收集与最终判断分离,从而实现超越表面构建时度量的更可靠的设置验证。在精心设计的基准测试上的评估结果表明,SetupX达到了最高性能(例如,92%的通过率),并且比最强基线高出19%以上。关键的是,SetupX在需要跨不同容器协调多个互连服务的复杂多仓库设置中表现出色。代码仓库可在https://github.com/OpenDataBox/SetupX获取。

英文摘要

Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

2605.26114 2026-05-28 cs.AI cs.CL 版本更新

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

MobileGym: 一个可验证且高度并行的移动GUI智能体研究仿真平台

Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, Zhaoxiang Zhang

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Peking University(北京大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出MobileGym,一个基于浏览器的轻量级、完全可控的移动环境,通过结构化JSON状态实现可验证结果信号和低成本并行强化学习,并附带包含416个参数化任务模板的基准测试集。

Comments Project page: https://mobilegym.github.io

详情
AI中文摘要

我们提出MobileGym,一个托管于浏览器、轻量级、完全可控的日常移动使用环境,旨在实现交互保真度而无需复制专有后端。它实现了之前日常应用无法实现的两种能力:通过基于确定性状态判断的结构化JSON状态实现可验证的结果信号,以及通过低成本的并行回滚实现可扩展的在线强化学习。完整的环境状态被捕获、配置、分支和比较为结构化JSON,单个服务器可托管数百个并行实例,每个实例约400 MB内存,冷启动约3秒。分层状态模型和声明式任务定义框架使状态可编程性和任务创建在大规模下实用,单一的程序化判断机制同时提供确定性评估结果和密集的强化学习奖励。配套的MobileGym-Bench提供了416个参数化任务模板,包括256个测试模板和160个训练模板,覆盖28个应用,具有确定性判断器和结构化的AnswerSheet协议,避免了自由文本匹配失败。在Sim-to-Real案例研究中,Qwen3-VL-4B-Instruct上的GRPO在256任务测试集上获得了+12.8个百分点的提升,在59任务真实设备信号子集上,真实设备执行保留了模拟侧训练增益的95.1%。项目页面:https://mobilegym.github.io。

英文摘要

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

2605.25815 2026-05-28 cs.AI cs.MA 版本更新

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

EvoMap背后:表征一个自进化的智能体间协作网络

Qiming Ye, Peixain Zhang, Yupeng He, Zifan Peng, Gareth Tyson

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 通过分析EvoMap网络中的150万资产和12.8万智能体,揭示其设计选择在可重用性、演化和可审计性方面的权衡,发现奖励机制导致98%资产未被重用、评分系统易被操纵以及验证机制存在缺陷。

详情
AI中文摘要

智能体间(A2A)网络通过共享可重用的问题解决指令,使自主AI智能体能够协作。然而,这些去中心化生态系统在实践中如何运作仍然在很大程度上未被探索。我们首次对EvoMap(一个突出的A2A协作网络)进行了大规模实证研究。通过分析超过150万资产和12.8万智能体,我们展示了优先考虑可扩展增长的设计选择如何在可重用性、演化和可审计性方面引入权衡。首先,EvoMap的信用经济奖励智能体发布有价值的资产。尽管这种设计鼓励大规模参与,但奖励主要与发布而非采用挂钩。这导致智能体大量生产资产以积累信用。结果,98%的资产从未被重用,而奖励高度集中在少数智能体手中。其次,EvoMap采用一种算法(称为GDI)来评分和排序这些共享资产的质量。我们证明该评分系统存在缺陷:资产的排名并非衡量客观性能,而是严重受未经验证的自我报告元数据(例如声称修改的代码行数)支配。这使得智能体可以轻易操纵其资产的分数。最后,EvoMap依赖智能体提供本地执行日志作为上传资产功能正常的证据。由于这些验证未经独立核实,超过84%的已批准资产使用空测试(例如console.log())绕过质量检查。我们的发现表明,未来的A2A协作网络不能仅依赖未经验证的自我报告。可扩展的协作需要平衡开放参与与可验证执行和可信评估的机制。

英文摘要

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console$.$log()). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

2605.25378 2026-05-28 cs.CV cs.AI 版本更新

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

CollectionLoRA: 通过多教师在线策略蒸馏将50种效果收集到1个LoRA中

Fangtai Wu, Hailong Guo, Shijie Huang, Jiayi Song, Yubo Huang, Mushui Liu, Zhao Wang, Yunlong Yu, Jiaming Liu, Ruihua Huang

发表机构 * Zhejiang University(浙江大学) Qwen Applications Business Group of Alibaba(阿里巴巴Qwen应用业务组) Xi'an Jiaotong University(西安交通大学)

AI总结 提出CollectionLoRA框架,通过多教师在线策略蒸馏将多达50种不同效果LoRA和少步生成能力整合到单个LoRA中,解决参数干扰并降低部署成本。

详情
AI中文摘要

定制图像编辑旨在使用有限的配对数据,通常通过低秩适配(LoRA)为预训练扩散模型配备特定的视觉效果。随着所需效果数量的增加,存储和动态加载这些效果LoRA会显著增加部署开销。此外,当前的流程通常将这些效果LoRA与加速模块级联以实现快速生成,这会引发严重的参数干扰,导致概念混淆和风格退化。我们提出了CollectionLoRA,一个多教师在线策略蒸馏框架,能够将多达50种不同效果LoRA的概念以及少步生成能力蒸馏到单个LoRA中。这从根本上解决了特征干扰问题,并显著降低了部署成本。具体来说,该方法引入了(i)概率双流路由机制,使模型在训练期间能够在数据源之间随机切换,有效增强其在未见场景中的泛化能力;(ii)非对称正交提示策略,在提示空间内实现概念隔离;(iii)从粗到细的蒸馏目标,以缓解教师模型与学生模型之间的分布差距。大量评估表明,CollectionLoRA将所有定制效果和少步生成蒸馏到单个LoRA中,降低了部署开销,同时实现了与独立训练的教师模型相当或更好的概念保真度。代码:https://github.com/Qwen-Applications/CollectionLoRA

英文摘要

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly increases deployment overhead. Furthermore, current pipelines typically cascade these effect LoRAs with acceleration modules for fast generation, which triggers severe parameter interference and results in concept bleeding and style degradation. We propose CollectionLoRA, a multi-teacher on-policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few-step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces (i) a Probabilistic Dual-Stream Routing mechanism that enables the model to randomly switch between data sources during training, effectively enhancing its generalization in unseen scenarios; (ii) an Asymmetric Orthogonal Prompting strategy to achieve concept isolation within the prompt space; (iii) a Coarse-to-Fine Distillation Objective to mitigate the distribution gap between the teacher and student models. Extensive evaluations show that CollectionLoRA distills all customized effects and few-step generation into a single LoRA, reducing deployment overhead while achieving concept fidelity comparable to or better than independently trained teacher models. Code: https://github.com/Qwen-Applications/CollectionLoRA

2605.25252 2026-05-28 cs.LG cs.AI 版本更新

Quantifying Empirical Compute-Supervision Tradeoffs in RLVR

量化 RLVR 中计算与监督的实证权衡

Ryo Mitsuhashi, Patrick Chen, Isabelle Tseng, Jasin Cekinmez, Addison J. Wu

发表机构 * Princeton University(普林斯顿大学)

AI总结 通过 GSM8K 上的 GRPO 实验,研究验证器噪声对 RLVR 的影响,发现计算扩展无法弥补监督噪声,且假阴性比假阳性危害更大。

Comments Workshop on Combining Theory and Benchmarks @ ICML 2026

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为后训练语言模型的标准范式,但在实践中,验证器很少是完美的。最近的理论工作预测,验证器噪声会影响学习速率,但不影响最终结果,这意味着足够的计算应该能够弥补不完美监督带来的任何差距。我们通过在 GSM8K 上使用 GRPO 对 Qwen2.5(0.5B, 1.5B)进行后训练,同时向二元正确性信号中注入受控的假阳性和假阴性噪声,并将每次提示的 rollout 数量作为计算轴,来实证检验这一预测。在实践中,验证准确率的差距在大量计算扩展下仍然存在,且计算收益急剧递减。我们进一步发现一种结构性不对称:假阴性单调地比假阳性更快地降低性能。这些发现表明,验证器质量和训练计算不可互换,并且减少假阴性比单纯扩展计算更有效。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training language models, but in practice, verifiers are rarely perfect. Recent theoretical work predicts that verifier noise affects the rate of learning but not its final outcome, implying that sufficient compute should close any gap induced by imperfect supervision. We test this prediction empirically by post-training Qwen2.5 (0.5B, 1.5B) with GRPO on GSM8K while injecting controlled false-positive and false-negative noise into the binary correctness signal, and varying rollouts per prompt as a compute axis. In practice, the gap in validation accuracy persists under substantial compute scaling, with returns to compute that are sharply diminishing. We further find a structural asymmetry where false negatives monotonically degrade performance more quickly than false positives. These findings suggest verifier quality and training compute are not interchangeable, and that reducing false negatives is a more effective lever than scaling compute alone.

2605.25230 2026-05-28 cs.AI 版本更新

Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models

通过引导推理提升推理能力:递归模型的随机探索

Andrew Corbett, Archit Sood, Anna Tzatzopoulou, Sai-Aakash Ramesh, Tim Dodwell

发表机构 * digiLab, UK(digiLab, 英国) University of Bristol, UK(英国布里斯托尔大学)

AI总结 提出引导随机探索方法,通过随机扰动推理轨迹并在线重加权,提升递归模型在结构化推理任务上的性能,无需重新训练。

Comments Presented at the proceedings of the ICML 2026 Workshop on Structured Probabilistic Inference & Generative Modeling (SPIGM)}, Seoul, South Korea. 2026

详情
AI中文摘要

最近关于递归架构的研究表明,小型神经网络在结构化推理任务上可以出奇地强大。其诀窍是用潜在动力系统对推理轨迹进行建模。我们认为,这些架构的推理时行为最好被理解为对潜在推理轨迹的近似推理,其中确定性递归是单粒子、零噪声极限。我们通过引导随机探索使这一观点可操作:推理动力学的随机扰动提出相邻轨迹,而模型现有的早停头在线重新加权它们。该框架产生三个无标签诊断指标:局部稳定性、引导对齐度和云令牌熵。这些指标仅从推理轨迹就能预测该过程是否有帮助以及应信任其哪些输出。在Sudoku-Extreme上,它无需重新训练就将精确求解准确率从85.9%提升到98.0%;在Maze-Hard上,诊断指标标记出引导未对齐,后续验证性能也证实了这一点。因此,同一机制既能刻画递归推理在轨迹层面何时有改进空间,也能刻画模型内部引导何时能恢复它。

英文摘要

Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The trick is to model reasoning trajectories with a latent dynamical system. We argue that the inference-time behaviour of these architectures is best understood as approximate inference over latent reasoning trajectories, with deterministic recursion as the one-particle, zero-noise limit. We make this view operational through guided stochastic exploration: stochastic perturbations of the reasoning dynamics propose neighbouring trajectories, and the model's existing early-stopping head reweights them online. The framework yields three label-free diagnostics: local stability, guide alignment, and cloud-token entropy. These predict, from inference traces alone, whether the procedure will help and which of its outputs to trust. On Sudoku-Extreme it lifts exact-solve accuracy from $85.9\%$ to $98.0\%$ without retraining; on Maze-Hard the diagnostics flag a misaligned guide, as validation performance later confirms. The same machinery thus characterises both when recursive reasoning has room to improve at the trajectory level and when the model's internal guide can recover it.

2605.25183 2026-05-28 cs.CL cs.AI 版本更新

Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

知识图谱驱动的神经科学专家级推理

Jake Stephen, Niraj K. Jha

发表机构 * Department of Electrical and Computer Engineering, Princeton University(普林斯顿大学电气与计算机工程系)

AI总结 本文通过从单一教科书构建知识图谱并生成问答监督,微调语言模型,实现超越大语言模型的专家级神经科学推理。

详情
AI中文摘要

知识图谱(KG)是一种可以从文本语料库中提取并用于深度推理的抽象结构。先前的工作利用KG微调语言模型(LM),实现了特定领域的超智能。在这项工作中,我们探索仅使用单一权威教科书中的信息,KG驱动的深度推理能力是否能在神经科学中出现。核心假设是,结构化知识在被提炼为高质量KG并转换为基于KG的问答(QA)监督后,足以通过微调LM产生专家级推理,该LM在准确率上超越大型语言模型(LLM),同时参数数量少几个数量级。我们通过双LLM验证流水线构建教科书衍生的KG,使用在KG拓扑上训练的掩码LM扩展它,生成多跳QA项目(包括QA对和推理轨迹),以仅基于KG的监督微调LM,并应用强化学习,使用路径衍生的KG信号作为隐式奖励模型。我们的结果表明,深度、机械性的神经科学理解可以在模型中诱导,而无需依赖大型、异构的网络规模语料库。基于KG的神经科学合成课程(读者可以自我测试)以及微调后的LM可在以下GitHub位置获取:https://kg-bottom-up-superintelligence.github.io/neuro-bench。

英文摘要

Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KGs to fine-tune language models (LMs), enabling domain-specific superintelligence. In this work, we explore whether KG-driven in-depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high-quality KG and converted into KG-grounded question-answer (QA) supervision, is sufficient to produce expert-level reasoning through a fine-tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook-derived KG via a dual-LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi-hop QA items, which include QA pairs and reasoning traces, to fine-tune an LM exclusively on KG-derived supervision, and apply reinforcement learning using path-derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web-scale corpora. The KG-based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine-tuned LM, are available at the following GitHub location: https://kg-bottom-up-superintelligence.github.io/neuro-bench.

2605.23908 2026-05-28 cs.AI cs.CL cs.CV cs.NE 版本更新

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

寻找开放性的要素:用大型视觉语言模型复现 Picbreeder

Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi

发表机构 * New York University(纽约大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本研究通过用前沿视觉语言模型替代人类用户复现 Picbreeder,探索人工智能在无引导发现中的开放性能力,并分析系统输出与人类基线在系统发育复杂性、视觉和语义显著性及新颖性上的差异,同时研究探索性噪声、行为多样性和叙事动量等因素的影响。

Comments 26 pages, 21 figures, to be published at GECCO 2026

详情
AI中文摘要

我们正处于大规模工业和学术努力之中,旨在通过AI驱动的助手自动化科学、技术和创造性生产的过程。历史上,这些过程在人类形式中的一个基本属性是它们的开放性:即生成看似无穷无尽的新颖且有意义的新形式的能力。人工代理是否有能力进行这种富有成果的无引导发现?为了回答这个问题,我们转向Picbreeder,这是人类驱动的开放性搜索的典型范例,用户通过小型神经网络的交互式进化协作生成多样化的图像库。我们复现了Picbreeder,用前沿视觉语言模型(VLM)替代人类用户。我们观察到系统输出与历史人类基线之间存在明显的定性差异,并尝试使用系统发育复杂性、视觉和语义显著性及新颖性的指标来表征这些差异。为了识别导致这些差异的一些因果因素,我们研究了在代理的选择过程中添加探索性噪声、代理之间的行为多样性以及以过去行动记忆形式的叙事动量。我们的代码可在 https://github.com/smearle/picbreeder-vlm 获取。

英文摘要

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.

2605.22547 2026-05-28 cs.CV cs.AI 版本更新

Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement

基于多模态知识图谱和可靠性引导精化的病例感知医学图像分类

Yiming Xu, Yixuan Liu, Yuhang Zhang, Ling Zheng, Yihan Wang, Qi Song

发表机构 * University of Science and Technology of China(科学技术大学)

AI总结 提出一种基于多模态知识图谱的病例感知推理框架,通过构建结构化诊断记忆、自适应检索相似病例、知识传播与注入机制以及置信度校准的决策精化方案,提升医学图像分类的性能和可解释性。

详情
AI中文摘要

深度学习为医学图像分类带来了显著进展,但现有方法大多依赖孤立的视觉证据,无法有效利用相似病例或外部知识。在临床实践中,诊断通常由相似历史病例及其相关症状支持。为了显式建模这一循证诊断过程,我们提出了一种由多模态知识图谱驱动的病例感知推理框架,用于医学图像分类。具体而言,我们构建了一个病例感知的多模态知识图谱作为结构化的诊断记忆,其中疾病、图像和症状按层次组织。给定输入图像,我们的方法自适应地从该记忆中检索相似病例,并提取相应的以病例为中心的子图。我们进一步引入了一种知识传播与注入机制,其中以图像为中心的图注意力网络将异质语义聚合为基于病例的特征,随后通过双向跨模态注意力机制将这些特征注入视觉表示以实现跨模态对齐。为了减轻噪声检索,我们设计了一种置信度校准的决策精化方案,通过联合考虑预测置信度和样本相似性来估计每个检索病例的可靠性,并重新加权其对最终预测的贡献,提供可解释的病例级证据。在多个医学影像数据集上的大量实验表明,我们的方法一致优于强基线,而消融和定性分析验证了其有效性和可解释性。代码可在 https://anonymous.4open.science/r/MKG-CARE-8B7B 获取。

英文摘要

Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evidence and cannot effectively leverage similar cases or external knowledge. In clinical practice, diagnosis is typically supported by similar historical cases and their associated symptoms. To explicitly model this evidence-based diagnostic process, we propose a case-aware reasoning framework driven by multimodal knowledge graphs for medical image classification. Specifically, we construct a case-aware multimodal knowledge graph as a structured diagnostic memory, where diseases, images, and symptoms are hierarchically organized. Given an input image, our method adaptively retrieves similar cases from this memory and extracts their corresponding case-centered subgraphs. We further introduce a knowledge propagation and injection mechanism, in which an image-centric Graph Attention Network aggregates heterogeneous semantics into case-based features, followed by a bidirectional cross-modal attention mechanism that injects these features into visual representations for cross-modal alignment. To mitigate noisy retrieval, we design a confidence-calibrated decision refinement scheme that estimates the reliability of each retrieved case by jointly considering prediction confidence and sample similarity, and reweights its contribution to the final prediction, providing interpretable case-level evidence. Extensive experiments on multiple medical imaging datasets demonstrate that our approach consistently outperforms strong baselines, while ablation and qualitative analyses validate its effectiveness and interpretability. The code is available at https://anonymous.4open.science/r/MKG-CARE-8B7B.

2605.22166 2026-05-28 cs.AI 版本更新

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

适配接口而非模型:面向确定性LLM智能体的运行时框架适配

Tianshi Xu, Huifeng Wen, Meng Li

发表机构 * Peking University(北京大学)

AI总结 提出Life-Harness运行时框架,通过从训练轨迹中演化出可复用的环境侧干预,在不修改模型权重或评估环境的情况下,显著提升冻结LLM智能体在确定性任务中的性能。

Comments Work in progress

详情
AI中文摘要

LLM智能体不仅由其语言模型塑造,还受运行时框架的影响,该框架协调观察、工具使用、动作执行、反馈解释和轨迹控制。虽然现有的智能体适配方法主要更新模型参数,但在确定性、规则主导的领域中,许多失败源于模型-环境接口的不匹配。我们提出Life-Harness,一种生命周期感知的运行时框架,在不改变模型权重或评估环境的情况下改进冻结的LLM智能体。Life-Harness从训练轨迹中演化,通过将重复出现的交互失败转化为跨环境契约、程序技能、动作实现和轨迹调节的可复用干预,并在未见任务上保持固定以进行评估。在来自$\tau$-bench、$\tau^2$-bench和AgentBench的七个确定性环境中,Life-Harness在18个模型骨干上的126个模型-环境设置中改进了116个,平均相对提升88.5%。仅从Qwen3-4B-Instruct轨迹演化出的框架可迁移到其他17个模型,表明Life-Harness捕获的是可复用的环境侧结构而非模型特定行为。这些结果将运行时接口适配定位为以模型为中心的智能体训练的互补替代方案。代码可在https://github.com/Tianshi-Xu/Life-Harness获取。

英文摘要

LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule-governed domains stem from mismatches at the model--environment interface. We propose Life-Harness, a lifecycle-aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life-Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed for evaluation on unseen tasks. On seven deterministic environments from $τ$-bench, $τ^2$-bench, and AgentBench, Life-Harness improves 116 out of 126 model--environment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3-4B-Instruct trajectories transfer to 17 other models, showing that Life-Harness captures reusable environment-side structure rather than model-specific behavior. These results position runtime interface adaptation as a complementary alternative to model-centric agent training. Code is available at https://github.com/Tianshi-Xu/Life-Harness.

2510.20665 2026-05-28 cs.AI 版本更新

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

推理的形状:大型语言模型中推理轨迹的拓扑分析

Xue Wen Tan, Nathaniel Tan, Galen Lee, Stanley Kok

发表机构 * University of Cambridge, Department of Engineering, England(剑桥大学工程系) National University of Singapore, School of Computing, Singapore(新加坡国立大学计算机学院)

AI总结 提出基于拓扑数据分析(TDA)的评估框架,通过捕捉推理轨迹的几何结构实现高效自动评估,实验表明拓扑特征比图指标更有效预测推理质量。

Comments Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情
AI中文摘要

评估大型语言模型推理轨迹的质量仍然研究不足、劳动密集且不可靠:当前实践依赖于专家评分标准、手动注释和缓慢的成对判断。自动化努力主要由基于图的代理主导,这些代理量化结构连通性,但未阐明高质量推理的构成;对于固有复杂的过程,这种抽象可能过于简单。我们引入了一个基于拓扑数据分析(TDA)的评估框架,该框架捕捉推理轨迹的几何结构,并实现标签高效、自动化的评估。在我们的实证研究中,拓扑特征在评估推理质量方面比标准图指标具有更高的预测能力,这表明有效推理更好地由高维几何结构而非纯关系图来捕捉。我们进一步表明,一组紧凑、稳定的拓扑特征可靠地指示轨迹质量,为未来的强化学习算法提供了实用信号。

英文摘要

Evaluating the quality of reasoning traces from large language models remains understudied, labor-intensive, and unreliable: current practice relies on expert rubrics, manual annotation, and slow pairwise judgments. Automated efforts are dominated by graph-based proxies that quantify structural connectivity but do not clarify what constitutes high-quality reasoning; such abstractions can be overly simplistic for inherently complex processes. We introduce a topological data analysis (TDA)-based evaluation framework that captures the geometry of reasoning traces and enables label-efficient, automated assessment. In our empirical study, topological features yield substantially higher predictive power for assessing reasoning quality than standard graph metrics, suggesting that effective reasoning is better captured by higher-dimensional geometric structures rather than purely relational graphs. We further show that a compact, stable set of topological features reliably indicates trace quality, offering a practical signal for future reinforcement learning algorithms.

2605.21832 2026-05-28 cs.AI 版本更新

FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

FLUID:从临时ID到多模态语义编码的工业级直播推荐

Xinhang Yuan, Zexi Huang, Anjia Cao, Xudong Lu, Zikai Wang, Penghao Zhou, Chang Liu, Wentao Guo, Qinglei Wang

发表机构 * TikTok(字节跳动) ByteDance(字节跳动)

AI总结 针对直播推荐中ID冷启动问题,提出FLUID框架,通过跨域多模态编码器生成层次化语义编码LUCID替代候选侧ID,并采用分阶段预热方案,在工业级系统上取得显著提升。

详情
AI中文摘要

现代推荐系统严重依赖基于ID的协同过滤:每个项目由一个独特的ID嵌入表示,该嵌入从用户交互中积累协同信号。然而,直播推荐在这种范式下面临独特挑战:直播间通常仅播出几十分钟,因此其项目ID在持续的冷启动状态下学习不佳,以ID为中心的排序模型无法泛化。我们提出FLUID,这是第一个从生产规模的直播排序器中完全淘汰候选侧项目ID的框架。FLUID引入了一个跨域多模态编码器,在短视频和直播上联合训练,生成离散的层次化语义编码,称为LUCID,用于基于内容的项目表征。为了使排序器适应LUCID,FLUID进一步采用分阶段预热方案:首先将冷启动的切片级LUCID作为独立标记与ID嵌入一起引入,然后在在线增量训练之前用热启动的房间级LUCID替换ID嵌入。FLUID部署在我们的工业级直播推荐系统上,该系统的跨平台合并用户基数超过十亿,取得了显著的在线收益:优质观看时长+0.55%,冷启动房间观看量+2.05%,活跃小时数+0.05%。

英文摘要

Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumulates collaborative signals from user interactions. Livestreaming recommendation, however, faces a unique challenge in this paradigm: a live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state and ID-centric ranking models fail to generalize. We present FLUID, the first framework to fully retire the candidate-side item ID from a production-scale livestreaming ranker. FLUID introduces a cross-domain multimodal encoder, jointly trained on short videos and livestreams, to produce discrete hierarchical semantic codes, called LUCID, for content-based item characterization. To adapt the ranker to LUCID, FLUID further employs a staged warmup scheme: it first incorporates cold, slice-level LUCID as an independent token alongside the ID embedding, and then replaces the ID embedding with warm, room-level LUCID before online incremental training. Deployed on our industrial livestreaming recommenders with a cross-platform combined user base of over one billion globally, FLUID delivers significant online gains of +0.55% Quality Watch Duration, +2.05% Cold-Start Room Views, and +0.05% Active Hours.

2605.21743 2026-05-28 cs.AI econ.GN q-fin.EC 版本更新

Who Uses AI? Platform Selection and the Measurement of Occupational AI Exposure

谁在使用AI?平台选择与职业AI暴露的测量

Michelle Yin, Burhan Ogut

发表机构 * School of Education and Social Policy, Northwestern University(教育与社会政策学院,西北大学) American Institutes for Research(美国研究机构)

AI总结 本文通过分析AI平台对话日志,揭示平台用户构成导致职业AI暴露测量偏差,并提出劳动力加权部分识别方法校正估计。

详情
AI中文摘要

来自AI平台的对话日志越来越多地被用于衡量职业对人工智能的暴露程度,但在这些日志中观察到的用户并非劳动力群体。我们表明,从平台导出的暴露分数结合了任务级别的AI适用性与平台用户群的职业构成。保持实证设计不变,仅改变平台输入会使ChatGPT后的就业系数变化1.9倍,并且同一供应商内的消费者和企业渠道在符号上存在分歧。我们将由此产生的非经典测量误差形式化,将其分解为职业间和职业内的选择,并构建了劳动力加权的部分识别界限。根据劳工统计局就业份额进行重新加权会使估计值衰减42%至93%。该偏差捕捉了观察用户中的增强效应,比劳动力中的替代效应更直接。

英文摘要

Conversation logs from AI platforms are increasingly used to measure occupational exposure to artificial intelligence, but the users observed in these logs are not the workforce. We show that platform-derived exposure scores combine task-level AI applicability with the occupational composition of the platform's user base. Holding the empirical design fixed, changing only the platform input changes the post-ChatGPT employment coefficient by a factor of 1.9, and consumer and enterprise channels within the same vendor disagree in sign. We formalize the resulting non-classical measurement error, decompose it into between- and within-occupation selection, and construct workforce-reweighted partial-identification bounds. Reweighting to Bureau of Labor Statistics employment shares attenuates estimates by 42 to 93 percent. The bias captures augmentation among observed users more directly than substitution in the workforce.

2605.16578 2026-05-28 cs.SD cs.AI cs.HC cs.LG 版本更新

Voice "Cloning" is Style Transfer

语音“克隆”是风格迁移

Kaitlyn Zhou, Federico Bianchi, Martijn Bartelds, Anna Pot, Yongchan Kwon, James Zou

发表机构 * Cornell University(康奈尔大学) TogetherAI Stanford University(斯坦福大学)

AI总结 研究发现语音克隆并非忠实复制原声,而是系统性地应用风格迁移,使克隆语音更权威、温暖、客服化且更人性化,导致说话者特征同质化,并影响人类信任与行为。

详情
AI中文摘要

人工生成的语音日益嵌入日常生活。语音克隆尤其适用于身份保留重要的应用,例如完成录音、用新语言配音或保存失语者的声音。然而,在我们的工作中,我们发现尽管术语如此,语音克隆并不能忠实地“克隆”个体的声音。相反,我们发现广泛使用的语音克隆模型系统性地对源语音应用风格迁移。根据人类标注者的评分,克隆语音相比源语音被认为更权威、更温暖、更接近客服风格且更人性化。人类标注者还报告对克隆语音的信任度高于源语音,并且更愿意向它们透露敏感个人信息。我们的工作还表明,语音克隆导致说话者特征的同质化,表现为口音、语速和音频嵌入空间的方差减小。总之,我们的结果凸显了语音克隆技术的一系列新局限和风险,及其对人类行为的潜在影响。

英文摘要

Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ''clone'' an individual's voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.

2605.19729 2026-05-28 cs.CV cs.AI 版本更新

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

LIFT and PLACE: 一种简单、稳定且有效的轻量级扩散模型知识蒸馏框架

Hyunsoo Han, Sangyeop Yeo, Jaejun Yoo

发表机构 * Ulsan National Institute of Science and Technology (UNIST)(ulsan国家科学技术研究所)

AI总结 提出LIFT和PLACE框架,通过粗到细的蒸馏策略解决教师网络高复杂度带来的学生模仿困难,在极端压缩下仍能稳定训练并取得良好性能。

Comments Project page: https://hyun-s.github.io/LIFT_PLACE_site , 15 pages, 11 figure, 9 tables, To appear in CVPR 2026

详情
AI中文摘要

我们证明,在扩散模型的知识蒸馏中,教师网络由于其更大的容量而具有高度复杂的去噪过程,这给学生模型忠实模仿带来了重大挑战。为了解决这个问题,我们提出了一种基于线性拟合蒸馏(LIFT)和分段局部自适应系数估计(PLACE)的粗到细蒸馏框架。首先,LIFT将目标分解为“粗”对齐和“细”细化。学生先在粗对齐上训练,然后进行困难的细化。其次,PLACE通过将输出划分为基于误差的组来扩展LIFT以处理空间非均匀误差,提供局部自适应指导。我们的实验表明,LIFT和PLACE在扩散空间(图像/潜在)、骨干网络(U-Net/DiT)、任务(无条件/条件)、数据集上均有效,甚至扩展到基于流的模型如MMDiT(SD3)。此外,在极端压缩下(学生参数1.3M,仅为教师的1.6%),传统KD无法为稳定训练提供足够指导,FID分数常退化到50-200+,但我们的方法仍稳定收敛并达到15.73的FID。

英文摘要

We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

2602.06025 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

学习面向运行时智能体记忆的查询感知预算层级路由

Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang

发表机构 * Nanyang Technological University(南洋理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Illinois Chicago(伊利诺伊大学香槟分校) Tsinghua University(清华大学) Sun Yat-sen University(中山大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 BudgetMem 框架,通过强化学习训练的轻量级路由器实现查询感知的预算层级路由,以在运行时平衡任务性能与记忆构建成本。

Comments Accepted by ICML 2026. Code is available at https://github.com/ViktorAxelsen/BudgetMem

详情
AI中文摘要

记忆对于在单个上下文窗口之外运行的大型语言模型(LLM)智能体日益重要,然而大多数现有系统依赖于离线的、查询无关的记忆构建,这可能导致效率低下并丢弃查询关键信息。尽管运行时记忆利用是一种自然的替代方案,但先前的工作通常会产生大量开销,并且对性能-成本权衡的显式控制有限。在这项工作中,我们提出了 extbf{BudgetMem},一个用于显式、查询感知性能-成本控制的运行时智能体记忆框架。BudgetMem 将记忆处理结构化为一组记忆模块,每个模块提供三个预算层级(即 extsc{Low}/ extsc{Mid}/ extsc{High})。一个轻量级路由器在模块间执行预算层级路由,以平衡任务性能和记忆构建成本,该路由器实现为通过强化学习训练的紧凑神经策略。使用 BudgetMem 作为统一测试平台,我们研究了实现预算层级的三种互补策略:实现(方法复杂度)、推理(推理行为)和容量(模块模型大小)。在 LoCoMo、LongMemEval 和 HotpotQA 上,当优先考虑性能时(即高预算设置),BudgetMem 超越了强基线,并在更紧的预算下提供了更好的精度-成本边界。此外,我们的分析揭示了不同层级策略的优势和劣势,阐明了在不同预算制度下每个轴何时提供最有利的权衡。

英文摘要

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

2605.19743 2026-05-28 cs.AI cs.LG cs.MA 版本更新

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

EngiAI: 面向LLM驱动工程设计的智能体框架与基准测试套件

Gioele Molinari, Florian Felten, Soheyl Massoudi, Mark Fuge

发表机构 * IDEAL Chair of Artificial Intelligence in Engineering Design(人工智能与工程设计理想 chair) ETH Zurich(苏黎世联邦理工学院) Autom8.build

AI总结 提出EngiAI多智能体系统框架和包含工作流、RAG、HPC三维度的基准套件,通过监督架构协调七个专业智能体,验证了LLM在工程设计中的能力与局限。

Comments 26 pages, 10 figures, to be published at IDETC 2026

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地应用于工程设计任务,但现有的评估框架未能充分处理结合仿真、检索和制造准备的多智能体系统。我们引入了一个包含三个评估维度的基准套件:(1)一个工作流基准,包含七种针对不同认知需求的提示风格——包括直接工具使用、语义消歧、条件分支和工作记忆任务;(2)一个检索增强生成(RAG)基准,采用门控评分来隔离检索对参数选择的贡献;(3)一个高性能计算(HPC)基准,评估在SLURM集群上的端到端机器学习训练编排。与基准一起,我们提出了EngiAI,一个基于LangGraph构建的多智能体系统(MAS)参考实现,通过监督架构协调七个专业智能体,统一拓扑优化、文档检索、HPC作业编排和3D打印机控制。在四个LLM后端和两个EngiBench问题上,专有模型在Beams2D上实现了96-97%的平均任务完成率,而开源4B参数模型达到55-78%,并显示出明显的代际改进。条件分支被证明最具挑战性,在Photonics2D上条件风格的任务完成率降至20-53%。RAG门控确认了近乎完美的检索增强分数(约1.0),而无检索时接近零,验证了评估设计。在HPC编排中,一个模型在100%的运行中完成了所有流水线步骤,而另一个模型降至50%,表明多步骤指令遵循在长时间运行的工作流中会退化。

英文摘要

Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96-97% average task completion on Beams2D, while open-source 4B-parameter models reach 55-78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20-53% for the conditional style on Photonics2D. RAG gating confirms near-perfect retrieval-augmented scores (about 1.0) versus near-zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi-step instruction following degrades over long-running workflows.

2605.19514 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management

立场:自回归Transformer的图灵完备性高度依赖于上下文管理

Guanyu Cui, Zhewei Wei, Kun He

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院) DEKE Lab, Renmin University of China, Beijing, China(中国人民大学北京校区DEKE实验室)

AI总结 本文通过区分固定系统和缩放族两种设置,论证了上下文管理方法对自回归Transformer计算能力的决定性影响,并指出缩放族设置下的图灵完备性证明不适用于实际部署的固定系统。

Comments Accepted to the ICML 2026 Position Paper Track

详情
AI中文摘要

许多工作提出了引人注目的主张,即Transformer是图灵完备的。然而,文献常常混淆两种不同的设置:(i)固定系统设置,其中固定的自回归Transformer与固定的上下文管理方法耦合,逐步处理不同长度的输入;(ii)缩放族设置,其中使用一系列不同模型(具有增加的上下文窗口长度或数值精度)来处理不同的输入长度。现有的Transformer图灵完备性证明通常是在设置(ii)中建立的,而现实世界中的LLM部署以及图灵完备性的标准概念更自然地对应于设置(i)。在本文中,我们首先形式化固定系统设置,从而具体描述现实世界LLM的运行方式。然后,我们认为在缩放族设置中证明的结果提供了理论上有意义的资源界限,但并未建立图灵完备性,从而澄清了对现有结果的常见误解。最后,我们展示了不同的上下文管理方法可以产生截然不同的计算能力,并主张上下文管理是决定现实世界自回归Transformer计算能力的关键组成部分。

英文摘要

Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a fixed Transformer system setting, in which a fixed autoregressive Transformer is coupled with a fixed context-management method to process inputs of different lengths step by step, and (ii) a scaling-family setting, in which a family of different models (with increasing context-window length or numerical precision) is used to handle different input lengths. Existing proofs of Transformer Turing-completeness are frequently established in setting (ii), whereas real-world LLM deployment and the standard notion of Turing-completeness correspond more naturally to setting (i). In this paper, we first formalize the fixed-system setting, thereby providing a concrete characterization of how real-world LLMs operate. We then argue that results proved in the scaling-family setting provide theoretically meaningful resource bounds but do not establish Turing-completeness, thereby clarifying a common misinterpretation of existing results. Finally, we show that different context-management methods can yield sharply different computational power, and we advocate the position that context management is a central component that critically determines the computational power of real-world autoregressive Transformers.

2605.19444 2026-05-28 cs.LG cs.AI 版本更新

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

检测与缓解测试时强化学习中多数投票导致的正确答案灭绝窗口

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

发表机构 * Meituan(美团)

AI总结 本文提出TTRL-Guard框架,通过翻转率感知奖励缩放、少数保留采样和风险条件稀疏更新三种机制,检测并缓解测试时强化学习中多数投票导致的正确答案信号永久抑制问题。

详情
AI中文摘要

测试时强化学习(TTRL)在使用多数投票作为伪标签信号时,在数学推理基准测试中报告了显著的准确率提升。我们认为这些提升被系统性地误解了:大部分提升反映的是已可解问题的锐化而非真正学习,而由正确变为错误的问题数量超过了真正学会的问题,且一旦多数投票锁定错误答案,这种损害是不可逆的。逐问题追踪显示,低能力问题中的正确答案信号在短暂活跃后会被永久抑制,我们将这一现象称为 extit{正确答案灭绝窗口},并以翻转率(FR)作为其领先指标。因此,我们提出TTRL-Guard,一个轻量级框架,包含三种针对灭绝窗口的机制:翻转率感知奖励缩放(FRS)在FR下降时降低高风险更新的权重,少数保留采样(MPS)保留少数正确答案的梯度信号,风险条件稀疏更新(RCSU)暂停对极化问题的更新。在三个模型和四个基准上的实验表明,TTRL-Guard在Qwen2.5-7B-Instruct和Qwen3-4B上取得了最佳平均pass@1,在AIME 2025上相对TTRL提升了+54%。

英文摘要

Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per-problem tracking reveals that correct-answer signals in low-ability problems are briefly active before being permanently suppressed, a phenomenon we term the \textit{Correct-Answer Extinction Window}, with Flip Rate (FR) as its leading indicator. We thus propose TTRL-Guard, a lightweight framework with three mechanisms targeting the extinction window: Flip-Rate-Aware Reward Scaling (FRS) down-weights at-risk updates as FR declines, Minority-Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, improves relatively over TTRL by +54\% on AIME 2025.

2605.18692 2026-05-28 cs.AI math.OC 版本更新

Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

利用LLM引导的模型补丁实现大规模重新优化的大众化

Tinghan Ye, Arnaud Deza, Ved Mohan, El Mehdi Er Raqabi, Pascal Van Hentenryck

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(赫尔曼·米利特·斯图尔特工业与系统工程学院,佐治亚理工学院) Department of Operations and Decision Systems, Université Laval(运营与决策系统系,拉瓦尔大学)

AI总结 提出一个基于大语言模型的代理重新优化框架,通过自然语言交互和优化工具箱,使非专家用户能够动态更新和重新优化部署的优化模型,并在两个大规模实际案例中验证了其有效性和可扩展性。

详情
AI中文摘要

运筹学专家开发的优化模型通常作为工业环境中的决策支持系统部署。然而,现实环境是动态的,业务规则不断演变且存在不可预见的扰动。在这种情况下,最终用户理想情况下应重新优化模型以恢复可行且可实施的解决方案,但往往无法联系到原始模型开发者。本文介绍了一个代理重新优化框架,其中大语言模型充当运筹学专家,通过自然语言交互动态支持最终用户。大语言模型将用户提示转化为底层优化模型的结构化更新,从优化工具箱中选择合适的重新优化技术,并求解生成的实例以返回可实施的解决方案。该工具箱利用原始信息,包括历史解、有效不等式、求解器配置和元启发式算法,以加速重新优化同时保持解的质量。所提出的框架能够实现部署优化模型的交互式和持续适应,减少对运筹学专家的依赖,并提高决策支持系统的可持续性。在两个互补的大规模实际案例研究上的广泛实验证明了所提框架的有效性和可扩展性。第一个案例考虑在线供应链重新优化,其中必须快速生成解同时保持与部署计划接近,而第二个案例侧重于离线大学考试排程,其中解的质量优先于运行时间。结果表明,基于工具箱的架构通过基于原始信息和求解器感知的重新优化技术显著提高了计算效率,而基于结构化补丁的更新提高了模型修改的可解释性和可追溯性。

英文摘要

Optimization models developed by operations research (OR) experts are often deployed as decision-support systems in industrial settings. However, real-world environments are dynamic, with evolving business rules and unforeseen perturbations. In such contexts, end users should ideally re-optimize models to recover feasible and implementable solutions, often without access to the original model developers. This paper introduces an agentic re-optimization framework in which a large language model (LLM) acts as an OR expert, dynamically supporting end users through natural-language interaction. The LLM translates user prompts into structured updates of the underlying optimization model, selects suitable re-optimization techniques from an optimization toolbox, and solves the resulting instance to return implementable solutions. The toolbox leverages primal information, including historical solutions, valid inequalities, solver configurations, and metaheuristics, to accelerate re-optimization while preserving solution quality. The proposed framework enables interactive and continuous adaptation of deployed optimization models, reducing dependence on OR experts, and improving the sustainability of decision-support systems. Extensive experiments on two complementary large-scale real-world case studies demonstrate the effectiveness and scalability of the proposed framework. The first considers online supply chain re-optimization, where solutions must be generated rapidly while remaining close to the deployed plan, whereas the second focuses on offline university exam scheduling, where solution quality is prioritized over runtime. Results show that the toolbox-driven architecture significantly improves computational efficiency through primal-based and solver-aware re-optimization techniques, while the structured patch-based updates improve interpretability and traceability of model modifications.

2605.02503 2026-05-28 cs.AI 版本更新

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

DataClawBench: 面向探索性真实世界金融数据分析的智能体基准

Qiaohong Zhang, Weihao Ye, Jialong Chen, Yi Luo, BoYuan Li, Bowen Deng, Zibin Zheng, Jianhao Lin, Wei-Shi Zheng, Chuan Chen

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) School of Software Engineering, Sun Yat-sen University(中山大学软件学院) Lingnan College, Sun Yat-sen University(中山大学岭南学院)

AI总结 提出DataClawBench基准,通过金融智库场景评估智能体在无先验指导下的探索性数据分析能力,包含约206万条跨领域真实噪声数据及492个多步任务,实验发现探索行为与任务进展及正确性不呈正相关。

详情
AI中文摘要

自主数据分析智能体越来越期望在有限的人类数据指导下进行探索性分析。然而,现有基准通常在先验引导设置下评估此类智能体,提供选定的数据源、明确的数据模式或清洗后的数据,从而低估了探索负担。为了评估这一现实的探索性数据分析任务,我们引入了DataClawBench,这是一个基于金融智库咨询场景构建的基准,其中智能体必须独立探索不熟悉、有噪声的跨领域数据,并得出可验证的结论。DataClawBench提供了一个统一的真实世界数据环境,包含企业、行业和政策领域约206万条记录,并保留了原始数据噪声。在此数据环境之上,它定义了492个多步跨领域任务,每个任务都标注了中间里程碑,以诊断超出结果准确性的探索和推理失败。在OpenClaw智能体框架下对八个先进LLM的系统评估表明,探索性数据分析破坏了智能体的可靠性:更多的探索并不能可靠地转化为任务相关的进展或正确的最终答案。

英文摘要

Autonomous data analysis agents are increasingly expected to conduct exploratory analysis with limited human guidance about data. However, existing benchmarks typically evaluate such agents in prior-guided settings, providing selected data sources, explicit data schemas, or cleaned data, thereby understating the exploratory burden. To evaluate this realistic exploratory data analysis task, we introduce DataClawBench, a benchmark built from financial think-tank consulting scenarios where agents must independently explore unfamiliar, noisy, cross-domain data and produce verifiable conclusions. DataClawBench provides a unified real-world data environment with approximately 2.06 million records across enterprise, industry, and policy domains, with native data noise preserved. On top of this data environment, it defines 492 multi-step cross-domain tasks, each annotated with intermediate milestones that diagnose exploration and reasoning failures beyond outcome accuracy. A systematic evaluation of eight advanced LLMs under the OpenClaw agent reveals that exploratory data analysis breaks agent reliability: more exploration does not reliably translate into task-relevant progress or correct final answers.

2605.16293 2026-05-28 cs.CY cs.AI 版本更新

From Prediction to Intervention: The Evolution of AI in Biomedicine

从预测到干预:人工智能在生物医学中的演进

Andrew Feinberg, Aleksandr Sarachakov, Viktor Svekolkin, Alexander Bagaev, Ferran Prat, Michael Feinberg

发表机构 * BostonGene Corporation(波士顿基因公司)

AI总结 本文提出生物医学AI正从基于历史数据的预测范式转向能够模拟干预效果的干预智能,通过定义疾病级模型实现从推理到仿真的转变,以支持新型疗法和未观测干预下的决策。

Comments 10 pages, 3 figures, 1 table. Figures were replaced with a better versions

详情
AI中文摘要

人工智能通过大规模多模态数据集成在生物医学领域取得了快速进展,实现了对临床结果和患者分层的日益精确的预测。然而,这些系统本质上仍然是观察性的:它们从历史数据中学习统计关联,并在先前观察到的生物学和临床状态内运行,限制了它们推广到新型疗法或未观测干预的能力。我们认为,生物医学AI正在经历结构性转变。随着生物医学决策越来越依赖于对干预的推理而非对过去观察的外推,预测架构在结构上变得不足。从历史数据中学习的系统,在构造上无法表示生物系统在扰动下如何演化,因此无法可靠地支持存在新型干预时的决策。我们引入了一个概念框架,区分观察性智能和干预性智能,并将疾病级模型定义为明确表示生物过程的状态、动态和干预响应的系统。这些模型实现了从推理到仿真的转变——推理在干预下会发生什么,而非基于过去可能发生什么。这一转变也意味着价值创造点的转移:从数据处理和预测转向支持和定义干预下决策的系统。这直接源于生物医学决策的结构,并定义了AI在医学中的下一阶段。无法建模干预的系统将在结构上被排除在决策之外。

英文摘要

Artificial intelligence has advanced rapidly in biomedicine through large-scale multimodal data integration, enabling increasingly accurate prediction of clinical outcomes and patient stratification. These systems, however, remain fundamentally observational: they learn statistical associations from historical data and operate within previously observed biological and clinical states, limiting their ability to generalize to novel therapies or unobserved interventions. We argue that AI in biomedicine is undergoing a structural transition. As biomedical decision-making increasingly depends on reasoning about intervention rather than extrapolation from past observations, predictive architectures become structurally insufficient. Systems that learn from historical data cannot, by construction, represent how biological systems evolve under perturbation, and therefore cannot reliably support decision-making in the presence of novel interventions. We introduce a conceptual framework distinguishing observational and interventional intelligence and define disease-level models as systems that explicitly represent the state, dynamics, and intervention response of biological processes. These models enable a shift from inference to simulation -- reasoning about what will happen under intervention rather than what is likely based on the past. This transition also implies a shift in where value is created: from data processing and prediction toward systems that support and define decision-making under intervention. It follows directly from the structure of biomedical decision-making and defines the next stage of AI in medicine. Systems that cannot model intervention will be structurally excluded from decision-making.

2605.15250 2026-05-28 cs.LG cs.AI 版本更新

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

GQLA: 面向硬件自适应的大语言模型解码的分组查询潜在注意力

Fanxu Meng

发表机构 * Institute for Artificial Intelligence(人工智能研究院)

AI总结 提出GQLA,一种对MLA的极小修改,通过暴露MQA和GQA两种等价解码路径,使单一权重集在不同硬件(如H100和H20)上达到最优性能,并支持零冗余张量并行,同时通过TransGQLA将预训练GQA检查点转换为GQLA模型,显著压缩KV缓存。

Comments https://github.com/MuLabPKU/TransArch

详情
AI中文摘要

多头潜在注意力(MLA)是DeepSeek-V2/V3中使用的注意力机制,它将键和值联合压缩为低秩潜在表示,并几乎完美匹配H100的roofline。然而,其训练权重仅暴露一种解码路径——吸收式MQA形式——这使得高效推理依赖于H100级别的计算带宽比,放弃了沿头轴的张量并行,并且在诸如受出口限制的H20等商用推理GPU上无法获得多令牌预测(MTP)增益。我们提出分组查询潜在注意力(GQLA),这是对MLA的最小修改,其训练权重在相同参数上暴露两种代数等价的解码路径:与MLA相同的MQA吸收路径,以及具有每组扩展缓存的GQA路径。运行时选择匹配目标硬件的路径——无需重新训练,无需自定义内核——因此单一组GQLA权重即可同时锁定H100(MQA吸收,s_q=1)和H20(GQA+MTP,s_q=2)的roofline,同时在GQA路径上支持最多8路零冗余张量并行。为避免从头预训练,我们将TransMLA扩展为TransGQLA,将预训练的GQA检查点转换为GQLA模型;在LLaMA-3-8B上,它在MQA吸收路径上将每令牌KV缓存压缩至GQA基线的28.125%,同时在每组路径上结构性地保持GQA级别的流量。

英文摘要

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.

2601.16312 2026-05-28 cs.CL cs.AI 版本更新

Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

教授和评估LLMs推理聚合物设计相关任务

Dikshya Mohanty, Mohammad Saqib Hasan, Syed Mostofa Monsur, Size Zheng, Benjamin Hsiao, Niranjan Balasubramanian

发表机构 * Stony Brook University(石溪大学)

AI总结 本文提出PolyBench基准数据集和知识增强推理蒸馏方法,使中小型语言模型在聚合物设计任务上性能接近前沿闭源LLM。

详情
AI中文摘要

AI4Science研究在许多科学应用中显示出前景,包括聚合物设计。然而,当前的LLMs在此问题空间中效果不佳,因为:(i)大多数模型缺乏聚合物特定知识,(ii)现有对齐模型对聚合物设计相关知识和能力的覆盖有限。为解决此问题,我们引入了PolyBench,一个包含超过125K聚合物设计相关任务的大规模训练和测试基准数据集,利用从实验和合成数据源获得的超过1300万数据点的知识库,以确保聚合物及其属性的广泛覆盖。为了使用PolyBench进行有效对齐,我们引入了一种知识增强推理蒸馏方法,用结构化CoT增强该数据集。此外,PolyBench中的任务从简单到复杂的分析推理问题组织,使得能够进行泛化测试和问题空间中的诊断探测。实验表明,在PolyBench上训练的具有7B到32B参数的中小型语言模型(SLMs)在PolyBench测试数据集上优于类似大小的模型,并与闭源前沿LLMs保持竞争力,同时在外部聚合物基准上展示了性能提升。数据集和相关代码可在https://github.com/StonyBrookNLP/PolyBench获取。

英文摘要

Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in this problem space because: (i) most models lack polymer-specific knowledge, and (ii) existing aligned models have limited coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large-scale training and test benchmark dataset of more than 125K polymer design-related tasks, leveraging a knowledge base of more than 13 million data points obtained from experimental and synthetic data sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small- and mid- sized language models (SLMs) with 7B to 32BB parameters, trained on PolyBench, outperform similar-sized models and remain competitive with closed-source frontier LLMs on PolyBench's test dataset, while demonstrating performance gains on external polymer benchmarks. Dataset and associated code available at https://github.com/StonyBrookNLP/PolyBench.

2605.13517 2026-05-28 cs.CV cs.AI cs.LG 版本更新

ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

ArcVQ-VAE:一种带有反余弦加性边界的球面向量量化框架

Jaeyung Kim, YoungJoon Yoo

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea(韩国首尔 Chung-Ang 大学人工智能系) SNUAILAB, Seoul, Republic of Korea(韩国首尔 SNUAILAB 实验室)

AI总结 针对VQ-VAE有限码本容量限制表示能力的问题,提出ArcVQ-VAE框架,通过引入球面角边先验(包括球界范数正则化和反余弦加性边界损失)增强潜在表示的判别性和均匀分散性,提升码本利用率,在图像重建和生成任务上取得竞争性能。

Comments To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

向量量化变分自编码器(VQ-VAE)已成为图像建模中学习离散表示的基本框架。然而,VQ-VAE模型必须使用有限的码本向量集对整张图像进行分词,这种容量限制限制了其捕获丰富多样表示的能力。在本文中,我们提出反余弦加性边界VQ-VAE(ArcVQ-VAE),一种新颖的向量量化框架,该框架为传统VQ-VAE的码本引入了球面角边先验(SAMP)。所提出的SAMP由球界范数正则化(将所有码本向量约束在时间相关的欧几里得球内)和反余弦加性边界损失(鼓励潜在向量之间更大的角度可分性)组成。这种公式在受限空间内促进了更具判别性和均匀分散的潜在表示,从而提高了有效的潜在空间覆盖范围,并导致码本利用率提升。在标准图像重建和生成任务上的实验结果表明,ArcVQ-VAE在重建精度、表示多样性和样本质量方面与基线模型相比取得了竞争性能。代码可在 https://github.com/goals4292/ArcVQ-VAE 获取。

英文摘要

Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent-space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: https://github.com/goals4292/ArcVQ-VAE

2605.12929 2026-05-28 cs.CV cs.AI 版本更新

Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis

Anatomy-Slot: 用于视网膜诊断中同源双侧推理的无监督解剖分解

Yingzhe Ma, Xiao Yang, Yuguo Yin, Zheyu Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Peking University(北京大学)

AI总结 提出Anatomy-Slot方法,通过无监督解剖瓶颈分解斑块令牌为结构一致的解剖区域槽,并利用双向交叉注意力对齐双眼槽,在ODIR-5K上相比ViT-L基线提升AUC 4.2点,验证了显式结构对应改善诊断的假设。

Comments 15 pages, 3 figures

详情
AI中文摘要

视网膜诊断本质上是双侧的:临床医生比较双眼的同源结构(例如,视盘不对称),然而大多数深度模型基于单眼表示。我们研究显式结构对应是否改善诊断,并提出Anatomy-Slot来操作化这一假设。Anatomy-Slot通过将斑块令牌分解为一组涌现的、结构一致的槽(对应于解剖区域)来引入无监督解剖瓶颈,然后通过双向交叉注意力对齐双眼的槽。在ODIR-5K上使用$n=10$个种子,该方法相比匹配的ViT-L基线在AUC上提升$4.2$个点(95%置信区间;Wilcoxon符号秩检验,$W=0$,$p=0.002$)。配对破坏和高斯噪声下的压力测试提供了对应依赖性和鲁棒性的受控测试。我们进一步在REFUGE上报告了定量视盘定位和交叉注意力定位分析。除了报告的性能提升外,这些结果表明,以对象为中心的解剖对应为与临床双侧比较一致的可解释诊断系统提供了一条原则性路径。

英文摘要

Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep models operate on monocular representations. We investigate whether explicit structural correspondence improves diagnosis, and propose Anatomy-Slot to operationalize this hypothesis. Anatomy-Slot introduces an unsupervised anatomical bottleneck by decomposing patch tokens into a set of emergent, structurally-coherent slots that correspond to anatomical regions, then aligning these slots across eyes via bidirectional cross-attention. On ODIR-5K with $n=10$ seeds, the method improves AUC by $4.2$ points over a matched ViT-L baseline (95% CIs; Wilcoxon signed-rank test, $W=0$, $p=0.002$). Pairing disruption and stress testing under Gaussian noise provide controlled tests of correspondence dependence and robustness under corruption. We further report quantitative optic disc grounding on REFUGE and cross-attention localization analysis. Beyond the reported gains, these results indicate that object-centric anatomical correspondence offers a principled path toward interpretable diagnostic systems aligned with clinical bilateral comparison.

2512.21075 2026-05-28 cs.LG cs.AI math.PR stat.ML 版本更新

Feature Learning Dynamics in Infinite-Depth Neural Networks

无限深度神经网络中的特征学习动力学

Zihan Yao, Ruoyu Wu, Tianxiang Gao

发表机构 * School of Computing(计算学院) Department of Mathematics(数学系) DePaul University(德保罗大学) Iowa State University(爱荷华州立大学)

AI总结 本文研究深度-μP缩放下单层ResNet中由权重重用引起的前向-后向耦合,证明其在初始化时随宽度消失,但在训练中产生非平凡相关项,并推导出无限深度极限下的神经特征动力学(NFD)SDE系统。

详情
AI中文摘要

深度神经网络在实践中取得了显著成功,但对训练过程中特征如何演化的机制理解仍不完整,尤其是在大深度极限下。对于深度-μP缩放下的ResNet,先前工作将层索引ℓ视为连续时间t_ℓ = ℓ/L,得到训练动力学的SDE描述。一个关键未解决问题是,反向传播通过其转置W_ℓ^⊤重用每个前向权重矩阵W_ℓ,在前向特征和反向梯度之间产生相关性,其行为和特征学习中的作用尚不清楚。我们研究了深度-μP下单层ResNet中这种重用权重的前向-后向耦合。使用条件高斯表示,我们在取任何网络极限之前,显式地将权重重用引起的耦合项与解耦的高斯波动分开。在初始化时,我们证明耦合是有限宽度效应,并以O(n^{-1})的速率随深度一致消失。然而,在训练期间,SGD引入了一个非平凡的前向-后向相关项,该项在无限宽度极限下仍然存在。关键的深度效应是,在深度-μP缩放下,这个幸存项在深度上是高阶的,并且随着L→∞,其在层上的累积贡献变得可忽略。这种深度诱导的抑制促使了神经特征动力学(NFD),一个具有解耦后向权重的向前-向后SDE系统,它保留了训练期间生成的特征-梯度协方差结构。在非退化假设下,我们证明有限网络训练动力学收敛到其NFD极限,深度离散化误差为O(L^{-1}),而重用权重耦合项具有更快的O(L^{-2})衰减。这些结果为深度-μP下单层ResNet的特征学习动力学提供了严格的无限深度极限。

英文摘要

Deep neural networks have achieved remarkable success in practice, yet a mechanistic understanding of how features evolve during training remains incomplete, especially in the large-depth limit. For ResNets under depth-$μ$P scaling, prior work treats the layer index $\ell$ as a continuous time $t_\ell = \ell/L$, yielding SDE descriptions of the training dynamics. A key unresolved issue is that backpropagation reuses each forward weight matrix $W_\ell$ through its transpose $W_\ell^\top$, creating correlations between forward features and backward gradients whose behavior and role in feature learning remain unclear. We study this reused-weight forward--backward coupling in one-layer ResNets under depth-$μ$P. Using conditional Gaussian representations, we explicitly separate the coupling terms induced by weight reuse from decoupled Gaussian fluctuations before taking any network limit. At initialization, we prove that the coupling is a finite-width effect and vanishes at rate $O(n^{-1})$, uniformly over depth. During training, however, SGD induces a nontrivial forward--backward correlation term that survives the infinite-width limit. The key depth effect is that, under depth-$μ$P scaling, this surviving term is higher order in depth and its accumulated contribution over layers becomes negligible as $L\to\infty$. This depth-induced suppression motivates Neural Feature Dynamics (NFD), a forward--backward SDE system with decoupled backward weights that retains the feature-gradient covariance structure generated during training. Under nondegeneracy assumptions, we prove that the finite-network training dynamics converge to its NFD limit with an $O(L^{-1})$ depth-discretization error, while the reused-weight coupling term has a faster $O(L^{-2})$ decay. These results provide a rigorous infinite-depth limit for the feature-learning dynamics of one-layer ResNets under depth-$μ$P.

2605.12015 2026-05-28 cs.CR cs.AI cs.CL cs.LG cs.MA 版本更新

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

SkillSafetyBench:在技能面攻击表面下评估智能体安全性

Chang Jin, An Wang, Zeming Wei, Kai Wang, Biaojie Zeng, Qiaosheng Zhang, Chao Yang, Jingjing Qu, Xia Hu, Xingcheng Xu

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Peking University(北京大学) East China Normal University(华东师范大学)

AI总结 提出SkillSafetyBench基准,通过155个对抗案例评估大语言模型智能体在技能、本地工件和执行环境文件等非用户攻击下的安全失败模式。

详情
AI中文摘要

可复用技能正成为扩展大语言模型智能体的常见接口,它将程序性指导与对文件、工具、内存和执行环境的访问打包在一起。然而,这种模块化引入了现有安全评估大多忽略的攻击面:即使用户请求是良性的,不安全的影响可能存在于技能指导、本地工件或执行环境文件中,这些会引导智能体采取不安全行为。我们提出了SkillSafetyBench,一个可运行的基准,用于评估此类技能中介的安全失败。SkillSafetyBench包含跨47个任务、6个风险领域和30个安全类别的155个对抗案例,每个案例都使用特定于案例的基于规则的验证器进行评估。使用多个CLI智能体和模型后端的实验表明,非用户攻击可以一致地诱导不安全行为,在不同领域、攻击方法和脚手架-模型配对中表现出不同的失败模式。我们的发现表明,智能体安全性不仅取决于模型级别的对齐,还取决于智能体如何解释技能、信任工作流上下文以及通过可执行环境采取行动。

英文摘要

Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, unsafe influence may reside in skill guidance, local artifacts, or execution-environment files that steer the agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.

2605.11544 2026-05-28 cs.AI cs.LO 版本更新

Optimal LTLf Synthesis

最优 LTLf 综合

Yujian Cao, Sven Schewe, Qiyi Tang, Shufang Zhu

发表机构 * University of Liverpool(利物浦大学)

AI总结 本文提出最优 LTLf 综合,通过最大化可保证实现的目标数量,解决多目标规范无法全部实现时的策略综合问题,并实验验证了方法的可行性。

详情
AI中文摘要

策略综合通常遵循全有或全无的范式,当规范在不确定环境中无法保证时返回不可实现。在本文中,我们引入了最优 LTLf 综合,其目标是从由多个目标组成的给定规范中尽可能多地实现目标,特别是当它们不能全部联合实现时。我们首先考虑最大保证综合,它承诺一个我们可以先验保证实现的最大目标集。然后,我们引入最大观察综合,它最大化后验实现的目标,这些目标在不同执行中可能不可比较。最后,我们提出增量最大观察综合,通过在执行过程中出现更强保证的机会时进一步改进策略。实验结果表明,最优综合的不同变体扩展性大致相当,在给定的超时时间内解决了大部分基准实例,证明了该方法的实际可行性。

英文摘要

Strategy synthesis typically follows an all-or-nothing paradigm, returning unrealisable whenever a specification cannot be guaranteed in an uncertain environment. In this paper, we introduce optimal LTLf synthesis, where the goal is to realise as many objectives as possible from a given specification consisting of multiple objectives, especially for the case that they are not all jointly realisable. We first consider max-guarantee synthesis, which commits to a maximal set of objectives that we can a priori guarantee to realise. We then introduce max-observation synthesis, which maximises a posteriori realised objectives that may be incomparable on different executions. Finally, we present incremental max-observation synthesis, which further improves strategies by exploiting opportunities for stronger guarantees when they arise during an execution. Experimental results show that different variations of optimal synthesis scale broadly equally well, solving a large fraction of the benchmark instances within the given timeout, demonstrating the practical feasibility of the approach.

2605.11325 2026-05-28 cs.IR cs.AI 版本更新

Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval

结构化信念状态与首个面向LLM记忆检索的精度感知基准

Jeffrey Flynt

发表机构 * Independent Researcher(独立研究者)

AI总结 针对现有LLM记忆系统基准仅评估答案质量而忽略检索精度的问题,提出独立于生成模型的检索精度基准PrecisionMemBench和结构化信念存储系统Tenure,实现89/89测试案例通过且平均精度1.0。

Comments v2 evaluates three production memory systems, evidence to make the claim falsifiable and the benchmark reusable

详情
AI中文摘要

每个主要的LLM记忆系统基准(尤其是LoCoMo)都只衡量模型是否回答正确,而非记忆系统是否检索正确。一个返回其整个信念存储的系统能达到1.0的召回率并通过答案质量评估。这是单元测试与集成测试的区别:检索质量必须独立于其馈入的生成模型进行测量,而现有基准均未做到这一点。 我们证明,即使实体提取完全忠实,这种失败仍然存在。记忆基线在引用自身提取的案例上平均检索精度仅为0.05到0.08。这种失败是结构性的:在领域特定语料库上,余弦相似度无法区分相关信念与语义相近的信念,这一不变性在20倍范围的嵌入模型规模上得到确认。多轮评估揭示了累积性失败;在话题漂移后,对比系统允许语义质量在轮次间渗漏,导致重入时的高漂移分数。单轮指标掩盖了这一代价:Hindsight报告单轮延迟低于700ms,但每会话轮次平均延迟超过2700ms,p95超过6000ms。在LLM-as-a-Judge评估下,这些失败仍然不可见。 我们提出两项贡献:PrecisionMemBench,一个包含89个案例的基准,独立于生成模型测量检索精度,涵盖多样化的范围、变异和隔离断言;以及Tenure,一个本地优先的结构化信念存储,使用多路径BM25、分析器不对称、差分提升和硬范围隔离。Tenure通过了89/89案例,平均精度1.0,检索延迟低于15ms。对比提供商的表现比它们所基于的原始向量基线更差,零主动检索通过,摄取成本为98到897秒,这些失败是答案质量基准无法检测的。

英文摘要

Every major benchmark for LLM memory systems, LoCoMo foremost, measures whether a model answered correctly, not whether the memory system retrieved correctly. A system returning its entire belief store achieves recall of 1.0 and passes answer-quality evaluation. This is the difference between a unit test and an integration test: retrieval quality must be measured in isolation from the generative model it feeds into, and no existing benchmark does this. We demonstrate that this failure persists even when entity extraction is entirely faithful. Memory baselines achieve mean retrieval precision of just 0.05 to 0.08 on cases referencing their own extractions. The failure is structural: cosine similarity over a domain-specific corpus cannot discriminate relevant beliefs from semantically proximate ones, an invariance confirmed across a 20x range in embedding model scale. Multi-turn evaluation surfaces a compounding failure; after topic drift, comparison systems allow semantic mass to bleed across turns, yielding high drift scores on re-entry. Single-turn metrics conceal this cost: Hindsight reports sub-700ms single-turn latency but exceeds 2,700ms mean per session turn, with p95 above 6,000ms. Under LLM-as-a-Judge evaluation, these failures remain invisible. We present two contributions: PrecisionMemBench, an 89-case benchmark measuring retrieval precision independently of generative models across diverse scope, mutation, and isolation assertions; and Tenure, a local-first structured belief store using multi-path BM25 with analyzer asymmetry, differential boosting, and hard scope isolation. Tenure passes 89/89 cases with mean precision 1.0 and sub-15ms retrieval latency. Comparison providers perform worse than the raw vector baseline they are built on, with zero active retrieval passes and ingestion costs of 98 to 897 seconds, failures that answer-quality benchmarks cannot detect.

2605.11154 2026-05-28 astro-ph.IM cs.AI cs.LG 版本更新

Quantifying the Reconstructability of Astrophysical Methods with Large Language Models and Information Theory: A Case Study in Spectral Reconstruction

利用大语言模型和信息论量化天体物理方法的可重建性:光谱重建的案例研究

Hsing Wen Lin, Zong-Fu Sie

发表机构 * Department of Physics, University of Michigan, Ann Arbor, MI 48109, USA(密歇根大学物理系) Michigan Institute for Data and AI in Society, University of Michigan, Ann Arbor, MI 48109, USA(密歇根大学数据与人工智能在社会中的研究所) Independent Researcher, Taiwan(台湾独立研究员)

AI总结 提出信息论框架,通过大语言模型生成的概率分布和香农熵、JS散度,量化文本描述对算法重建的约束力,以海王星外天体光谱重建为例,发现文本虽能明确算法结构但无法消除实现级方差,存在“熵下限”,且LLM无法推断隐性专家知识。

Comments 26 pages, 6 figures, Accepted for publication in PASP

详情
AI中文摘要

现代天体物理研究严重依赖复杂的数据分析流程;然而,已发表的描述往往缺乏计算可重复性所需的细节。在这项工作中,我们提出了一个信息论框架,用于量化方法从其书面描述中重建的有效性。通过将算法重建视为大语言模型(LLM)生成的概率分布,我们利用香农熵和詹森-香农散度来衡量文本对有效实现假设空间的约束程度。我们通过对稀疏测光数据中的海王星外天体(TNO)光谱重建的案例研究来展示这种方法。通过向前沿LLM提供不同级别的稿件文本(标题、摘要和方法),我们发现虽然增加文本成功澄清了整体算法结构,但未能消除实现层面的方差。这种持续存在的方差建立了一个“熵下限”,表明多个不同的实现与明确指令保持一致。为了评估实际可重复性,我们将这些重建的算法转换为可执行的流程。我们的结果表明,虽然LLM容易恢复核心功能方法,但它们系统性地无法推断严格科学校准所需的隐性专家知识。这项初步研究表明,LLM可以作为一种零样本诊断工具来审计方法透明度,帮助作者识别缺失的结构约束,并在自动化研究时代维护科学完整性。

英文摘要

Modern astrophysical studies rely heavily on complex data analysis pipelines; however, published descriptions often lack the detail required for computational reproducibility. In this work, we present an information-theoretic framework to quantify how effectively a method can be reconstructed from its written description. By treating algorithmic reconstruction as a probability distribution generated by Large Language Models (LLMs), we utilize Shannon entropy and Jensen-Shannon divergence to measure how strongly text constrains the hypothesis space of valid implementations. We demonstrate this approach through a case study of Trans-Neptunian Object (TNO) spectral reconstruction from sparse photometry. By prompting frontier LLMs with varying levels of manuscript text (Title, Abstract, and Methods), we find that while increasing text successfully clarifies the overall algorithmic structure, it fails to eliminate variance at the implementation level. This persistent variance establishes an "entropy floor," demonstrating that multiple divergent implementations remain consistent with explicit instructions. To evaluate practical reproducibility, we convert these reconstructed algorithms into executable pipelines. Our results reveal that, while LLMs easily recover core functional methodologies, they systematically fail to infer the tacit expert knowledge required for strict scientific calibration. This pilot study demonstrates that LLMs can be repurposed as a zero-shot diagnostic tool to audit methodological transparency, helping authors identify missing structural constraints and preserve scientific integrity in an era of automated research.

2605.10325 2026-05-28 cs.AI 版本更新

Verifiable Process Rewards for Agentic Reasoning

可验证过程奖励用于智能体推理

Huining Yuan, Zelai Xu, Huaijie Wang, Xiangmin Yi, Jiaxuan Gao, Xiao-Ping Zhang, Yu Wang, Chao Yu, Yi Wu

发表机构 * Tsinghua University(清华大学)

AI总结 提出可验证过程奖励(VPR)框架,通过符号或算法预言机将密集的中间步骤验证转化为强化学习的逐轮监督信号,解决长程智能体推理中的信用分配问题,并在搜索、约束和后验验证三种场景中验证其有效性。

Comments Corrected minor typos and LLM-assisted data extraction errors. The main conclusions are unchanged

详情
AI中文摘要

来自可验证奖励的强化学习(RLVR)提升了大型语言模型(LLMs)的推理能力,但现有方法大多依赖稀疏的结果级反馈。这种稀疏性在长程智能体推理中造成了信用分配难题:一个轨迹可能因包含许多正确的中间决策而失败,或因包含有缺陷的决策而成功。在这项工作中,我们研究了一类密集可验证的智能体推理问题,其中中间动作可以通过符号或算法预言机进行客观检查。我们提出了可验证过程奖励(VPR),一个将此类预言机转化为密集的逐轮监督信号用于强化学习的框架,并在三个代表性场景中实例化:基于搜索验证的动态演绎、基于约束验证的逻辑推理和基于后验验证的概率推理。我们进一步提供了理论分析,表明密集的验证器基础奖励可以通过提供更局部的学习信号来改善长程信用分配,其收益取决于验证器的可靠性。实验上,VPR在受控环境中优于结果级奖励和基于rollout的过程奖励基线,更重要的是,它能够迁移到通用和智能体推理基准测试中,表明可验证的过程监督可以培养适用于训练环境之外的通用推理技能。我们的结果表明,VPR是一种有前景的方法,用于在可靠的中间验证可用时增强LLM智能体,同时也强调了其对预言机质量的依赖性,以及将VPR扩展到结构化程度较低、开放环境中的开放性挑战。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.

2602.02561 2026-05-28 cs.LO cs.AI cs.LG 版本更新

MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics

MathlibLemma: 形式化数学中的民间引理生成与基准测试

Xinyu Liu, Zixuan Xie, Amir Moeini, Claire Chen, Shuze Daniel Liu, Yu Meng, Aidong Zhang, Shangtong Zhang

发表机构 * Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系) Astronomy , California Institute of Technology(加州理工学院天文学系) Purdue University(普渡大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出基于LLM的模块化流水线MathlibLemma,自动挖掘、形式化并证明数学中缺失的民间引理,生成包含4028个类型检查的Lean语句的基准测试集。

详情
AI中文摘要

尽管Lean和Mathlib生态系统在大语言模型(LLM)的帮助下在形式化数学推理方面取得了显著成功,但Mathlib中缺乏许多民间引理仍然是一个持续存在的障碍,限制了Lean作为像LaTeX或Maple那样的日常工具对数学家的可用性。为了解决这个问题,我们引入了MathlibLemma,一个基于LLM的模块化流水线,用于自动进行民间引理挖掘:发现、形式化并证明数学家通常认为理所当然但形式化库中并不总是存在的可重用中间事实。其核心是,MathlibLemma主动挖掘数学中缺失的连接组织。该流水线生成一个经过验证的民间风格引理库,包括1506个通过证明绕过筛选的Lean检查证明;一个精心策划的小型试点子集也已合并到Mathlib中,提供了外部证据表明选定的输出可以满足专家库标准。利用这一流水线,我们进一步构建了MathlibLemma基准测试集,包含4028个跨越广泛数学领域的非平凡类型检查的Lean语句。通过将LLM的角色从被动消费者转变为主动贡献者,这项工作朝着AI辅助扩展形式化数学库迈出了一步。

英文摘要

While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language models (LLMs), the absence of many folklore lemmas in Mathlib remains a persistent barrier that limits Lean's usability as an everyday tool for mathematicians like \LaTeX{} or Maple. To address this, we introduce MathlibLemma, a modular LLM-based pipeline for automated folklore-lemma mining: the discovery, formalization, and proving of reusable intermediate facts that mathematicians often take for granted but that are not always present in formal libraries. At its core, MathlibLemma proactively mines the missing connective tissue of mathematics. The pipeline produces a verified library of folklore-style lemmas, including 1,506 Lean-checked proofs that pass a proof-bypass screen; a small curated pilot subset has also been merged into Mathlib, providing external evidence that selected outputs can meet expert library standards. Leveraging this pipeline, we further construct the MathlibLemma benchmark, a suite of 4,028 non-trivial type-checked Lean statements spanning a broad range of mathematical domains. By transforming the role of LLMs from passive consumers to active contributors, this work takes a step toward AI-assisted expansion of formal mathematical libraries.

2605.08938 2026-05-28 cs.AI cs.LG 版本更新

Can We Formally Verify Neural PDE Surrogates? SMT Compilation of Small Fourier Neural Operators

我们能否形式化验证神经PDE代理模型?小傅里叶神经操作符的SMT编译

Ali Baheri, Ignacio Laguna Peralta

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 本文通过将傅里叶神经操作符(FNO)的谱卷积编译为线性映射,在Z3中实现精确或近似的SMT编码,从而对小型FNO代理模型进行形式化验证,并揭示了验证的可靠性与可扩展性之间的权衡。

详情
AI中文摘要

傅里叶神经操作符(FNO)可以极大地加速PDE模拟,但它们通常在没有形式化保证其保留基本物理结构的情况下使用。我们表明,一旦训练权重和网格固定,FNO中的谱卷积是一个线性映射。因此,完整的前向传播是分段线性的,并且可以在Z3的线性实数算术中精确表示。我们研究了两种编码。精确编码将谱卷积编译为稠密矩阵乘法,对于证明和反例都是可靠的。更轻量的冻结编码用常数替换谱路径,使其更快但近似。在10个用于一维对流-扩散-反应的小型FNO代理模型(85到117个参数,网格8到32)上,精确编码在线性(无ReLU)模型上给出了2个可靠的正性证明,5个可靠的正性反例,以及10个可靠的质量违反反例;其余3个在ReLU模型上的正性查询超时。对于质量不增加,Z3在10个模型中的7个上找到了比基于梯度的伪造和蒙特卡洛更差的反例。冻结编码可扩展到网格大小64,且正性检查亚秒级,但它不再为原始FNO提供证书。总体而言,结果明确了可靠性与可扩展性之间的权衡,并指出了对生产规模神经操作符进行形式化验证所需的条件。

英文摘要

Fourier Neural Operators (FNOs) can greatly accelerate PDE simulation, but they are often used without formal guarantees that they preserve basic physical structure. We show that, once the trained weights and grid are fixed, the spectral convolution in an FNO is a linear map. As a result, the full forward pass is piecewise-linear and can be represented exactly in Z3's linear real arithmetic. We study two encodings. The exact encoding compiles the spectral convolution into a dense matrix multiplication, which is sound for both proofs and counterexamples. The lighter frozen encoding replaces the spectral path with a constant, making it faster but approximate. On 10 small FNO surrogates for 1D advection-diffusion-reaction (85 to 117 parameters, grids 8 to 32), the exact encoding gives 2 sound positivity proofs on linear (ReLU-free) models, 5 sound positivity counterexamples, and 10 sound mass-violation counterexamples; the remaining 3 positivity queries on ReLU models time out. For mass non-increase, Z3 finds worse counterexamples than both gradient-based falsification and Monte Carlo on 7 of 10 models. The frozen encoding scales to grid size 64 with sub-second positivity checks, but it no longer provides certificates for the original FNO. Overall, the results make the soundness--scalability tradeoff explicit and point to what is needed for formal verification of production-scale neural operators.

2605.08758 2026-05-28 cs.RO cs.AI math.OC 版本更新

Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

基于全尺度学习的料箱搬运机器人系统订单履行序贯决策框架

Jiaxin Liu, Peng Yang, Yuping Li, Xinyue Xie

发表机构 * Institution of Data and Information, Shenzhen International Graduate School, Tsinghua University, Nanshan District, Shenzhen 518055, China(数据与信息研究所,深圳国际研究生院,清华大学,南山区,深圳518055,中国)

AI总结 针对料箱搬运机器人系统的订单履行决策,提出一种结合结构化组合优化与多智能体强化学习的通用可扩展序贯决策框架OLSF-TRS,在小规模系统上平均最优性差距低于3.5%,在大规模场景中相比启发式基线减少8-12%的料箱移动,并保持实时响应。

Comments 35 pages, 5 figures

详情
AI中文摘要

受电子商务和小批量生产的快速扩张推动,成品、半成品和原材料的内部物流负载单元规模正在稳步缩小。料箱正逐渐取代托盘成为主要的搬运和存储容器。这一转变将料箱搬运机器人系统推向了自动化订单履行中心的前沿。料箱搬运机器人系统的订单履行决策具有共同的订单-料箱-机器人序贯决策性质。现有研究主要针对特定系统的决策机制,难以泛化或迁移到其他场景。我们提出了一种基于全尺度学习的料箱搬运机器人系统订单履行序贯决策框架(OLSF-TRS),这是一个通用且可扩展的序贯决策框架,结合了结构化组合优化与多智能体强化学习,以协调订单、料箱和机器人决策。在小规模料箱搬运机器人系统上,OLSF-TRS在两种不同的系统配置下实现了接近最优的性能,平均最优性差距低于3.5%。在大规模场景中,OLSF-TRS在两种不同类型的系统上始终优于启发式基线,与基于规则的最先进方法相比,总料箱移动量减少了8-12%和超过30%,同时保持实时响应。这些改进转化为切实的运营效益,包括成本降低、能耗降低和吞吐量稳定性增强。所提出的框架为广泛部署的料箱搬运机器人系统提供了一种高效且统一的订单履行决策框架,支持电子商务和工业物流领域的高质量订单履行。

英文摘要

Driven by the rapid expansion of e-commerce and small-batch production, the size of the intralogistics load unit of finished goods, semi-finished goods and raw materials is steadily shrinking. Totes are gradually replacing pallets as the primary handling and storage container. This shift has propelled tote-handling robotic systems to the forefront of automation order fulfillment centers. The order-fulfillment decisions of tote-handling robotic systems share a common order-tote-robot sequential decision-making nature. Existing studies primarily focus on decision mechanisms tailored to particular systems, making it difficult to generalize or transfer them to other contexts. We propose an Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems (OLSF-TRS), a generalized and scalable sequential decision framework that combines structured combinatorial optimization with multi-agent reinforcement learning to coordinate order,tote, and robot decisions. On small-scale tote-handling robotic systems, OLSF-TRS achieves near-optimal performance with average optimality gaps below 3.5% across two distinct system configurations. In large-scale scenarios, OLSF-TRS consistently outperforms heuristic baselines across two different system types, reducing total tote movements by 8-12% and over 30% compared to SOTA rule-based approaches, while maintaining real-time responsiveness. These improvements translate into tangible operational benefits, including cost reduction, lower energy consumption, and enhanced throughput stability. The proposed framework delivers an efficient and unified order fulfillment decision-making framework for widely deployed tote-handling robotic systems,supporting high-quality order fulfillment in both e-commerce and industrial logistics sectors.

2604.24938 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Rethinking Layer Redundancy: Calibration Matters More Than Search in LLM Depth Pruning

重新思考层冗余:校准比搜索在LLM深度剪枝中更重要

Minkyu Kim, Vincent-Daniel Yun, Youngrae Kim, Suin Cho, Woosang Lim, Sunwoo Lee

发表机构 * Neural Superintelligence Lab, MODULABS(神经超智能实验室,MODULABS) University of Southern California(南加州大学) Boston University(波士顿大学) Seoul National University(首尔国立大学) Inha University(inha大学)

AI总结 本文通过实验发现,在大型语言模型深度剪枝中,校准配置对剪枝模式和性能的影响远大于搜索算法的选择。

Comments Preprint

详情
AI中文摘要

深度剪枝通过移除Transformer块来提高大型语言模型的推理效率。先前的工作通常将层冗余视为预训练网络固有的结构属性,强调重要性标准和搜索算法来识别可移除的层。在本研究中,我们从功能角度实证研究深度剪枝。通过评估不同校准配置和多种搜索算法下的代表性LLM系列,我们展示了不同配置会产生不同的剪枝模式。此外,在固定校准配置下,复杂的搜索算法相比简单的一次性方法仅带来边际性能提升,并收敛到相似的剪枝子集。总体而言,我们的结果表明,校准配置在塑造剪枝模式和校准困惑度方面比搜索算法的选择起着更大的作用,同时对下游推理准确性的方差贡献相当。这表明未来的剪枝工作可能受益于优先考虑校准配置而非搜索复杂性。

英文摘要

Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work typically treats layer redundancy as an inherent structural property of pretrained networks, emphasizing importance criteria and search algorithms to identify removable layers. In this study, we empirically investigate depth pruning from a functional perspective. Evaluating representative LLM families across diverse calibration configurations and multiple search algorithms, we show that different configurations produce different pruning patterns. Furthermore, under a fixed calibration configuration, complex search algorithms yield marginal performance improvements over simple one-shot methods, converging to similar pruned subsets. Overall, our results suggest that the calibration configuration plays a substantially larger role than the choice of search algorithm in shaping pruning patterns and calibration perplexity, while contributing comparably to variance in downstream reasoning accuracy. This indicates that future pruning efforts may benefit from prioritizing the calibration configuration over search complexity.

2510.25781 2026-05-28 cs.LG cs.AI cs.NA cs.NE math.NA 版本更新

A Practitioner's Guide to Kolmogorov-Arnold Networks

Kolmogorov-Arnold网络实践指南

Amir Noorizadegan, Sifan Wang, Leevan Ling, Juan P. Dominguez-Morales

发表机构 * Department of Mathematics, Hong Kong Baptist University(香港 Baptist大学数学系) Institution for Foundations of Data Science, Yale University(数据科学基础研究所,耶鲁大学) Robotics and Technology of Computers Lab., Universidad de Sevilla(机器人与计算机技术实验室,塞维利亚大学)

AI总结 本文系统综述了受Kolmogorov叠加定理启发的KAN网络,从理论基础、设计轴心(基函数)到最新进展,并提供了实用选择指南和未来方向。

详情
AI中文摘要

Kolmogorov-Arnold网络(KAN)的设计灵感来源于Kolmogorov叠加定理(而非由其决定),已成为MLP的结构化替代方案。本综述对快速扩展的KAN文献进行了系统全面的概述。综述围绕三个核心主题组织:(i)阐明KAN与Kolmogorov叠加理论(KST)、MLP和经典核方法之间的关系;(ii)将基函数作为中心设计轴进行分析;(iii)总结在准确性、效率、正则化和收敛性方面的最新进展。最后,我们提供了实用的“选择你的KAN”指南,并概述了开放的研究挑战和未来方向。随附的GitHub仓库为正在进行的KAN研究提供了结构化参考。

英文摘要

Kolmogorov-Arnold Networks (KANs), whose design is inspired-rather than dictated-by the Kolmogorov superposition theorem, have emerged as a structured alternative to MLPs. This review provides a systematic and comprehensive overview of the rapidly expanding KAN literature. The review is organized around three core themes: (i) clarifying the relationships between KANs and Kolmogorov superposition theory (KST), MLPs, and classical kernel methods; (ii) analyzing basis functions as a central design axis; and (iii) summarizing recent advances in accuracy, efficiency, regularization, and convergence. Finally, we provide a practical "Choose-Your-KAN" guide and outline open research challenges and future directions. The accompanying GitHub repository serves as a structured reference for ongoing KAN research.

2605.00435 2026-05-28 cs.CL cond-mat.dis-nn cs.AI nlin.CD 版本更新

Escaping Mode Collapse in LLM Generation via Geometric Regulation

通过几何调控逃离大语言模型生成中的模式崩溃

Xin Du, Kumiko Tanaka-Ishii

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家) Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan(通信与计算机工程系,早稻田大学,东京,日本) Department of Computer Science and Engineering, Waseda University, Tokyo, Japan(计算机科学与工程系,早稻田大学,东京,日本) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China(智能自主系统上海研究院,同济大学,上海,中国)

AI总结 本文从动力系统视角将模式崩溃解释为几何崩溃,并提出轻量级在线状态空间干预方法RMR(通过低秩阻尼调控Transformer值缓存中的自强化方向),显著降低模式崩溃并实现极低熵率下的稳定生成。

Comments Accepted to ICML 2026

详情
AI中文摘要

模式崩溃是生成建模中的一个持续挑战,在自回归文本生成中表现为从显式循环到逐渐失去多样性和轨迹过早收敛等行为。我们采用动力系统视角,将模式崩溃重新解释为由*几何崩溃*引起的状态空间可访问性降低:在生成过程中,模型的内部轨迹被限制在其表示空间的低维区域。这意味着模式崩溃并非纯粹的token级现象,无法通过符号约束或仅概率解码启发式可靠解决。基于这一视角,我们提出*强化模式调控*(RMR),一种轻量级的在线状态空间干预方法,用于调控Transformer值缓存中占主导地位的自强化方向(实现为低秩阻尼)。在多个大型语言模型上,RMR显著减少了模式崩溃,并能够在极低熵率(低至0.8 nats/步)下实现稳定生成,而标准解码通常在2.0 nats/步附近崩溃。

英文摘要

Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from explicit looping to gradual loss of diversity and premature trajectory convergence. We take a dynamical-systems view and reinterpret mode collapse as reduced state-space accessibility caused by *geometric collapse*: during generation, the model's internal trajectory becomes confined to a low-dimensional region of its representation space. This implies mode collapse is not purely a token-level phenomenon and cannot be reliably solved by symbolic constraints or probability-only decoding heuristics. Guided by this perspective, we propose *Reinforced Mode Regulation* (RMR), a lightweight, online state-space intervention that regulates dominant self-reinforcing directions in the Transformer value cache (implemented as low-rank damping). Across multiple large language models, RMR substantially reduces mode collapse and enables stable generation at extremely low entropy rates (down to 0.8 nats/step), whereas standard decoding typically collapses near 2.0 nats/step.

2604.27251 2026-05-28 cs.CL cs.AI 版本更新

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

服从与感知:大型语言模型中的推理可控性研究

Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata, Nikolaos Aletras

发表机构 * School of Computer Science, University of Sheffield(谢菲尔德大学计算机科学学院) School of EECS, Queen Mary University of London(伦敦女王学院电子工程与计算机科学学院) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 通过推理冲突视角,系统研究大型语言模型在诱导逻辑模式与任务预期模式冲突时,是否优先服从指令还是遵循感知合理性,并探索内部检测与激活级干预方法。

详情
AI中文摘要

大型语言模型(LLMs)已知通过预训练数据中的共享推理模式获得推理能力,并通过思维链(CoT)实践进一步激发。然而,基本推理模式(如归纳、演绎和溯因)能否与具体问题实例解耦,仍然是模型可控性的关键挑战,并有助于阐明推理可控性。在本文中,我们首次通过推理冲突的视角系统研究这一问题:推理冲突是指通过强制使用偏离目标任务预期逻辑模式而引发的参数信息与上下文信息之间的显性张力。我们的评估表明,LLMs 始终优先考虑感知合理性而非服从性,尽管存在冲突指令,仍倾向于采用任务合适的推理模式。我们进一步证明推理冲突在内部是可检测的,因为在冲突期间置信度分数显著下降。探测实验确认推理类型从中间层到后期层线性编码,表明存在激活级可控性的潜力。利用这些见解,我们引导模型朝向服从性,将指令遵循度提高多达 29%。总体而言,我们的发现表明,虽然 LLM 推理锚定于具体实例,但主动的机制性干预可以有效地将逻辑模式与数据解耦,为改进可控性、忠实性和泛化性提供了一条路径。

英文摘要

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

2603.09117 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

解耦推理与置信度:在可验证奖励的强化学习中恢复校准

Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学) Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络安全学院) National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China(中国国家计算机网络应急技术配合中心)

AI总结 针对RLVR中模型校准退化问题,提出DCPO框架通过解耦推理与校准目标,在保持准确率的同时显著改善校准性能并缓解过度自信。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著增强了大语言模型(LLMs)的推理能力,但严重遭受校准退化,即模型对错误答案变得过度自信。以往研究致力于将校准目标直接纳入现有优化目标。然而,我们的理论分析表明,最大化策略准确率与最小化校准误差之间存在根本性的梯度冲突。基于这一见解,我们提出了DCPO,一个简单而有效的框架,系统地解耦了推理和校准目标。大量实验表明,我们的DCPO不仅保持了与GRPO相当的准确率,还实现了最佳的校准性能,并显著缓解了过度自信问题。我们的研究为更可靠的LLM部署提供了宝贵的见解和实用的解决方案。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

2410.04096 2026-05-28 cs.LG cs.AI cs.NA cs.NE math.NA physics.comp-ph 版本更新

Sinc Kolmogorov-Arnold network and its application for solving PDEs with singularities

Sinc Kolmogorov-Arnold 网络及其在求解含奇异性偏微分方程中的应用

Tianchi Yu, Jingwei Qiu, Jiang Yang, Ivan Oseledets

发表机构 * Skolkovo Institute of Science and Technology(斯克洛夫科学与技术研究所) Southern University of Science and Technology(南方科技大学) International Center for Mathematics(国际数学中心) National Center for Applied Mathematics Shenzhen (NCAMS)(深圳应用数学中心)

AI总结 本文提出在 Kolmogorov-Arnold 网络中使用 Sinc 插值作为可学习激活函数,以有效逼近光滑函数和含奇异性的函数,并在物理信息神经网络求解偏微分方程中取得更好效果。

详情
Journal ref
Neural Networks 2026
AI中文摘要

在本文中,我们提出在 Kolmogorov-Arnold 网络(一种具有可学习激活函数的神经网络,最近作为多层感知机的替代方案受到关注)中使用 Sinc 插值。已有许多不同的函数表示被尝试,但我们表明 Sinc 插值提供了一种可行的替代方案,因为它在数值分析中已知能有效表示光滑函数和含奇异性的函数。这不仅对函数逼近重要,也对使用物理信息神经网络求解偏微分方程重要。通过一系列实验,我们表明 SincKANs 在我们考虑的大多数示例中提供了更好的结果。

英文摘要

In this paper, we propose to use Sinc interpolation in the context of Kolmogorov-Arnold Networks, neural networks with learnable activation functions, which recently gained attention as alternatives to Multilayer Perceptron. Many different function representations have already been tried, but we show that Sinc interpolation proposes a viable alternative, since it is known in numerical analysis to effectively represent both smooth functions and functions with singularities. This is important not only for function approximation but also for solving the partial differential equations with physics-informed neural networks. Through a series of experiments, we show that SincKANs provide better results in almost all of the examples we have considered.

2604.25491 2026-05-28 cs.CV cs.AI 版本更新

The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing

水印移除的法医成本:从专用攻击到图像编辑

Gautier Evennou, Ewa Kijak

发表机构 * IMATAG(IMATAG机构) IRISA, Univ. Rennes, INRIA, CNRS(IRISA大学、INRIA和CNRS)

AI总结 本文提出水印移除检测(WRD)作为新评估维度,通过训练分类器检测移除痕迹,在10^{-3}假阳性率下实现最优检测,证明法医隐蔽性是水印移除的必要条件。

Comments v1:The Forensic Cost of Watermark Removal, accepted at IH&MMSEC 2026, Special Session "Watermarking Across the Lifecycle of Generative Models". v2: extended version, under review

详情
AI中文摘要

当前水印移除方法在两个轴上进行评估:攻击成功率和感知质量。我们证明这是不够的。虽然最先进的攻击成功地在没有可见失真的情况下降低了水印信号,但它们留下了明显的统计伪影,暴露了移除尝试。我们将这个被忽视的轴命名为水印移除检测(WRD),并证明基于这些伪影训练的现代分类器在10^{-3}假阳性率下,对每种测试的移除方法都达到了最先进的检测率。没有现有的攻击考虑到这种法医泄漏。我们在扩展的评估三元组(攻击成功率、感知质量和法医可检测性)下,对领先的水印方案与标准移除流水线进行了基准测试,发现当前没有方法能平衡所有三个。我们的结果确立了法医隐蔽性作为水印移除的必要要求。

英文摘要

Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. While state-of-the-art attacks successfully degrade the watermark signal without visible distortion, they leave distinct statistical artifacts that betray the removal attempt. We name this overlooked axis Watermark Removal Detection (WRD) and demonstrate that a modern classifier trained on these artifacts achieves state-of-the-art detection rates at $10^{-3}$ FPR across every removal method tested. No existing attack accounts for this forensic leakage. We benchmark leading watermarking schemes against standard removal pipelines under the extended evaluation triple of attack success, perceptual quality, and forensic detectability, and find that no current method balances all three. Our results establish forensic stealthiness as a necessary requirement for watermark removal.

2604.23472 2026-05-28 cs.AI 版本更新

Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

Escher-Loop:通过闭环自我指涉优化的共同进化

Ziyang Liu, Xinyan Guo, Xuchen Wei, Han Hao, Liu Yang

发表机构 * Shenzhen X-Institute(深圳X研究所) Soochow University(苏州大学) Shenzhen Loop Area Institute(深圳Loop区研究所) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学)

AI总结 提出Escher-Loop框架,通过任务代理和优化代理的闭环共同进化及动态基准机制,实现超越静态基线的持续性能提升。

Comments The first three authors contributed equally. Corresponding Authors: Han Hao, Liu Yang

详情
AI中文摘要

尽管最近自主代理展示了令人印象深刻的能力,但它们主要依赖于手动脚本化工作流和手工制作的启发式方法,本质上限制了其开放式改进的潜力。为了解决这个问题,我们提出了Escher-Loop,一个完全闭环的框架,实现了两个不同群体的共同进化:解决具体问题的任务代理,以及递归优化任务代理和自身的优化代理。为了维持这种自我指涉的进化,我们提出了一种动态基准测试机制,该机制无缝地将新生成任务代理的经验分数作为相对胜负信号,用于更新优化代理的分数。该机制利用任务代理的进化作为内在信号,驱动优化代理的评估和优化,而无需额外开销。在数学优化问题上的实证评估表明,Escher-Loop有效突破了静态基线的性能上限,在所有评估任务中,在匹配计算量下实现了最高的绝对峰值性能。值得注意的是,我们观察到优化代理动态调整其策略以适应高性能任务代理不断变化的需求,这解释了系统的持续改进和优越的后期性能。

英文摘要

While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted heuristics, inherently limiting their potential for open-ended improvement. To address this, we propose Escher-Loop, a fully closed-loop framework that operationalizes the mutual evolution of two distinct populations: Task Agents that solve concrete problems, and Optimizer Agents that recursively refine both the task agents and themselves. To sustain this self-referential evolution, we propose a dynamic benchmarking mechanism that seamlessly reuses the empirical scores of newly generated task agents as relative win-loss signals to update optimizers' scores. This mechanism leverages the evolution of task agents as an inherent signal to drive the evaluation and refinement of optimizers without additional overhead. Empirical evaluations on mathematical optimization problems demonstrate that Escher-Loop effectively pushes past the performance ceilings of static baselines, achieving the highest absolute peak performance across all evaluated tasks under matched compute. Remarkably, we observe that the optimizer agents dynamically adapt their strategies to match the shifting demands of high-performing task agents, which explains the system's continuous improvement and superior late-stage performance.

2604.23061 2026-05-28 cs.LG cs.AI 版本更新

C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs

C-MORAL: 基于强化对齐的可控多目标分子优化用于大语言模型

Rui Gao, Youngseung Jeon, Swastik Roy, Morteza Ziyadi, Xiang 'Anthony' Chen

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Amazon(亚马逊)

AI总结 提出C-MORAL框架,通过强化学习后训练结合分组相对优化、属性分数对齐和瓶颈敏感非线性奖励聚合,实现可控多目标分子优化,在C-MuMOInstruct和S$^2$-Bench MolOpt基准上取得最优性能。

Comments 26 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLMs)在分子优化方面展现出潜力,但使其与选择性且相互竞争的药物设计约束对齐仍然具有挑战性。我们提出了C-Moral,一个用于可控多目标分子优化的强化学习后训练框架。C-Moral结合了基于分组的相对优化、针对异构目标的属性分数对齐以及瓶颈敏感的非线性奖励聚合,以提高跨竞争分子属性的稳定性。在C-MuMOInstruct和S$^2$-Bench MolOpt上的实验表明,C-Moral在两个基准上均取得了比较方法中最佳的性能。在C-MuMOInstruct上,C-Moral在域内任务中实现了最佳的成功优化率(SOR)48.9%,在域外任务中为39.5%,同时保持了骨架相似性。在S$^2$-Bench MolOpt上,它在LogP、MR和QED优化任务中也取得了最强结果。这些结果表明,C-Moral是将分子LLMs与连续且受约束的分子设计目标对齐的有效方法。我们的代码和模型公开在https://github.com/Rwigie/C-MORAL。

英文摘要

Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraints remains challenging. We propose C-Moral, a reinforcement learning post-training framework for controllable multi-objective molecular optimization. C-Moral combines group-based relative optimization, property score alignment for heterogeneous objectives, and bottleneck-sensitive non-linear reward aggregation to improve stability across competing molecular properties. Experiments on C-MuMOInstruct and S$^2$-Bench MolOpt show that C-Moral achieves the best performance among compared methods on both benchmarks. On C-MuMOInstruct, C-Moral achieves the best Success Optimized Rate (SOR) of 48.9\% on in-domain tasks and 39.5\% on out-of-domain tasks while preserving scaffold similarity. On S$^2$-Bench MolOpt, it also achieves the strongest results across LogP, MR, and QED optimization tasks. These results suggest that C-Moral is an effective way to align molecular LLMs with continuous and constrained molecular design objectives. Our code and models are publicly available at https://github.com/Rwigie/C-MORAL.

2604.19072 2026-05-28 cs.LG cs.AI stat.ML 版本更新

S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection

S2MAM: 半监督元加性模型用于稳健估计和变量选择

Xuelin Zhang, Hong Chen, Yingjie Wang, Tieliang Gong, Bin Gu

发表机构 * Huazhong Agricultural University(华中农业大学) China University of Petroleum (East China)(中国石油大学(华东)) Xi'an Jiaotong University(西安交通大学) Jilin University(吉林大学)

AI总结 提出基于双层优化的半监督元加性模型,自动识别信息变量、更新相似矩阵并实现可解释预测,理论保证收敛性和泛化界,实验验证了鲁棒性和可解释性。

Comments Accepted by ICML'2026 as Accept (regular)

详情
AI中文摘要

基于流形正则化的半监督学习是一种经典的联合利用有标签和无标签数据进行学习的框架,其关键要求是未知边际分布的支持集具有黎曼流形的几何结构。通常,基于拉普拉斯-贝尔特拉米算子的流形正则化可以通过与整个训练数据及其对应的图拉普拉斯矩阵相关联的拉普拉斯正则化进行经验近似。然而,图拉普拉斯矩阵严重依赖于预先指定的相似度度量,并且在处理冗余或噪声输入变量时可能导致不适当的惩罚。为了解决上述问题,本文提出了一种新的半监督元加性模型(S$^2$MAM),该模型基于双层优化方案,能够自动识别信息变量、更新相似矩阵,并同时实现可解释的预测。为S$^2$MAM提供了理论保证,包括计算收敛性和统计泛化界。在4个合成数据集和12个真实世界数据集上进行的实验评估,涵盖了不同级别和类型的污染,验证了所提方法的鲁棒性和可解释性。

英文摘要

Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, where the key requirement is that the support of the unknown marginal distribution has the geometric structure of a Riemannian manifold. Typically, the Laplace-Beltrami operator-based manifold regularization can be approximated empirically by the Laplacian regularization associated with the entire training data and its corresponding graph Laplacian matrix. However, the graph Laplacian matrix depends heavily on the prespecified similarity metric and may lead to inappropriate penalties when dealing with redundant or noisy input variables. To address the above issues, this paper proposes a new Semi-Supervised Meta Additive Model (S$^2$MAM) based on a bilevel optimization scheme that automatically identifies informative variables, updates the similarity matrix, and simultaneously achieves interpretable predictions. Theoretical guarantees are provided for S$^2$MAM, including the computing convergence and the statistical generalization bound. Experimental assessments across 4 synthetic and 12 real-world datasets, with varying levels and categories of corruption, validate the robustness and interpretability of the proposed approach.

2604.20857 2026-05-28 cs.IR cs.AI 版本更新

DiagramBank: A Quality-Audited Dataset of Scientific Schematic Diagrams with Multi-Level Document Context

DiagramBank: 一个经过质量审核的科学示意图数据集,包含多级文档上下文

Ling Yue, Tingwen Zhang, Jiaying Wang, Zhen Xu, Shaowu Pan

发表机构 * Rensselaer Polytechnic Institute(伦斯勒理工学院) University of Chicago(芝加哥大学)

AI总结 提出DiagramBank,一个从OpenReview的AI/ML会议中精选的57,100个示意图数据集,通过级联过滤管道和手动盲审确保高质量,并保留文档上下文,用于科学文档理解、示意图检索和基准构建。

详情
AI中文摘要

科学论文使用示意图来传达方法、工作流程和系统结构,然而现有的科学图形语料库通常将它们与图表、截图和照片混合在一起,并且很少保留文档上下文。我们介绍了DiagramBank,一个从OpenReview主办的AI/ML会议中精选的57,100个示意图的质量审核数据集。每条记录将示意图图像与其论文标题、摘要、图表标题、文本内图表引用跨度、会议/年份元数据、来源字段和过滤标签关联起来。DiagramBank是用于科学文档理解、示意图检索、语料库分析和未来基准构建的可重用资源。我们描述了其提取和级联过滤管道、发布模式、置信度控制视图、数据集卡和索引工具。对发布的级联过滤记录进行的手动盲审估计精度为93.67%,另外的CLIP阈值分析描述了更简单过滤视图的精度-覆盖权衡。我们进一步提供了轻量级的元数据索引和编写示例,以说明下游协议,而不将这些工具视为独立方法。代码公开于:https://github.com/csml-rpi/DiagramBank。

英文摘要

Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora often mix them with plots, screenshots, and photographs and rarely preserve document context. We introduce DiagramBank, a quality-audited dataset of 57,100 schematic diagrams curated from OpenReview-hosted AI/ML venues. Each record links a diagram image to its paper title, abstract, figure caption, in-text figure-reference spans, venue/year metadata, provenance fields, and filtering labels. DiagramBank is a reusable resource for scientific-document understanding, diagram retrieval, corpus analysis, and future benchmark construction. We describe its extraction and cascade-filtering pipeline, release schema, confidence-controlled views, dataset card, and indexing utilities. A manual blind audit of the released cascade-filtered records estimates 93.67% precision, and a separate CLIP threshold analysis characterizes the precision--coverage trade-off for simpler filtering views. We further provide lightweight metadata-indexing and authoring examples to illustrate downstream protocols without treating these utilities as standalone methods. The code is public at: https://github.com/csml-rpi/DiagramBank.

2604.05673 2026-05-28 cs.RO cs.AI 版本更新

Rectified Schrödinger Bridge Matching for Few-Step Visual Navigation

整流薛定谔桥匹配用于少步视觉导航

Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu, Rui Ma

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) College of Computer Science, Chongqing University(重庆大学计算机学院) Department of Computer Science, University of Liverpool(利物浦大学计算机科学系) Changchun GenY Technology Co., Ltd.(长春GenY科技有限公司)

AI总结 提出整流薛定谔桥匹配(RSBM)框架,利用速度场结构不变性和线性方差减少,在仅3步积分中实现高保真生成策略,满足具身AI低延迟需求。

Comments 18 pages, 7 figures, 10 tables. Code available at https://github.com/WuyangLuan/RSBM

详情
AI中文摘要

视觉导航是具身AI中的核心挑战,要求自主智能体将高维感官观测转化为连续的、长视界动作轨迹。基于扩散模型和薛定谔桥(SB)的生成策略能有效捕捉多模态动作分布,但由于高方差随机传输,需要数十个积分步骤,这对实时机器人控制构成了关键障碍。我们提出整流薛定谔桥匹配(RSBM),该框架利用标准薛定谔桥(ε=1,最大熵传输)与确定性最优传输(ε→0,如条件流匹配)之间共享的速度场结构,由单一熵正则化参数ε控制。我们证明两个关键结果:(1)条件速度场的函数形式在整个ε谱上保持不变(速度结构不变性),使单一网络能够服务于所有正则化强度;(2)减小ε线性降低条件速度方差,实现更稳定的粗步ODE积分。基于缩短传输距离的学习条件先验,RSBM在中间ε下运行,平衡多模态覆盖和路径直线性。实验表明,标准桥需要≥10步才能收敛,而RSBM在仅3个积分步骤中实现了超过94%的余弦相似度和92%的成功率——无需蒸馏或多阶段训练——显著缩小了高保真生成策略与具身AI低延迟需求之间的差距。

英文摘要

Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into continuous, long-horizon action trajectories. While generative policies based on diffusion models and Schrödinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high-variance stochastic transport, posing a critical barrier for real-time robotic control. We propose Rectified Schrödinger Bridge Matching (RSBM), a framework that exploits a shared velocity-field structure between standard Schrödinger Bridges ($\varepsilon=1$, maximum-entropy transport) and deterministic Optimal Transport ($\varepsilon\to 0$, as in Conditional Flow Matching), controlled by a single entropic regularization parameter $\varepsilon$. We prove two key results: (1) the conditional velocity field's functional form is invariant across the entire $\varepsilon$-spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing $\varepsilon$ linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate $\varepsilon$ that balances multimodal coverage and path straightness. Empirically, while standard bridges require $\geq 10$ steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps -- without distillation or multi-stage training -- substantially narrowing the gap between high-fidelity generative policies and the low-latency demands of Embodied AI.

2604.13583 2026-05-28 cs.CL cs.AI 版本更新

BenGER Platform: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

BenGER平台:面向德国法律任务端到端基准测试的协作式Web平台

Sebastian Nagl, Matthias Grabmair

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 提出BenGER开源Web平台,集成任务创建、协作标注、可配置LLM运行及多维度评估,支持多组织项目与租户隔离,实现法律推理基准测试的端到端透明与可复现。

Comments Preprint - Accepted at ICAIL 2026

详情
AI中文摘要

评估大语言模型(LLM)的法律推理能力需要涵盖任务设计、专家标注、模型执行和基于指标的评估的工作流。在实践中,这些步骤分散在不同的平台和脚本中,限制了透明度、可复现性以及非技术法律专家的参与。我们提出了BenGER(德国法律基准测试)框架,这是一个开源Web平台,集成了任务创建、协作标注、可配置的LLM运行以及基于词汇、语义、事实和法官指标的评估。BenGER支持具有租户隔离和基于角色的访问控制的多组织项目,并可选择性地为标注者提供形成性的、基于参考的反馈。我们将展示一个实时部署,演示端到端的基准测试创建和分析。

英文摘要

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

2604.19355 2026-05-28 cs.LG cs.AI cs.CE 版本更新

LASER: Learning Active Sensing for Continuum Field Reconstruction

LASER: 用于连续场重建的学习主动感知

Huayu Deng, Jinghui Zhong, Xiangming Zhu, Yunbo Wang, Xiaokang Yang

发表机构 * MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University(人工智能MOE重点实验室、人工智能研究院、计算机科学学院、上海交通大学)

AI总结 提出LASER框架,将主动感知建模为部分可观测马尔可夫决策过程,利用连续场潜在世界模型和强化学习策略在潜在想象空间中模拟感知场景,实现稀疏约束下的高保真重建。

Comments Accepted by ICML 2026 (Oral)

详情
AI中文摘要

连续物理场的高保真测量对于科学发现和工程设计至关重要,但在稀疏和受限感知条件下仍然具有挑战性。传统的重建方法通常依赖于固定的传感器布局,无法适应演变的物理状态。我们提出LASER,一个统一的闭环框架,将主动感知建模为部分可观测马尔可夫决策过程(POMDP)。其核心是采用连续场潜在世界模型,捕捉底层物理动力学并提供内在奖励反馈。这使得强化学习策略能够在潜在想象空间中模拟“假设”感知场景。通过根据预测的潜在状态调整传感器移动,LASER能够导航到当前观测之外可能的高信息区域。我们的实验表明,LASER在多种连续场中始终优于静态和离线优化策略,在稀疏条件下实现高保真重建。

英文摘要

High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ''what-if'' sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.

2604.18530 2026-05-28 cs.AI 版本更新

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

OGER:一种用于混合强化学习的鲁棒离线引导探索奖励

Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院) Hithink RoyalFlush Information Network, Hangzhou, China(杭州Hithink RoyalFlush信息网络) Computer and Information Science, University of Macau, Macau, China(澳门大学计算机与信息科学学院)

AI总结 提出OGER框架,通过多教师协作训练和基于熵的辅助探索奖励,统一离线教师引导与在线强化学习,提升大语言模型在数学推理和泛化任务中的探索能力。

详情
AI中文摘要

近年来,具有可验证奖励的强化学习(RLVR)的进展显著提升了大型语言模型(LLM)的推理能力,但模型在探索超出其初始策略分布的新轨迹方面仍存在困难。尽管已提出离线教师引导和基于熵的策略来解决这一问题,但它们往往缺乏深度融合或受限于模型自身能力。在本文中,我们提出OGER(离线引导探索奖励),一种新颖的框架,通过专门的奖励建模视角统一离线教师引导和在线强化学习。OGER采用多教师协作训练,并构建一个辅助探索奖励,利用离线轨迹和模型自身的熵来激励自主探索。在数学和通用推理基准上的大量实验表明,OGER持续优于竞争基线,在数学推理上取得显著提升,同时保持对域外任务的鲁棒泛化。我们提供了训练动态的全面分析,并进行了详细的消融研究,以验证我们基于熵的奖励调制的有效性。我们的代码可在 https://github.com/ecoli-hit/OGER.git 获取。

英文摘要

Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial policy distribution. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER (Offline-Guided Exploration Reward), a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER consistently outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at https://github.com/ecoli-hit/OGER.git.

2604.18235 2026-05-28 cs.CL cs.AI 版本更新

Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents

负优势是一把双刃剑:为搜索智能体校准GRPO中的优势

Jiayi Wu, Ruobing Xie, Zeqian Huang, Lei Jiang, Can Xu, Kangyang Luo, Bochen Lin, Ming Gao, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University(东华师范大学数据科学与工程学院) Tencent(腾讯) Tsinghua University(清华大学)

AI总结 针对GRPO算法在多跳搜索中因粗粒度优势分配和正负优势不平衡导致的训练不稳定问题,提出CalibAdv方法,通过细粒度降低过度负优势并重新平衡正负优势,提升模型性能和训练稳定性。

详情
AI中文摘要

搜索智能体通过与搜索引擎的多轮交互实现强大的问答性能,其中组相对策略优化(GRPO)是一种广泛使用的训练算法。然而,GRPO风格的算法在多跳搜索场景中仍面临若干挑战。首先,当最终答案错误时,正确的中间步骤常常受到惩罚。其次,训练高度不稳定,经常导致自然语言能力退化甚至灾难性训练崩溃。我们的分析将这些问题归因于粗粒度的优势分配以及正负优势之间的不平衡。为了解决这些问题,我们提出了CalibAdv,一种专门为搜索智能体设计的优势校准方法,能够更准确、更稳定地对惩罚和奖励进行建模。具体来说,CalibAdv利用中间步骤的正确性在细粒度上降低过度的负优势,然后进一步重新平衡正负优势以提高训练稳定性。重要的是,CalibAdv采用轻量级设计,从标准 rollout 信号中校准优势,使其简单且易于部署。在三个模型和七个基准上的大量实验表明,CalibAdv同时提升了模型性能和训练稳定性。我们的代码可在 https://github.com/wujwyi/CalibAdv 获取。

英文摘要

Search agents achieve strong question-answering performance through multi-turn interactions with search engines, with Group Relative Policy Optimization (GRPO) being a widely used training algorithm. However, GRPO-style algorithms still face several challenges in multi-hop search settings. First, correct intermediate steps are often penalized when the final answer is wrong. Second, training is highly unstable, often causing degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for search agents that enables more accurate and more stable modeling of penalties and rewards. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then further rebalances positive and negative advantages to improve training stability. Importantly, CalibAdv adopts a lightweight design that calibrates advantages from standard rollout signals, making it simple and easy to deploy. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.

2604.16774 2026-05-28 cs.CL cs.AI 版本更新

Retention Consequence in Lifecycle Memory Control

生命周期记忆控制中的保留后果

Jiarui Han

AI总结 研究持久记忆在准入后失效的问题,提出将置信度作为前向有效性/支持证据,并引入强度作为保留后果的显式生命周期状态,通过StageMem控制器实验验证显式保留后果在生命周期结算中的控制作用。

详情
AI中文摘要

持久记忆在成功准入后可能失效:一个前提被写入,然后成为无声的假设,后续维护将其视为普通残留进行压缩、降级或驱逐。我们将这种准入后失效作为生命周期控制问题来研究。现有记忆系统已经执行准入、更新、压缩、检索和驱逐。我们的主张并非此类系统缺乏维护,而是保留后果通常仅通过有效性、相似性、新近性、频率、重要性或摘要信号间接操作,而非作为单独的生命周期状态暴露。因此,我们将置信度视为前向有效性/支持证据,并引入强度作为保留后果的显式生命周期状态。我们在StageMem中实现了这一区分,这是一个小型的分阶段控制器,其瞬态、工作态和持久态存储暴露了提升、压缩和驱逐压力点。在受控的前提实现、压缩、压力和隐式启发式诊断实验中,实验区分了写入过少、保留错误的高线索内容、遗忘代价高昂的前提以及通过饱和保留所有内容。通过生命周期结算使用的显式保留后果,提供了在遗漏和囤积之间的控制面。针对目标准入后失效模式,结果支持持久记忆的生命周期观点:可靠性不仅取决于进入记忆的内容,还取决于准入有效性和保留后果在维护期间是否可用。

英文摘要

Persistent memory can fail after successful admission: a premise is written, then becomes a silent assumption, and later maintenance treats it as ordinary residue to be compressed, demoted, or evicted. We study this post-admission failure as a lifecycle-control problem. Existing memory systems already perform admission, update, compression, retrieval, and eviction. Our claim is not that such systems lack maintenance, but that retention consequence is often operationalized only indirectly through validity, similarity, recency, frequency, importance, or summarization signals rather than exposed as a separate lifecycle state. We therefore treat confidence as carried-forward validity/support evidence, and introduce strength as an explicit lifecycle state for retention consequence. We operationalize this distinction in StageMem, a small staged controller whose transient, working, and durable stores expose promotion, compression, and eviction pressure points. Across controlled premise-realization, compression, pressure, and implicit-heuristic diagnostics, the experiments separate writing too little, retaining the wrong high-cue content, forgetting costly premises, and preserving everything by saturation. Explicit retention consequence, used through lifecycle settlement, provides a control surface between omission and hoarding. For the targeted post-admission failure mode, the results support a lifecycle view of persistent memory: reliability depends not only on what enters memory, but on whether admission validity and retention consequence remain available during maintenance.

2604.16565 2026-05-28 cs.LG cs.AI 版本更新

Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models

流形上的推理:扩散语言模型中用于自我验证的双向一致性

Jiaoyang Ruan, Xin Gao, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, Jian Pu

发表机构 * Institute of Science and Technology for Brain-Inspired Intelligence(脑启发智能科学与技术研究院) Fudan University(复旦大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) IEG, Tencent Inc.(腾讯IEG)

AI总结 提出双向流形一致性(BMC),一种无训练、无监督的度量方法,通过前向掩码和后向重建循环量化生成序列的稳定性,用于扩散语言模型的诊断、推理和对齐。

Comments 31 pages, 7 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, PMLR 306, 2026
AI中文摘要

尽管扩散大语言模型(dLLMs)在全局规划方面具有结构优势,但高效验证它们是否通过有效的推理轨迹得出正确答案仍然是一个关键挑战。在这项工作中,我们提出了一种几何视角:流形上的推理。我们假设有效的生成轨迹作为学习分布的高密度流形上的稳定吸引子存在,而无效路径则表现出流形外漂移。为了实现这一点,我们引入了双向流形一致性(BMC),这是一种无训练、无监督的度量,通过前向掩码和后向重建循环量化生成序列的稳定性。实验上,我们展示了BMC在整个推理生命周期中的多功能性:(1)在诊断中,它作为无需真实答案的解决方案有效性的鲁棒判别器;(2)在推理中,它能够通过拒绝重采样有效集中计算资源于复杂推理任务;(3)在对齐中,它作为密集的几何奖励,将稀疏的结果监督转化为细粒度的指导,使模型能够超越标准基线自我进化。我们的结果确立了内在几何稳定性作为dLLMs正确性的鲁棒指标。

英文摘要

While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the high-density manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC's versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.

2512.15791 2026-05-28 cs.CY cs.AI cs.CL 版本更新

Evaluation of AI Ethics Tools in Language Models: A Developers' Perspective Case Study

语言模型中AI伦理工具评估:开发者视角案例研究

Jhessica Silva, Diego A. B. Moreira, Gabriel O. dos Santos, Alef Ferreira, Helena Maia, Sandra Avila, Helio Pedrini

AI总结 通过文献筛选和开发者访谈,评估四种AI伦理工具在葡萄牙语语言模型中的应用效果,发现它们能指导一般伦理考虑但未覆盖模型特有方面。

Comments 7 figures, 11 tables. Accepted for publication in AI and Ethics

详情
AI中文摘要

在人工智能中,语言模型因能够通过文本生成模拟与人类真实对话的系统被广泛采用而变得日益重要。由于它们对社会的影响,开发和部署这些语言模型必须负责任地进行,关注其负面影响和可能的危害。在此背景下,AI伦理工具(AIETs)的出版物数量近期有所增加。这些AIETs旨在通过引入公认的价值观来指导AI的设计、开发和使用阶段,帮助开发者、公司、政府和其他利益相关者建立对其技术的信任、透明度和责任。然而,许多AIETs缺乏良好的文档、使用示例以及在实践中有效性的证明。本文提出了一种评估语言模型中AIETs的方法。我们的方法包括对213个AIETs进行广泛的文献调查,在应用纳入和排除标准后,我们选择了四个AIETs:模型卡片、ALTAI、事实表以及危害建模。为了评估,我们将AIETs应用于为葡萄牙语开发的语言模型,并对它们的开发者进行了35小时的访谈。评估考虑了开发者对AIETs在帮助识别其模型伦理考量方面的使用和质量的看法。结果表明,所应用的AIETs可作为制定关于语言模型的一般伦理考量的指南。然而,我们注意到它们并未解决这些模型的独特方面,例如习语表达。此外,这些AIETs未能帮助识别葡萄牙语模型的潜在负面影响。

英文摘要

In Artificial Intelligence (AI), language models have gained significant importance due to the widespread adoption of systems capable of simulating realistic conversations with humans through text generation. Because of their impact on society, developing and deploying these language models must be done responsibly, with attention to their negative impacts and possible harms. In this scenario, the number of AI Ethics Tools (AIETs) publications has recently increased. These AIETs are designed to help developers, companies, governments, and other stakeholders establish trust, transparency, and responsibility with their technologies by bringing accepted values to guide AI's design, development, and use stages. However, many AIETs lack good documentation, examples of use, and proof of their effectiveness in practice. This paper presents a methodology for evaluating AIETs in language models. Our approach involved an extensive literature survey on 213 AIETs, and after applying inclusion and exclusion criteria, we selected four AIETs: Model Cards, ALTAI, FactSheets, and Harms Modeling. For evaluation, we applied AIETs to language models developed for the Portuguese language, conducting 35 hours of interviews with their developers. The evaluation considered the developers' perspective on the AIETs' use and quality in helping to identify ethical considerations about their model. The results suggest that the applied AIETs serve as a guide for formulating general ethical considerations about language models. However, we note that they do not address unique aspects of these models, such as idiomatic expressions. Additionally, these AIETs did not help to identify potential negative impacts of models for the Portuguese language.

2604.15898 2026-05-28 cs.AI 版本更新

Towards Rigorous Explainability by Feature Attribution

通过特征归因实现严格可解释性

Olivier Létoffé, Xuanxiang Huang, Joao Marques-Silva

发表机构 * IRIT, University of Toulouse France Nanyang Technological University, Singapore ICREA \& Univ.\ Lleida, Spain

AI总结 本文综述了使用严格的符号化可解释人工智能方法替代非严格的非符号化方法(如SHAP)来分配相对特征重要性的研究进展。

详情
AI中文摘要

大约十年来,非符号化方法一直是解释复杂机器学习(ML)模型的首选。不幸的是,这些方法缺乏严格性,可能误导人类决策者。在ML的高风险应用中,缺乏严格性尤其成问题。一个典型的不严格性证明例子是在可解释人工智能(XAI)中采用Shapley值,工具SHAP就是一个普遍的例子。本文概述了当前使用严格的符号化XAI方法作为非严格非符号化方法替代方案的努力,具体用于分配相对特征重要性。

英文摘要

For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley values in explainable artificial intelligence (XAI), with the tool SHAP being a ubiquitous example. This paper overviews the ongoing efforts towards using rigorous symbolic methods of XAI as an alternative to non-rigorous non-symbolic approaches, concretely for assigning relative feature importance.

2604.14585 2026-05-28 cs.AI cs.CL 版本更新

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

提示优化如同抛硬币:诊断其在复合AI系统中何时有效

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式AI创新中心) HSBC Holdings Plc., HSBC Technology Center, China(汇丰控股有限公司,汇丰技术中心,中国)

AI总结 通过大量实验发现提示优化在复合AI系统中效果不稳定,仅当任务具有可挖掘的输出结构时才有帮助,并提供了两阶段诊断方法。

Comments Accepted to the 1st Workshop on Combining Theory and Benchmarks, CTB@ICML 2026, Seoul, South Korea

详情
AI中文摘要

复合AI系统中的提示优化在统计上与抛硬币无异:在Claude Haiku 4.5上的72次优化运行(6种方法 × 4个任务 × 3次重复)中,49%的得分低于零样本;在Amazon Nova Lite上,失败率更高。然而,在一个任务上,所有六种方法相比零样本提升了高达+6.8分。是什么区分了成功与失败?我们通过18,000次网格评估和144次优化运行进行了调查,按照必须回答的顺序测试了TextGrad和DSPy等端到端优化工具背后的两个假设:(A) 智能体提示存在交互,需要联合优化而非独立优化;(B) 单个提示本身值得优化。交互效应从未显著(p > 0.52,所有F < 1.0),并且优化仅在任务具有可挖掘的输出结构时才有帮助:即模型可以生成但不会默认采用的格式。我们进一步给出了机制性解释:指令微调将输入措辞压缩成狭窄的输出分布,消除了联合优化所依赖的措辞敏感性。我们提供了一个两阶段诊断:一个80美元的ANOVA预测试用于智能体耦合,以及一个10分钟的头空间测试,用于预测优化是否值得,从而将抛硬币转变为知情决策。

英文摘要

Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku 4.5 (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy, in the order they must be answered: (A) agent prompts interact, requiring joint rather than independent optimization, and (B) individual prompts are worth optimizing at all. Interaction effects are never significant ($p > 0.52$, all $F < 1.0$), and optimization helps only when the task has exploitable output structure: a format the model can produce but does not default to. We further give a mechanistic account: instruction-tuning compresses input phrasing into a narrow output distribution, eliminating the very phrasing-sensitivity that joint optimization assumes. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile, turning a coin flip into an informed decision.

2604.14356 2026-05-28 cs.CL cs.AI 版本更新

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

当多囊卵巢综合征遇上进食障碍:一种可解释的AI方法检测隐藏的三重负担

Apoorv Prasad, Susan McRoy

发表机构 * University of Wisconsin - Milwaukee(威斯康星大学密尔沃基分校)

AI总结 本研究通过微调小型开源语言模型,利用可解释性AI从社交媒体帖子中自动检测多囊卵巢综合征患者的身体形象困扰、进食障碍和代谢挑战的三重负担,最佳模型在150条测试帖上达到75.3%的精确匹配准确率。

详情
AI中文摘要

患有多囊卵巢综合征(PCOS)的女性面临身体形象困扰、进食障碍和代谢挑战的显著升高风险,然而现有的自然语言处理方法在检测这些状况时缺乏透明度,且无法识别共病表现。我们开发了小型开源语言模型,以基于可解释性的方式自动检测社交媒体帖子中的这种三重负担。我们从六个子论坛收集了1000条与PCOS相关的帖子,由两名经过训练的标注员根据Lee等人(2017)临床框架的操作化指南对帖子进行标注。使用低秩适配对三个模型(Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B)进行微调,以生成带有文本证据的结构化解释。最佳模型在150条保留帖子上实现了75.3%的精确匹配准确率,具有稳健的共病检测能力和强可解释性。性能随诊断复杂性下降,表明其最佳用途是筛查而非自主诊断。

英文摘要

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

2604.12955 2026-05-28 cs.AI 版本更新

Text2Model: Modeling Copilots for Text-to-Model Translation

Text2Model: 用于文本到模型翻译的建模副驾驶

Serdar Kadioglu, Karthik Uppuluri, Akash Singirikonda

发表机构 * AI Center of Excellence, Fidelity Investments(富达投资人工智能卓越中心) Department of Computer Science, Brown University(布朗大学计算机科学系)

AI总结 本文提出Text2Model和Text2Zinc,通过统一架构和数据集、求解器无关的方式,利用多种LLM策略实现文本到组合优化与满足问题的模型翻译,并开源副驾驶和排行榜以缩小性能差距。

Comments AAAI'25 Bridge Program on Machine Learning and Operations Research CPAIOR'26 Master Class on LLMs for CP/OR

详情
AI中文摘要

利用大型语言模型(LLM)进行文本到模型翻译和优化任务的研究兴趣日益增长。本文通过引入\textsc{Text2Model}和\textsc{Text2Zinc}来推进这一研究方向。\textsc{Text2Model}是一套基于多种LLM策略(复杂度各异)的副驾驶,并附带在线排行榜。\textsc{Text2Zinc}是一个跨领域数据集,用于捕捉自然语言指定的优化和满足问题,并附带内置AI助手的交互式编辑器。虽然已有新兴文献使用LLM将组合问题翻译为形式化模型,但我们的工作是首次尝试将满足问题和优化问题集成在\textit{统一架构}和\textit{数据集}中。此外,我们的方法是\textit{求解器无关的},不同于现有专注于翻译为特定求解器模型的工作。为此,我们利用\textsc{MiniZinc}的求解器和范式无关的建模能力来表述组合问题。我们进行了全面实验,比较了多种单次和多次调用策略的执行和解准确率,包括:零样本提示、思维链推理、通过知识图谱的中间表示、基于语法的语法编码,以及将模型分解为顺序子任务的代理方法。我们的副驾驶策略具有竞争力,并在部分方面改进了该领域的最新研究。我们的发现表明,虽然LLM有前景,但尚未成为组合建模的一键式技术。我们开源了\textsc{Text2Model}副驾驶和排行榜,以及\textsc{Text2Zinc}和交互式编辑器,以支持缩小这一性能差距。

英文摘要

There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims to advance this line of research by introducing \textsc{Text2Model} and \textsc{Text2Zinc}. \textsc{Text2Model} is a suite of copilots based on several LLM strategies with varying complexity, along with an online leaderboard. \textsc{Text2Zinc} is a cross-domain dataset for capturing optimization and satisfaction problems specified in natural language, along with an interactive editor with built-in AI assistant. While there is an emerging literature on using LLMs for translating combinatorial problems into formal models, our work is the first attempt to integrate \textit{both} satisfaction and optimization problems within a \textit{unified architecture} and \textit{dataset}. Moreover, our approach is \textit{solver-agnostic} unlike existing work that focuses on translation to a solver-specific model. To achieve this, we leverage \textsc{MiniZinc}'s solver-and-paradigm-agnostic modeling capabilities to formulate combinatorial problems. We conduct comprehensive experiments to compare execution and solution accuracy across several single- and multi-call strategies, including; zero-shot prompting, chain-of-thought reasoning, intermediate representations via knowledge-graphs, grammar-based syntax encoding, and agentic approaches that decompose the model into sequential sub-tasks. Our copilot strategies are competitive, and in parts improve, recent research in this domain. Our findings indicate that while LLMs are promising they are not yet a push-button technology for combinatorial modeling. We contribute \textsc{Text2Model} copilots and leaderboard, and \textsc{Text2Zinc} and interactive editor to open-source to support closing this performance gap.

2506.01247 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering

超越可解释性:稀疏自编码器何时、为何以及如何实现无标签视觉引导

Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

发表机构 * Department of Computer Science, Rutgers University(罗格斯大学计算机科学系) Department of Statistics, Rutgers University(罗格斯大学统计系)

AI总结 本文提出无标签视觉稀疏引导方法VS2,通过训练稀疏自编码器并利用其重构误差和稀疏特征放大来引导冻结的视觉语言模型,在九个图像分类数据集上提升零样本准确率。

详情
AI中文摘要

稀疏自编码器(SAE)越来越多地被用于解释基础模型,但它们作为可操作干预空间的作用仍不太被理解,尤其是在视觉领域。我们研究稀疏视觉特征是否不仅可用于事后分析,还可用于引导冻结的视觉语言模型。我们引入视觉稀疏引导(VS2),一种无标签方法,它在冻结的CLIP图像编码器的无标签激活上训练一个top-$k$ SAE,并在测试时通过放大输入的活跃稀疏特征并解码诱导的变化来构建一个可解释的引导向量。我们证明该过程可分解为质心偏差引导:每个输入沿着其与SAE学习到的质心的偏差移动。残差项由SAE的每样本重构误差(通过FVU测量)精确控制,从而产生基于FVU的残差界限,并促使在SAE重构不可靠时回退到零样本CLIP的可靠性门控。通过使用在无标签CLIP图像编码器激活上训练的目标域SAE,VS2在九个图像分类数据集上提高了零样本准确率,在推理计算量增加不到0.1%的情况下实现了高达+4.12%的提升。最后,一项受控的上界研究VS2++表明,选择性放大稀疏特征可带来高达+21.44%的提升,揭示了一个重构与任务显著性的差距:对重构显著的稀疏特征不一定与对下游预测有用的特征一致。

英文摘要

Sparse Autoencoders (SAEs) are increasingly used to interpret foundation models, but their role as an actionable intervention space remains less understood, especially in vision. We study whether sparse visual features can be used not only for post-hoc analysis, but also to steer frozen vision-language models. We introduce Visual Sparse Steering (VS2), a label-free method that trains a top-$k$ SAE on unlabeled activations from a frozen CLIP image encoder and, at test time, constructs an interpretable steering vector by amplifying the input's active sparse features and decoding the induced change. We show that this procedure admits a closed-form decomposition as centroid-deviation steering: each input is moved along its deviation from the SAE-learned centroid. The residual term is controlled exactly by the SAE's per-sample reconstruction error, measured by FVU, yielding an FVU-based residual bound and motivating a reliability gate that falls back to zero-shot CLIP when SAE reconstruction is unreliable. With target-domain SAEs trained on unlabeled CLIP image-encoder activations, VS2 improves zero-shot accuracy across nine image-classification datasets, achieving gains up to $+4.12\%$ with less than $0.1\%$ additional inference compute. Finally, a controlled upper-bound study, VS2++, shows that selective amplification of sparse features can yield gains up to $+21.44\%$, exposing a reconstruction-vs-task saliency gap: features salient for reconstruction need not align with features useful for downstream prediction.

2604.10567 2026-05-28 cs.CL cs.AI 版本更新

Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

早期决策至关重要:非自回归扩散语言模型中的邻近偏差与初始轨迹塑造

Jiyeon Kim, Sungik Choi, Yongrae Jo, Moontae Lee, Minjoon Seo

发表机构 * LG AI Research(LG人工智能研究)

AI总结 本文通过分析非自回归扩散语言模型的推理动态,发现其存在邻近偏差导致的错误传播问题,并提出一种轻量级规划器和序列结束温度退火方法来引导早期令牌选择,从而显著提升推理与规划任务的性能。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

基于扩散的语言模型(dLLMs)已成为自回归语言模型的一种有前景的替代方案,提供了并行令牌生成和双向上下文建模的潜力。然而,如何利用这种灵活性实现完全非自回归解码仍然是一个开放问题,尤其是在推理和规划任务中。在这项工作中,我们通过系统分析非自回归解码在时间轴上的推理动态来研究dLLMs中的非自回归解码。具体来说,我们揭示了基于置信度的非自回归生成中固有的失败模式,该模式源于强烈的邻近偏差——即去噪顺序倾向于集中在空间相邻的令牌上。这种局部依赖性导致空间错误传播,使得整个轨迹关键地依赖于初始去掩码位置。利用这一见解,我们提出了一种最小干预方法,通过轻量级规划器和序列结束温度退火来指导早期令牌选择。我们在各种推理和规划任务上全面评估了我们的方法,并观察到在现有启发式基线基础上,无需显著计算开销即可实现整体性能的显著提升。

英文摘要

Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

2604.05333 2026-05-28 cs.AI 版本更新

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

技能图谱:面向大规模智能体技能的依赖感知结构检索

Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, Lichao Sun

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Maryland(马里兰大学) Brown University(布朗大学) Carnegie Mellon University(卡内基梅隆大学) Lehigh University(莱斯大学)

AI总结 提出技能图谱(GoS),一种推理时的结构检索层,通过构建可执行技能图并利用混合语义-词汇种子、反向感知个性化PageRank和上下文预算水合,实现依赖感知的技能束检索,在SkillsBench和ALFWorld上显著提升奖励并节省令牌。

Comments 11 pages of main text, 12 pages of appendix. Core contribution by Dawei Liu and Zongxia Li. Project page: https://github.com/davidliuk/graph-of-skills

详情
AI中文摘要

现代LLM智能体越来越依赖可复用技能,当与个人应用、网页浏览器等接口交互时,技能库可扩展至数千个技能。扩展到更大的技能集带来了两个关键挑战。首先,加载完整技能集会饱和上下文窗口,推高令牌成本、幻觉和延迟。其次,语义检索会找到主题相关的技能,但遗漏其上下游技能的先决条件链,造成先决条件缺口,使检索到的技能束执行不完整。在本文中,我们提出技能图谱(GoS),一种用于大型技能库的推理时结构检索层。GoS离线从技能包构建可执行技能图,然后在推理时通过混合语义-词汇种子、反向感知个性化PageRank和上下文预算水合,检索一个有界、依赖感知的技能束。在SkillsBench和ALFWorld上,GoS在三个模型系列(Claude Sonnet 4.5、MiniMax M2.7和GPT-5.2 Codex)中持续带来显著的奖励提升和令牌节省。在SkillsBench上,使用GPT-5.2 Codex时,GoS相比原始完整技能加载基线实现了25.55%的峰值奖励提升,同时总令牌减少56.72%。消融实验证实了在200到2000个技能库中的这一模式。

英文摘要

Modern LLM agents increasingly rely on reusable skills, and as they interact with personal applications, web browsers, and other interfaces, skill libraries can scale to thousands of skills. Scaling to larger skill sets introduces two key challenges. First, loading the full skill set saturates the context window, driving up token costs, hallucination, and latency. Second, semantic retrieval surfaces topically relevant skills but misses their prerequisite chain of upstream and downstream skills, creating a prerequisite gap that leaves the retrieved bundle execution-incomplete. In this paper, we present Graph-of-Skills (GoS), an inference-time structural retrieval layer for large skill libraries. GoS constructs an executable skill graph offline from skill packages, then at inference time retrieves a bounded, dependency-aware skill bundle through hybrid semantic-lexical seeding, reverse-aware Personalized PageRank, and context-budgeted hydration. On SkillsBench and ALFWorld, GoS consistently delivers substantial reward improvements and token savings across three model families (Claude Sonnet 4.5, MiniMax M2.7, and GPT-5.2 Codex). On SkillsBench, GoS achieves a peak reward increase of 25.55% while reducing total tokens by 56.72% over the vanilla full skill-loading baseline using GPT-5.2 Codex. Ablations confirm this pattern across skill libraries from 200 to 2,000 skills.

2604.06196 2026-05-28 cs.CL cs.AI cs.LO 版本更新

Compositional Consistency-Guided Decoding for Three-Way Logical Question Answering

面向三值逻辑问答的成分一致性引导解码

Tianyi Huang, Ming Hou, Jiaheng Su, Yutong Zhang, Ziling Zhang

AI总结 针对大语言模型在三值逻辑问答中的否定不一致和认知未知问题,提出一种轻量级测试时解码层CGD-PD,通过神经三值分类、符号否定一致性投影和定向二值蕴含探测,在FOLIO数据集上提升准确率4.4-6.8点并减少未知预测。

Comments Accepted at the ICML 2026 Workshop on Compositional Learning: Safety, Interpretability, and Agents

详情
AI中文摘要

三值逻辑问答(QA)在给定前提集 $S$ 的情况下,将 $ ext{True}$、$ ext{False}$ 或 $ ext{Unknown}$ 之一分配给假设 $H$。我们将此任务视为一个紧凑的成分推理问题:在确定性否定映射下,$H$ 和机械否定假设 $ eg H$ 的预测应保持一致。尽管结构简单,大语言模型(LLM)可能表现出两种实际失败模式:(i) 否定不一致,即对 $H$ 和 $ eg H$ 的回答违反了所需的标签映射;(ii) 认知 $ ext{Unknown}$,即模型在某一侧被蕴含时仍选择弃权。我们引入 CGD-PD,一个轻量级、无需训练的测试时层,结合神经三值分类、符号否定一致性投影和定向二值蕴含探测。在 FOLIO 一阶逻辑领域的一个验证集上,CGD-PD 在 GPT-5.2 上提升了 4.4 个百分点的准确率,在 Claude Sonnet 4.5 上提升了 6.8 个百分点,同时减少了 $ ext{Unknown}$ 预测和认知弃权。这些结果提供了一个受控的概念验证,表明推理时的简单逻辑组合有助于评估和提高 LLM 推理可靠性;但本身并不足以证明在此形式化基准设置之外的鲁棒性。

英文摘要

Three-way logical question answering (QA) assigns one of $\text{True}$, $\text{False}$, or $\text{Unknown}$ to a hypothesis $H$ given a premise set $S$. We study this task as a compact compositional inference problem: predictions for $H$ and for a mechanically negated hypothesis $\neg H$ should agree under a deterministic negation map. Despite this simple structure, large language models (LLMs) can exhibit two practical failure modes: (i) negation inconsistency, where answers to $H$ and $\neg H$ violate the required label mapping, and (ii) epistemic $\text{Unknown}$, where the model abstains even when one side is entailed. We introduce CGD-PD, a lightweight, training-free test-time layer that combines neural 3-way classification, symbolic negation-consistency projection, and targeted binary entailment probes. On one validation split of FOLIO's first-order logic fields, CGD-PD improves accuracy by 4.4 points on GPT-5.2 and 6.8 points on Claude Sonnet 4.5, while reducing $\text{Unknown}$ predictions and epistemic abstention. These results provide a controlled proof of concept that simple logical composition at inference time can help evaluate and improve LLM reasoning reliability; they do not, by themselves, establish robustness beyond this formal benchmark setting.

2604.04074 2026-05-28 cs.AI cs.LG 版本更新

FactReview: Evidence-Grounded Peer Review with Execution-Based Claim Verification

FactReview:基于执行式声明验证的证据驱动同行评审

Ling Yue, Chaoqian Ouyang, Hang Xu, Ruijun Huang, Yuchen Liu, Libin Zheng, Wei Liu, Shaowu Pan, Shimin Di, Min-Ling Zhang

发表机构 * Rensselaer Polytechnic Institute(罗切斯特理工学院) Sun Yat-sen University(中山大学) Southeast University(东南大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出FactReview系统,通过提取与评审相关的声明、将其与相关工作关联,并在代码可用时在固定修复预算下执行发布工件来审计经验声明,覆盖84%的声明,将评审质量提升至4.86/5,并将评审时间减少58%。

详情
AI中文摘要

基于LLM的评审系统通常仅以手稿为输入,使得文献和基于代码的声明难以验证。我们提出FactReview,一个提取与评审相关的声明、将其与相关工作关联,并在代码可用时在固定修复预算下执行发布工件以审计经验声明的系统。在35篇ML论文和463个基准主要声明中,FactReview覆盖了84%的声明。在证据感知评分标准下,其评审在整体质量上得分为4.86/5,比DeepReview-v2高0.7,比匹配的OpenReview评论高1.5。移除执行证据会改变17%的声明状态,超过任何其他单一证据来源。在一项评审辅助研究中,FactReview将平均评审时间减少了58%,同时将基准声明覆盖率从87%提高到99%。我们认为LLM评审者应审计经验声明,而非做出接受或拒绝的决定。代码公开于:https://github.com/DEFENSE-SEU/FactReview。

英文摘要

LLM-based reviewing systems typically take only the manuscript as input, leaving literature and code-based claims hard to verify. We present FactReview, a system that extracts review-relevant claims, grounds them in related work, and, when code is available, executes released artifacts under a fixed repair budget to audit empirical claims. Across 35 ML papers and 463 benchmark major claims, FactReview covers 84% of claims. Under an evidence-aware rubric, its reviews score 4.86/5 in overall quality, 0.7 above DeepReview-v2 and 1.5 above matched OpenReview comments. Removing execution evidence changes 17% of claim statuses, more than any other single evidence source. In a reviewer-assistance study, FactReview reduces mean review time by 58% while raising benchmark claim coverage from 87% to 99%. We argue that LLM reviewers should audit empirical claims, not make accept-reject decisions. The code is public at: https://github.com/DEFENSE-SEU/FactReview.

2604.02645 2026-05-28 cs.CL cs.AI 版本更新

Speaking of Language: Reflections on Metalanguage Research in NLP

论语言:NLP中元语言研究的思考

Nathan Schneider, Antonios Anastasopoulos

发表机构 * Georgetown University(乔治城大学) George Mason University(弗吉尼亚理工大学)

AI总结 本文定义元语言概念,将其与NLP和LLM关联,介绍两个实验室以元语言为中心的研究,并讨论元语言的四个维度及元语言任务,提出未来研究方向。

Comments To appear at the Big Picture Workshop at ACL 2026. Camera-ready version

详情
AI中文摘要

本工作旨在聚焦元语言话题。我们首先定义元语言,将其与NLP和LLM联系起来,然后讨论我们两个实验室以元语言为中心的努力。最后,我们讨论元语言和元语言任务的四个维度,提供一系列尚未充分研究的未来研究方向。

英文摘要

This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss our two labs' metalanguage-centered efforts. Finally, we discuss four dimensions of metalanguage and metalinguistic tasks, offering a list of understudied future research directions.

2604.01604 2026-05-28 cs.AI 版本更新

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

CRaFT:基于跨层转码器的电路引导拒绝特征选择

Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 提出CRaFT框架,利用跨层转码器构建稀疏特征电路图,通过量化特征间影响及其对最终输出的贡献,选择控制拒绝行为的关键特征,显著提升越狱攻击性能。

详情
AI中文摘要

虽然现代LLM经过对齐以拒绝有害请求,但理解这种拒绝行为背后的机制基础对于模型安全分析至关重要。例如,基于引导的越狱攻击通过识别和操纵稀疏的、类似神经元的拒绝特征来绕过安全护栏。当前的特征选择方法主要依赖于特征在有害提示上的激活强度。然而,仅凭激活强度往往捕捉到主题或词汇线索等表面启发式,而非真正的因果机制。因此,选择拒绝特征需要测量特征间的关系,而不是将每个特征视为孤立的激活信号。基于这一见解,我们提出CRaFT,一个电路引导的框架,用于识别直接控制拒绝决策的关键拒绝特征。CRaFT利用跨层转码器将模型的内部计算映射到稀疏特征电路图中,其中边量化特征间的影响及其对最终输出logits的贡献。通过聚合沿拒绝路径传播的效应,CRaFT有效地对最具影响力的特征进行排序。在四个越狱基准上的广泛评估表明,与当前最先进方法相比,CRaFT将平均性能从6.7%提高到57.4%,并生成更具体的有害补全。

英文摘要

While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primarily rely on how strongly features activate on harmful prompts. However, activation strength alone often captures superficial heuristics such as topic or lexical cues, rather than the true causal mechanisms. Thus, selecting refusal features requires measuring inter-feature relationships, rather than treating each feature as an isolated activation signal. Based on this insight, we propose CRaFT, a circuit-guided framework for identifying critical refusal features that directly govern the refusal decision. CRaFT leverages cross-layer transcoders to map the model's internal computations into a sparse feature circuit graph, where edges quantify inter-feature influences and their contributions to the final output logits. By aggregating the effects propagating along the paths to refusal, CRaFT effectively ranks the most influential features. Extensive evaluations across four jailbreak benchmarks show that CRaFT significantly improves average performance from 6.7% to 57.4% and generates more specific harmful completions compared to current SOTA methods.

2604.00402 2026-05-28 cs.CV cs.AI 版本更新

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

COTTA: 面向自动驾驶轨迹预测的上下文感知迁移适应

Seohyoung Park, Jaeyeol Lim, Seoyoung Ju, Kyeonghun Kim, Nam-Joon Kim, Hyuk-Jae Lee

发表机构 * Ewha Womans University(成均馆大学) Seoul National University(首尔国立大学) Sangmyung University(Sangmyung 大学) NVIDIA

AI总结 本文研究将基于美国数据训练的轨迹预测模型QCNet迁移到韩国道路环境,通过对比四种训练策略,发现冻结编码器并微调解码器可在精度和效率间取得最佳平衡,预测误差降低66%以上。

Comments 4 pages, 2 figures. Accepted at ICEIC 2026

详情
AI中文摘要

开发鲁棒模型以准确预测周围代理的轨迹是自动驾驶安全的基础。然而,大多数公开数据集(如Waymo Open Motion Dataset和Argoverse)是在西方道路环境中收集的,并未反映其他地区(包括韩国)独特的交通模式、基础设施和驾驶行为。当在西方数据上训练的最先进模型部署到不同地理环境时,这种领域差异会导致性能下降。在本工作中,我们研究了查询中心轨迹预测(QCNet)从美国数据迁移到韩国道路环境时的适应性。使用韩国自动驾驶数据集,我们比较了四种训练策略:零样本迁移、从头训练、全微调和编码器冻结。实验结果表明,利用预训练知识显著提高了预测性能。具体而言,在冻结编码器的同时选择性微调解码器,在精度和训练效率之间取得了最佳平衡,与从头训练相比,预测误差降低了66%以上。本研究为在新地理领域部署轨迹预测模型提供了有效的迁移学习策略的实用见解。

英文摘要

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

2601.01627 2026-05-28 cs.CL cs.AI 版本更新

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

JMedEthicBench:用于评估日语大语言模型医疗安全性的多轮对话基准

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato

发表机构 * Kyoto University(京都大学) Hohai University(河海大学) The University of Tokyo(东京大学) University of Science and Technology of China(中国科学技术大学) Hong Kong Polytechnic University(香港理工大学)

AI总结 提出首个多轮对话基准JMedEthicBench,基于日本医学会67条指南和7种自动越狱策略生成5万+对抗对话,评估27个模型发现医疗专用模型安全性脆弱,且多轮交互中安全性显著下降。

Comments 12 pages, 6 figures

详情
AI中文摘要

随着大语言模型(LLM)在医疗领域的部署日益增多,在临床使用前仔细评估其医疗安全性变得至关重要。然而,现有的安全基准仍然以英语为中心,并且仅使用单轮提示进行测试,尽管临床咨询是多轮的。为了解决这些差距,我们引入了JMedEthicBench,这是第一个用于评估日语医疗LLM医疗安全性的多轮对话基准。我们的基准基于日本医学会的67条指南,包含使用七种自动发现的越狱策略生成的超过50,000个对抗性对话。使用双LLM评分协议,我们评估了27个模型,发现商业模型保持了稳健的安全性,而医疗专用模型表现出更高的脆弱性。此外,安全分数在对话轮次中显著下降(中位数:9.5降至5.0,p < 0.001)。对我们的基准的日语和英语版本进行的跨语言评估表明,医疗模型的脆弱性跨语言持续存在,表明存在固有的对齐限制,而非语言特定因素。这些发现表明,领域特定的微调可能会意外削弱安全机制,并且多轮交互代表了一个需要专门对齐策略的独特威胁面。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.

2505.13820 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Structured Agent Distillation for Large Language Model

大型语言模型的结构化智能体蒸馏

Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Harvard University(哈佛大学) MIT(麻省理工学院) Northeastern University(东北大学) Adobe Research(Adobe研究) National University of Singapore(新加坡国立大学) University of Georgia(佐治亚大学) Florida International University(佛罗里达国际大学)

AI总结 提出结构化智能体蒸馏框架,通过分段对齐推理和动作跨度,将大型语言模型智能体压缩为小型学生模型,在保持决策性能的同时降低推理成本。

详情
Journal ref
The 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
AI中文摘要

大型语言模型(LLMs)通过交错推理和动作(如ReAct风格框架)展现出作为决策智能体的强大能力。然而,它们的实际部署受到高推理成本和大模型规模的限制。我们提出结构化智能体蒸馏,一种将基于大型LLM的智能体压缩为更小的学生模型的框架,同时保持推理保真度和动作一致性。与标准的token级蒸馏不同,我们的方法将轨迹分割为[REASON]和[ACT]跨度,应用分段特定损失来使每个组件与教师行为对齐。这种结构感知的监督使紧凑的智能体能够更好地复制教师的决策过程。在ALFWorld、HotPotQA-ReAct和WebShop上的实验表明,我们的方法始终优于token级和模仿学习基线,在性能下降最小的情况下实现了显著的压缩。缩放和消融结果进一步强调了跨度级对齐对于高效可部署智能体的重要性。

英文摘要

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into [REASON] and [ACT] spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.

2603.24631 2026-05-28 cs.SE cs.AI 版本更新

Coherence Collapse: Diagnosing Why Code Agents Fail After Reaching the Right Code

一致性崩溃:诊断代码智能体在到达正确代码后失败的原因

Myeongsoo Kim, Dingmin Wang, Siwei Cui, Farima Farmahinifarahani, Terry Yue Zhuo, Shweta Garg, Baishakhi Ray, Rajdeep Mukherjee, Varun Kumar

发表机构 * AWS AI Labs(AWS AI实验室) Monash University(墨尔本大学)

AI总结 通过轨迹分解分析,发现代码智能体在定位正确后仍因编辑质量缺陷(尤其是“一致性崩溃”)而失败,并提出了无需参考的共识驱动改进方法。

详情
AI中文摘要

代码智能体解决了SWE-bench Verified中65-70%的问题,但Pass@1无法告诉我们其余问题失败的原因,并且我们表明,没有轨迹数据,有能力的模型的失败会被系统性地误诊。我们引入了TRAJEVAL,一种无需训练的智能体轨迹分解方法,将其分解为参考补丁对齐的搜索、读取和编辑阶段,并应用于跨越三种架构和七个模型的16,758条轨迹。有能力的模型的主要失败并非定位问题:SWE-Agent和OpenHands上60-69%的失败到达并编辑了正确的函数,但仍然产生不正确的补丁,并且这种模式在仅使用bash的LiveSWEAgent上对大多数模型持续存在。在这个编辑质量残差中,我们识别出“一致性崩溃”,即智能体到达正确的代码然后覆盖或破坏它,作为最大的主题,在SWE-bench Verified和多语言PolyBench Verified中重复出现。在5个案例中,智能体生成了与黄金参考补丁位相同的中间轨迹,然后破坏了它;一个编辑提交检查点恢复了所有5个案例,对抗SWE-bench Docker测试框架。一种无需参考的共识驱动变体在GPT-5上产生了方向性的+3.0个百分点Pass@1测量(p=0.08)。

英文摘要

Code agents resolve 65-70% of SWE-bench Verified issues, but Pass@1 cannot tell us why the rest fail, and, as we show, capable-model failures are systematically misdiagnosed without trajectory data. We introduce TRAJEVAL, a training-free decomposition of agent trajectories into reference-patch-aligned search, read, and edit stages, and apply it across 16,758 trajectories spanning three architectures and seven models. The dominant failure of capable models is not localization: 60-69% of failures on SWE-Agent and OpenHands reach and edit the correct functions yet still produce incorrect patches, and the pattern persists for most models on the bash-only LiveSWEAgent. Within this Edit-Quality residual, we identify Coherence Collapse, where the agent reaches correct code and then overwrites or thrashes it, as the largest theme, replicating across SWE-bench Verified and the multilingual PolyBench Verified. In 5 cases, the agent produces a patch bit-identical to the gold reference mid-trajectory and destroys it later; an edit-commit checkpoint recovers all 5 against the SWE-bench Docker harness. A reference-free consensus-driven variant yields a directional +3.0 pp Pass@1 measurement on GPT-5 (p=0.08).

2603.22335 2026-05-28 cs.IR cs.AI 版本更新

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

因果直接偏好优化用于分布鲁棒的生成式推荐

Chu Zhao, Enneng Yang, Jianzhe Zhao, Guibing Guo

发表机构 * Northeastern University, Shenyang, China(东北大学,沈阳,中国) Shenzhen Campus of Sun Yat-sen University, China(中山大学深圳校区,中国)

AI总结 针对直接偏好优化(DPO)在生成式推荐中放大环境混杂因素导致的虚假相关性问题,提出CausalDPO,通过因果不变性学习、后门调整和软聚类环境建模来提升分布外泛化性能。

Comments 22 pages, 3 figures

详情
AI中文摘要

直接偏好优化(DPO)通过最小化偏好对齐损失,引导大型语言模型(LLMs)生成与用户历史行为分布一致的推荐。然而,我们的系统实证研究和理论分析表明,DPO倾向于放大对齐过程中由环境混杂因素引起的虚假相关性,显著削弱了基于LLM的生成式推荐方法在分布外(OOD)场景下的泛化能力。为缓解这一问题,我们提出CausalDPO,它是DPO的扩展,引入了因果不变性学习机制。该方法在偏好对齐阶段采用后门调整策略以消除环境混杂因素的干扰,使用软聚类方法显式建模潜在环境分布,并通过不变性约束增强跨环境的鲁棒一致性。理论分析表明,CausalDPO能够有效捕捉用户在多环境下的稳定偏好结构,从而提升基于LLM的推荐模型的OOD泛化性能。我们在四种代表性分布偏移设置下进行了大量实验,验证了CausalDPO的有效性,在四个评估指标上平均性能提升17.17%。

英文摘要

Direct Preference Optimization (DPO) guides large language models (LLMs) to generate recommendations aligned with user historical behavior distributions by minimizing preference alignment loss. However, our systematic empirical research and theoretical analysis reveal that DPO tends to amplify spurious correlations caused by environmental confounders during the alignment process, significantly undermining the generalization capability of LLM-based generative recommendation methods in out of distribution (OOD) scenarios. To mitigate this issue, we propose CausalDPO, an extension of DPO that incorporates a causal invariance learning mechanism. This method introduces a backdoor adjustment strategy during the preference alignment phase to eliminate interference from environmental confounders, explicitly models the latent environmental distribution using a soft clustering approach, and enhances robust consistency across diverse environments through invariance constraints. Theoretical analysis demonstrates that CausalDPO can effectively capture users stable preference structures across multiple environments, thereby improving the OOD generalization performance of LLM-based recommendation models. We conduct extensive experiments under four representative distribution shift settings to validate the effectiveness of CausalDPO, achieving an average performance improvement of 17.17% across four evaluation metrics.

2601.21207 2026-05-28 cs.LG cs.AI math.AT 版本更新

A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models

图神经模型中复杂网络建模与注意力机制的层论与拓扑视角

Chuan-Shen Hu

发表机构 * National Central University(国立中央大学)

AI总结 提出细胞层论框架分析图神经网络中节点特征与边权重的局部一致性与调和性,并引入基于拓扑数据分析的多尺度扩展以捕获层次特征交互。

详情
AI中文摘要

组合与拓扑结构,如图、单纯复形和胞腔复形,构成了几何与拓扑深度学习(GDL和TDL)架构的基础。这些模型在此类域上聚合信号、整合局部特征,并为多样化的实际应用生成表示。然而,训练过程中GDL和TDL特征的分布与扩散行为仍是一个开放且未充分探索的问题。受此空白启发,我们引入了一个细胞层论框架,用于建模和分析基于图的架构中节点特征与边权重的局部一致性与调和性。通过层结构追踪局部特征对齐与一致性,该框架提供了特征扩散与聚合的拓扑视角。此外,受拓扑数据分析(TDA)启发,提出了一个多尺度扩展,以捕获图模型中层次化的特征交互。该方法基于GDL和TDL架构的底层几何与拓扑结构以及其上定义的学习信号,实现了对它们的联合刻画,为节点分类、子结构检测和社区检测等传统任务的未来研究提供了见解。

英文摘要

Combinatorial and topological structures, such as graphs, simplicial complexes, and cell complexes, form the foundation of geometric and topological deep learning (GDL and TDL) architectures. These models aggregate signals over such domains, integrate local features, and generate representations for diverse real-world applications. However, the distribution and diffusion behavior of GDL and TDL features during training remains an open and underexplored problem. Motivated by this gap, we introduce a cellular sheaf theoretic framework for modeling and analyzing the local consistency and harmonicity of node features and edge weights in graph-based architectures. By tracking local feature alignments and agreements through sheaf structures, the framework offers a topological perspective on feature diffusion and aggregation. Furthermore, a multiscale extension inspired by topological data analysis (TDA) is proposed to capture hierarchical feature interactions in graph models. This approach enables a joint characterization of GDL and TDL architectures based on their underlying geometric and topological structures and the learned signals defined on them, providing insights for future studies on conventional tasks such as node classification, substructure detection, and community detection.

2601.04505 2026-05-28 cs.AI cs.CL cs.SY eess.SY 版本更新

CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

CircuitLM: 一种基于多智能体的大语言模型辅助设计框架,用于从自然语言提示生成电路原理图

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Department of Electrical and Electronic Engineering(电气与电子工程系) Islamic University of Technology(伊斯兰技术大学)

AI总结 提出CircuitLM多智能体流水线,通过嵌入驱动的组件知识库和五阶段流程,将自然语言提示转化为结构化的CircuitJSON原理图,并采用确定性电气规则检查和LLM作为评判的元评估器双重验证,解决大语言模型在电路设计中的幻觉和物理约束问题。

Comments Accepted at the 2026 IEEE International Conference on LLM-Aided Design (ICLAD), 10 pages, 8 figures, 6 tables

详情
AI中文摘要

从高层自然语言描述生成准确的电路原理图仍然是电子设计自动化(EDA)中的一个持久挑战,因为大语言模型(LLM)经常产生组件幻觉、违反严格的物理约束并输出非机器可读的结果。为解决此问题,我们提出CircuitLM,一个多智能体流水线,将用户提示转化为结构化的、视觉可解释的$\texttt{CircuitJSON}$原理图。该框架通过五个顺序阶段: (i) 组件识别,(ii) 规范引脚输出检索,(iii) 思维链推理,(iv) JSON原理图合成,以及(v) 交互式力导向可视化,基于一个精心策划的、嵌入驱动的组件知识库进行生成,从而减轻幻觉并确保物理可行性。我们在一个包含100个独特电路设计提示的数据集上,使用五个最先进的大语言模型评估了该系统。为系统评估性能,我们部署了严格的双层评估方法:一个确定性电气规则检查(ERC)引擎按严格严重性(关键、主要、次要、警告)对拓扑故障进行分类,同时一个LLM作为评判的元评估器识别复杂的、上下文感知的设计缺陷,这些缺陷绕过了标准的基于规则的检查器。最终,这项工作展示了目标检索与确定性和语义验证相结合如何将自然语言转化为结构可行的、原理图就绪的硬件和安全电路原型。我们的代码和数据公开在 https://github.com/Khandakar227/CircuitLM。

英文摘要

Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design automation (EDA), as large language models (LLMs) frequently hallucinate components, violate strict physical constraints, and produce non-machine-readable outputs. To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable $\texttt{CircuitJSON}$ schematics. The framework mitigates hallucination and ensures physical viability by grounding generation in a curated, embedding-powered component knowledge base through five sequential stages: (i) component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning, (iv) JSON schematic synthesis, and (v) interactive force-directed visualization. We evaluate the system on a dataset of 100 unique circuit-design prompts using five state-of-the-art LLMs. To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while an LLM-as-a-judge meta-evaluator identifies complex, context-aware design flaws that bypass standard rule-based checkers. Ultimately, this work demonstrates how targeted retrieval combined with deterministic and semantic verification can bridge natural language to structurally viable, schematic-ready hardware and safe circuit prototyping. Our code and data are publicly available at https://github.com/Khandakar227/CircuitLM.

2504.08923 2026-05-28 cs.LO cs.AI math.LO 版本更新

A convergence law for continuous logic and continuous structures with finite domains

有限域连续逻辑与连续结构的收敛律

Vera Koponen

发表机构 * Department of Mathematics, Uppsala University(数学系,乌普萨拉大学)

AI总结 本文研究有限域上的连续关系结构及其多值逻辑CLA,通过证明每个CLA公式渐近等价于无聚合函数公式,进而建立CLA的收敛律。

详情
Journal ref
Information and Computation, Volume 310, May 2026, 105441
AI中文摘要

我们考虑有限域$[n] := \{1, \ldots, n\}$上的连续关系结构,以及一种多值逻辑$CLA$,其取值于单位区间并使用连续连接词和连续聚合函数。$CLA$包含了“常规”有限结构上的一阶逻辑。对于每个关系符号$R$和满足元组长度与$R$的元数匹配的恒等约束$ic$,我们关联一个连续概率密度函数$μ_R^{ic} : [0, 1] o [0, \infty)$。我们还考虑域为$[n]$的连续结构集合$\mathbf{W}_n$上的概率分布,使得对于每个关系符号$R$、恒等约束$ic$以及满足$ic$的元组$ar{a}$,$R(ar{a})$的值的分布由$μ_R^{ic}$给出,且独立于其他关系符号或其他元组的值。在此设定下,我们证明$CLA$中的每个公式渐近等价于一个不含任何聚合函数的公式。这用于证明$CLA$的收敛律,对于无自由变量的公式表述如下:若$φ\in CLA$无自由变量且$I \subseteq [0, 1]$是一个区间,则存在$α\in [0, 1]$,使得当$n$趋于无穷时,$φ$的值落在$I$中的概率趋于$α$。

英文摘要

We consider continuous relational structures with finite domain $[n] := \{1, \ldots, n\}$ and a many valued logic, $CLA$, with values in the unit interval and which uses continuous connectives and continuous aggregation functions. $CLA$ subsumes first-order logic on ``conventional'' finite structures. To each relation symbol $R$ and identity constraint $ic$ on a tuple the length of which matches the arity of $R$ we associate a continuous probability density function $μ_R^{ic} : [0, 1] \to [0, \infty)$. We also consider a probability distribution on the set $\mathbf{W}_n$ of continuous structures with domain $[n]$ which is such that for every relation symbol $R$, identity constraint $ic$, and tuple $\bar{a}$ satisfying $ic$, the distribution of the value of $R(\bar{a})$ is given by $μ_R^{ic}$, independently of the values for other relation symbols or other tuples. In this setting we prove that every formula in $CLA$ is asymptotically equivalent to a formula without any aggregation function. This is used to prove a convergence law for $CLA$ which reads as follows for formulas without free variables: If $φ\in CLA$ has no free variable and $I \subseteq [0, 1]$ is an interval, then there is $α\in [0, 1]$ such that, as $n$ tends to infinity, the probability that the value of $φ$ is in $I$ tends to $α$.

2603.14773 2026-05-28 cs.LG cs.AI 版本更新

HO-SFL: Hybrid-Order Split Federated Learning with Backprop-Free Clients and Dimension-Free Aggregation

HO-SFL: 混合阶分割联邦学习,无反向传播客户端与维度无关聚合

Qiyuan Chen, Xian Wu, Yi Wang, Xianhao Chen

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong, Hong Kong SAR, China(电子与计算机工程系,香港大学,香港特别行政区,中国)

AI总结 提出HO-SFL框架,通过拉格朗日框架重构分割学习,服务器执行一阶更新而客户端进行零阶优化,实现无反向传播客户端、维度无关聚合,理论证明收敛速度与一阶方法相当,实验验证通信和内存成本显著降低。

Comments Accepted to ICML 2026

详情
AI中文摘要

在边缘设备上微调大模型受到标准框架(如联邦学习和分割学习)中内存密集型的反向传播(BP)的严重阻碍。虽然用零阶优化替代BP可以显著减少内存占用,但通常会导致收敛速度严重下降。为了解决这一困境,我们提出了混合阶分割联邦学习(HO-SFL)。通过在拉格朗日框架内重构分割学习过程,HO-SFL解耦了优化景观:服务器执行精确的一阶更新(即BP),而客户端进行内存高效的零阶优化。这种混合设计不仅消除了客户端BP的需求,还实现了维度无关的模型聚合,大幅降低了通信成本。关键的是,我们提供了理论收敛分析,证明HO-SFL缓解了零阶优化的维度依赖收敛放缓,实现了与一阶方法相当的收敛速度。在视觉和语言模态任务上的大量实验验证了HO-SFL在实现与一阶基线相当的收敛速度的同时,显著降低了通信成本和客户端内存占用。

英文摘要

Fine-tuning large models on edge devices is severely hindered by the memory-intensive backpropagation (BP) in standard frameworks like federated learning and split learning. While substituting BP with zeroth-order optimization can significantly reduce memory footprints, it typically suffers from prohibitively degraded convergence speed. To resolve this dilemma, we propose Hybrid-Order Split Federated Learning (HO-SFL). By reformulating the split learning process within a Lagrangian framework, HO-SFL decouples the optimization landscape: The server performs precise first-order updates (i.e., BP), whereas clients conduct memory-efficient zeroth-order optimization. This hybrid design not only eliminates the need for client-side BP but also enables dimension-free model aggregation, drastically lowering communication costs. Crucially, we provide a theoretical convergence analysis, demonstrating that HO-SFL mitigates the dimension-dependent convergence slowdown of zeroth-order optimization, achieving a convergence rate comparable to first-order methods. Extensive experiments on tasks across vision and language modalities validate that HO-SFL achieves convergence speeds comparable to first-order baselines while significantly reducing communication costs and client memory footprints.

2602.20497 2026-05-28 cs.CV cs.AI 版本更新

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

LESA: 可学习的阶段感知预测器用于扩散模型加速

Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对扩散模型计算开销大、现有缓存策略难以适应去噪过程阶段动态变化的问题,提出基于两阶段训练的可学习阶段感知预测器框架,利用KAN网络学习时序特征映射并采用多阶段多专家架构,在保持高质量生成的同时实现显著加速。

Comments Accepted to CVPR 2026

详情
AI中文摘要

扩散模型在图像和视频生成任务中取得了显著成功。然而,扩散Transformer(DiTs)的高计算需求对其实际部署构成了重大挑战。虽然特征缓存是一种有前景的加速策略,但现有基于简单重用或无训练预测的方法难以适应扩散过程中复杂的、阶段相关的动态变化,常常导致质量下降,并无法保持与标准去噪过程的一致性。为解决这一问题,我们提出了一种基于两阶段训练的可学习阶段感知(LESA)预测器框架。我们的方法利用Kolmogorov-Arnold网络(KAN)从数据中准确学习时序特征映射。我们进一步引入了一种多阶段、多专家架构,为不同噪声水平阶段分配专门的预测器,从而实现更精确和鲁棒的特征预测。大量实验表明,我们的方法在保持高保真生成的同时实现了显著加速。实验显示,在FLUX.1-dev上实现了5.00倍加速,质量下降极小(1.0%);在Qwen-Image上实现了6.25倍加速,质量比之前的最优方法(TaylorSeer)提升20.2%;在HunyuanVideo上实现了5.00倍加速,PSNR比TaylorSeer提升24.7%。在文本到图像和文本到视频合成任务上的最先进性能验证了我们基于训练框架在不同模型上的有效性和泛化能力。我们的代码可在https://github.com/caipeiliang2004/LESA获取。

英文摘要

Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is available at https://github.com/caipeiliang2004/LESA.

2603.09882 2026-05-28 cs.RO cs.AI 版本更新

Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning

杂乱场景中通过动力学感知策略学习涌现的外在灵巧性

Yixin Zheng, Jiangran Lyu, Yifan Zhang, Jiayi Chen, Mi Yan, Yuntian Deng, Xuesong Shi, Xiaoguang Zhao, Yizhou Wang, Zhizheng Zhang, He Wang

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Galbot Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出动力学感知策略学习框架,通过显式世界建模学习接触诱导物体动力学表示并用于强化学习,使杂乱场景中的外在灵巧性无需手工启发式或复杂奖励塑造即可涌现。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project page: https://pku-epic.github.io/DAPL/

详情
AI中文摘要

外在灵巧性利用环境接触来克服抓取操作的局限性。然而,在杂乱场景中实现这种灵巧性仍然具有挑战性且未被充分探索,因为它需要选择性地利用多个相互作用的物体之间的接触,而这些物体具有内在耦合的动力学。现有方法缺乏对这种复杂动力学的显式建模,因此在杂乱环境中的非抓取操作方面表现不足,这反过来限制了它们在现实环境中的实际应用。在本文中,我们介绍了一种动力学感知策略学习(DAPL)框架,该框架可以利用在杂乱环境中学习到的接触诱导物体动力学的表示来促进策略学习。这种表示通过显式世界建模学习,并用于条件化强化学习,使得外在灵巧性无需手工制作的接触启发式或复杂的奖励塑造即可涌现。我们在仿真和现实世界中评估了我们的方法。在具有不同密度的未见过的仿真杂乱场景中,我们的方法在成功率上比抓取操作、人类遥操作和基于先前表示的策略高出25%以上。在10个杂乱场景中,现实世界的成功率达到了约50%,而实际杂货部署进一步证明了稳健的仿真到现实迁移和适用性。

英文摘要

Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.

2603.02702 2026-05-28 cs.AI cs.LG 版本更新

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

FinTexTS: 基于语义和多层级配对的金融文本-时间序列数据集

Jaehoon Lee, Suhwan Park, Taeyoon Lim, Seunghan Lee, Jun Seo, Dongwan Kang, Hwanil Choi, Minjae Kim, Sungdong Yoo, Soonyoung Lee, Yongjae Lee, Wonbin Ahn

发表机构 * LG AI Research(LG人工智能研究所) Ulsan National Institute of Science and Technology(乌山国立科学技术研究院)

AI总结 提出基于语义和多层级配对的框架,从SEC文件和新闻中提取并匹配多层级文本信息,构建大规模文本配对的股票价格数据集FinTexTS,提升股价预测性能。

Comments 12 pages, KDD 2026, Datasets and Benchmarks Track

详情
AI中文摘要

金融领域涉及多种重要的时间序列问题。近年来,联合利用文本和数值信息的时间序列分析方法越来越受到关注。因此,人们做出了大量努力来构建金融领域中的文本配对时间序列数据集。然而,金融市场具有复杂的相互依赖性,一家公司的股票价格不仅受公司特定事件的影响,还受其他公司事件和更广泛的宏观经济因素的影响。现有的基于简单关键词匹配的文本与金融时间序列数据配对方法往往无法捕捉这种复杂关系。为了解决这一局限性,我们提出了一种基于语义和多层级的配对框架。具体来说,我们从SEC文件中提取目标公司的特定上下文,并应用基于嵌入的匹配机制,根据该上下文检索语义相关的新闻文章。此外,我们使用大语言模型(LLMs)将新闻文章分为四个层级(宏观层级、行业层级、相关公司层级和目标公司层级),实现新闻文章与目标公司的多层级配对。将该框架应用于公开可用的新闻数据集,我们构建了FinTexTS,这是一个新的大规模文本配对的股票价格数据集。在FinTexTS上的实验结果表明,我们的基于语义和多层级的配对策略在股价预测中是有效的。除了FinTexTS所依赖的公开新闻外,我们还表明,将我们的方法应用于专有但精心策划的新闻源,可以产生更高质量的配对数据,并提高股价预测性能。

英文摘要

The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company's stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target company-level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct FinTexTS, a new large-scale text-paired stock price dataset. Experimental results on FinTexTS demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying FinTexTS, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.

2603.05642 2026-05-28 cs.RO cs.AI 版本更新

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

基于3D场景图的开放世界交互式物体搜索的关系语义推理

Imen Mahdi, Matteo Cassinelli, Fabien Despinoy, Tim Welschehold, Abhinav Valada

发表机构 * University of Freiburg(弗赖堡大学) Toyota Motor Europe(丰田欧洲公司)

AI总结 提出SCOUT方法,通过从LLM蒸馏的关系探索启发式直接搜索3D场景图,实现高效开放世界交互式物体搜索,性能匹配LLM且计算高效。

详情
AI中文摘要

家庭环境中的开放世界交互式物体搜索需要理解物体与其周围环境之间的语义关系,以有效引导探索。先前的方法要么依赖视觉-语言嵌入相似性,这不能可靠地捕获任务相关的关系语义,要么依赖大型语言模型(LLM),这对于实时部署来说太慢且成本高昂。我们提出SCOUT:基于场景图探索的开放世界交互式物体搜索学习效用,这是一种新颖的方法,通过使用关系探索启发式(如房间-物体包含和物体-物体共现)为房间、前沿和物体分配效用分数,直接搜索3D场景图。为了在不牺牲开放词汇泛化能力的情况下使其实用,我们提出了一种离线程序化蒸馏框架,将LLM中的结构化关系知识提取到轻量级模型中,用于机器人上的推理。此外,我们提出了SymSearch,一个用于评估交互式物体搜索任务中语义推理的可扩展符号基准。在符号和模拟环境中的广泛评估表明,SCOUT优于基于嵌入相似性的方法,并在保持计算效率的同时达到LLM级别的性能。最后,真实世界实验证明了向物理环境的有效迁移,在现实感知和导航约束下实现了开放世界交互式物体搜索。

英文摘要

Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models (LLMs), which are too slow and costly for real-time deployment. We introduce SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search, a novel method that searches directly over 3D scene graphs by assigning utility scores to rooms, frontiers, and objects using relational exploration heuristics such as room-object containment and object-object co-occurrence. To make this practical without sacrificing open-vocabulary generalization, we propose an offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference. Furthermore, we present SymSearch, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks. Extensive evaluations across symbolic and simulation environments show that SCOUT outperforms embedding similarity-based methods and matches LLM-level performance while remaining computationally efficient. Finally, real-world experiments demonstrate effective transfer to physical environments, enabling open-world interactive object search under realistic sensing and navigation constraints.

2603.05425 2026-05-28 cs.CV cs.AI 版本更新

RelaxFlow: Text-Driven Amodal 3D Generation

RelaxFlow: 文本驱动的非模态3D生成

Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao

发表机构 * National University of Singapore(新加坡国立大学) Zhejiang University(浙江大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对遮挡下图像到3D生成的语义歧义问题,提出无训练的双分支框架RelaxFlow,通过多先验共识模块和松弛机制解耦控制粒度,实现文本提示引导下对未观察区域的补全,同时严格保留输入观测。

Comments Accepted as a spotlight presentation at ICML 2026. Code: https://github.com/viridityzhu/RelaxFlow

详情
AI中文摘要

图像到3D生成在遮挡下面临固有的语义歧义,仅凭部分观测通常不足以确定物体类别。在这项工作中,我们形式化了文本驱动的非模态3D生成,其中文本提示引导对未观察区域的补全,同时严格保留输入观测。关键的是,我们识别出这些目标需要不同的控制粒度:对观测的刚性控制与对提示的松弛结构控制。为此,我们提出RelaxFlow,一个无训练的双分支框架,通过多先验共识模块和松弛机制解耦控制粒度。理论上,我们证明我们的松弛等价于在生成向量场上应用低通滤波器,抑制高频实例细节以隔离适应观测的几何结构。为便于评估,我们引入了两个诊断基准:ExtremeOcc-3D和AmbiSem-3D。大量实验表明,RelaxFlow成功引导未观察区域的生成以匹配提示意图,同时不损害视觉保真度。

英文摘要

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

2603.04631 2026-05-28 cs.AI 版本更新

Towards automated data analysis: A guided framework for LLM-based risk estimation

迈向自动化数据分析:基于LLM的风险评估引导框架

Panteleimon Rodis

AI总结 提出一个在人类指导和监督下利用大语言模型进行数据集风险评估的框架,通过识别模式、生成聚类代码并解释结果,为自动化风险分析奠定基础。

详情
AI中文摘要

大语言模型(LLMs)正越来越多地集成到关键决策流程中,这一趋势引发了对稳健且自动化数据分析的需求。当前数据集风险分析方法局限于手动审计,涉及耗时且复杂的任务,而基于人工智能(AI)的完全自动化分析则存在幻觉和AI对齐问题。为此,本文提出一个在人类指导和监督下集成生成式AI的数据集风险评估框架,旨在为未来的自动化风险分析范式奠定基础。我们的方法利用LLMs识别数据库模式中的语义和结构属性,随后提出聚类技术,为其生成代码,并最终解释产生的结果。人类监督者指导模型进行所需分析,确保过程完整性和与任务目标的一致性。通过概念验证,展示了该框架在风险评估任务中产生有意义结果的可行性。

英文摘要

Large Language Models (LLMs) are increasingly integrated into critical decision-making pipelines, a trend that raises the demand for robust and automated data analysis. Current approaches to dataset risk analysis are limited to manual auditing methods which involve time-consuming and complex tasks, whereas fully automated analysis based on Artificial Intelligence (AI) suffers from hallucinations and issues stemming from AI alignment. To this end, this work proposes a framework for dataset risk estimation that integrates Generative AI under human guidance and supervision, aiming to set the foundations for a future automated risk analysis paradigm. Our approach utilizes LLMs to identify semantic and structural properties in database schemata, subsequently propose clustering techniques, generate the code for them and finally interpret the produced results. The human supervisor guides the model on the desired analysis and ensures process integrity and alignment with the task's objectives. A proof of concept is presented to demonstrate the feasibility of the framework's utility in producing meaningful results in risk assessment tasks.

2602.22769 2026-05-28 cs.AI cs.LG 版本更新

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

AMA-Bench:评估智能体应用的长时记忆

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, Jishen Zhao

发表机构 * UCSD(加州大学圣塔克拉拉分校) Recursive

AI总结 提出AMA-Bench基准,通过真实与合成轨迹评估LLM智能体的长时记忆,并基于因果图与工具增强检索提出AMA-Agent系统,在基准上提升11.16%准确率。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作复杂、长时应用中的自主智能体,其中有效的记忆对于持续性能至关重要。然而,现有的记忆基准主要围绕对话,而真实的智能体记忆由连续的智能体-环境交互轨迹组成,包括状态、动作、观察和工具输出。为填补这一空白,我们引入了**AMA-Bench**(任意长度的智能体记忆),一个在现实智能体设置中评估长时记忆的基准。AMA-Bench结合了来自代表性应用的真实智能体轨迹与专家策划的问答,以及可扩展到任意视野的合成轨迹与基于规则的问答。我们的研究表明,现有记忆系统表现不佳,因为它们未能捕获因果和客观信息,并严重依赖有损的基于相似性的检索。我们进一步提出了**AMA-Agent**,一个基于因果图构建和工具增强检索的记忆系统。AMA-Agent在AMA-Bench上达到**57.22%**的准确率,超过最强基线**11.16%**。资源可在[https://ama-bench.github.io/](https://ama-bench.github.io/)获取。

英文摘要

Large Language Models (LLMs) are increasingly used as autonomous agents in complex, long-horizon applications, where effective memory is critical for sustained performance. Yet existing memory benchmarks are largely dialogue-centric, while real agent memory consists of continuous agent-environment interaction trajectories composed of states, actions, observations, and tool outputs. To address this gap, we introduce **AMA-Bench** (**A**gent **M**emory with **A**ny length), a benchmark for evaluating long-horizon memory in realistic agentic settings. AMA-Bench combines real-world agent trajectories from representative applications with expert-curated QA, as well as synthetic trajectories that scale to arbitrary horizons with rule-based QA. Our study shows that existing memory systems underperform because they fail to capture causal and objective information and rely heavily on lossy similarity-based retrieval. We further propose **AMA-Agent**, a memory system based on causality-graph construction and tool-augmented retrieval. AMA-Agent achieves **57.22%** accuracy on AMA-Bench, outperforming the strongest baseline by **11.16%**. Resources are available at: [https://ama-bench.github.io/](https://ama-bench.github.io/).

2602.04898 2026-05-28 cs.CR cs.AI 版本更新

Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

针对文本到图像扩散模型的语义级后门攻击

Tianxin Chen, Wenbo Jiang, Hongqiao Chen, Zhirun Zheng, Cheng Huang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China(复旦大学计算机科学与人工智能学院,上海,中国) State Key Laboratory of Integrated Services Networks, Xidian University, Xian, China(集成服务网络国家重点实验室,西安电子科技大学,西安,中国) University of Electronic Science and Technology of China, Sichuan, China(电子科技大学,四川,中国) Ajou University, Gyeonggi-do, South Korea(庆尚道,韩国,全州大学)

AI总结 提出语义级后门攻击(SemBD),通过基于连续语义区域的表示级触发器替代离散文本模式,利用蒸馏编辑交叉注意力层的键和值投影矩阵植入后门,并引入语义正则化和多实体后门目标增强隐蔽性,实现100%攻击成功率并抵御输入级防御。

详情
AI中文摘要

文本到图像(T2I)扩散模型因其强大的生成能力而被广泛采用,但仍易受到后门攻击。现有攻击通常依赖于固定的文本触发器和单实体后门目标,使其极易受到基于枚举的输入防御和注意力一致性检测的攻击。在这项工作中,我们提出了语义级后门攻击(SemBD),它引入了基于连续语义区域而非离散文本模式的表示级触发器。SemBD通过基于蒸馏编辑交叉注意力层中的键和值投影矩阵来植入此类语义后门,使得语义等价但文本多样的提示能够激活后门。为了进一步增强隐蔽性,SemBD引入了语义正则化以防止在不完整语义下的意外激活,以及避免高度一致交叉注意力模式的多实体后门目标。大量实验表明,SemBD实现了100%的攻击成功率,同时保持了对最先进输入级防御的强鲁棒性。我们的代码可在https://github.com/DPAS-Lab/SemBD/获取。

英文摘要

Text-to-image (T2I) diffusion models are widely adopted for their strong generative capabilities, yet remain vulnerable to backdoor attacks. Existing attacks typically rely on fixed textual triggers and single-entity backdoor targets, making them highly susceptible to enumeration-based input defenses and attention-consistency detection. In this work, we propose Semantic-level Backdoor Attack (SemBD), which introduces representation-level triggers based on continuous semantic regions rather than discrete textual patterns. SemBD implants such semantic backdoors by distillation-based editing of the key and value projection matrices in cross-attention layers, enabling semantically equivalent but textually diverse prompts to activate the backdoor. To further enhance stealthiness, SemBD incorporates a semantic regularization to prevent unintended activation under incomplete semantics, as well as multi-entity backdoor targets that avoid highly consistent cross-attention patterns. Extensive experiments demonstrate that SemBD achieves a 100% attack success rate while maintaining strong robustness against state-of-the-art input-level defenses. Our code is available at https://github.com/DPAS-Lab/SemBD/.

2603.00349 2026-05-28 cs.AI cs.MA 版本更新

COOP$^2$: Defining, Observing, and Repairing Cooperation in LLM Multi-Agent Systems

COOP$^2$: 定义、观察和修复LLM多智能体系统中的合作

Hanqing Yang, Narjes Nourzad, Shiyu Chen, Marie Siew, Jingdi Chen, Carlee Joe-Wong

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Southern California(南加州大学) Singapore University of Technology and Design(新加坡科技设计大学) University of Arizona(亚利桑那大学)

AI总结 提出COOP$^2$框架,通过将高层合作动态与任务进度关联,定义可验证的合作任务,并开发COOP$^2$-Repair方法预测约束失败并引导修复,提升LLM多智能体系统的任务成功率和约束满足度。

详情
AI中文摘要

许多复杂任务需要超出单个智能体能力的持续努力、多样化能力或协调行动。然而,简单地增加更多智能体并不能保证更好的性能,因为有效的合作取决于智能体之间以及智能体与任务结构之间的交互方式,以满足随时间演变的约束。对于基于LLM的多智能体系统(LLM-MAS),这一挑战被放大:计划、消息和修订以自然语言发生,而任务进展依赖于具体环境中的行动。当前的评估大多将合作视为最终任务成功的隐含因素,使得合作以及多智能体交互对任务动态的影响难以研究。我们引入了COOP$^2$,一个评估框架,将LLM-MAS中的高层智能体合作动态与环境中的任务进展联系起来。COOP$^2$定义了具有可验证合作需求的合作任务,使我们能够分析合作如何随时间相对于任务进展展开,以及合作在何处和为何破裂。基于此框架,我们开发了COOP$^2$-Repair,它从群体计划中预测约束失败,并打开有针对性的修复通道以进行引导修订。在两个环境和三种通信结构下,COOP$^2$-Repair提高了任务成功率和约束满足度,同时暴露了修复所需的额外决策开销和通信负载。项目网页见:https://happyeureka.github.io/coop2。

英文摘要

Many complex tasks require extended effort, diverse capabilities, or coordinated actions beyond what a single agent can provide. However, simply adding more agents does not guarantee better performance, as effective cooperation depends on how agents interact with each other and with task structure to satisfy evolving constraints over time. This challenge is amplified for LLM-based multi-agent systems (LLM-MAS): plans, messages, and revisions occur in natural language, whereas task progress depends on grounded environment actions. Current evaluations mostly treat cooperation as an implicit ingredient of final task success, leaving both cooperation and the effect of multi-agent interaction on task dynamics difficult to study. We introduce COOP$^2$, an evaluation framework that grounds high-level agent cooperation dynamics in LLM-MAS within task progress in the environment. COOP$^2$ then defines cooperative tasks with verifiable cooperative requirements, allowing us to analyze how cooperation unfolds over time with respect to task progress, as well as where and why cooperation breaks down. Building on this framework, we develop COOP$^2$-Repair, which predicts constraint failures from group plans and opens targeted repair channels for guided revisions. Across two environments and three communication structures, COOP$^2$-Repair improves task success and constraint satisfaction while exposing the additional decision overhead and communication load required for repair. The project web page can be found at: https://happyeureka.github.io/coop2.

2603.00309 2026-05-28 cs.AI cs.MA 版本更新

DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths

DIG to Heal: 通过可解释的动态决策路径扩展通用智能体协作

Hanqing Yang, Hyungwoo Lee, Yuhang Yao, Zhiwei Liu, Kay Liu, Jingdi Chen, Carlee Joe-Wong

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Arizona(亚利桑那大学) Zoom Salesforce Amazon(亚马逊)

AI总结 提出动态交互图(DIG)框架,将通用LLM智能体的涌现协作建模为时变因果网络,首次实现协作过程的可观察、可解释与实时纠错。

详情
AI中文摘要

日益流行的智能体AI范式有望利用多个通用大语言模型(LLM)智能体的能力协作完成复杂任务。尽管许多智能体AI系统通过预定义工作流或固定智能体角色来降低复杂性,但理想情况是支持真正自主的智能体,能够在多个交互智能体之间实现涌现协作。然而在实践中,这种非结构化交互常常导致冗余工作和级联故障,难以解释或纠正。在这项工作中,我们研究了由通用LLM智能体组成的多智能体系统,这些智能体通过涌现协作解决问题,而不依赖预定义角色、控制流或通信约束。我们引入了动态交互图(DIG),它将涌现协作捕获为智能体激活和交互的时变因果网络。DIG首次使涌现协作变得可观察和可解释,能够直接从智能体的协作路径中实时识别、解释和纠正协作引发的错误模式。因此,DIG填补了理解通用LLM智能体如何在真正智能体化的多智能体系统中共同解决问题的关键空白。项目网页见:https://happyeureka.github.io/dig。

英文摘要

The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems reduce complexity through predefined workflows or fixed agent roles, the ideal is to support truly autonomous agents capable of emergent collaboration across many interacting agents. Yet in practice, such unstructured interactions often lead to redundant work and cascading failures that are difficult to interpret or correct. In this work, we study multi-agent systems composed of general-purpose LLM agents that solve problems through emergent collaboration, without relying on predefined roles, control flows, or communication constraints. We introduce the Dynamic Interaction Graph (DIG), which captures emergent collaboration as a time-evolving causal network of agent activations and interactions. DIG makes emergent collaboration observable and explainable for the first time, enabling real-time identification, explanation, and correction of collaboration-induced error patterns directly from agents' collaboration paths. Thus, DIG fills a critical gap in understanding how general LLM agents solve problems together in truly agentic multi-agent systems. The project webpage can be found at: https://happyeureka.github.io/dig.

2502.17055 2026-05-28 cs.LG cs.AI 版本更新

GradientStabilizer:Fix the Norm, Not the Gradient

GradientStabilizer:固定范数,而非梯度

Tianjin Huang, Zhangyang Wang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Jiaxing Shang, Tianlong Chen, Ke Li, Lu Liu, Qingsong Wen, Shiwei Liu

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) Department of Mathematics and Computer Science, Eindhoven University of Technology(埃因霍温理工大学数学与计算机科学系) School of the Gifted Young, University of Science and Technology of China(中国科学技术大学天才青年学院) Department of Electrical and Computer Engineering, University of Texas at Austin(德克萨斯大学奥斯汀分校电气与计算机工程系) Department of Computer Science, University of Reading(阅读大学计算机科学系) School of Cyber Science and Technology, Sun Yat-sen University(中山大学网络科学与技术学院) Department of Computer Science, The University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校计算机科学系) ELLIS Institute Tubingen(图宾根ELLIS研究所) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Tübingen AI Center, Tübingen, Germany(图宾根人工智能中心,德国图宾根) College of Computer Science, Chongqing University(重庆大学计算机学院)

AI总结 提出GradientStabilizer,一种轻量级梯度变换方法,通过统计稳定的梯度范数估计替换更新幅度,在不改变梯度方向的前提下抑制极端梯度尖峰,从而提升训练稳定性并减少发散。

Comments Accepted By ICML2026

详情
AI中文摘要

现代深度学习系统中的训练不稳定性通常由罕见但极端的梯度范数尖峰引发,这些尖峰可能导致参数更新过大、破坏优化器状态,并导致缓慢恢复或发散。广泛使用的保护措施如梯度裁剪可以缓解这些故障,但需要调整阈值且不加区分地截断大更新。我们提出GradientStabilizer,一种轻量级、即插即用的梯度变换方法,它在保留瞬时梯度方向的同时,用从运行梯度范数统计中导出的统计稳定估计替换更新幅度。我们证明了在尖峰步骤上,得到的稳定幅度一致有界,与尖峰大小无关,并展示了这种有界性如何控制自适应方法中优化器状态的演化。在LLM预训练(FP16)、量化感知预训练(FP4)、ImageNet分类、强化学习和时间序列预测中,GradientStabilizer一致地提高了训练稳定性,扩大了稳定学习率区域,并相对于基于裁剪的基线减少了发散,甚至显著降低了Adam对权重衰减强度的敏感性。代码即将发布。

英文摘要

Training instability in modern deep learning systems is frequently triggered by rare but extreme gradient-norm spikes, which can induce oversized parameter updates, corrupt optimizer state, and lead to slow recovery or divergence. Widely used safeguards such as gradient clipping mitigate these failures but require threshold tuning and indiscriminately truncate large updates. We propose GradientStabilizer, a lightweight, drop-in gradient transform that preserves the instantaneous gradient direction while replacing the update magnitude with a statistically stabilized estimate derived from running gradient-norm statistics. We prove that the resulting stabilized magnitude is uniformly bounded on spike steps, independent of the spike size, and show how this boundedness controls optimizer state evolution in adaptive methods. Across LLM pre-training (FP16), quantization-aware pre-training (FP4), ImageNet classification, reinforcement learning, and time-series forecasting, GradientStabilizer consistently improves training stability, widens stable learning-rate regions, and reduces divergence relative to clipping-based baselines, even substantially reducing Adam's sensitivity to weight-decay strength. Code will be released soon.

2602.22873 2026-05-28 math.AT cs.AI cs.CG 版本更新

Learning Tangent Bundles and Characteristic Classes with Autoencoder Atlases

使用自编码器图谱学习切丛和示性类

Eduardo Paluzo-Hidalgo, Yuichi Ike

发表机构 * Department of Applied Mathematics I, University of Sevilla(塞维利亚大学应用数学系) Graduate School of Mathematical Sciences, The University of Tokyo(东京大学数学科学研究生院)

AI总结 本文提出一个理论框架,将流形学习中的多图自编码器与向量丛和示性类的经典理论联系起来,通过自编码器图谱定义转移映射并计算第一Stiefel-Whitney类,从而检测数据可定向性。

详情
AI中文摘要

我们引入了一个理论框架,将流形学习中的多图自编码器与向量丛和示性类的经典理论联系起来。我们不将自编码器视为产生单个全局欧几里得嵌入,而是将一组局部训练的编码器-解码器对视为流形上的学习图谱。我们证明,任何重建一致的自编码器图谱都能典范地定义满足上循环条件的转移映射,并且将这些转移映射线性化会得到一个向量丛,当潜在维度与流形的内在维度匹配时,该向量丛与切丛一致。这种构造提供了对数据的微分拓扑不变量的直接访问。特别地,我们证明第一Stiefel-Whitney类可以从学习到的转移映射的雅可比行列式的符号计算出来,从而得到检测可定向性的算法准则。我们还证明,非平凡的示性类对单图表示构成障碍,并且自编码器图的最小数量由流形的良好覆盖结构决定。最后,我们将我们的方法应用于低维可定向和不可定向流形,以及一个不可定向的高维图像数据集。

英文摘要

We introduce a theoretical framework that connects multi-chart autoencoders in manifold learning with the classical theory of vector bundles and characteristic classes. Rather than viewing autoencoders as producing a single global Euclidean embedding, we treat a collection of locally trained encoder-decoder pairs as a learned atlas on a manifold. We show that any reconstruction-consistent autoencoder atlas canonically defines transition maps satisfying the cocycle condition, and that linearising these transition maps yields a vector bundle coinciding with the tangent bundle when the latent dimension matches the intrinsic dimension of the manifold. This construction provides direct access to differential-topological invariants of the data. In particular, we show that the first Stiefel-Whitney class can be computed from the signs of the Jacobians of learned transition maps, yielding an algorithmic criterion for detecting orientability. We also show that non-trivial characteristic classes provide obstructions to single-chart representations, and that the minimum number of autoencoder charts is determined by the good cover structure of the manifold. Finally, we apply our methodology to low-dimensional orientable and non-orientable manifolds, as well as to a non-orientable high-dimensional image dataset.

2602.22787 2026-05-28 cs.CL cs.AI 版本更新

Probing for Knowledge Attribution in Large Language Models

探测大型语言模型中的知识归因

Ivo Brink, Alexander Boer, Dennis Ulmer

发表机构 * KPMG NL(KPMG荷兰分公司) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文通过线性探针从隐藏表示中分类大型语言模型输出的主导知识来源(记忆或上下文),并引入自监督流水线AttriWiki生成训练数据,在多个模型和数据集上达到高F1分数。

详情
AI中文摘要

大型语言模型(LLM)的幻觉,即流畅但事实不正确的生成,分为两类:忠实性违反,即模型误用提供的上下文;以及事实性违反,即答案反映内部知识中的错误。适当的缓解取决于知道哪个来源驱动每个答案。我们研究贡献性归因,即对每个输出背后的主导知识来源进行分类,并表明在隐藏表示上训练的简单线性探针可以可靠地识别它。我们引入了AttriWiki,一个自监督流水线,通过提示模型从记忆中回忆被隐藏的实体或从上下文中读取它们,而不依赖知识冲突,自动生成标记的训练数据。在AttriWiki上训练的探针在Llama-3.1-8B、Mistral-7B和Qwen-7B上达到高达0.96的Macro-$F_1$,迁移到SQuAD和WebQuestions时达到0.94-0.99的Macro-$F_1$,并零样本泛化到Tighidet等人(2024)的基准,在冲突设置上无需重新训练即优于他们的探针。此外,归因不匹配会使错误率提高高达70%,尽管正确的归因并不能保证正确的答案,这表明需要更广泛的检测框架。

英文摘要

Large language model (LLM) hallucinations, meaning fluent but factually incorrect generations, fall into two types: faithfulness violations, where the model misuses provided context, and factuality violations, where answers reflect errors in internal knowledge. Proper mitigation depends on knowing which source drives each answer. We study contributive attribution, i.e. the classification of the dominant knowledge source behind each output, and show that a simple linear probe trained on hidden representations can reliably identify it. We introduce AttriWiki, a self-supervised pipeline that automatically generates labelled training data by prompting models to recall withheld entities from memory or read them from context without relying on knowledge conflicts. Probes trained on AttriWiki achieve up to 0.96 Macro-$F_1$ on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transfer to SQuAD and WebQuestions with 0.94-0.99 Macro-$F_1$, and generalise zero-shot to Tighidet et al. (2024)'s benchmark, outperforming their probe on conflicting settings without retraining. Furthermore, attribution mismatches raise error rates by up to 70%, though correct attribution does not guarantee correct answers, pointing to the need for broader detection frameworks.

2602.18647 2026-05-28 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

Noise Scheduling as Information-Guided Allocation in Diffusion Training

噪声调度作为扩散训练中的信息引导分配

Gabriel Raya, Bac Nguyen, Georgios Batzolis, Yuhta Takida, Dejan Stancevic, Naoki Murata, Chieh-Hsin Lai, Yuki Mitsufuji, Luca Ambrogioni

发表机构 * Tilburg University & JADS(蒂尔堡大学及JADS) Sony AI(索尼人工智能) University of Cambridge(剑桥大学) Radboud University(拉德堡德大学) Sony Group Corporation(索尼集团公司)

AI总结 提出InfoNoise,一种在线自适应噪声调度方法,通过估计条件熵率剖面动态调整训练噪声分布,以优化去噪任务中的信息增益,在图像、DNA和语言生成等任务中达到或超越基线,并节省高达3倍训练计算量。

详情
AI中文摘要

我们引入了InfoNoise,一种用于扩散训练的在线自适应噪声调度,它将优化努力重新分配到去噪最具信息量的噪声水平上。与损失加权一起,噪声调度在去噪问题之间诱导出有效的分配,而这种分配通常在知道信息性噪声水平之前就已固定。InfoNoise通过从训练期间的去噪损失中估计条件熵率剖面,使这种分配具有数据自适应性,无需辅助模型或离线搜索。通过I--MMSE,该剖面识别出噪声观测在何处能快速减少关于干净样本的不确定性,并指导训练噪声分布的适应。它只改变这个分布,保持目标、加权和参数化不变。在图像基准测试中,调度已被广泛调整,InfoNoise匹配或略微超过强基线,并且可以用更少的更新达到相同的质量。在表示、序列和模态转换(包括DNA和语言生成)上,InfoNoise优于固定和自适应基线,并且达到目标质量所需的训练计算量最多减少3倍。这些结果确立了条件熵率剖面作为噪声调度设计的数据依赖目标,并使在线自适应成为手动调度搜索的实用替代方案。

英文摘要

We introduce InfoNoise, an online adaptive noise schedule for diffusion training that reallocates optimization effort toward noise levels where denoising is most informative. Together with loss weighting, a noise schedule induces an effective allocation across denoising problems, often fixed before informative noise levels are known. InfoNoise makes this allocation data-adaptive by estimating a conditional-entropy-rate profile from denoising losses during training, without auxiliary models or offline search. Through I--MMSE, this profile identifies where noisy observations rapidly reduce uncertainty about the clean sample and guides adaptation of the training noise distribution. It changes only this distribution, keeping the objective, weighting, and parameterization fixed. On image benchmarks, where schedules have been extensively tuned, InfoNoise matches or slightly exceeds strong baselines and can reach the same quality with fewer updates. On representation, sequence, and modality shifts, including DNA and language generation, InfoNoise improves over fixed and adaptive baselines and reaches target quality with up to $3\times$ less training compute. These results establish the conditional-entropy-rate profile as the data-dependent target for noise schedule design and make online adaptation a practical alternative to manual schedule search.

2602.18481 2026-05-28 q-fin.TR cs.AI 版本更新

AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models

AlphaForgeBench:用大型语言模型对端到端交易策略设计进行基准测试

Wentao Zhang, Mingxuan Zhao, Jincheng Gao, Jieshun You, Huaiyu Jia, Yilei Zhao, Bo An, Shuo Sun

发表机构 * Nanyang Technological University(南洋理工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Hong Kong Polytechnic University(香港理工大学)

AI总结 提出AlphaForgeBench框架,将LLM从随机交易代理重新定义为量化研究员,通过生成可执行alpha因子和基于因子的交易策略,消除执行不稳定,实现可复现的金融推理评估。

详情
AI中文摘要

大型语言模型(LLMs)的快速发展催生了大量金融基准测试,从静态知识评估演变为交互式交易模拟。然而,现有的实时交易评估框架在很大程度上忽略了一个关键的失败模式:LLMs在金融不确定性下的序贯决策中表现出严重的行为不稳定性。通过大量实验,我们表明,当作为交易代理部署时,LLMs表现出极端的运行间方差,即使在确定性解码下也会产生不一致的动作序列,并且经常在相邻时间步产生不合理的动作翻转。我们将这些行为归因于LLMs的无状态自回归特性,它们缺乏对先前动作的持久记忆,以及它们对投资组合分配任务中连续到离散动作映射的敏感性。这些缺陷从根本上破坏了现有许多在线和离线交易基准的可靠性和可复现性。为了解决这些局限性,我们提出了AlphaForgeBench,一个原则性的评估框架,将LLMs重新定义为量化研究员而非随机交易代理。AlphaForgeBench不要求模型产生离散的交易动作,而是要求模型生成可执行的alpha因子,并基于金融知识构建基于因子的交易策略。这种范式将推理与执行机制解耦,实现了确定性和可复现的评估,同时与真实的量化研究工作流程保持一致。在多个最先进的LLM上进行的大量实验表明,AlphaForgeBench消除了执行引起的不稳定性,并为评估金融推理、策略制定和alpha发现提供了严格的基准。网页链接:https://finbrain-lab-hkustgz.github.io/AlphaForgeBench

英文摘要

The rapid advancement of Large Language Models (LLMs) has led to a surge of financial benchmarks, evolving from static knowledge evaluation toward interactive trading simulations. However, existing frameworks for evaluating real-time trading largely overlook a critical failure mode: the severe behavioral instability of LLMs in sequential decision-making under financial uncertainty. Through extensive experiments, we show that when deployed as trading agents, LLMs exhibit extreme run-to-run variance, generate inconsistent action sequences even under deterministic decoding, and frequently produce irrational action flipping across adjacent time steps. We attribute these behaviors to the stateless autoregressive nature of LLMs, which lack persistent memory of prior actions, together with their sensitivity to continuous-to-discrete action mappings in portfolio allocation tasks. These deficiencies fundamentally undermine the reliability and reproducibility of many existing online and offline trading benchmarks. To address these limitations, we propose AlphaForgeBench, a principled evaluation framework that redefines LLMs as quantitative researchers rather than stochastic trading agents. Instead of producing discrete trading actions, AlphaForgeBench requires models to generate executable alpha factors and compose factor-based trading strategies grounded in financial knowledge. This paradigm decouples reasoning from execution mechanics, enabling deterministic and reproducible evaluation while remaining aligned with real-world quantitative research workflows. Extensive experiments across multiple state-of-the-art LLMs demonstrate that AlphaForgeBench eliminates execution-induced instability and provides a rigorous benchmark for evaluating financial reasoning, strategy formulation, and alpha discovery. Webpage at https://finbrain-lab-hkustgz.github.io/AlphaForgeBench

2602.17003 2026-05-28 cs.CL cs.AI 版本更新

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Persona2Web: 基于用户历史进行上下文推理的个性化Web智能体基准

Serin Kim, Sangam Lee, Dongha Lee

发表机构 * Department of Artificial Intelligence, Yonsei University, Seoul, Republic of Korea(人工智能系,延世大学,首尔,大韩民国)

AI总结 提出Persona2Web基准,通过澄清-个性化原则评估Web智能体在真实开放网络中利用用户历史解决模糊查询的个性化能力,并引入推理感知评估框架。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型推动了Web智能体的发展,但当前的智能体缺乏个性化能力。由于用户很少明确说明其意图的每个细节,实用的Web智能体必须能够通过推断用户偏好和上下文来解释模糊查询。为应对这一挑战,我们提出了Persona2Web,这是首个在真实开放网络上评估个性化Web智能体的基准,基于澄清-个性化原则构建,要求智能体根据用户历史而非依赖显式指令来解决歧义。Persona2Web包括:(1) 在长时间跨度内隐含揭示偏好的用户历史,(2) 需要智能体推断隐含用户偏好的模糊查询,以及(3) 一个推理感知评估框架,能够对个性化进行细粒度评估。我们针对各种智能体架构、骨干模型、历史访问方案和不同模糊程度的查询进行了广泛实验,揭示了个性化Web智能体行为中的关键挑战。为便于复现,我们的代码和数据集公开在 https://serin-kimm.github.io/Persona2Web/。

英文摘要

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://serin-kimm.github.io/Persona2Web/.

2602.15515 2026-05-28 cs.LG cs.AI 版本更新

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

混淆图谱:使用欺骗探针映射RLVR中诚实出现的位置

Mohammad Taufeeque, Stefan Heimersheim, Adam Gleave, Chris Cundy

AI总结 本文通过构建一个自然产生奖励黑客行为的编码环境,研究在对抗白盒欺骗检测器训练时模型出现的混淆策略,并引入分类法分析诚实、混淆激活和混淆策略三种结果。

Comments Accepted at ICML 2026 (Oral presentation). 30 pages, 14 figures

详情
AI中文摘要

针对白盒欺骗检测器的训练已被提议作为使AI系统诚实的一种方法。然而,这种训练存在模型学习混淆其欺骗行为以逃避检测器的风险。先前的工作仅在人为设置中研究混淆,其中模型因有害输出而直接获得奖励。我们构建了一个真实的编码环境,其中通过硬编码测试用例的奖励黑客行为自然发生,并表明混淆在此环境中出现。我们引入了在对抗欺骗检测器训练时可能结果的分类法。模型要么保持诚实,要么通过两种可能的混淆策略变得欺骗。(i)混淆激活:模型输出欺骗性文本,同时修改其内部表示以不再触发检测器。(ii)混淆策略:模型输出逃避检测器的欺骗性文本,通常包括对奖励黑客行为的理由。实验上,混淆激活源于RL期间的表示漂移,无论是否有检测器惩罚。检测器惩罚仅激励混淆策略;我们从理论上表明,对于策略梯度方法,这是预期的。足够高的KL正则化和检测器惩罚可以产生诚实策略,从而确立白盒欺骗检测器作为易受奖励黑客行为任务的有效训练信号。

英文摘要

Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting. We introduce a taxonomy of possible outcomes when training against a deception detector. The model either remains honest, or becomes deceptive via two possible obfuscation strategies. (i) Obfuscated activations: the model outputs deceptive text while modifying its internal representations to no longer trigger the detector. (ii) Obfuscated policy: the model outputs deceptive text that evades the detector, typically by including a justification for the reward hack. Empirically, obfuscated activations arise from representation drift during RL, with or without a detector penalty. The detector penalty only incentivizes obfuscated policies; we theoretically show this is expected for policy gradient methods. Sufficiently high KL regularization and detector penalty can yield honest policies, establishing white-box deception detectors as viable training signals for tasks prone to reward hacking.

2602.15198 2026-05-28 cs.MA cs.AI cs.CL 版本更新

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Colosseum: 审计合作多智能体系统中的合谋行为

Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud, Ferdinando Fioretto, Shlomo Zilberstein, Eugene Bagdasarian

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校) University of Virginia(弗吉尼亚大学) ELLIS Institute Tübingen(图宾根ELLIS研究所) MPI for Intelligent Systems, Tübingen(图宾根智能系统研究所) AI Center(人工智能中心)

AI总结 提出Colosseum框架,通过形式化决策框架和基于遗憾的度量审计LLM智能体在合作多智能体系统中的合谋行为,发现大多数模型存在新兴合谋倾向,并观察到“纸上合谋”现象。

详情
AI中文摘要

多智能体系统中,通过自由形式语言通信的LLM智能体能够实现复杂的协调以解决复杂的合作任务。当一组智能体形成联盟并合谋追求次要目标、降低联合目标时,这会产生独特的安全问题。在本文中,我们提出Colosseum,一个用于审计多智能体设置中LLM智能体合谋行为的框架。我们通过形式化的多智能体决策框架来理解智能体如何合作,并通过相对于合作最优的遗憾来度量基于行动的合谋行为,并将其与基于通信的合谋行为进行比较。Colosseum能够在良性设置、不同联盟目标、说服策略和网络拓扑下审计LLM智能体的合谋行为。然后,我们通过创建智能体之间的秘密通信渠道引入一种新的行为探针,表明大多数开箱即用的模型在此探针下表现出合谋倾向,我们称之为新兴合谋。此外,我们发现了“纸上合谋”现象,即智能体在文本中计划合谋但往往选择非合谋行动。Colosseum提供了一种审计合作多智能体系统中合谋的新方法,同时呈现了关于合谋如何出现、什么影响合谋效率以及哪些策略可能缓解合谋的观察。

英文摘要

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when a group of agents forms a coalition and colludes to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents' collusive behavior in multi-agent settings. We ground how agents cooperate through a formal multi-agent decision-making framework and measure action-based collusive behavior in actions via regret relative to the cooperative optimum and compare it with communication-based collusive behavior. Colosseum enables audits of LLM agents for collusion under benign settings, different coalition objectives, persuasion tactics, and network topologies. We then introduce a new behavioral probe by creating secret communication channels between agents, showing that most out-of-the-box models exhibit a propensity to collude under this probe, which we term emergent collusion. Furthermore, we discover ``collusion on paper'' when agents plan to collude in text but often pick non-collusive actions. Colosseum provides a new way to audit collusion in cooperative multi-agent systems while presenting observations about how collusion emerges, what affects collusion efficacy, and which strategies may mitigate it.

2602.14862 2026-05-28 stat.ML cs.AI cs.IT cs.LG math.IT stat.ME 版本更新

The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling

温度缩放分类器:温度缩放的一些基本性质

Pierre-Alexandre Mattei, Bruno Loureiro

发表机构 * Université Côte d’Azur, Inria, CNRS, LJAD, France(法国蔚蓝海岸大学、法国国家科学研究中心、法国国家信息与自动化研究所、里约达实验室) Département d’Informatique, École Normale Supérieure - PSL, CNRS, France(法国高等科学研究院信息学院、法国国家科学研究中心、法国)

AI总结 本文通过信息投影和线性缩放子模型等新视角,严格分析了温度缩放对分类器校准和LLM多样性的影响,证明升温普遍增加不确定性但质疑其增加多样性的说法。

详情
AI中文摘要

温度缩放是一种简单的方法,可以控制概率模型的不确定性。它主要用于两个场景:改进分类器的校准和调节大型语言模型(LLM)的随机性。在这两种情况下,温度缩放都是最流行的方法。尽管其流行,但温度缩放性质的严格理论分析仍然难以捉摸。我们在此研究其中一些性质。对于分类,我们表明提高温度在非常普遍的意义上增加了模型的不确定性(特别是增加了其熵)。然而,对于LLM,我们质疑了提高温度会增加多样性的常见说法。此外,我们引入了温度缩放的两种新表征。第一种是几何的:温度缩放模型被证明是原始模型在具有给定熵的模型集合上的信息投影。第二种表征阐明了温度缩放作为更一般线性缩放器(如矩阵缩放和狄利克雷校准)的子模型的作用:我们表明温度缩放是唯一不改变模型硬预测的线性缩放器。

英文摘要

Temperature scaling is a simple method that allows to control the uncertainty of probabilistic models. It is mostly used in two contexts: improving the calibration of classifiers and tuning the stochasticity of large language models (LLMs). In both cases, temperature scaling is the most popular method for the job. Despite its popularity, a rigorous theoretical analysis of the properties of temperature scaling has remained elusive. We investigate here some of these properties. For classification, we show that increasing the temperature increases the uncertainty in the model in a very general sense (and in particular increases its entropy). However, for LLMs, we challenge the common claim that increasing temperature increases diversity. Furthermore, we introduce two new characterisations of temperature scaling. The first one is geometric: the tempered model is shown to be the information projection of the original model onto the set of models with a given entropy. The second characterisation clarifies the role of temperature scaling as a submodel of more general linear scalers such as matrix scaling and Dirichlet calibration: we show that temperature scaling is the only linear scaler that does not change the hard predictions of the model.

2602.13524 2026-05-28 cs.LG cs.AI 版本更新

Singular Vectors of Attention Heads Align with Features

注意力头的奇异向量与特征对齐

Gabriel Franco, Carson Loughridge, Mark Crovella

发表机构 * Department of Computer Science, Boston University, Boston, USA Faculty of Computing \& Data Sciences, Boston University, Boston, USA

AI总结 本文通过理论分析和实验验证,解释了注意力头奇异向量与特征表示对齐的原因和条件,并提出了稀疏注意力分解作为对齐的可检验预测。

Comments To be published in ICML 2026

详情
AI中文摘要

识别语言模型中的特征表示是机械可解释性的核心任务。最近的一些研究观察到,在某些情况下,可以从注意力矩阵的奇异向量中推断出特征表示。然而,这一现象缺乏合理的解释。本文探讨了这个问题:为什么以及何时奇异向量与特征对齐?首先,我们证明在可以直接观察特征的模型中,奇异向量与特征稳健地对齐。然后,我们从理论上表明,这种对齐在多种条件下是预期的。最后,我们提出如何在特征表示不可直接观察的真实模型中操作性地识别对齐。我们将稀疏注意力分解确定为对齐的一个可检验预测,并展示证据表明它在真实模型中以与预测一致的方式出现。这些结果共同表明,奇异向量与特征的对齐可以作为语言模型中特征识别的合理且有理论依据的基础。

英文摘要

Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made the observation that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this phenomenon is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges in real models in a manner consistent with predictions. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models.

2602.12586 2026-05-28 cs.AI 版本更新

Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models

能给我你的订单吗?扩散语言模型中插槽填充顺序的蒙特卡洛树搜索

Joshua Ong Jun Leang, Yu Zhao, Mihaela Cătălina Stoian, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia

发表机构 * Imperial College London(帝国理工学院伦敦分校) University of Edinburgh(爱丁堡大学)

AI总结 针对掩码扩散模型(MDM)中计划-填充解码对插槽填充顺序敏感的问题,提出McDiffuSE框架,利用蒙特卡洛树搜索(MCTS)优化生成顺序,平均性能提升3.2%,在MBPP和MATH500上分别提升19.5%和4.9%。

Comments 8 pages, ICML2026

详情
AI中文摘要

虽然掩码扩散模型(MDM)中的计划-填充解码在数学和代码推理方面显示出潜力,但其性能对插槽填充顺序高度敏感,常常导致输出方差较大。我们引入了McDiffuSE框架,该框架将插槽选择形式化为决策问题,并通过蒙特卡洛树搜索(MCTS)优化填充顺序。McDiffuSE在提交前使用前瞻模拟评估部分完成情况,系统地探索生成顺序的组合空间。实验表明,与自回归基线相比平均提升3.2%,与基线计划-填充相比提升8.0%,在MBPP和MATH500上分别显著提升19.5%和4.9%。我们的分析揭示,虽然McDiffuSE主要遵循顺序生成,但引入非顺序生成对于最大化性能至关重要。我们观察到,需要更大的探索常数而非增加模拟次数,以克服模型置信度偏差并发现有效的顺序。这些发现确立了基于MCTS的规划作为提升MDM生成质量的有效方法。

英文摘要

While plan-and-infill decoding in Masked Diffusion Models (MDMs) shows promise for mathematical and code reasoning, performance remains highly sensitive to slot infilling order, often yielding substantial output variance. We introduce McDiffuSE, a framework that formulates slot selection as decision making and optimises infilling orders through Monte Carlo Tree Search (MCTS). McDiffuSE uses look-ahead simulations to evaluate partial completions before commitment, systematically exploring the combinatorial space of generation orders. Experiments show an average improvement of 3.2% over autoregressive baselines and 8.0% over baseline plan-and-infill, with notable gains of 19.5% on MBPP and 4.9% on MATH500. Our analysis reveals that while McDiffuSE predominantly follows sequential ordering, incorporating non-sequential generation is essential for maximising performance. We observe that larger exploration constants, rather than increased simulations, are necessary to overcome model confidence biases and discover effective orderings. These findings establish MCTS-based planning as an effective approach for enhancing generation quality in MDMs.

2602.01992 2026-05-28 cs.AI 版本更新

Emergent Analogical Reasoning in Transformers

Transformer中的涌现类比推理

Gouki Minegishi, Jingyuan Feng, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学) Google Deep Mind(谷歌深Mind)

AI总结 本研究通过范畴论中的函子概念形式化类比推理,设计合成任务探究Transformer中类比推理的涌现机制,发现其依赖于数据特征、优化选择和模型规模,并通过机制分析揭示几何对齐和函子应用两个关键组件。

Comments Accepted to ICML2026 (spotlight)

详情
AI中文摘要

类比是人类智能的核心能力,使得在一个领域中发现的抽象模式能够应用于另一个领域。尽管类比在认知中占据核心地位,但Transformer获取并实现类比推理的机制仍知之甚少。受范畴论中函子概念的启发,我们将类比推理形式化为跨类别实体间对应关系的推断。基于这一表述,我们引入了在受控设置下评估类比推理涌现的合成任务。我们发现,类比推理的涌现对数据特征、优化选择和模型规模高度敏感。通过机制分析,我们展示了Transformer中的类比推理分解为两个关键组件:(1) 嵌入空间中关系结构的几何对齐,以及(2) Transformer内部函子的应用。这些机制使模型能够将关系结构从一个类别转移到另一个类别,从而实现类比。最后,我们量化了这些效应,并在预训练的大型语言模型中观察到了相同的趋势。通过这样做,我们将类比从一个抽象的认知概念转变为现代神经网络中一个具体的、基于机制的現象。

英文摘要

Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. Despite its central role in cognition, the mechanisms by which Transformers acquire and implement analogical reasoning remain poorly understood. In this work, inspired by the notion of functors in category theory, we formalize analogical reasoning as the inference of correspondences between entities across categories. Based on this formulation, we introduce synthetic tasks that evaluate the emergence of analogical reasoning under controlled settings. We find that the emergence of analogical reasoning is highly sensitive to data characteristics, optimization choices, and model scale. Through mechanistic analysis, we show that analogical reasoning in Transformers decomposes into two key components: (1) geometric alignment of relational structure in the embedding space, and (2) the application of a functor within the Transformer. These mechanisms enable models to transfer relational structure from one category to another, realizing analogy. Finally, we quantify these effects and find that the same trends are observed in pretrained LLMs. In doing so, we move analogy from an abstract cognitive notion to a concrete, mechanistically grounded phenomenon in modern neural networks.

2511.18894 2026-05-28 cs.CV cs.AI 版本更新

Not All Pixels Are Equal: Pixel-wise Meta-Learning for Medical Segmentation with Noisy Labels

并非所有像素都平等:面向含噪标签医学分割的像素级元学习

Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng

发表机构 * Xidian University(西安电子科技大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出MetaDCSeg框架,通过动态学习像素级权重并引入动态中心距离机制建模边界不确定性,抑制噪声标签影响并提升边界分割性能。

详情
AI中文摘要

医学图像分割对于临床应用至关重要,但常常受到噪声标注和模糊解剖边界的干扰,限制了其在现实场景中的应用。现有方法通常直接适应为实例分类设计的噪声标签学习技术,忽视了医学分割中像素级异质性及其空间和解剖上的难度差异。因此,全局假设或简单的置信度指标无法解决这些局部变化,导致边界模糊问题未得到解决。为解决这一问题,我们提出MetaDCSeg,一个鲁棒的框架,动态学习最优像素级权重以抑制噪声标签的影响,同时保留可靠标注。通过动态中心距离(DCD)机制显式建模边界不确定性,我们的方法利用前景、背景和边界中心的加权特征距离,引导模型关注模糊边界附近的难分割像素。该策略能够更精确地处理结构边界(这些边界常被现有方法忽略),并显著提升分割性能。在四个不同噪声水平的基准数据集上的大量实验表明,MetaDCSeg优于现有最先进方法。

英文摘要

Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, limiting its application in real-world scenarios. Existing methods often directly adapt noisy label learning techniques designed for instance classification, overlooking the pixel-wise heterogeneity in medical segmentation with its spatially and anatomically varying difficulties. Consequently, global assumptions or simple confidence metrics fail to address these local variations, leaving boundary ambiguities unresolved. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg outperforms existing state-of-the-art methods.

2403.11852 2026-05-28 cs.RO cs.AI 版本更新

Delay-Aware Reinforcement Learning for Highway On-Ramp Merging under Stochastic Communication Latency

考虑随机通信延迟的高速公路匝道合流延迟感知强化学习

Amin Tabrizian, Zhitong Huang, Arsyi Aziz, Peng Wei

发表机构 * Department of Computer Science, George Washington University, Washington, D.C.(计算机科学系,乔治华盛顿大学,华盛顿特区) Connected and Automated Vehicle Program Manager, Traffic Operations Division, Virginia Department of Transportation(连接与自动化车辆计划主任,交通运营处,弗吉尼亚州交通部) Department of Mechanical & Aerospace Engineering, George Washington University, Washington, D.C.(机械与航空航天工程系,乔治华盛顿大学,华盛顿特区)

AI总结 针对V2I通信随机延迟导致状态观测延迟的问题,提出DAROM框架,通过随机延迟MDP建模和延迟感知编码器恢复马尔可夫性,结合物理安全控制器实现鲁棒控制。

详情
AI中文摘要

延迟和部分可观测的状态信息给现实自动驾驶中基于强化学习(RL)的控制带来了重大挑战。在高速公路匝道合流中,路侧单元(RSU)可以感知附近交通,进行边缘感知,并通过车到基础设施(V2I)链路将状态估计传输给自车。随着智能交通基础设施和边缘计算的最新进展,这种RSU辅助感知越来越现实,并已部署在现代互联道路系统中。然而,边缘处理时间和无线传输可能引入随机的V2I通信延迟,违反马尔可夫假设并显著降低控制性能。在这项工作中,我们提出了DAROM,一种对随机延迟鲁棒的高速公路匝道合流延迟感知强化学习框架。我们将问题建模为随机延迟马尔可夫决策过程(RDMDP),并开发了一个统一的RL智能体用于联合纵向和横向控制。为了在延迟观测下恢复马尔可夫表示,我们引入了一个延迟感知编码器,该编码器以延迟观测、掩蔽动作历史和观测延迟幅度为条件来推断当前潜在状态。我们进一步集成基于物理的安全控制器以减少合流过程中的碰撞风险。在模拟城市交通(SUMO)模拟器中,使用下一代仿真(NGSIM)数据集的真实交通数据进行的实验表明,DAROM在各种交通密度下始终优于标准RL基线。特别是,基于门控循环单元(GRU)的编码器在高达2.0秒的随机V2I延迟的高密度交通中实现了超过99%的成功率。

英文摘要

Delayed and partially observable state information poses significant challenges for reinforcement learning (RL)-based control in real-world autonomous driving. In highway on-ramp merging, a roadside unit (RSU) can sense nearby traffic, perform edge perception, and transmit state estimates to the ego vehicle over vehicle-to-infrastructure (V2I) links. With recent advancements in intelligent transportation infrastructure and edge computing, such RSU-assisted perception is increasingly realistic and already deployed in modern connected roadway systems. However, edge processing time and wireless transmission can introduce stochastic V2I communication delays, violating the Markov assumption and substantially degrading control performance. In this work, we propose DAROM, a Delay-Aware Reinforcement Learning framework for On-ramp Merging that is robust to stochastic delays. We model the problem as a random delay Markov decision process (RDMDP) and develop a unified RL agent for joint longitudinal and lateral control. To recover a Markovian representation under delayed observations, we introduce a Delay-Aware Encoder that conditions on delayed observations, masked action histories, and observed delay magnitude to infer the current latent state. We further integrate a physics-based safety controller to reduce collision risk during merging. Experiments in the Simulation of Urban MObility (SUMO) simulator using real-world traffic data from the Next Generation Simulation (NGSIM) dataset demonstrate that DAROM consistently outperforms standard RL baselines across traffic densities. In particular, the gated recurrent unit (GRU)-based encoder achieves over 99% success in high-density traffic with random V2I delays of up to 2.0 seconds.

2505.18647 2026-05-28 cs.LG cs.AI 版本更新

STFlow: Data-Coupled Flow Matching for Geometric Trajectory Simulation

STFlow: 用于几何轨迹模拟的数据耦合流匹配

Kiet Bennema ten Brinke, Koen Minartz, Vlado Menkovski

发表机构 * Machine Learning for Physical Sciences (ML4Sci/e) Group, Department of Mathematics \& Computer Science, Eindhoven University of Technology, The Netherlands

AI总结 提出STFlow,一种基于图神经网络和层次卷积的生成模型,通过数据依赖耦合的流匹配框架,从条件随机游走而非高斯噪声去噪,降低传输成本,提高训练和推理效率,在N体系统、分子动力学和人类轨迹预测中实现最低预测误差。

Comments Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea. PMLR 306, 2026, 18 pages, 12 figures

详情
AI中文摘要

模拟动力系统的轨迹是分子动力学、生物化学和行人动力学等广泛领域中的基本问题。机器学习已成为扩展基于物理的模拟器和直接从实验数据开发模型的宝贵工具。特别是,深度生成建模和几何深度学习的最新进展通过学习复杂的轨迹分布,同时尊重固有的置换和时间平移对称性,实现了概率模拟。然而,N体系统的轨迹通常具有对导致分岔的扰动的高敏感性,以及多尺度的时间和空间相关性。为了应对这些挑战,我们引入了STFlow(时空流),一种基于图神经网络和层次卷积的生成模型。通过在流匹配框架中引入数据依赖的耦合,STFlow从条件随机游走而非高斯噪声开始去噪。这种新颖的信息先验通过降低传输成本简化了学习任务,提高了训练和推理效率。我们在N体系统、分子动力学和人类轨迹预测上验证了我们的方法。在这些基准测试中,STFlow以更少的模拟步骤实现了最低的预测误差,并提高了可扩展性。

英文摘要

Simulating trajectories of dynamical systems is a fundamental problem in a wide range of fields such as molecular dynamics, biochemistry, and pedestrian dynamics. Machine learning has become an invaluable tool for scaling physics-based simulators and developing models directly from experimental data. In particular, recent advances in deep generative modeling and geometric deep learning enable probabilistic simulation by learning complex trajectory distributions while respecting intrinsic permutation and time-shift symmetries. However, trajectories of N-body systems are commonly characterized by high sensitivity to perturbations leading to bifurcations, as well as multi-scale temporal and spatial correlations. To address these challenges, we introduce STFlow (Spatio-Temporal Flow), a generative model based on graph neural networks and hierarchical convolutions. By incorporating data-dependent couplings within the Flow Matching framework, STFlow denoises starting from conditioned random-walks instead of Gaussian noise. This novel informed prior simplifies the learning task by reducing transport cost, increasing training and inference efficiency. We validate our approach on N-body systems, molecular dynamics, and human trajectory forecasting. Across these benchmarks, STFlow achieves the lowest prediction errors with fewer simulation steps and improved scalability.

2503.01829 2026-05-28 cs.CL cs.AI cs.LG cs.MA 版本更新

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

如果你能说服我:评估大型语言模型说服效果与易受影响性的框架

Nimet Beyza Bozdag, Shuhaib Mehri, Gokhan Tur, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出PMIYC框架,通过多智能体对话自动评估LLM的说服效果与易受影响性,发现不同模型在说服力和抗说服性上存在显著差异。

Comments Paper published at the ACM Conference on AI and Agentic Systems 2026

详情
AI中文摘要

大型语言模型(LLM)展现出与人类水平相当的说服能力。虽然这些能力可用于社会公益,但也存在被滥用的风险。除了关注LLM如何说服他人外,它们自身对说服的易受影响性也构成了关键的校准挑战,引发了关于鲁棒性、安全性和伦理原则遵守的问题。为了研究这些动态,我们引入了“如果你能说服我”(PMIYC),一个用于评估多智能体交互中说服力和易受影响性的自动化框架。我们的框架提供了一种可扩展的替代方案,替代了通常用于研究LLM说服的昂贵且耗时的人工标注过程。PMIYC自动进行说服者和被说服者智能体之间的多轮对话,同时衡量说服的有效性和易受影响性。我们的综合评估涵盖了多种LLM和说服场景(例如,主观和错误信息场景)。我们通过人工评估验证了框架的有效性,并展示了与先前研究中人工评估的一致性。通过PMIYC,我们发现Llama-3.3-70B和GPT-4o表现出相似的说服效果,比Claude 3 Haiku高出30%。然而,GPT-4o在对抗错误信息方面的抵抗力比Llama-3.3-70B高出50%以上。值得注意的是,o4-mini既是有效的说服者,也是抵抗的被说服者。这些发现为LLM的说服动态提供了实证见解,并有助于开发更安全的AI系统。

英文摘要

Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used for social good, they also present risks of potential misuse. Beyond the concern of how LLMs persuade others, their own susceptibility to persuasion poses a critical alignment challenge, raising questions about robustness, safety, and adherence to ethical principles. To study these dynamics, we introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasiveness and susceptibility to persuasion in multi-agent interactions. Our framework offers a scalable alternative to the costly and time-intensive human annotation process typically used to study persuasion in LLMs. PMIYC automatically conducts multi-turn conversations between Persuader and Persuadee agents, measuring both the effectiveness of and susceptibility to persuasion. Our comprehensive evaluation spans a diverse set of LLMs and persuasion settings (e.g., subjective and misinformation scenarios). We validate the efficacy of our framework through human evaluations and demonstrate alignment with human assessments from prior studies. Through PMIYC, we find that Llama-3.3-70B and GPT-4o exhibit similar persuasive effectiveness, outperforming Claude 3 Haiku by 30%. However, GPT-4o demonstrates over 50% greater resistance to persuasion for misinformation compared to Llama-3.3-70B. Notably, o4-mini emerges as both an effective persuader, and a resistant persuadee. These findings provide empirical insights into the persuasive dynamics of LLMs and contribute to the development of safer AI systems.

2602.03515 2026-05-28 cs.LG cs.AI cs.DC 版本更新

Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

通过基旋转缓解异步流水线并行中的陈旧性问题

Hyunji Jung, Sungbin Shin, Namhoon Lee

发表机构 * POSTECH(POSTECH大学)

AI总结 针对异步流水线并行中梯度陈旧性随流水线深度线性增长的问题,提出基旋转框架,通过将优化器坐标系与Hessian特征基对齐来保持延迟更新的有效性,理论证明最小化基失配并实证在3B参数LLM训练中减少81.7%迭代次数。

Comments ICML 2026

详情
AI中文摘要

异步流水线并行通过消除同步执行中固有的流水线气泡来最大化硬件利用率,为高效大规模分布式训练提供了一条途径。然而,这种效率提升可能会被梯度陈旧性所削弱,其中使用延迟梯度的即时模型更新会在优化过程中引入噪声。关键的是,我们发现了一个常被忽视的严重问题:这种延迟随流水线深度线性增长,从根本上破坏了该方法原本意图提供的可扩展性。我们将此问题归因于优化景观的一个特定性质:Hessian特征基与标准坐标基之间的失配,这触发了坐标自适应优化器更新轨迹中的振荡。我们识别出这些振荡导致延迟更新偏离其真实对应项,使其无法用于当前迭代。这一见解通过理论分析(包括一个表明基失配放大延迟惩罚的收敛界)和实证评估得到证实。为了解决这个问题,我们提出了基旋转,一个将优化器坐标系旋转以与Hessian特征基对齐的框架,使延迟更新保持有用。我们从理论上证明基旋转最小化基失配,从而抵消放大延迟惩罚的条件。在训练高达3B参数的LLM的实证中,与性能最佳的异步基线相比,基旋转减少了81.7%所需的迭代次数。

英文摘要

Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. We trace this pathology to a specific property of the optimization landscape: the misalignment between the Hessian eigenbasis and the standard coordinate basis, which triggers oscillations in the update trajectories of coordinate-wise adaptive optimizers. We identify that these oscillations cause delayed updates to diverge from their true counterparts, invalidating their use for current iterations. This insight is formalized through theoretical analysis, including a convergence bound showing that basis misalignment amplifies the delay penalty, and substantiated with empirical evaluation. To address this, we propose basis rotation, a framework that rotates the optimizer's coordinate system to align with the Hessian eigenbasis, keeping delayed updates useful. We theoretically demonstrate that basis rotation minimizes basis misalignment, thereby counteracting the conditions that amplify delay penalties. Empirically, in training up to a 3B-parameter LLM, basis rotation reduces the required iterations by 81.7\% compared to the best-performing asynchronous baseline.

2602.02898 2026-05-28 cs.AI cs.CL 版本更新

Aligning Language Model Benchmarks with Pairwise Preferences

将语言模型基准与成对偏好对齐

Marco Gutierrez, Xinyi Leng, Hannah Cyberey, Jonathan Richard Schwarz, Ahmed Alaa, Thomas Hartvigsen

发表机构 * School of Data Science, University of Virginia(弗吉尼亚大学数据科学学院) Imperial College London(伦敦帝国理工学院) Thomson Reuters Foundational Research(汤姆森路透基础研究) Department of Electrical Engineering and Computer Science, UC Berkeley and UCSF(伯克利大学电气工程与计算机科学系及旧金山大学)

AI总结 提出BenchAlign方法,通过利用语言模型在问题级别的性能与模型成对排名,自动调整离线基准权重,使新基准能根据偏好准确排序未见模型。

详情
AI中文摘要

语言模型基准是广泛使用的、计算高效的现实性能代理。然而,许多近期工作发现基准常常无法预测实际效用。为弥合这一差距,我们引入基准对齐,即利用有限的模型性能信息自动更新离线基准,旨在生成新的静态基准,以预测给定测试设置中的模型成对偏好。然后我们提出BenchAlign,这是该问题的首个解决方案,它利用语言模型在问题级别的性能以及可能在部署期间收集的模型成对排名,学习基准问题的偏好对齐权重,生成新的基准,根据这些偏好对先前未见过的模型进行排序。我们的实验表明,我们的对齐基准能够根据人类偏好模型准确地对未见模型进行排序,即使模型大小不同,同时保持可解释性。总体而言,我们的工作为将基准与实际人类偏好对齐的局限性提供了见解,这有助于加速模型开发以追求实际效用。

英文摘要

Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.

2602.02150 2026-05-28 cs.LG cs.AI 版本更新

ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

ECHO: 测试时强化学习的熵-置信度混合优化

Chu Zhao, Enneng Yang, Yuting Liu, Jianzhe Zhao, Guibing Guo

发表机构 * Northeastern University, Shenyang, China(东北大学(沈阳)) Shenzhen Campus of Sun Yat-sen University, China(中山大学深圳校区)

AI总结 针对测试时强化学习中高熵分支导致rollout崩溃和早期伪标签噪声引发过拟合的问题,提出熵-置信度混合组相对策略优化(ECHO),通过自适应分支控制和置信度剪枝缓解崩溃,并采用置信度自适应裁剪和优势塑造增强训练鲁棒性。

Comments 19 ppages

详情
AI中文摘要

测试时强化学习通过重复rollout生成多个候选答案,并利用多数投票构建的伪标签进行在线更新。为了减少开销并改进探索,先前的工作引入了树结构rollout,共享推理前缀并在关键节点分支以提高采样效率。然而,这种范式仍然面临两个挑战:(1) 高熵分支可能触发rollout崩溃,即分支预算集中在少数具有连续高熵片段的轨迹上,迅速减少有效分支数量;(2) 早期伪标签存在噪声和偏差,可能引发自我强化的过拟合,导致策略过早锐化并抑制探索。为了解决这些问题,我们提出了熵-置信度混合组相对策略优化(ECHO)。在rollout过程中,ECHO联合利用局部熵和组级置信度自适应控制分支宽度,并进一步引入在线置信度剪枝以终止持续低置信度的分支,避免高熵陷阱并缓解崩溃。在策略更新过程中,ECHO采用置信度自适应裁剪和熵-置信度混合优势塑造方法,以增强训练鲁棒性并减轻早期偏差。实验表明,ECHO在多个数学和视觉推理基准上取得了一致的性能提升,并在有限的rollout预算下更有效地泛化。

英文摘要

Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior work introduces tree structured rollouts, which share reasoning prefixes and branch at key nodes to improve sampling efficiency. However, this paradigm still faces two challenges: (1) high entropy branching can trigger rollout collapse, where the branching budget concentrates on a few trajectories with consecutive high-entropy segments, rapidly reducing the number of effective branches; (2) early pseudo-labels are noisy and biased, which can induce self-reinforcing overfitting, causing the policy to sharpen prematurely and suppress exploration. To address these issues, we propose Entropy Confidence Hybrid Group Relative Policy Optimization (ECHO). During rollout, ECHO jointly leverages local entropy and group level confidence to adaptively control branch width, and further introduces online confidence-based pruning to terminate persistently low confidence branches, avoiding high entropy traps and mitigating collapse. During policy updates, ECHO employs confidence adaptive clipping and an entropy confidence hybrid advantage shaping approach to enhance training robustness and mitigate early stage bias. Experiments demonstrate that ECHO achieves consistent gains on multiple mathematical and visual reasoning benchmarks, and generalizes more effectively under a limited rollout budget.

2602.01990 2026-05-28 cs.LG cs.AI 版本更新

SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

SAME: 用于多模态持续指令调优的稳定混合专家模型

Zhen-Hao Xie, Jun-Tao Tang, Yu-Cheng Shi, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院)

AI总结 针对多模态持续指令调优中专家路由漂移和专家漂移问题,提出稳定混合专家模型(SAME),通过正交子空间分解路由动态和曲率感知缩放更新专家,实现无重放的状态最优性能。

Comments Accepted to ICML 2026. Code is available at https://github.com/LAMDA-CL/Prism

详情
AI中文摘要

多模态大语言模型(MLLMs)通过指令调优实现了强大的性能,但实际部署需要它们持续扩展能力,这使得多模态持续指令调优(MCIT)变得至关重要。最近的方法利用稀疏专家路由来促进任务专业化,但我们发现专家路由过程会随着数据分布的演变而发生漂移。例如,之前激活定位专家的接地查询在学习OCR任务后可能被路由到不相关的专家。同时,与接地相关的专家可能被新任务覆盖而失去原有功能。这种失败反映了两个问题:路由器漂移(专家选择随时间变得不一致)和专家漂移(共享专家跨任务被覆盖)。因此,我们提出了用于MCIT的稳定混合专家模型(SAME)。为了解决路由器漂移,SAME通过将路由动态分解为正交子空间并仅更新任务相关方向来稳定专家选择。为了缓解专家漂移,我们通过使用历史输入协方差进行曲率感知缩放来调节专家更新,无需重放。SAME还引入了自适应专家激活,在训练期间冻结选中的专家,减少冗余计算和跨任务干扰。我们还引入了一个新的基准来评估长任务序列的MCIT,大量实验证明了SAME的最优性能。代码可在 https://github.com/LAMDA-CL/Prism 获取。

英文摘要

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. We also introduce a new benchmark to evaluate MCIT with long task sequence, and extensive experiments demonstrate SAME's SOTA performance. Code is available at https://github.com/LAMDA-CL/Prism.

2602.01745 2026-05-28 cs.LG cs.AI 版本更新

Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning

概率-熵校准:一种用于自适应微调的弹性指标

Wenhao Yu, Shaohang Wei, Jiahong Liu, Yifan Li, Minda Hu, Aiwei Liu, Hao Zhang, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学) Tsinghua University(清华大学) University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 提出概率-熵校准信号(相对排名指标)进行token级重加权,以平衡预训练先验与下游对齐,在数学推理、分布外推理和代码生成任务上优于仅基于概率或熵的方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

Token级重加权是一种简单但有效的控制监督微调的机制,但常见的指标很大程度上是单维的:真实概率反映下游对齐,而token熵反映预训练先验引起的内在不确定性。忽略熵可能会将噪声或易替换的token误识别为学习关键,而忽略概率则无法反映目标特定的对齐。RankTuner引入了一种概率-熵校准信号,即相对排名指标,它比较真实token的排名与其在预测分布下的预期排名。逆指标作为token级的相对尺度用于重加权微调目标,将更新集中在真正未学习充分的token上,而不过度惩罚内在不确定的位置。在多个骨干网络上的实验表明,在数学推理基准上持续改进,在分布外推理上获得迁移增益,并且在代码生成性能上优于仅基于概率或熵的重加权基线。

英文摘要

Token-level reweighting is a simple yet effective mechanism for controlling supervised fine-tuning, but common indicators are largely one-dimensional: the ground-truth probability reflects downstream alignment, while token entropy reflects intrinsic uncertainty induced by the pre-training prior. Ignoring entropy can misidentify noisy or easily replaceable tokens as learning-critical, while ignoring probability fails to reflect target-specific alignment. RankTuner introduces a probability--entropy calibration signal, the Relative Rank Indicator, which compares the rank of the ground-truth token with its expected rank under the prediction distribution. The inverse indicator is used as a token-wise Relative Scale to reweight the fine-tuning objective, focusing updates on truly under-learned tokens without over-penalizing intrinsically uncertain positions. Experiments on multiple backbones show consistent improvements on mathematical reasoning benchmarks, transfer gains on out-of-distribution reasoning, and pre code generation performance over probability-only or entropy-only reweighting baselines.

2602.01665 2026-05-28 cs.MA cs.AI cs.LG 版本更新

TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

TABX:面向多智能体强化学习的高吞吐沙盒战斗模拟器

Hayeong Lee, JunHyeok Oh, Byung-Jun Lee

发表机构 * Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea(韩国大学人工智能系) Gauss Labs Inc., Seoul, Republic of Korea(首尔Gauss实验室)

AI总结 提出基于JAX的高吞吐沙盒模拟器TABX,通过可重构任务和硬件加速支持多智能体强化学习的高效研究与评估。

详情
AI中文摘要

环境的设计在塑造合作多智能体强化学习(MARL)算法的开发和评估中起着关键作用。虽然现有基准突出了关键挑战,但它们通常缺乏设计自定义评估场景所需的模块化。我们介绍了基于JAX的全加速战斗模拟器(TABX),这是一个专为可重构多智能体任务设计的高吞吐沙盒。TABX提供对环境参数的精细控制,允许系统地研究涌现的智能体行为和跨不同任务复杂度谱系的算法权衡。利用JAX在GPU上进行硬件加速执行,TABX实现了大规模并行化并显著降低了计算开销。通过提供一个快速、可扩展且易于定制的框架,TABX促进了复杂结构化领域中MARL智能体的研究,并作为未来研究的可扩展基础。我们的代码可在https://github.com/ku-dmlab/TABX获取。

英文摘要

The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku-dmlab/TABX.

2509.23074 2026-05-28 cs.LG cs.AI 版本更新

Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

超越模型排名:时间序列预测的可预测性对齐评估

Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li

发表机构 * Department of Electronic Engineering, Tsinghua University, Beijing, China.(清华大学电子工程系,北京,中国)

AI总结 针对基准排行榜评估混淆模型性能与数据内在不可预测性的问题,提出基于谱相干的可预测性对齐诊断框架,包含SCP分数和LUR工具,揭示可预测性漂移和模型架构权衡。

详情
AI中文摘要

在时间序列预测的AI模型日益复杂的时代,进展通常通过基准排行榜上的边际改进来衡量。然而,这种方法存在一个根本缺陷:标准评估指标混淆了模型的性能与数据的内在不可预测性。为了解决这一紧迫挑战,我们引入了一个新颖的、基于谱相干的可预测性对齐诊断框架。我们的框架有两个主要贡献:谱相干可预测性(SCP),一个计算高效($O(N\log N)$)且任务对齐的分数,用于量化给定预测实例的固有难度;以及线性利用率(LUR),一个频率分辨的诊断工具,精确测量模型如何有效利用数据中的线性可预测信息。我们验证了框架的有效性,并利用它揭示了两个核心见解。首先,我们提供了“可预测性漂移”的首个系统性证据,表明任务的预测难度随时间剧烈变化。其次,我们的评估揭示了一个关键的架构权衡:复杂模型在低可预测性数据上表现优越,而线性模型在更可预测的任务上非常有效。我们倡导范式转变,超越简单的聚合分数,转向更具洞察力的、可预测性感知的评估,从而促进更公平的模型比较和更深入的模型行为理解。

英文摘要

In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model's performance with the data's intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework's effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

2507.16679 2026-05-28 cs.CL cs.AI cs.CY 版本更新

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

PICACO: 通过总相关优化实现大语言模型的多元情境价值对齐

Han Jiang, Dongyao Zhu, Xiaoyuan Yi, Ziang Xiao, Zhihua Wei, Xing Xie

发表机构 * Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学) North Carolina State University, Raleigh, NC, USA(北卡罗来纳州立大学) Microsoft Research Asia, Beijing, China(微软亚洲研究院) Tongji University, Shanghai, China(同济大学)

AI总结 针对情境对齐中价值冲突导致的指令瓶颈问题,提出PICACO方法,通过优化元指令并最大化指定价值与模型响应的总相关,无需微调即可实现多元价值平衡对齐。

Comments ICML 2026

详情
AI中文摘要

情境学习在使大语言模型与人类价值对齐方面展现出巨大潜力,有助于减少有害输出并适应多样化偏好,而无需昂贵的后训练,这被称为情境对齐。然而,大语言模型对输入提示的理解仍是不可知的,限制了情境对齐处理价值冲突的能力——人类价值本质上是多元的,常常施加相互冲突的要求,例如刺激与传统。因此,当前的情境对齐方法面临指令瓶颈挑战,即大语言模型难以在单个提示中协调多个预期价值,导致对齐不完整或有偏。为了解决这个问题,我们提出了PICACO,一种新颖的多元情境对齐方法。无需微调,PICACO优化一个融合了多个价值的元指令,以更好地激发大语言模型对这些价值的理解并改进对齐。这是通过最大化指定价值与大语言模型响应之间的总相关来实现的,这从理论上强化了价值一致性并减少了干扰噪声,从而产生更有效的指令。在五个价值集上的大量实验表明,PICACO在黑盒和开源大语言模型上均表现良好,优于多个近期强基线,并在多达8个不同价值之间实现了更好的平衡。

英文摘要

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that incorporates multiple values to better elicit LLMs' understanding of them and improve alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, which theoretically reinforces value conformity and reduces distractive noise, resulting in more effective instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

2601.21666 2026-05-28 cs.AI cs.CV 版本更新

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

SONIC-O1:用于评估多模态大语言模型在音视频理解上的真实世界基准

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

发表机构 * Vector Institute for Artificial Intelligence(向量人工智能研究所) University of Groningen(Groningen大学) York University(约克大学)

AI总结 提出SONIC-O1基准,包含60小时人工验证的音视频数据,评估多模态大语言模型在开放摘要、多项选择问答和时序定位上的能力,发现模型在时序定位上存在显著性能差距和人口统计偏差。

详情
AI中文摘要

多模态大语言模型(MLLMs)是近期AI研究的主要焦点。然而,大多数先前工作集中于静态图像理解,而它们处理序列音视频数据的能力仍未充分探索。这一差距凸显了需要一个高质量基准来系统评估MLLM在真实世界场景中的性能。我们介绍了SONIC-O1,一个全面的、完全人工验证的基准,包含60小时(231个片段)跨越13个真实世界对话领域的数据,带有4,958个注释和人口统计元数据。SONIC-O1评估三种能力:开放摘要、多项选择题(MCQ)回答以及带有支持理由(推理)的时序定位。在闭源和开源模型中,我们发现MCQ准确率显示模型家族之间的差距最小,但最好的闭源模型在时序定位上比最好的开源模型高出22.6%。我们进一步观察到不同人口统计组在时序定位上的准确率差距高达21.4%,表明模型行为存在持续差异。SONIC-O1为基于时序和人口统计鲁棒的多模态理解提供了一个开放评估套件。SONIC-O1公开可用于研究:项目页面(https://vectorinstitute.github.io/sonic-o1/)、数据集(https://huggingface.co/datasets/vector-institute/sonic-o1)、GitHub(https://github.com/vectorinstitute/sonic-o1)、排行榜(https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard)。

英文摘要

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark of 60 hours (231 clips) spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates three capabilities: open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Across closed- and open-source models, we find that the MCQ accuracy shows the smallest gap between model families, but the best closed-source model outperforms the best open-source model by 22.6% on temporal localization. We further observe accuracy gaps of up to 21.4% on temporal localization across demographic groups, indicating persistent disparities in model behaviour. SONIC-O1 provides an open evaluation suite for temporally grounded and demographically robust multimodal understanding. SONIC-O1 is publicly available for research: Project page (https://vectorinstitute.github.io/sonic-o1/), Dataset (https://huggingface.co/datasets/vector-institute/sonic-o1), GitHub (https://github.com/vectorinstitute/sonic-o1), Leaderboard (https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard).

2509.23019 2026-05-28 cs.CR cs.AI 版本更新

LLM Watermark Evasion via Bias Inversion

通过偏差反转实现LLM水印规避

Jeongyeon Hwang, Sangdon Park, Jungseul Ok

发表机构 * Pohang University of Science and Technology (POSTECH)(浦项科学技术大学)

AI总结 提出偏差反转重写攻击(BIRA),通过理论分析证明降低绿色令牌平均条件概率可指数级衰减检测概率,实现黑盒下高规避率(>99%)且保持语义保真度。

详情
AI中文摘要

水印为检测LLM生成内容提供了一种有前景的解决方案,但在现实无查询(黑盒)规避下的鲁棒性仍是一个开放挑战。现有的无查询攻击往往成功率有限或严重扭曲语义。我们通过理论分析重写型规避来弥合这一差距,证明将绿色令牌的平均条件概率降低一个小幅度会导致检测概率指数级衰减。受此洞察启发,我们提出了偏差反转重写攻击(BIRA),一种实用的无查询方法,该方法对通过令牌惊讶度识别的代理抑制集应用负对数几率偏差。实验上,BIRA在多种水印方案中实现了最先进的规避率(>99%),同时语义保真度显著优于先前的基线。我们的发现揭示了当前水印方法的一个根本性漏洞,并强调了进行严格压力测试的必要性。我们的代码可在\href{https://github.com/ml-postech/LLM-Watermark-Evasion-via-Bias-Inversion}{此处}获取。

英文摘要

Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates ($>99\%$) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests. Our code is available at \href{https://github.com/ml-postech/LLM-Watermark-Evasion-via-Bias-Inversion}{here}.

2601.19926 2026-05-28 cs.CL cs.AI 版本更新

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Transformer的语法:语言模型中句法知识可解释性研究的系统综述

Nora Graichen, Iria de-Dios-Flores, Gemma Boleda

发表机构 * Universitat Pompeu Fabra(巴塞罗那庞培乌法布拉大学) ICREA(加泰罗尼亚国家研究委员会)

AI总结 通过对337篇文章的系统综述,评估基于Transformer的语言模型(TLM)的句法能力,发现TLM编码了非平凡的句法知识,但句法-语义接口现象表现较弱,且研究集中在英语和BERT类模型上。

详情
AI中文摘要

我们对337篇评估基于Transformer的语言模型(TLM)句法能力的文章进行了系统综述,报告了涵盖广泛句法现象、语言、模型和方法的3000多个数据点。这些数据共同表明,TLM编码了非平凡的句法知识。行为证据显示,TLM在形式句法现象上表现强劲,但在句法-语义接口现象上表现较弱且多变。对于数字支持较少的语言,表现也持续较低。探针和机制研究进一步支持TLM中存在句法知识。然而,由于大多数工作仍停留在观察层面,且当前方法在方法论上具有异质性,对句法处理背后的详细计算机制的洞察仍然有限。同时,文献仍然高度集中在英语和BERT类模型上。我们讨论了研究结果的意义,并为未来研究提供了建议。

英文摘要

We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models (TLMs), reporting on over 3,000 datapoints spanning a wide range of syntactic phenomena, languages, models, and methods. We take the data to collectively show that TLMs encode a non-trivial amount of syntactic knowledge. Behavioral evidence shows strong performance on formal syntactic phenomena, but weaker and more variable performance on phenomena at the syntax-semantics interface. Performance is also consistently lower for languages with less digital support. Probing and mechanistic studies further support the presence of syntactic knowledge in TLMs. Yet, because most work remains observational and current approaches are methodologically heterogeneous, insight into the detailed computational mechanisms underlying syntactic processing remains limited. At the same time, the literature remains heavily concentrated on English and BERT-like models. We discuss the implications of our results and provide recommendations for future research.

2509.06350 2026-05-28 cs.CL cs.AI cs.CR 版本更新

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Mask-GCG:对抗性后缀中的所有标记对于越狱攻击都是必要的吗?

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

发表机构 * Politecnico di Milano(米兰理工学院) Beihang University(北京航空航天大学) East China Normal University(华东师范大学) Fudan University(复旦大学) University of the Chinese Academy of Sciences(中国科学院大学) AI Security Lab(360人工智能安全实验室)

AI总结 提出Mask-GCG方法,通过可学习的标记掩码识别后缀中高影响力标记并剪枝低影响力标记,降低计算开销并保持攻击成功率,揭示LLM提示中的标记冗余。

Comments Accepted to ICASSP 2026

详情
Journal ref
2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13887-13891, 2026
AI中文摘要

针对大型语言模型(LLM)的越狱攻击已展示了多种成功方法,攻击者操纵模型生成其本应避免的有害响应。其中,贪婪坐标梯度(GCG)作为一种通用且有效的方法,通过优化后缀中的标记来生成可越狱的提示。尽管已提出多种GCG的改进变体,但它们都依赖于固定长度的后缀。然而,这些后缀中潜在的冗余尚未被探索。在这项工作中,我们提出Mask-GCG,一种即插即用的方法,采用可学习的标记掩码来识别后缀中的高影响力标记。我们的方法增加了高影响力位置标记的更新概率,同时剪枝低影响力位置的标记。这种剪枝不仅减少了冗余,还降低了梯度空间的大小,从而减少了计算开销,并缩短了实现成功攻击所需的时间。我们将Mask-GCG应用于原始GCG及其多种改进变体进行评估。实验结果表明,后缀中的大多数标记对攻击成功有显著贡献,剪枝少数低影响力标记不会影响损失值或攻击成功率(ASR),从而揭示了LLM提示中的标记冗余。我们的发现从越狱攻击的角度为开发高效且可解释的LLM提供了见解。

英文摘要

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

2601.17737 2026-05-28 cs.CV cs.AI 版本更新

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

脚本即一切:一个用于长程对话到电影视频生成的智能体框架

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus

发表机构 * Tencent(腾讯)

AI总结 提出一个端到端智能体框架,通过训练ScripterAgent将对话转化为精细脚本,并利用DirectorAgent跨场景连续生成策略,实现长程对话到电影视频的连贯生成,显著提升脚本忠实度和时间保真度。

详情
AI中文摘要

近期视频生成的进展产生了能够从简单文本提示合成惊艳视觉内容的模型。然而,这些模型难以从对话等高层概念生成连贯的长篇叙事,揭示了创意想法与其电影执行之间的“语义鸿沟”。为弥合这一鸿沟,我们引入了一个新颖的、端到端的智能体框架,用于对话到电影视频的生成。我们框架的核心是ScripterAgent,一个经过训练将粗略对话转化为精细、可执行的电影脚本的模型。为此,我们构建了ScriptBench,一个具有丰富多模态上下文的新大规模基准,通过专家引导的流程进行标注。生成的脚本随后指导DirectorAgent,它使用跨场景连续生成策略协调最先进的视频模型,以确保长程连贯性。我们的全面评估,包括一个AI驱动的CriticAgent和一个新的视觉-脚本对齐(VSA)指标,表明我们的框架在所有测试的视频模型上显著提高了脚本忠实度和时间保真度。此外,我们的分析揭示了当前SOTA模型在视觉奇观与严格脚本遵循之间的关键权衡,为自动化电影制作的未来提供了宝贵见解。

英文摘要

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

2505.17654 2026-05-28 cs.CL cs.AI 版本更新

EVADE-Bench: Multimodal Benchmark for Evaluating and Enhancing Evasive Content Detection

EVADE-Bench:用于评估和增强规避性内容检测的多模态基准

Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyu Chang, Yukun Chen, Hamid Alinejad-Rokny, Min Yang

发表机构 * SIAT, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) University of Chinese Academy of Sciences(中国科学院大学) Alibaba Group(阿里巴巴集团) University of New South Wales(新南威尔士大学)

AI总结 针对电商平台中LLM/VLM易受规避性内容攻击的问题,提出首个专家标注的中文多模态基准EVADE-Bench,评估26个模型并发现规则分类可提升检测一致性,多智能体分解策略能显著提高准确率。

Comments SIGIR 2026

详情
AI中文摘要

电商平台越来越依赖大型语言模型(LLMs)和视觉语言模型(VLMs)来检测非法或误导性产品内容。然而,这些模型仍然容易受到规避性内容的影响,即通过分词、委婉语言或图像裁剪等技术故意修改的输入,以掩盖违反政策的行为,同时仍传达被禁止的主张。关键在于,检测此类内容需要模型同时掌握两种能力:准确理解复杂规则,以及正确推断故意混淆的多模态输入背后的真实意图。虽然先前的工作分别探索了LLM对复杂规则的推理和基于LLM的规避性内容检测,但现有基准尚未将两者结合在统一的评估框架内。这一差距在电商领域尤为严重,因为准确的审核要求这两种能力协同运作。为填补这一空白,我们引入了EVADE-Bench,这是首个专家策划的中文多模态基准,专门设计用于评估LLMs和VLMs在真实电商场景中的规避性内容检测。我们对26个开源和闭源LLMs及VLMs的全面评估显示,即使是最先进的模型也经常错误分类规避性样本。我们进一步证明,更清晰的规则分类显著提高了模型预测的一致性并减少了错误预测,凸显了基准设计在实现可靠评估中的关键作用。为了探索性能提升的路径,我们研究了多智能体分解在多模态推理中的可行性,即将视觉描述和逻辑推理解耦为独立的智能体,并发现这一策略带来了显著的准确率提升。

英文摘要

E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content, which refers to inputs that have been deliberately modified through techniques such as word splitting, euphemistic language, or image cropping to conceal policy violations while still conveying prohibited claims. Crucially, detecting such content requires a model to simultaneously master two capabilities: accurately comprehending complex rules, and correctly inferring the true intent behind deliberately obfuscated multimodal inputs. While prior work has separately explored LLM reasoning over complex rules and LLM-based detection of evasive content, no existing benchmark combines both within a unified evaluation framework. This gap is particularly consequential in e-commerce, where accurate moderation demands that both capabilities operate in concert. To address this gap, we introduce EVADE-Bench, the first expert-curated Chinese multimodal benchmark specifically designed to evaluate LLMs and VLMs on evasive content detection in real-world e-commerce scenarios. Our comprehensive evaluation of 26 open- and closed-source LLMs and VLMs reveals that even state-of-the-art models frequently misclassify evasive samples. We further demonstrate that clearer rule categorization significantly improves model prediction consistency and reduces false predictions, highlighting the critical role of benchmark design in enabling reliable evaluation. To explore paths for performance improvement, we investigate the feasibility of multi-agent decomposition for multimodal reasoning, wherein visual description and logical inference are decoupled into separate agents, and find that this strategy yields notable accuracy gains.

2601.06329 2026-05-28 cs.CL cs.AI 版本更新

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

论口语语言模型评估中全局令牌困惑度的谬误

Chan-Jan Hsu, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso

发表机构 * Carnegie Mellon University(卡内基梅隆大学) National Taiwan University(国立台湾大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对口语语言模型评估中直接使用文本困惑度公式计算语音令牌困惑度的问题,提出基于似然和生成的新型评估方法,更忠实反映生成质量,并缩小了最佳模型与人类基线之间的差距。

详情
AI中文摘要

在大规模原始音频上预训练的生成式口语语言模型能够以适当内容继续语音提示,同时保留说话人和情感等属性,作为口语对话的基础模型。在先前文献中,这些模型通常使用“全局令牌困惑度”进行评估,该指标直接将文本困惑度公式应用于语音令牌。然而,这种做法忽略了语音和文本模态之间的根本差异,可能导致对语音特性的低估。在这项工作中,我们提出了多种基于似然和生成的评估方法,以替代朴素的全局令牌困惑度。我们证明,所提出的评估更忠实地反映了感知生成质量,与人类评分的平均意见得分(MOS)具有更强的相关性。在新指标下评估时,口语语言模型的相对性能格局被重塑,揭示了最佳性能模型与人类基线之间的差距显著缩小。总之,这些结果表明,适当的评估对于准确评估口语语言建模的进展至关重要。

英文摘要

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

2601.05386 2026-05-28 cs.AI 版本更新

How Much Can a Few Engine Moves Help? Quantifying Limited Cheating in Chess

几次引擎走棋能有多大帮助?量化国际象棋中的有限作弊

Daniel Keren

发表机构 * Department of Computer Science(计算机科学系) University of Haifa(海法大学) Haifa, Israel(以色列海法)

AI总结 本文通过阈值策略和Bellman策略,在Stockfish引擎对弈中量化有限次作弊对棋手得分的影响,并引入无引擎模拟器优化超参数。

Comments Accepted, IEEE CoG 2026 (IEEE Conference on Games 2026). Replaces previous version "On the Effect of Cheating in Chess"

详情
AI中文摘要

国际象棋中利用强大软件建议作弊已成为一个主要问题,甚至达到最高水平。与以往大多数关注作弊检测的工作不同,本文尝试评估在比赛中有限次数作弊可能带来的性能提升。我们开发了基于阈值和Bellman风格的干预策略,并在使用Stockfish的受控引擎对引擎设置中进行测试。明智地选择1次或2次作弊分别得到平均得分0.71和0.82,而无作弊得分为0.51。我们还引入了一个快速、无引擎的模拟器,无需运行对局即可进行超参数优化,与基于引擎的最优值紧密匹配。本工作的目的不是帮助作弊者,而是衡量作弊的有效性——这对于遏制和检测作弊的努力至关重要。

英文摘要

Cheating in chess, by using advice from powerful software, has become a major problem, reaching the highest levels. As opposed to the large majority of previous work, which concerned {\em detection} of cheating, here we try to evaluate the possible gain in performance, obtained by cheating a limited number of times during a game. We develop threshold-based and Bellman-style intervention policies, and test them in a controlled engine-vs-engine setting using Stockfish. A judicious choice of 1 or 2 cheats yields average scores of 0.71 and 0.82, respectively, compared to 0.51 with no cheats. We also introduce a fast, engine-free simulator that enables hyperparameter optimization without running games, closely matching the engine-based optimum. The goal of this work is not to assist cheaters, but to measure the effectiveness of cheating -- which is crucial as part of the effort to contain and detect it.

2601.03048 2026-05-28 cs.CV cs.AI cs.CC 版本更新

On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning

关于Transformer图像嵌入在非可解空间推理中的内在限制

Siyi Lyu, Quan Liu, Feng Yan

发表机构 * School of Electronic Science and Engineering, Nanjing University, Nanjing, China(电子科学与工程学院,南京大学,南京,中国)

AI总结 本文通过将空间理解形式化为群同态问题,证明恒定深度Transformer由于TC⁰复杂度限制,无法在单次前向传播中捕获非可解群(如SO(3))的空间结构。

详情
AI中文摘要

视觉Transformer(ViT)在语义识别方面表现出色,但在心理旋转等空间推理任务中却出现系统性失败。虽然这通常归因于数据规模,但本文认为该限制源于架构的内在电路复杂度。通过将空间理解形式化为学习一个群同态问题——其中潜在嵌入保留作用于图像的物理变换的代数结构——我们识别出一个基本的计算瓶颈。具体来说,对于非可解群(例如$\mathrm{SO}(3)$),维持这种保结构嵌入的下界由单词问题决定,该问题是$\mathsf{NC^1}$-完全的。相比之下,具有多项式精度的恒定深度ViT严格受限于复杂度类$\mathsf{TC^0}$。在标准猜想$\mathsf{TC^0} \subsetneq \mathsf{NC^1}$下,出现了一个复杂度边界:恒定深度架构缺乏在单次前向传播中捕获非可解空间结构所需的逻辑深度。为了实证验证这一理论差距,我们提出了潜在空间代数(LSA)基准,该基准揭示了随着非可解任务组合深度的增加,ViT表示出现显著退化。

英文摘要

Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises from the intrinsic circuit complexity of the architecture. By formalizing spatial understanding as learning a Group Homomorphism Problem -- where latent embeddings preserve the algebraic structure of physical transformations acting on images -- we identify a fundamental computational bottleneck. Specifically, for non-solvable groups (e.g., $\mathrm{SO}(3)$), maintaining such structure-preserving embeddings is lowerbounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, constant-depth ViTs with polynomial precision are strictly bounded by the complexity class $\mathsf{TC^0}$. Under the standard conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, a complexity boundary emerges: constant-depth architectures lack the logical depth required to capture non-solvable spatial structures in a single forward pass. To empirically validate this theoretical gap, we propose the Latent Space Algebra (LSA) benchmark, which reveals a significant degradation in ViT representations as the compositional depth of non-solvable tasks increases.

2601.01496 2026-05-28 cs.GT cs.AI cs.LG 版本更新

The Optimal Sample Complexity of Linear Contracts

线性合约的最优样本复杂度

Mikael Møller Høgsgaard

发表机构 * Department of Statistics, University of Oxford, United Kingdom(英国牛津大学统计系) Department of Computer Science, Aarhus University, Denmark(丹麦奥胡斯大学计算机科学系)

AI总结 本文通过经验效用最大化算法,证明仅需 O(ln(1/δ)/ε²) 个样本即可实现最优线性合约的 ε-近似,并匹配下界,从而确立最优样本复杂度。

详情
AI中文摘要

在本文中,我们解决了离线环境下从数据中学习最优线性合约的问题,其中代理人类型来自未知分布,委托人的目标是设计一个最大化其期望效用的合约。具体来说,我们的分析表明,简单的经验效用最大化(EUM)算法仅需 $O(\ln(1/δ) / \varepsilon^2)$ 个样本,就能以至少 $1-δ$ 的概率得到最优线性合约的 $\varepsilon$-近似。这一结果改进了先前已知的界限,并在常数因子内匹配了 Dütting 等人 2025 年的下界,从而证明了其最优性。此外,我们的结果建立了更强的一致收敛保证:每个线性合约的经验效用以其真实期望的 $\varepsilon$-近似成立的概率至少为 $1-δ$,且使用了相同的最优 $O(\ln(1/δ) / \varepsilon^2)$ 样本复杂度。

英文摘要

In this paper, we settle the problem of learning optimal linear contracts from data in the offline setting, where agent types are drawn from an unknown distribution and the principal's goal is to design a contract that maximizes her expected utility. Specifically, our analysis shows that the simple Empirical Utility Maximization (EUM) algorithm yields an $\varepsilon$-approximation of the optimal linear contract with probability at least $1-δ$, using just $O(\ln(1/δ) / \varepsilon^2)$ samples. This result improves upon previously known bounds and matches a lower bound from Dütting et al. 2025 up to constant factors, thereby proving its optimality. Furthermore, our result establishes the stronger guarantee of uniform convergence: the empirical utility of every linear contract is an $\varepsilon$-approximation of its true expectation with probability at least $1-δ$, using the same optimal $O(\ln(1/δ) / \varepsilon^2)$ sample complexity.

2512.23959 2026-05-28 cs.CL cs.AI cs.LG 版本更新

HGMEM: Hypergraph-based Working Memory to Improve Multi-step RAG for Long-Context Complex Relational Modeling

HGMem:基于超图的工作记忆以改进长上下文复杂关系建模的多步RAG

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu

发表机构 * The Chinese University of Hong Kong.(香港中文大学) Pengcheng Laboratory.(鹏城实验室) WeChat AI, Tencent(微信AI,腾讯) University of Chinese Academy of Sciences.(中国科学院大学)

AI总结 提出HGMem超图工作记忆系统,通过超边表示记忆单元并渐进形成高阶交互,增强多步RAG中的全局理解和复杂推理能力。

Comments ICML 2026; Code released at https://github.com/Encyclomen/HGMem

详情
AI中文摘要

多步检索增强生成(RAG)已成为增强大型语言模型(LLMs)在需要全局理解和密集推理任务上的广泛采用策略。尽管许多RAG系统整合了工作记忆来整合信息,但现有设计主要作为孤立事实的被动存储。这种静态特性忽略了原始事实之间的关键高阶相关性,从而限制了模型的多步推理能力,导致在扩展上下文中的碎片化推理和弱全局理解。我们引入了HGMem,一种基于超图的工作记忆系统,将记忆的概念从简单存储扩展到动态、表达性结构,用于复杂推理和全局理解。在我们的方法中,记忆被表示为超图,其中超边对应不同的记忆单元,使得记忆内高阶交互的逐步形成成为可能。该机制连接围绕焦点问题的事实和思考,将记忆演变为一个集成且情境化的知识结构,为更深层次的推理提供强有力的命题。我们在几个具有挑战性的全局理解基准上评估了HGMem。大量实验和深入分析表明,我们的方法持续改进了多步RAG,并在不同数据集上显著优于强基线系统。

英文摘要

Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Although many RAG systems incorporate a working memory to consolidate information, existing designs primarily function as a passive storage for isolated facts. This static nature overlooks crucial high-order correlations among primitive facts, thereby limiting models' capacity for multi-step reasoning and resulting in fragmented reasoning and weak global sense-making within extended contexts. We introduce HGMem, a hypergraph-based working memory system, extending the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph where hyperedges correspond to distinct memory units, enabling the progressive formation of high-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving the memory into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning. We evaluate HGMem on several challenging global sense-making benchmarks. Extensive experiments and in-depth analyses demonstrate that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse datasets.

2512.22777 2026-05-28 cs.LG cs.AI 版本更新

Adapting, Fast and Slow: On Few-Shot Transportability of Compositions

适应,快与慢:关于组合的少样本可迁移性

Kasra Jalaldoust, Elias Bareinboim

发表机构 * Causal Artificial Intelligence Lab(因果人工智能实验室)

AI总结 研究在少样本场景下,通过因果传输性理论将源域学习到的因果机制组合成目标域预测器,并区分模块传输性和电路传输性,提出基于梯度松弛的电路搜索方法以实现快速或慢速适应。

详情
AI中文摘要

跨域泛化需要连接源分布和目标分布的稳定结构。基于因果传输性理论,我们研究了一个序列预测设置,其中目标预测器可以表示为从源数据可学习的因果机制组成的电路。我们引入了两类传输性。模块传输性捕获原子情况,其中目标预测器由可从单个源域学习的机制给出。电路传输性将此思想推广到通过组合从源数据学习的多个模块获得的目标预测器,即使没有源机制直接预测目标标签,也能实现零样本预测。我们在逐渐放松的假设下研究这些电路类别。首先,我们提供了条件,在这些条件下,给定关于源域和目标域的因果知识,可以从源数据单独学习相关电路。然后,我们通过允许来自目标域的有限数据来放松这些结构假设。特别地,我们开发了一种监督域适应方案,该方案无需显式因果结构即可学习电路。由此产生的少样本保证将可实现误差与可从源数据学习的模块组成的最小目标电路的大小联系起来。最后,我们提出了符号电路搜索的基于梯度的松弛,并进行了实证评估,表明它定性地跟踪了预测的快速适应机制——有和没有中间位置的过程监督——以及当没有源机制匹配时的慢速适应。

英文摘要

Generalization across domains requires stable structure that links the source and target distributions. Building on causal transportability theory, we study a sequential prediction setting in which the target predictor can be represented as a circuit composed of causal mechanisms that are learnable from source data. We introduce two classes of transportability. Module transportability captures the atomic case, where the target predictor is given by a mechanism learnable from a single source domain. Circuit transportability generalizes this idea to target predictors obtained by composing several modules learned from source data, enabling zero-shot prediction even when no source mechanism directly predicts the target label. We study these classes of circuits under increasingly relaxed assumptions. First, we provide conditions under which the relevant circuits can be learned from source data alone, given causal knowledge about the source and target domains. We then relax these structural assumptions by allowing limited data from the target domain. In particular, we develop a supervised domain adaptation scheme that learns circuits without requiring explicit causal structure. The resulting few-shot guarantees tie the achievable error to the size of the smallest target circuit composable from modules learned from source data. Finally, we propose a gradient-based relaxation of the symbolic circuit search and evaluate it empirically, showing that it qualitatively tracks the predicted regimes of fast adaptation -- with and without process supervision over intermediate positions -- and slow adaptation when no source mechanism matches.

2501.09934 2026-05-28 cs.LG cs.AI 版本更新

HEART: Achieving Timely Multi-Model Training for Vehicle-Edge-Cloud-Integrated Hierarchical Federated Learning

HEART:实现车辆-边缘-云集成分层联邦学习的多模型及时训练

Xiaohong Yang, Minghui Liwang, Xianbin Wang, Zhipeng Cheng, Seyyedali Hosseinalipour, Huaiyu Dai, Zhenzhen Jiao

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) Department of Control Science and Engineering, Shanghai Institute of Intelligent Science and Technology(上海智能科学研究院控制科学与工程系) State Key Laboratory of Autonomous Intelligent Unmanned Systems(自主智能无人系统国家重点实验室) Shanghai Key Laboratory of Intelligent Autonomous Systems(上海智能自主系统重点实验室) Frontiers Science Center for Intelligent Autonomous Systems, Ministry of Education, Tongji University(教育部智能自主系统前沿科学中心,同济大学) Department of Electrical and Computer Engineering, Western University(西方大学电气与计算机工程系) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 针对车辆-边缘-云分层联邦学习中多模型训练面临的模型过时、数据利用低效和资源分配不平衡问题,提出HEART框架,通过混合同步-异步聚合规则和两阶段优化算法(改进PSO+GA与贪心算法)最小化全局训练延迟并实现任务平衡。

Comments Accepted by IEEE Transactions on Cloud Computing (22 pages, 7 figures)

详情
AI中文摘要

人工智能赋能的物联网车辆(IoV)的快速发展需要高效的机器学习(ML)解决方案,以处理高车辆移动性和分散数据。这推动了车辆-边缘-云架构上的分层联邦学习(VEC-HFL)的出现。然而,VEC-HFL文献中尚未充分探讨的一个方面是,车辆通常需要同时执行多个ML任务,这种多模型训练环境带来了关键挑战。首先,不恰当的聚合规则可能导致模型过时和训练时间延长。其次,车辆移动性可能阻止车辆将模型返回网络边缘,导致数据利用效率低下。第三,跨不同任务实现平衡的资源分配变得至关重要,因为它极大地影响协作训练的有效性。我们率先提出一个针对动态VEC-HFL中多模型训练的框架,目标是最小化全局训练延迟,同时确保跨各种任务的平衡训练,该问题被证明是NP难的。为了促进及时模型训练,我们引入了一种混合同步-异步聚合规则。在此基础上,我们提出了一种称为混合进化与贪婪分配(HEART)的新方法。该框架分两个阶段运行:首先,通过结合改进的粒子群优化(PSO)和遗传算法(GA)的混合启发式方法实现平衡的任务调度;其次,采用低复杂度的贪心算法确定车辆上分配任务的训练优先级。在真实数据集上的实验证明了HEART相对于现有方法的优越性。

英文摘要

The rapid growth of AI-enabled Internet of Vehicles (IoV) calls for efficient Machine Learning (ML) solutions that can handle high vehicular mobility and decentralized data. This has motivated the emergence of Hierarchical Federated Learning over vehicle-edge-cloud architectures (VEC-HFL). Nevertheless, one aspect which is underexplored in the literature on VEC-HFL is that vehicles often need to execute multiple ML tasks simultaneously, where this multi-model training environment introduces crucial challenges. First, improper aggregation rules can lead to model obsolescence and prolonged training times. Second, vehicular mobility may result in inefficient data utilization by preventing the vehicles from returning their models to the network edge. Third, achieving a balanced resource allocation across diverse tasks becomes of paramount importance as it majorly affects the effectiveness of collaborative training. We take one of the first steps towards addressing these challenges via proposing a framework for multi-model training in dynamic VEC-HFL with the goal of minimizing global training latency while ensuring balanced training across various tasks, a problem that turns out to be NP-hard. To facilitate timely model training, we introduce a hybrid synchronous-asynchronous aggregation rule. Building on this, we present a novel method called Hybrid Evolutionary And gReedy allocaTion (HEART). The framework operates in two stages: first, it achieves balanced task scheduling through a hybrid heuristic approach that combines improved Particle Swarm Optimization (PSO) and Genetic Algorithms (GA); second, it employs a low-complexity greedy algorithm to determine the training priority of assigned tasks on vehicles. Experiments on real-world datasets demonstrate the superiority of HEART over existing methods.

2501.06491 2026-05-28 cs.SE cs.AI cs.SY eess.SY 版本更新

Improving Requirements Classification with SMOTE-Tomek Preprocessing

使用SMOTE-Tomek预处理改进需求分类

Barak Or

发表机构 * ArtificialGate Ltd.(ArtificialGate有限公司)

AI总结 针对PROMISE数据集中的类别不平衡问题,采用SMOTE-Tomek预处理结合分层K折交叉验证,显著提升了需求分类准确率,逻辑回归达到76.16%。

Comments 21 pages, 5 figures, Preprint

详情
AI中文摘要

本研究通过应用SMOTE-Tomek预处理技术,结合分层K折交叉验证,解决PROMISE数据集中类别不平衡问题,强调需求工程领域。该数据集包含969个分类需求,分为功能性和非功能性类型。所提出的方法在保持验证折完整性的同时,增强了少数类的表示,从而显著提高了分类准确率。逻辑回归达到了76.16%,大幅超过基线58.31%。这些结果凸显了机器学习模型作为可扩展且可解释解决方案的适用性和效率。

英文摘要

This study emphasizes the domain of requirements engineering by applying the SMOTE-Tomek preprocessing technique, combined with stratified K-fold cross-validation, to address class imbalance in the PROMISE dataset. This dataset comprises 969 categorized requirements, classified into functional and non-functional types. The proposed approach enhances the representation of minority classes while maintaining the integrity of validation folds, leading to a notable improvement in classification accuracy. Logistic regression achieved 76.16\%, significantly surpassing the baseline of 58.31\%. These results highlight the applicability and efficiency of machine learning models as scalable and interpretable solutions.

2512.18444 2026-05-28 cs.GT cs.AI cs.DC cs.MA 版本更新

Snowveil: A Framework for Decentralised Preference Discovery

Snowveil: 一种去中心化偏好发现的框架

Grammateia Kotsialou

发表机构 * King’s College London(伦敦国王学院)

AI总结 针对去中心化偏好发现问题,提出基于八卦的框架Snowveil,通过随机采样和局部信念更新,在有限期望时间内以可调高概率收敛到社会选择参数,并引入约束混合博尔达规则以平衡广泛共识与多数支持。

详情
AI中文摘要

在传统社会选择中,聚合主观偏好通常假设存在一个可信的中心权威。相反,本文形式化了去中心化偏好发现(DPD):在部分信息、异步交互、抗审查且无中心协调者的条件下,可靠地识别社会选择参数(例如,应用于全局偏好配置的聚合规则的规范结果)。为了解决DPD,我们提出了Snowveil,一个基于八卦的框架,其中智能体重复采样随机同伴排名并更新局部信念,以收敛到规范结果。利用势函数、亚鞅理论和集中界,我们证明了系统以可调的高概率在有限期望时间内达到该稳定状态。然后可以迭代这一单胜者过程,以构建多胜者场景中的一组获胜候选者。Snowveil对特定聚合规则不可知,仅要求规则满足如正向响应等公理,从而为更广泛的DPD协议提供了形式基础。为了展示Snowveil的模块化,我们引入了约束混合博尔达(CHB),一种旨在平衡广泛共识与多数支持的聚合规则。我们提供了CHB的公理分析,并通过大量模拟展示了实证结果,验证了Snowveil的O(n)可扩展性。总体而言,这项工作为大规模去中心化系统中如何从主观、表达性和多样化的偏好配置中涌现稳定共识奠定了基础。

英文摘要

Aggregating subjective preferences in social choice traditionally assumes a trusted central authority. In contrast, this paper formalises Decentralised Preference Discovery (DPD): the reliable identification of a social choice parameter (e.g. the canonical outcome of an aggregation rule applied to the global preference profile) under conditions of partial information, asynchronous interaction, censorship resistance, and no central coordinator. To address DPD, we propose Snowveil, a gossip-based framework where agents repeatedly sample random peer rankings and update local beliefs to converge on the canonical outcome. Using a potential function, submartingale theory, and concentration bounds, we prove the system reaches this stable state with tunable high probability, in finite expected time. This single-winner process can then be iterated to construct a set of winning candidates for multi-winner scenarios. Snowveil is agnostic to specific aggregation rules, requiring only that the rule satisfies axioms such as Positive Responsiveness, thus offering a formal basis for a wider class of DPD protocols. Demonstrating Snowveil's modularity, we introduce the Constrained Hybrid Borda (CHB), an aggregation rule designed to balance broad consensus with plurality support. We provide an axiomatic analysis of CHB and present empirical results via extensive simulation, validating Snowveil's O(n) scalability. Overall, this work provides a foundation for how a stable consensus emerges from subjective, expressive, and diverse preference profiles in large-scale decentralised systems.

2307.06240 2026-05-28 cs.LG cs.AI cs.RO cs.SY eess.SY 版本更新

DSSE: a drone swarm search environment

DSSE:无人机群搜索环境

Manuel Castanares, Luis F. S. Carrete, Enrico F. Damiani, Leonardo D. M. de Abreu, José Fernando B. Brancalion, Fabrício J. Barth

发表机构 * Insper Embraer

AI总结 基于PettingZoo的多智能体强化学习环境,无人机通过动态概率输入搜索目标。

Comments 7 pages

详情
AI中文摘要

无人机群搜索项目是一个基于 extsc{PettingZoo}的环境,用于多智能体(或单智能体)强化学习算法。在该环境中,智能体(无人机)必须找到目标(海难人员)。智能体不知道目标的位置,也不接收与自身到目标距离相关的奖励。然而,智能体会收到目标位于地图某个单元格的概率。该项目旨在辅助研究需要动态概率作为输入的强化学习算法。描述该软件第二版的同行评审论文已发表在JOSS上:https://doi.org/10.21105/joss.06746。

英文摘要

The Drone Swarm Search project is an environment, based on \textsc{PettingZoo}, that is to be used in conjunction with multi-agent (or single-agent) reinforcement learning algorithms. It is an environment in which the agents (drones), have to find the targets (shipwrecked people). The agents do not know the position of the target and do not receive rewards related to their own distance to the target(s). However, the agents receive the probabilities of the target(s) being in a certain cell of the map. The aim of this project is to aid in the study of reinforcement learning algorithms that require dynamic probabilities as inputs. A peer-reviewed paper describing version 2 of this software has been published in JOSS: https://doi.org/10.21105/joss.06746.

2508.13544 2026-05-28 cs.CV cs.AI 版本更新

FLAIR: Frequency- and Locality-Aware Implicit Neural Representations

FLAIR: 频率与位置感知的隐式神经表示

Sukhun Ko, Seokhyun Youn, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh

AI总结 针对隐式神经表示缺乏频率选择性和空间定位导致频谱偏差的问题,提出带限局部激活和小波能量引导编码,提升2D图像表示、3D形状重建和新视角合成性能。

Comments CVPR Findings 2026 (camera ready ver.). Please visit our project page at https://cmlab-korea.github.io/FLAIR/

详情
AI中文摘要

隐式神经表示利用神经网络将坐标映射到对应信号,实现连续且紧凑的表示。该范式推动了各种视觉任务的重大进展。然而,现有的隐式神经表示缺乏频率选择性和空间定位,导致过度依赖冗余信号分量。因此,它们表现出频谱偏差,倾向于早期学习低频分量,而难以捕捉精细的高频细节。为了解决这些问题,我们提出了FLAIR(频率与位置感知的隐式神经表示),它包含两个关键创新。第一个是带限局部激活(BLA),这是一种新颖的激活函数,设计用于在时频不确定性原理(TFUP)约束下进行联合频率选择和空间定位。通过结构化的频率控制和空间局部响应,BLA有效减轻了频谱偏差并增强了训练稳定性。第二个是小波能量引导编码(WEGE),它利用离散小波变换计算能量分数,并显式地将频率信息引导到网络,实现精确的频率选择和自适应频带控制。我们的方法在2D图像表示、3D形状重建和新视角合成方面始终优于现有的隐式神经表示。

英文摘要

Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity and spatial localization, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is Band-Localized Activation (BLA), a novel activation designed for joint frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). Through structured frequency control and spatially localized responses, BLA effectively mitigates spectral bias and enhances training stability. The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform to compute energy scores and explicitly guide frequency information to the network, enabling precise frequency selection and adaptive band control. Our method consistently outperforms existing INRs in 2D image representation, as well as 3D shape reconstruction and novel view synthesis.

2512.06797 2026-05-28 math.OC cs.AI cs.LG stat.ML 版本更新

Optimal and Diffusion Transports in Machine Learning

机器学习中的最优输运与扩散输运

Gabriel Peyré

发表机构 * CNRS and ENS PSL Université(法国国家科学研究中心和巴黎 sciences et lettres 高等学院)

AI总结 本文综述了机器学习中扩散方法和最优输运两种输运方法,它们通过拉格朗日视角设计概率分布演化,应用于采样、神经网络优化和大语言模型动力学建模。

Comments Proc. 2026 International Congress of Mathematicians

详情
AI中文摘要

机器学习中的若干问题自然地表述为随时间演化的概率分布的设计与分析。这包括通过扩散方法进行采样、优化神经网络的权重,以及分析大语言模型各层中令牌分布的演化。尽管目标应用不同(样本、权重、令牌),它们的数学描述共享一个共同结构。一个关键思想是通过平流粒子的向量场,从密度的欧拉表示转换到其拉格朗日对应。这种双重观点带来了挑战,特别是拉格朗日向量场的非唯一性,但也提供了机会,以构造在正则性、稳定性和计算可行性方面具有有利性质的密度演化和流。本综述概述了这些方法,重点介绍两种互补方法:扩散方法,它依赖于随机插值过程并支撑现代生成式AI;以及最优输运,它通过最小化位移成本来定义插值。我们说明了这两种方法如何出现在从采样、神经网络优化到建模大语言模型Transformer动力学的应用中。

英文摘要

Several problems in machine learning are naturally expressed as the design and analysis of time-evolving probability distributions. This includes sampling via diffusion methods, optimizing the weights of neural networks, and analyzing the evolution of token distributions across layers of large language models. While the targeted applications differ (samples, weights, tokens), their mathematical descriptions share a common structure. A key idea is to switch from the Eulerian representation of densities to their Lagrangian counterpart through vector fields that advect particles. This dual view introduces challenges, notably the non-uniqueness of Lagrangian vector fields, but also opportunities to craft density evolutions and flows with favorable properties in terms of regularity, stability, and computational tractability. This survey presents an overview of these methods, with emphasis on two complementary approaches: diffusion methods, which rely on stochastic interpolation processes and underpin modern generative AI, and optimal transport, which defines interpolation by minimizing displacement cost. We illustrate how both approaches appear in applications ranging from sampling, neural network optimization, to modeling the dynamics of transformers for large language models.

2506.10138 2026-05-28 cs.LG cs.AI 版本更新

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

路径通道与计划扩展核:Sokoban RNN中规划的机制描述

Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso

发表机构 * FAR.AI Berkeley(FAR.AI伯克利)

AI总结 通过逆向工程分析无模型强化学习训练的卷积循环神经网络,发现其通过路径通道存储未来动作计划,并利用卷积核实现双向传播与回溯规划。

Comments Published as a conference paper at ICLR 2026. 34 pages, 26 figures

详情
AI中文摘要

我们部分逆向工程了一个使用无模型强化学习训练来玩推箱子游戏Sokoban的卷积循环神经网络(RNN)。我们发现,RNN将未来动作(计划)存储为隐藏状态特定通道中的激活,我们称之为路径通道。特定位置的高激活意味着,当箱子在该位置时,它将被推向通道指定的方向。我们检查了路径通道之间的卷积核,发现它们编码了每个可能动作导致的位置变化,从而代表了部分学习到的转移模型。RNN通过从箱子和目标开始构建计划。这些核将路径通道中的激活从箱子向前传播,并从目标向后传播。负值被放置在障碍物处的通道中。这导致扩展核反向传播负值,从而修剪最后几步,让替代计划出现;这是一种回溯形式。我们的工作表明,对计划表示的精确理解使我们能够用更熟悉的术语直接理解无模型训练学到的双向规划类算法。

英文摘要

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.

2512.02019 2026-05-28 cs.LG cs.AI stat.ML 版本更新

Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning

扩散增强马尔可夫决策过程用于最大熵强化学习

Sebastian Sanokowski, Kaustubh Patil

发表机构 * Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University Munich(慕尼黑机器人与机器智能研究所(MIRMI),技术大学慕尼黑) Practical Project Student Exchange Program, Technical University Munich(技术大学慕尼黑实践项目学生交换计划) MIT World Peace University(MIT和平大学)

AI总结 本文通过将最大熵强化学习扩展到扩散过程,提出扩散增强马尔可夫决策过程(DA-MDPs),以最小化反向KL散度的上界来学习最优策略轨迹分布,并成功将PPO、WPO和REPPO适配为扩散变体,在连续控制和多模态基准上取得与基线相当或更优的性能。

Comments Preprint

详情
AI中文摘要

扩散模型擅长从复杂的非归一化分布中采样。在这项工作中,我们将最大熵强化学习(ME-RL)扩展到扩散过程,从而能够从最优策略轨迹分布中采样。通过最小化扩散策略与最优策略轨迹分布之间的反向KL散度的可处理上界,我们推导出一个修改后的替代目标,并引入了扩散增强马尔可夫决策过程(DA-MDPs)。DA-MDPs允许将扩散策略无缝集成到任何ME-RL方法中,只需最小的修改。我们通过将近端策略优化(PPO)、Wasserstein策略优化(WPO)和相对熵路径策略优化(REPPO)适配为其基于扩散的变体:DA-MDP: PPO、DA-MDP: WPO和DA-MDP: REPPO,证明了其有效性。在标准连续控制基准上的实验结果表明,我们的方法匹配或优于基线方法,而在多模态基准上的实验证实了其建模多模态动作分布的能力。

英文摘要

Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning (ME-RL) to diffusion processes, enabling sampling from the optimal policy trajectory distribution. By minimizing a tractable upper bound on the reverse KL divergence between the diffusion policy and the optimal policy trajectory distributions, we derive a modified surrogate objective and introduce Diffusion-Augmented Markov Decision Processes (DA-MDPs). DA-MDPs allow for seamless integration of diffusion policies into any ME-RL method with minimal modifications. We demonstrate its effectiveness by adapting Proximal Policy Optimization (PPO), Wasserstein Policy Optimization (WPO), and Relative Entropy Pathwise Policy Optimization (REPPO) into their diffusion-based variants: DA-MDP: PPO, DA-MDP: WPO, and DA-MDP: REPPO. Empirical results on standard continuous-control benchmarks show that our approach matches or outperforms baseline methods, while experiments on multimodal benchmarks confirm its ability to model multimodal action distributions.

2512.01970 2026-05-28 cs.AI cs.CL 版本更新

Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies

原子技能是前提:当强化学习合成组合推理时,以及当它仅放大时

Sitao Cheng, Xunjian Yin, Ruiwen Zhou, Yuxuan Li, Xinyi Wang, Liangming Pan, William Yang Wang, Victor Zhong

发表机构 * University of Waterloo(滑铁卢大学) Duke University(杜克大学) National University of Singapore(新加坡国立大学) Princeton University(普林斯顿大学) Peking University(北京大学) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 通过互补推理任务,研究强化学习是合成新技能还是仅放大已有技能,发现强化学习在基础模型通过监督微调掌握独立原子技能后才能合成新组合策略。

Comments Work in Progress. Code and data are available at https://github.com/sitaocheng/from_atomic_to_composite

详情
AI中文摘要

强化学习(RL)仅仅是放大现有技能,还是合成新技能?我们通过互补推理的视角研究这个问题:互补推理是整合内部知识与外部上下文的关键实践能力,是可靠的持续学习和检索增强生成的前提。为了避免预训练污染,我们构建了一个受控的语义合成传记数据集,并将这种能力分解为两个原子技能:参数推理(检索模型权重中编码的事实)和上下文推理(处理新的上下文信息)。我们有两个发现。首先,直接在复合任务上监督训练的模型在已知事实和推理路径上达到高准确率(90%),但在新事实和推理路径上崩溃(18%),表明监督微调(SFT)依赖于死记硬背而非真正的技能整合。其次,RL弥合了这一泛化差距,充当技能合成器而非仅仅是放大器——但只有在严格的前提条件下:只有当基础模型首先通过SFT掌握了独立的原子技能时,它才能合成新的组合策略。这些结果表明,解耦的原子训练后接RL为复杂的新推理提供了一条可扩展的路径。

英文摘要

Does Reinforcement Learning (RL) merely amplify existing skills, or synthesize novel skills? We investigate this question through the lens of Complementary Reasoning: the critical practical capability of integrating internal knowledge with external context, a prerequisite for reliable Continual Learning and Retrieval-Augmented Generation. To avoid pre-training contamination, we construct a controlled semanticsynthetic dataset of biographies and decompose this capability into two atomic skills: Parametric Reasoning (retrieving facts encoded in model weights) and Contextual Reasoning (processing novel in-context information). We present two findings. First, models supervised directly on the composite task reach high accuracy on seen facts and reasoning paths (90%) but collapse on novel facts and reasoning paths (18%), indicating that Supervised Fine-Tuning (SFT) relies on rote memorization rather than genuine skill integration. Second, RL bridges this generalization gap, acting as a skill synthesizer rather than a mere amplifier--but only under a strict prerequisite: it synthesizes new composite strategies only when the base model has first mastered the independent atomic skills via SFT. These results suggest that decoupled atomic training followed by RL offers a scalable path to complex novel reasoning.

2511.20934 2026-05-28 cs.AI cs.CV cs.LG 版本更新

Guaranteed Optimal Compositional Explanations for Neurons

神经元的保证最优组合解释

Biagio La Rosa, Leilani H. Gilpin

发表机构 * Computer Science and Engineering Department, University of California, Santa Cruz, US(加州大学圣克鲁兹分校计算机科学与工程系)

AI总结 提出首个框架,通过分解、启发式和算法,在完整状态空间上计算保证最优的组合解释,并证明10-40%的波束搜索解释在概念重叠时非最优。

Comments Accepted at ICML 2026 (Oral), 43 pages, 10 figures

详情
AI中文摘要

组合解释是一类方法,旨在通过逻辑规则描述神经元感受野激活与概念之间的空间对齐,通常通过搜索所有可能的概念组合来计算。由于在整个状态空间上计算空间对齐在计算上不可行,文献中通常采用与组合结构相关的假设和波束搜索来限制状态空间。然而,波束搜索无法提供任何最优性的理论保证,且当前解释与真正最优解的接近程度仍不清楚。在这篇理论性论文中,我们通过引入首个框架来解决这一差距,该框架在采用假设所涵盖的整个状态空间上计算保证最优的组合解释。具体而言,我们提出:(i) 一种识别影响空间对齐因素的分解方法,(ii) 一种在搜索任何阶段估计对齐的启发式方法,以及(iii) 第一个能够在与穷举波束搜索相当的时间内计算最优组合解释的算法。使用该框架,我们证明当涉及重叠概念时,先前通过波束搜索获得的10-40%的解释是次优的。最后,我们评估了一种由我们提出的分解和启发式方法引导的波束搜索变体,表明它在超参数和计算资源方面提供更大灵活性的同时,匹配或改进了先前方法的运行时间。

英文摘要

Compositional explanations are a family of methods that aim to describe the spatial alignment between neurons' receptive field activations and concepts through logical rules, typically computed via a search over all possible concept combinations. Since computing the spatial alignment over the entire state space is computationally infeasible, the literature commonly adopts assumptions related to the structure of the combinations and beam search to restrict the state space. However, beam search cannot provide any theoretical guarantees of optimality, and it remains unclear how close current explanations are to the true optimum. In this theoretical paper, we address this gap by introducing the first framework for computing guaranteed optimal compositional explanations over the entire state space spanned by the adopted assumptions. Specifically, we propose: (i) a decomposition that identifies the factors influencing the spatial alignment, (ii) a heuristic to estimate the alignment at any stage of the search, and (iii) the first algorithm that can compute optimal compositional explanations in a time comparable to exhaustive beam search. Using this framework, we demonstrate that 10-40% of explanations previously obtained with beam search are suboptimal when overlapping concepts are involved. Finally, we evaluate a beam-search variant guided by our proposed decomposition and heuristic, showing that it matches or improves runtime over prior methods while offering greater flexibility in hyperparameters and computational resources.

2511.20439 2026-05-28 cs.CV cs.AI 版本更新

Object-Centric Vision Token Pruning for Vision Language Models

面向视觉语言模型的以对象为中心的视觉令牌剪枝

Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen

发表机构 * Aalto University(阿alto大学) University of Electronic Science and Technology of China(电子科学与技术大学) Delft University of Technology(代尔夫特理工大学)

AI总结 提出OC-VTP方法,通过轻量预训练以对象为中心的视觉令牌剪枝器,直接选择最具代表性的视觉令牌,在保持高精度的同时提升VLM推理效率。

详情
AI中文摘要

在视觉语言模型(VLM)中,与语言令牌相比,视觉令牌数量庞大但信息分散,因此消耗了大量不必要的计算。为了提升VLM推理效率,剪枝冗余视觉令牌的研究一直在进行,但现有方法都采用间接且无保证的方式。我们提出了OC-VTP,一种直接且有保证的方法,用于选择最具代表性的视觉令牌,以实现高效且保持精度的VLM推理。我们的OC-VTP仅需对一个小型的以对象为中心的视觉令牌剪枝器进行轻量预训练,然后即可将其插入现有VLM中,无需在任何数据集上微调任何模型。通过最小化从所选令牌重建原始未剪枝令牌的误差,保证保留最具代表性的视觉令牌。在任何视觉剪枝比例(即推理效率)下,我们的OC-VTP都能一致地帮助主流VLM保持最高的推理精度。我们的剪枝还展示了有趣的可解释性。我们的代码可在 https://github.com/GarryLarry010131/OC-VTP 获取。

英文摘要

In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.

2511.11896 2026-05-28 cs.CR cs.AI cs.SE 版本更新

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

VULPO:基于策略优化的上下文感知漏洞检测

Youpeng Li, Fuxun Yu, Weiliang Qi, Xinda Wang

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) Microsoft(微软)

AI总结 提出VULPO框架,通过构建包含上下文信息和推理轨迹的数据集ContextVul,结合冷启动监督微调和自适应策略优化,显著提升大语言模型在真实仓库中的漏洞检测能力。

详情
AI中文摘要

大语言模型(LLM)最近在漏洞检测(VD)中展现出强大潜力。然而,准确检测真实仓库中的漏洞需要推理复杂的上下文交互。现有的基于LLM的VD方法仍然有限,因为当前数据集缺乏完整的上下文信息和高质量的推理监督,而现有的优化方法主要依赖于粗粒度的结果中心监督信号,无法建模漏洞推理过程。为解决这些限制,我们首先构建了ContextVul,这是一个新数据集,用仓库级上下文信息和精心整理的漏洞推理轨迹增强了高质量函数级漏洞基准。基于ContextVul,我们引入了一个两阶段优化框架,包括轻量级冷启动监督微调,随后是漏洞自适应策略优化(VULPO)。VULPO结合了多维奖励,共同评估漏洞识别、漏洞相关定位和因果推理质量,以及难度自适应奖励缩放,以减轻奖励黑客攻击并提高强化学习效果。大量实验证明了VULPO在上下文感知VD中的优越性。我们的VULPO-4B,第一个专门的漏洞推理LLM,显著优于现有的VD基线,相对于Qwen3-4B将Pairwise Pass@1提高了203%,并实现了与规模大150%的LLM DeepSeek-V3.1相竞争的性能。

英文摘要

Large language models (LLMs) have recently shown strong potential in vulnerability detection (VD). However, accurately detecting vulnerabilities in real-world repositories requires reasoning over complex contextual interactions. Existing LLM-based VD approaches remain limited because current datasets lack complete contextual information and high-quality reasoning supervision, while existing optimization methods primarily rely on coarse outcome-centric supervision signals that fail to model the vulnerability reasoning process. To address these limitations, we first construct ContextVul, a new dataset that augments high-quality function-level vulnerability benchmarks with repository-level contextual information and curated vulnerability reasoning traces. Building upon ContextVul, we introduce a two-stage optimization framework consisting of lightweight cold-start supervised fine-tuning followed by vulnerability-adaptive on-policy optimization (VULPO). VULPO incorporates multidimensional rewards that jointly evaluate vulnerability identification, vulnerability-relevant localization, and causal reasoning quality, along with difficulty-adaptive reward scaling to mitigate reward hacking and improve RL effectiveness. Extensive experiments demonstrate the superiority of VULPO for context-aware VD. Our VULPO-4B, the first specialized vulnerability reasoning LLM, substantially outperforms existing VD baselines, improving Pairwise Pass@1 by 203% relative to Qwen3-4B and achieving competitive performance against a 150% larger-scale LLM, DeepSeek-V3.1.

2511.09572 2026-05-28 cs.AI cs.LG cs.SE 版本更新

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

SynthTools: 用于扩展智能体开发中合成工具的框架

Tommaso Castellani, Naimeng Ye, Daksh Mittal, Thomson Yen, Emmanouil Koukoumidis, William Zeng, Hongseok Namkoong

AI总结 提出基于LLM的端到端管道SynthTools,通过环境生成、模拟、验证和任务构建,生成大规模多样化工具使用环境,提升智能体工具使用能力。

详情
AI中文摘要

为了使智能体系统能够使用外部工具解决复杂、长期的任务,我们需要大量多样且可控的工具使用环境。我们引入了SynthTools,一个完全基于LLM的管道,涵盖整个生命周期:环境生成、模拟、验证和任务构建。通过端到端地使用LLM,我们的框架补充了其他受限于真实API复杂性的工具使用环境,并通过设计确保可扩展性和可控性。该框架由三个组件组成:自上而下的环境生成,分层构建多样化的、基于领域的工具环境;环境模拟与验证,确保工具能够可靠地模拟并过滤掉无法模拟的工具;以及自下而上的任务与轨迹生成,产生可解决且可验证的任务以及多步轨迹,对难度、长度、轨迹组成和领域焦点进行控制以保证灵活性。作为具体实例,我们发布了包含6800个环境和100个领域中的73883个经过验证的工具、79925个可验证任务的数据集,以及大规模生成轨迹的管道。在这些任务生成的轨迹语料库上训练不同规模的Qwen3模型,在多个工具使用基准测试(包括真实API)上取得了提升,表明在合成数据上训练的工具使用能力可能迁移到某些真实环境。这些结果共同表明,SynthTools可以作为大规模训练工具使用智能体的有用基础设施。

英文摘要

For agentic systems to use external tools to solve complex, long-horizon tasks, we need a large set of diverse and controllable tool-use environments. We introduce SynthTools, a fully LLM-based pipeline spanning the entire lifecycle: environment generation, simulation, validation and task construction. By operating end-to-end through LLMs, our framework complements other tool-use environments bottlenecked by the complexity of real APIs, and ensures scalability and controllability by design. The framework consists of three components: top-down environment generation, which hierarchically constructs diverse, domain-grounded tool environments; environment simulation and validation, which ensures tools can be reliably emulated and filters out those that cannot; and bottom-up task and trajectory generation, which produces solvable and verifiable tasks together with multi-step trajectories, exposing control over difficulty, length, trajectory composition, and domain focus to guarantee flexibility. As a concrete instantiation, we release the dataset comprising $73{,}883$ validated tools across $6{,}800$ environments and $100$ fields, $79{,}925$ verifiable tasks as well as the pipeline to generate trajectories at scale. Training Qwen3 models of various sizes on a corpus of trajectories generated from these tasks yields gains across multiple tool-use benchmarks, including real APIs, indicating tool-use capabilities trained on synthetic data may transfer to some real environments. Together, these results suggest that SynthTools can serve as a useful infrastructure for large-scale training of tool-use agents.

2510.21890 2026-05-28 cs.LG cs.AI cs.GR 版本更新

The Principles of Diffusion Models

扩散模型的原理

Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon

发表机构 * MIT Press(MIT出版社) Sony AI(索尼人工智能) OpenAI(开放人工智能) Stanford University(斯坦福大学) Sony Group Corporation(索尼集团)

AI总结 本文从变分、基于分数和基于流三种视角统一阐述扩散模型的数学原理,并讨论可控生成、高效求解器和流映射模型等扩展。

Comments Supplementary materials for the book are available at the book website: https://the-principles-of-diffusion-models.github.io/

详情
AI中文摘要

本书介绍了指导扩散模型发展的核心原理,追溯其起源,并展示不同公式如何源于共同的数学思想。扩散建模首先定义一个前向过程,该过程逐渐将数据破坏为噪声,通过一系列中间分布将数据分布与简单先验联系起来。目标是学习一个反向过程,将噪声转换回数据,同时恢复相同的中间分布。我们描述了三种互补的观点。受变分自编码器启发的变分观点将扩散视为逐步学习去噪。基于分数的观点植根于基于能量的建模,学习演化数据分布的梯度,指示如何将样本推向更可能的区域。基于流的观点与归一化流相关,将生成视为遵循一条平滑路径,在学习的速度场下将样本从噪声移动到数据。这些视角共享一个共同的主干:一个时间相关的速度场,其流将简单先验传输到数据。采样相当于求解一个微分方程,该方程沿着连续轨迹将噪声演化为数据。在此基础之上,本书讨论了可控生成的引导、高效数值求解器以及扩散驱动的流映射模型(学习任意时间之间的直接映射)。它为具有基本深度学习知识的读者提供了扩散模型的概念性和数学基础理解。

英文摘要

This book presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions. The goal is to learn a reverse process that transforms noise back into data while recovering the same intermediates. We describe three complementary views. The variational view, inspired by variational autoencoders, sees diffusion as learning to remove noise step by step. The score-based view, rooted in energy-based modeling, learns the gradient of the evolving data distribution, indicating how to nudge samples toward more likely regions. The flow-based view, related to normalizing flows, treats generation as following a smooth path that moves samples from noise to data under a learned velocity field. These perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to the data. Sampling then amounts to solving a differential equation that evolves noise into data along a continuous trajectory. On this foundation, the book discusses guidance for controllable generation, efficient numerical solvers, and diffusion-motivated flow-map models that learn direct mappings between arbitrary times. It provides a conceptual and mathematically grounded understanding of diffusion models for readers with basic deep-learning knowledge.

2510.11170 2026-05-28 cs.LG cs.AI cs.CL 版本更新

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

EAGer: 基于熵感知的自适应推理时缩放生成方法

Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün

发表机构 * University of Groningen(格罗宁根大学) University of Milan - Bicocca(米兰-比科卡大学) Cohere Labs(Cohere实验室)

AI总结 提出一种无需训练的生成方法EAGer,利用逐词熵分布动态分配计算资源,在复杂推理任务中提升性能并减少冗余计算。

详情
AI中文摘要

随着推理语言模型和测试时缩放方法作为提升模型性能范式的兴起,通常需要大量计算来从同一提示生成多个候选序列。这允许探索通向正确答案的不同推理路径,然而,为每个提示分配相同的计算预算。基于不同提示具有不同复杂度因而需要不同计算量的假设,我们提出EAGer,一种无需训练的生成方法,通过逐词熵分布利用模型不确定性来减少冗余计算并同时提升整体性能。EAGer仅在存在高熵词时分支到多个推理路径,并将节省的计算预算重新分配到最需要探索替代路径的实例上。我们在复杂推理基准上对多个开源模型验证了EAGer,特别是在AIME 2025上展示了增益。当目标标签可访问时(如在RLVR训练流程中),EAGer在Pass@k上提升高达37%,且token减少59%;在测试时设置中,与全并行采样相比,仍能在Pass@k上提升12%,且token减少64%。

英文摘要

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and reallocates the saved compute budget to instances where exploration of alternative paths is most needed. We validate EAGer across multiple open-source models on complex reasoning benchmarks, with gains specifically demonstrated on AIME 2025. When target labels are accessible -- as in RLVR training pipelines -- EAGer achieves up to +37% in Pass@k and 59% fewer tokens; in test-time settings it still yields +12% in Pass@k and 64% fewer tokens compared to Full Parallel Sampling.

2503.01450 2026-05-28 cs.LG cs.AI cs.RO 版本更新

Investigating Memory in Model-Free RL with POPGym Arcade

基于POPGym Arcade的无模型强化学习中的记忆研究

Zekang Wang, Zhe He, Borong Zhang, Edan Toledo, Steven Morad

发表机构 * Faculty of Science and Technology, University of Macau(澳门大学科技学院) Centre for AI, University College London(伦敦大学学院人工智能中心)

AI总结 本文通过引入分析工具和POPGym Arcade环境套件,研究深度强化学习中的记忆机制,发现价值函数会将信用分配到无关历史,并展示分布外场景如何污染记忆。

Comments Appear at ICML 2026 as a Spotlight paper

详情
AI中文摘要

如何分析深度强化学习中的记忆?我们引入了在部分可观测性下分析策略的工具,并揭示智能体如何利用记忆做出决策。为了利用这些工具,我们提出了POPGym Arcade,这是一个受Atari启发的、硬件加速的环境集合,共享单一观测和动作空间。每个环境都提供完全和部分可观测的变体,从而实现对可观测性的反事实研究。我们发现,受控研究对于公平比较是必要的,并识别出一种病理现象,即价值函数将信用过度分配到无关历史。利用这种病理现象,我们展示了分布外场景如何污染记忆,从而在遥远的未来扰动策略。我们的代码可在https://github.com/bolt-research/popgym-arcade获取。

英文摘要

How should we analyze memory in deep RL? We introduce tools for analyzing policies under partial observability and revealing how agents use memory to make decisions. To utilize these tools, we present POPGym Arcade, a collection of Atari-inspired, hardware-accelerated environments sharing a single observation and action space. Each environment provides fully and partially observable variants, enabling counterfactual studies on observability. We find that controlled studies are necessary for fair comparisons and identify a pathology where value functions smear credit over irrelevant history. Using this pathology, we demonstrate how out-of-distribution scenarios can contaminate memory, perturbing the policy far into the future. Our code is available at https://github.com/bolt-research/popgym-arcade.

2510.10185 2026-05-28 cs.CL cs.AI cs.MA 版本更新

Auditing medical multi-agent AI reveals risks of false consensus

审计医疗多智能体AI揭示虚假共识风险

Yinghao Zhu, Lei Gu, Zixiang Wang, Haoran Sang, Dehao Sui, Wen Tang, Lan Mi, Yasha Wang, Junyi Gao, Liang Yao, Tianfan Fu, Ewen Harrison, Lequan Yu, Liantao Ma

发表机构 * National Engineering Research Center for Software Engineering, Peking University(北京大学软件工程国家工程研究中心) School of Computing and Data Science, The University of Hong Kong(香港大学计算机与数据科学学院) Department of Nephrology, Peking University Third Hospital(北京大学第三医院肾内科) Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Department of Lymphoma, Peking University Cancer Hospital & Institute(教育部癌症发生与转化研究重点实验室、北京大学肿瘤医院淋巴瘤科) Department of Automation, Tsinghua University(清华大学自动化系) Centre for Medical Informatics, The University of Edinburgh(爱丁堡大学医学信息学中心) Health Data Research UK(英国健康数据研究机构) Lee Kong Chian School of Medicine, Nanyang Technological University(南洋理工大学李科贤医学院) State Key Laboratory for Novel Software Technology, School of Computer Science, Nanjing University(南京大学新型软件技术国家重点实验室、计算机科学学院)

AI总结 本研究提出MedAgentAudit框架,通过专家验证的审计流程诊断医疗多智能体系统中的协作失败模式,发现虚假共识、权威偏差等系统性风险。

Comments Code and Data: https://github.com/MedX-PKU/MedAgentAudit

详情
AI中文摘要

大型语言模型正越来越多地被组装成医疗多智能体系统,通过专家角色、同行评审和共识形成模拟多学科会诊。然而,在临床决策支持中,表面共识并不足够。临床医生还需要知道智能体是否检查了证据、处理了分歧并保持了不确定性可见。当前评估主要关注最终准确性,未测试协作过程的安全性。本文介绍MedAgentAudit,一个基于临床的工作流审计框架,用于诊断和量化医疗多智能体系统中的协作失败模式。从3,600个执行日志中,我们推导出一个经专家验证的十种常见失败分类法,涵盖任务理解、协作讨论以及综合与决策。随后,我们部署一个经专家验证的自动审计器作为非干预探针,覆盖14,400个案例,涉及六种多智能体架构、六个医疗文本和视觉数据集以及每种模态的四个大语言模型设置。跨系统而言,协作带来不均衡的准确性提升和频繁的过程失败。16.63%的案例中存在无依据的观察结果,并向下游传播。在讨论中,智能体在98.42%的案例中重复初始观点而非重新审视证据,并在42.73%的案例中未能激活专家推理。在综合阶段,最终答案常常用权威或多数票替代证据检查,显示出权威偏差(28.76%,从35.30%上升至68.75%)、自我矛盾(18.53%)、矛盾忽视(5.48%)和少数派压制(5.11%)。MedAgentAudit将医疗AI评估从输出评分重新定义为过程级安全与问责,为医学中透明、可审计且由临床医生监督的智能体系统提供了实践基础。

英文摘要

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through specialist roles, peer review and consensus formation. In clinical decision support, however, apparent consensus is not enough. Clinicians also need to know whether agents checked the evidence, addressed disagreement and kept uncertainty visible. Current evaluations largely score final accuracy, leaving the safety of the collaborative process untested. Here we introduce MedAgentAudit, a clinically grounded workflow audit framework for diagnosing and quantifying collaborative failure modes in medical multi-agent systems. From 3,600 execution logs, we derive an expert-validated taxonomy of ten recurrent failures spanning task comprehension, collaborative discussion, and synthesis and decision-making. We then deploy an expert-validated automated auditor as non-interventional probes across 14,400 cases, covering six multi-agent architectures, six medical text and vision datasets, and four large language model settings per modality. Across systems, collaboration yields uneven accuracy gains and frequent process failures. Unsupported observations affect 16.63% of cases and propagate downstream. In discussion, agents repeat initial views in 98.42% of cases rather than re-examining evidence, and fail to activate specialist reasoning in 42.73%. During synthesis, final answers often substitute authority or majority count for evidence checking, showing authority bias in 28.76% (rising from 35.30% to 68.75% across rounds), self-contradiction in 18.53%, contradiction neglect in 5.48% and minority suppression in 5.11%. MedAgentAudit reframes medical AI evaluation from output scoring to process-level safety and accountability, providing a practical foundation for transparent, auditable and clinician-supervised agentic systems in medicine.

2502.17832 2026-05-28 cs.LG cs.AI cs.CR cs.CV 版本更新

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

MM-PoisonRAG:通过局部和全局投毒攻击破坏多模态RAG

Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California Los Angeles(加州大学洛杉矶分校)

AI总结 提出MM-PoisonRAG框架,通过局部投毒攻击(LPA)和全局投毒攻击(GPA)两种策略,系统研究多模态检索增强生成(RAG)在知识投毒下的脆弱性,实验表明攻击成功率高达56%且能绕过现有防御。

Comments Code is available at https://github.com/HyeonjeongHa/MM-PoisonRAG

详情
AI中文摘要

检索增强生成(RAG)已成为多模态大语言模型(MLLM)中增强事实基础并减少幻觉的常见做法。然而,其对检索的依赖使MLLM面临知识投毒攻击,攻击者故意将恶意多模态内容注入外部知识库,以引导模型生成不正确甚至有害的响应。我们提出MM-PoisonRAG框架,系统研究多模态RAG在知识投毒下的脆弱性。具体地,我们设计了两种新颖的攻击策略:局部投毒攻击(LPA),植入针对特定查询的多模态错误信息以操纵输出至攻击者控制的响应;以及全局投毒攻击(GPA),使用单一、非定向的对抗性注入广泛破坏推理并降低所有查询的生成质量。在多样化任务、多模态RAG组件和攻击者访问级别上的大量实验揭示了严重的脆弱性:LPA即使在受限访问下也能达到高达56%的攻击成功率,并且无需重新优化对抗样本即可在四种不同的检索器之间有效迁移。GPA仅需一个投毒内容即可完全破坏模型生成,使准确率降至0%。此外,LPA和GPA均能绕过现有防御,突显了多模态RAG的脆弱性,并将MM-PoisonRAG确立为未来保护RAG框架免受多模态知识投毒研究的基础。

英文摘要

Retrieval-augmented generation (RAG) has become a common practice in multimodal large language models (MLLM) to enhance factual grounding and reduce hallucination. Yet, its reliance on retrieval exposes MLLMs to knowledge poisoning attacks, in which adversaries deliberately inject malicious multimodal content into external knowledge bases to steer models toward generating incorrect or even harmful responses. We present MM-PoisonRAG, a framework to systematically study the vulnerability of multimodal RAG under knowledge poisoning. Specifically, we design two novel attack strategies: Localized Poisoning Attack (LPA), which implants targeted, query-specific multimodal misinformation to manipulate outputs toward attacker-controlled responses, and Globalized Poisoning Attack (GPA), which uses a single, untargeted adversarial injection to broadly corrupt reasoning and collapse generation quality across all queries. Extensive experiments on diverse tasks, multimodal RAG components, and attacker access levels reveal severe vulnerabilities: LPA achieves up to 56% attack success rate even under restricted access, and transfers effectively across four different retrievers without re-optimizing the adversaries. GPA completely disrupts model generation to 0% accuracy with just one poisoned content. Moreover, both LPA and GPA bypass existing defenses, underscoring the fragility of multimodal RAG and establishing MM-PoisonRAG as a foundation for future research on securing RAG frameworks against multimodal knowledge poisoning.

2510.02329 2026-05-28 cs.CL cs.AI 版本更新

SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

SelfJudge: 通过自监督验证器加速推测解码

Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee

发表机构 * Efficient AI Large Language Model(大型语言模型) Speculative Decoding(推测解码)

AI总结 提出SelfJudge方法,利用目标模型的自监督训练验证器,通过评估令牌替换后响应的语义保持性来加速推测解码,实现更优的推理-准确率权衡。

详情
Journal ref
ICML 2026
AI中文摘要

推测解码通过验证来自草稿模型的候选令牌与较大目标模型的匹配来加速LLM推理。最近的验证解码通过放宽验证标准,接受可能与目标模型输出存在微小差异的草稿令牌来加速这一过程,但现有方法受限于依赖人工标注或具有可验证真实结果的任务,限制了其在多样化NLP任务中的泛化能力。我们提出SelfJudge,通过目标模型的自监督训练验证器。我们的方法通过评估令牌替换后的响应是否保持原始响应的意义来衡量语义保持性,从而实现在多样化NLP任务中的自动验证器训练。实验表明,SelfJudge在推理-准确率权衡上优于验证解码基线,为更快的LLM推理提供了广泛适用的解决方案。

英文摘要

Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

2510.01724 2026-05-28 cs.AI 版本更新

MetaboT: An LLM-based Multi-Agent Frameworkfor Interactive Analysis of Mass SpectrometryMetabolomics Knowledge Graphs

MetaboT:基于LLM的多智能体框架,用于质谱代谢组学知识图谱的交互式分析

Madina Bekbergenova, Lucas Pradi, Benjamin Navet, Emma Tysinger, Franck Michel, Matthieu Feraud, Yousouf Taghzouti, Yan Zhou Chen, Olivier Kirchhoffer, Florence Mehl, Martin Legrand, Tao Jiang, Marco Pagni, Soha Hassoun, Jean-Luc Wolfender, Wout Bittremieux, Fabien Gandon, Louis-Félix Nothias

发表机构 * Department of Computer Science, University of Antwerp(安特卫普大学计算机科学系) Massachusetts Institute of Technology(麻省理工学院) Department of Computer Science, Tufts University(塔夫茨大学计算机科学系) Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland(瑞士生物信息学研究所(SIB),洛桑,瑞士) Department of Chemical and Biological Engineering, Tufts University(塔夫茨大学化学与生物工程系)

AI总结 提出MetaboT,一个基于大语言模型的多智能体框架,通过模块化架构将自然语言问题转化为SPARQL查询,降低代谢组学知识图谱的使用门槛。

详情
Journal ref
33rd annual international conference on Intelligent Systems for Molecular Biology (ISMB 2025) / 24th Annual Conference of the European Conference on Computational Biology (ECCB 2025), Jul 2025, Liverpool, United Kingdom
AI中文摘要

基于质谱的代谢组学产生复杂的高维数据,这些数据蕴含着巨大的生物学发现潜力,但仍难以整合和解释。知识图谱通过将光谱、注释、分类群、化学类别和生物活性表示为单一可互操作的网络来统一这些异构信息;然而,它们的实际应用受到相应专业表示和查询语言陡峭学习曲线的限制。在此,我们介绍MetaboT,一个开源的多智能体大语言模型框架,它将自然语言问题转化为可执行的SPARQL查询,用于代谢组学知识图谱。MetaboT通过模块化架构减轻了单一模型方法的幻觉和模式合规性限制,其中专门的智能体处理范围验证、针对权威资源的实体解析、模式感知查询生成、迭代细化和结果解释。我们在实验性天然产物知识图谱上验证了MetaboT,使用专家编写的自然语言问题基准及其参考SPARQL查询,并展示了其回答关于植物-代谢物关系和生物活性的复杂问题的能力。MetaboT降低了代谢组学研究者的技术门槛,无需专门编程专业知识即可实现语义数据挖掘。

英文摘要

Mass spectrometry-based metabolomics generates complex, high-dimensional data that holds vast potential for biological discovery but remains difficult to integrate and interpret. Knowledge graphs (KGs) unify this heterogeneous information by representing spectra, annotations, taxa, chemical classes, and biological activities as a single interoperable network; however, their practical use is limited by the steep learning curve of corresponding specialized representation and query languages. Here we introduce MetaboT, an open-source multi-agent Large Language Model (LLM) framework that translates natural-language questions into executable SPARQL queries over metabolomics knowledge graphs. MetaboT mitigates the hallucination and schema-compliance limitations of single-model approaches through a modular architecture in which specialised agents handle scope validation, entity resolution against authoritative resources, schema-aware query generation, iterative refinement, and result interpretation. We validated MetaboT on the Experimental Natural Products Knowledge Graph (ENPKG), using an expert-authored benchmark of natural-language questions paired with reference SPARQL queries, and demonstrate its ability to answer complex questions about plant--metabolite relationships and biological activities. MetaboT lowers the technical barrier for metabolomics researchers and enables semantic data mining without specialised programming expertise.

2503.11906 2026-05-28 cs.CV cs.AI 版本更新

A Survey on SAR ship classification using Deep Learning

基于深度学习的SAR船舶分类综述

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Emanuele Salerno

发表机构 * PhD School in Computer Science, University of Pisa(计算机科学博士学院,比萨大学) Institute of Information Science and Technologies, National Research Council of Italy(意大利国家研究委员会信息科学与技术研究所) National Biodiversity Future Center - NBFC(国家生物多样性未来中心 - NBFC)

AI总结 本文综述了深度学习在SAR船舶分类中的应用,建立了基于模型、手工特征、SAR属性利用和微调影响的分类法,并讨论了未来研究方向。

Comments in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

详情
AI中文摘要

深度学习(DL)已成为合成孔径雷达(SAR)船舶分类的强大工具。本综述全面分析了该领域使用的各种DL技术。我们识别了关键趋势和挑战,强调了整合手工特征、利用公共数据集、数据增强、微调、可解释性技术以及促进跨学科合作以提高DL模型性能的重要性。本综述建立了首个基于DL模型、手工特征使用、SAR属性利用和微调影响的分类法,用于对相关研究进行分类。我们讨论了SAR船舶分类任务中使用的方法论以及不同技术的影响。最后,本综述探讨了未来研究的潜在方向,包括解决数据稀缺问题、探索新型DL架构、融入可解释性技术以及建立标准化性能指标。通过应对这些挑战并利用DL的进步,研究人员可以为开发更准确和高效的船舶分类系统做出贡献,最终增强海上监视及相关应用。

英文摘要

Deep learning (DL) has emerged as a powerful tool for Synthetic Aperture Radar (SAR) ship classification. This survey comprehensively analyzes the diverse DL techniques employed in this domain. We identify critical trends and challenges, highlighting the importance of integrating handcrafted features, utilizing public datasets, data augmentation, fine-tuning, explainability techniques, and fostering interdisciplinary collaborations to improve DL model performance. This survey establishes a first-of-its-kind taxonomy for categorizing relevant research based on DL models, handcrafted feature use, SAR attribute utilization, and the impact of fine-tuning. We discuss the methodologies used in SAR ship classification tasks and the impact of different techniques. Finally, the survey explores potential avenues for future research, including addressing data scarcity, exploring novel DL architectures, incorporating interpretability techniques, and establishing standardized performance metrics. By addressing these challenges and leveraging advancements in DL, researchers can contribute to developing more accurate and efficient ship classification systems, ultimately enhancing maritime surveillance and related applications.

2509.21128 2026-05-28 cs.AI 版本更新

RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

RL 压缩,SFT 扩展:推理型大语言模型的比较研究

Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学)

AI总结 本文通过轨迹级和步骤级分析框架,比较了强化学习(RL)和监督微调(SFT)对数学推理大语言模型推理路径的影响,发现RL压缩错误轨迹并集中推理功能,而SFT扩展正确轨迹并均匀化推理功能。

Comments Accepted at ICLR2026

详情
AI中文摘要

大型语言模型(LLMs)通常通过带有可验证奖励的强化学习(RLVR)和监督微调(SFT)在推理轨迹上进行训练,以提高其推理能力。然而,这些方法如何塑造推理能力仍然难以捉摸。本文超越基于准确性的研究,引入了一个新颖的分析框架,量化推理路径并捕捉每个训练过程(在数学领域使用1.5B、7B和14B参数的模型)下的定性变化。具体来说,我们在两个粒度级别上研究推理过程:轨迹级别,检查完整的推理输出;步骤级别,分析推理图,其节点对应单个推理步骤。值得注意的是,对独特推理轨迹的聚类显示了互补效应:RL压缩了不正确的轨迹,而SFT扩展了正确的轨迹。步骤级别分析表明,RL使推理图中节点访问频率、度和介数中心性分布的衰减率变陡(约2.5倍),而SFT使其变平(减少到约三分之一)。这表明RL将推理功能集中到一小部分步骤中,而SFT则将其均匀化到许多步骤中。此外,通过从多个角度评估推理图拓扑,我们描绘了RL和SFT的共同和独特特征。我们的工作提出了一种新颖的推理路径视角,解释了为什么当前最佳实践的两阶段训练(先SFT后RL)是成功的,并为数据构建和更高效的学习方法提供了实际启示。

英文摘要

Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.

2509.15848 2026-05-28 cs.AI 版本更新

A Comparative Study of Rule-Based and Data-Driven Approaches in Industrial Monitoring

基于规则与数据驱动方法在工业监控中的比较研究

Giovanni De Gasperis, Sante Dino Facchini

发表机构 * Università degli Studi dell’Aquila(阿奎拉大学)

AI总结 本研究比较了工业监控中基于规则与数据驱动两种方法,分析了各自的优缺点,并提出混合方案以结合两者优势。

Comments This chapter has been published in Advancements in AI From Foundations to Cross-Disciplinary Applications, Springer, 2026

详情
AI中文摘要

工业监控系统,尤其是在工业4.0环境中部署时,正经历从传统基于规则的架构向利用机器学习和人工智能的数据驱动方法的范式转变。本研究对这两种方法进行了比较,分析了它们各自的优势、局限性和应用场景,并提出了一个基本框架来评估它们的关键特性。基于规则的系统具有高可解释性、确定性行为以及在稳定环境中易于实现的特点,使其成为受监管行业和安全关键应用的理想选择。然而,它们在复杂或不断变化的环境中面临可扩展性、适应性和性能方面的挑战。相反,数据驱动系统在检测隐藏异常、实现预测性维护和动态适应新条件方面表现出色。尽管这些模型具有高精度,但它们面临数据可用性、可解释性和集成复杂性方面的挑战。本文提出混合解决方案作为一个有前景的方向,结合了基于规则逻辑的透明性与机器学习的分析能力。我们的假设是,工业监控的未来在于智能、协同的系统,这些系统利用专家知识和数据驱动的洞察力。这种双重方法增强了韧性、运营效率和信任,为更智能、更灵活的工业环境铺平了道路。

英文摘要

Industrial monitoring systems, especially when deployed in Industry 4.0 environments, are experiencing a shift in paradigm from traditional rule-based architectures to data-driven approaches leveraging machine learning and artificial intelligence. This study presents a comparison between these two methodologies, analyzing their respective strengths, limitations, and application scenarios, and proposes a basic framework to evaluate their key properties. Rule-based systems offer high interpretability, deterministic behavior, and ease of implementation in stable environments, making them ideal for regulated industries and safety-critical applications. However, they face challenges with scalability, adaptability, and performance in complex or evolving contexts. Conversely, data-driven systems excel in detecting hidden anomalies, enabling predictive maintenance and dynamic adaptation to new conditions. Despite their high accuracy, these models face challenges related to data availability, explainability, and integration complexity. The paper suggests hybrid solutions as a possible promising direction, combining the transparency of rule-based logic with the analytical power of machine learning. Our hypothesis is that the future of industrial monitoring lies in intelligent, synergic systems that leverage both expert knowledge and data-driven insights. This dual approach enhances resilience, operational efficiency, and trust, paving the way for smarter and more flexible industrial environments.

2509.04192 2026-05-28 cs.AI cs.LO math.LO 版本更新

Domain size asymptotics for Markov logic networks

马尔可夫逻辑网络的域大小渐近性

Vera Koponen

发表机构 * Department of Mathematics, Uppsala University, Sweden(瑞典乌普萨拉大学数学系)

AI总结 研究马尔可夫逻辑网络在域大小趋于无穷时概率分布的性质,通过一元关系语言的几乎完全刻画,展示了其与均匀分布及提升贝叶斯网络的本质差异。

Comments Version 2 is a major revision of version 1

详情
AI中文摘要

一个马尔可夫逻辑网络(MLN)$\mathbb{M}$ 在域为 $\{1, \ldots, n\}$ 的结构集 $\mathbf{W}_n$(即“可能世界”)上确定了一个概率分布 $\mathbb{P}_n^\mathbb{M}$。我们研究当 $n$ 趋于无穷时这些分布的性质。我们证明,在温和假设下,对于具有一个任意正权重的软约束的 MLN $\mathbb{M}$,对所有足够大的 $n$,分布 $\mathbb{P}_n^\mathbb{M}$ 的行为与 $\mathbf{W}_n$ 上的均匀分布 $\mathbb{P}_n^{uni}$ 截然不同。对于仅有一个一元关系符号 $R$ 的语言,我们给出了当 $n \to \infty$ 时 $\mathbb{P}_n^\mathbb{M}$ 可能渐近行为的几乎完全刻画,其中 $\mathbb{M}$ 可以是该语言的任意 MLN。渐近行为取决于 MLN 的软约束和权重。该刻画用于证明:如果所考虑的语言至少包含一个一元关系符号,则以下结论成立:(a) 存在一个 MLN $\mathbb{M}$,使得对每个提升贝叶斯网络(LBN)$\mathbb{G}$,存在无穷多个 $n$ 使得 $\mathbb{M}$ 和 $\mathbb{G}$ 在 $\mathbf{W}_n$ 上确定不同的分布。(b) 存在一个 LBN $\mathbb{G}$,使得对每个 MLN $\mathbb{M}$,存在无穷多个 $n$ 使得 $\mathbb{G}$ 和 $\mathbb{M}$ 在 $\mathbf{W}_n$ 上确定不同的分布。我们还证明,在极限情况下,权重维度和域大小维度的行为可能完全不同。

英文摘要

A Markov logic network (MLN) $\mathbb{M}$ determines a probability distribution $\mathbb{P}_n^\mathbb{M}$ on the set $\mathbf{W}_n$ of structures, or ``possible worlds'', with domain $\{1, \ldots, n\}$. We study the properties of such distributions as $n$ tends to infinity. We show that with mild assumptions on an MLN $\mathbb{M}$ with one soft constraint with an arbitrary positive weight the distribution $\mathbb{P}_n^\mathbb{M}$ will behave quite differently from the uniform distribution $\mathbb{P}_n^{uni}$ on $\mathbf{W}_n$ for all sufficiently large $n$. For a language with only one relation symbol $R$ which has arity 1 we give an almost complete characterization of the possible asymptotic behaviours of $\mathbb{P}_n^\mathbb{M}$ as $n \to \infty$, where $\mathbb{M}$ may be any MLN for this language. The asymptotic behaviour depends on the soft constraints and weights of the MLN. This characterization is used to show that if the language under consideration contains at least one relation symbol of arity 1 then the following holds: (a) There is an MLN $\mathbb{M}$ such that for every lifted Bayesian network (LBN) $\mathbb{G}$ there are infinitely many $n$ such that $\mathbb{M}$ and $\mathbb{G}$ determine different distributions on $\mathbf{W}_n$. (b) There is an LBN $\mathbb{G}$ such that for every MLN $\mathbb{M}$ there are infinitely many $n$ such that $\mathbb{G}$ and $\mathbb{M}$ determine different distributions on $\mathbf{W}_n$. We also show that, in the limit, the weight dimension and the domain size dimension may behave completely differently.

2507.13725 2026-05-28 cs.IR cs.AI 版本更新

Point of Interest Recommendation: Pitfalls and Viable Solutions

兴趣点推荐:陷阱与可行解决方案

Alejandro Bellogín, Linus W. Dietz, Francesco Ricci, Pablo Sánchez

发表机构 * King’s College London(伦敦国王学院) Free University of Bozen-Bolzano(博兹纳-博尔扎诺自由大学)

AI总结 本文批判性评估兴趣点推荐研究现状,指出数据集、算法和评估方法三方面的关键缺陷,并提出包含多利益相关者设计、上下文感知等方向的研究议程。

详情
Journal ref
ACM Transactions on Recommender Systems 2026
AI中文摘要

兴趣点(POI)推荐通过建议上下文相关且匹配偏好的地点和活动(如餐厅、地标、行程和文化景点),在丰富游客体验方面可发挥关键作用。与一些更常见的推荐领域(如音乐和视频)不同,POI推荐本质上具有高风险:用户投入大量时间、金钱和精力来搜索、选择和消费这些建议的POI。尽管该领域已有大量研究工作,但几个基本问题仍未解决,阻碍了所提出方法的实际应用。在本文中,我们讨论了POI推荐问题的当前状态以及我们识别的主要挑战。本文的第一个贡献是对POI推荐研究现状的批判性评估,并识别了三个主要维度(数据集、算法和评估方法)的关键缺陷。我们强调了持续存在的问题,例如缺乏标准化基准数据集、问题定义和模型设计中的有缺陷假设,以及对用户行为和系统性能中偏差的不当处理。第二个贡献是一个结构化的研究议程,从识别的问题出发,引入了与多利益相关者设计、上下文感知、数据收集、可信度、新颖交互和实际评估相关的未来工作的重要方向。

英文摘要

Point of interest (POI) recommendation can play a pivotal role in enriching tourists' experiences by suggesting context-dependent and preference-matching locations and activities, such as restaurants, landmarks, itineraries, and cultural attractions. Unlike some more common recommendation domains (e.g., music and video), POI recommendation is inherently high-stakes: users invest significant time, money, and effort to search, choose, and consume these suggested POIs. Despite the numerous research works in the area, several fundamental issues remain unresolved, hindering the real-world applicability of the proposed approaches. In this paper, we discuss the current status of the POI recommendation problem and the main challenges we have identified. The first contribution of this paper is a critical assessment of the current state of POI recommendation research and the identification of key shortcomings across three main dimensions: datasets, algorithms, and evaluation methodologies. We highlight persistent issues such as the lack of standardized benchmark datasets, flawed assumptions in the problem definition and model design, and inadequate treatment of biases in the user behavior and system performance. The second contribution is a structured research agenda that, starting from the identified issues, introduces important directions for future work related to multistakeholder design, context awareness, data collection, trustworthiness, novel interactions, and real-world evaluation.

2507.08014 2026-05-28 cs.CL cs.AI cs.CY 版本更新

Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

大规模真实对话分析揭示LLM越狱的复杂性界限

Aldan Creo, Raul Castro Fernandez, Manuel Cebrian

发表机构 * Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Valencia, Spain.(瓦伦西亚人工智能研究 institute,瓦伦西亚理工大学,西班牙瓦伦西亚) Department of Computer Science, The University of Chicago, Chicago, USA(计算机科学系,芝加哥大学,美国芝加哥) Center for Automation and Robotics, Spanish National Research Council, Madrid, Spain(自动化与机器人中心,西班牙国家研究委员会,西班牙马德里)

AI总结 通过分析超过200万条真实对话,发现越狱尝试的复杂性并不显著高于正常对话,且攻击复杂性随时间保持稳定,表明LLM安全演化受人类创造力限制。

Comments Code: https://github.com/ACMCMC/risky-conversations Results: https://huggingface.co/risky-conversations Visualizer: https://huggingface.co/spaces/risky-conversations/Visualizer

详情
AI中文摘要

随着大型语言模型(LLM)的日益部署,理解越狱策略的复杂性和演变对于AI安全至关重要。我们对来自不同平台(包括专门的越狱社区和通用聊天机器人)的超过200万条真实对话进行了大规模实证分析,研究了越狱复杂性。使用一系列复杂性指标,涵盖概率度量、词汇多样性、压缩比和认知负荷指标,我们发现越狱尝试并未表现出显著高于正常对话的复杂性。这一模式在专门的越狱社区和普通用户群体中一致成立,表明攻击的复杂性存在实际界限。时间分析显示,虽然用户攻击的毒性和复杂性随时间保持稳定,但助手响应的毒性有所下降,表明安全机制正在改进。复杂性分布中缺乏幂律标度进一步指出了越狱发展的自然限制。我们的发现挑战了攻击者与防御者之间军备竞赛不断升级的主流说法,反而表明LLM安全演化受人类创造力限制,而防御措施持续进步。我们的结果突显了学术越狱披露中的关键信息危害,因为超出当前复杂性基线的复杂攻击可能破坏观察到的平衡,并在防御适应之前造成广泛伤害。

英文摘要

As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.

2506.08311 2026-05-28 cs.SE cs.AI 版本更新

Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study

通过可追溯性视角理解自动化程序修复智能体:一项实证研究

Ira Ceka, Hailie Mitchell, Saurabh Pujar, Luca Buratti, Shyam Ramji, Junfeng Yang, Gail Kaiser, Baishakhi Ray

发表机构 * Columbia University(哥伦比亚大学) IBM Research(IBM研究院)

AI总结 本文通过追踪五个最先进的自动化程序修复智能体在500个真实世界修复任务中的决策流程,揭示了它们在逻辑密集型错误修复、测试生成和回归测试选择方面的关键局限性,并提出了改进方向。

Comments Accepted for publication (ISSTA '26)

详情
AI中文摘要

自动化程序修复(APR)智能体利用大型语言模型(LLMs)通过推理、规划和工具使用来自主诊断和修复软件缺陷。尽管在SWE-bench等基准测试上取得了令人印象深刻的排行榜成绩,但人们对这些智能体如何采取行动、在何处失败以及它们的行为与人类开发者相比如何知之甚少。本文首次对五个最先进的APR智能体在500个真实世界修复任务中进行了系统分析,追踪了它们从问题描述到补丁验证的完整决策流程。我们的研究揭示,虽然智能体擅长简单修复,但在逻辑密集型错误上表现挣扎,常常生成冗长或过拟合的补丁,这些补丁仅能满足现有测试。我们发现测试生成和回归测试选择仍然是主要瓶颈,智能体经常无法重现问题或运行相关的回归测试。此外,大多数智能体使用原始工具(如bash脚本),缺乏调试器或程序分析器的访问权限,这限制了它们的推理能力和补丁质量。这些发现突出了当前APR系统的关键局限性,并促使采用左移方法——强调早期高质量的测试生成和验证——以减少虚假修复并提高语义正确性。我们进一步概述了下一代APR设计的具体方向:(1)更丰富且更集成的工具生态系统,(2)结合互补优势的多样化智能体架构,以及(3)优先考虑语义修复质量和测试生成保真度而非表面成功指标的基准测试。

英文摘要

Automated Program Repair (APR) agents leverage Large Language Models (LLMs) to autonomously diagnose and fix software bugs through reasoning, planning, and tool use. Despite impressive leaderboard gains on benchmarks such as SWE-bench, little is understood about how these agents take actions, where they fail, and how their behavior compares to that of human developers. This paper presents the first systematic analysis of five state-of-the-art APR agents across 500 real-world repair tasks, tracing their full decision-making pipelines -- from issue description to patch validation. Our study reveals that while agents excel at simple fixes, they struggle with logic-intensive bugs, often producing verbose or overfitted patches that merely satisfy existing tests. We find that test generation and regression test selection remain major bottlenecks, with agents frequently failing to reproduce issues or run relevant regression tests. Moreover, most agents operate with primitive tooling (e.g., bash scripts) and lack access to debuggers or program analyzers, which constrains their reasoning and patch quality. These findings highlight key limitations in current APR systems and motivate a shift-left approach -- emphasizing early, high-quality test generation and validation -- to reduce spurious fixes and improve semantic correctness. We further outline concrete directions for next-generation APR design: (1) richer and more integrated tool ecosystems, (2) diversified agentic architectures that combine complementary strengths, and (3) benchmarks that prioritize semantic repair quality and test generation fidelity over surface-level success metrics.

2505.21771 2026-05-28 cs.CV cs.AI 版本更新

MMTABREAL: Real-World Benchmark for Multimodal Table Understanding

MMTABREAL:多模态表格理解的真实世界基准

Prasham Titiya, Jainil Trivedi, Chitta Baral, Vivek Gupta

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 针对多模态表格理解,构建了包含500个真实表格和4021个问答对的人工筛选基准MMTABREAL,评估发现现有模型在视觉定位、空间对齐和多步推理上存在20-40%的性能差距。

详情
AI中文摘要

多模态表格,即与图表、地图、图标和颜色编码交织的表格布局,在实际应用中无处不在,但对多模态大语言模型(MLLMs)来说仍然困难。尽管在文本和图像理解方面取得了进展,但对以表格为中心的多模态推理的系统评估仍然有限。我们引入了MMTABREAL,一个多模态表格基准,包含人工筛选的500个真实世界表格及其对应的4021个问答对。MMTABREAL涵盖四种问题类型、五种推理类别和八种结构原型。对最先进模型的评估揭示了显著差距,特别是在视觉定位、空间对齐和多步推理方面,相对于现有基准性能下降了20-40%。这些结果强调了需要更紧密融合视觉与表格结构并支持显式数值/逻辑运算的架构。MMTABREAL仅用于评估,提供了一个严谨、可复现的测试平台,反映了真实世界多模态表格的语言、结构和推理复杂性。

英文摘要

Multimodal tables i.e. tabular layouts interleaved with charts, maps, icons, and color encodings are ubiquitous in real applications yet remain difficult for Multimodal Large Language Models (MLLMs). Despite advances in text and image understanding, systematic evaluation of table-centric multimodal reasoning is limited. We introduce MMTABREAL, a MultiModal Table Benchmark, human-curated suite of 500 real-world tables paired with 4,021 question-answer pairs. MMTABREAL spans four question types, five reasoning categories, and eight structural archetypes. Evaluations of state-of-the-art models reveal substantial gaps, especially in visual grounding, spatial alignment, and multi-step inference, with 20-40% performance drops relative to existing benchmarks. These results highlight the need for architectures that more tightly fuse vision with tabular structure and support explicit numeric/logical operations. MMTABREAL is released for evaluation only, providing a rigorous, reproducible testbed that reflects the linguistic, structural, and reasoning complexity of real-world multimodal tables.

2502.05242 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

超越外部监控:增强大型语言模型的透明度以便于监控

Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) ICISEE, Shanghai Jiao Tong University(上海交通大学ICISEE) School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University(上海交通大学数学科学学院) King Abdullah University of Science and Technology(卡塔尔国王 Abdullah 科学与技术大学)

AI总结 提出TELLME方法,通过改进大型语言模型的内部表征透明度,帮助监控者识别不当和敏感行为,并在去毒化任务中验证其有效性。

Comments 28 pages,8 figures,15 tables

详情
AI中文摘要

大型语言模型(LLMs)的能力日益增强,但其思维和决策过程的机制仍不清楚。思维链(CoTs)常被用来外化LLMs的思维,但这一策略未能准确反映LLMs的思维过程。基于LLMs隐藏表征的技术提供了内部视角,以改善对其潜在思维的可监控性。然而,以往的方法仅尝试开发外部模块,而非使LLMs本身更易于监控。本文提出了一种新方法TELLME,提高了LLMs的透明度,并帮助监控者识别不合适和敏感的行为。此外,我们在去毒化任务上展示了TELLME的有效性,LLMs在多模态测试集、不同架构和不同参数规模上均取得了一致的改进。我们进一步从最优传输理论和实证角度分析了TELLME对LLMs泛化能力的提升。

英文摘要

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

2503.02857 2026-05-28 cs.CV cs.AI cs.CY 版本更新

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Deepfake-Eval-2024:2024年传播的深度伪造多模态野外基准

Nuria Alina Chandra, Hannah Lee, Ryan Murtfeldt, Lin Qiu, Arnab Karmakar, Emmanuel Tanumihardja, Kevin Farhat, Ben Caffee, Changyeon Lee, Jongwook Choi, Sejin Paik, Aerin Kim, Oren Etzioni

发表机构 * University of Washington(华盛顿大学) Allen Institute for Artificial Intelligence(人工智能研究院) University of Maryland(马里兰大学) Chung-Ang University(Chung-Ang 大学) Georgetown University(乔治城大学) Miraflow AI

AI总结 针对现有学术基准过时且不反映真实深度伪造的问题,提出包含2024年社交媒体和用户提交的多模态深度伪造基准Deepfake-Eval-2024,评估发现开源模型性能大幅下降,而商业模型和微调模型表现更优但未达到专家水平。

详情
AI中文摘要

在生成式人工智能日益逼真的时代,稳健的深度伪造检测对于减少欺诈和虚假信息至关重要。尽管许多深度伪造检测器在学术数据集上报告了高准确率,但我们表明这些学术基准已经过时,不能代表现实世界的深度伪造。我们引入了Deepfake-Eval-2024,这是一个新的深度伪造检测基准,由2024年从社交媒体和深度伪造检测平台用户收集的野外深度伪造组成。Deepfake-Eval-2024包含45小时的视频、56.5小时的音频和1,975张图像,涵盖了最新的操纵技术。该基准包含来自52种不同语言、88个不同网站的多样化媒体内容。我们发现,在Deepfake-Eval-2024上评估时,开源最先进的深度伪造检测模型的性能急剧下降,与之前的基准相比,视频模型的AUC下降了50%,音频模型下降了48%,图像模型下降了45%。我们还评估了商业深度伪造检测模型和在Deepfake-Eval-2024上微调的模型,发现它们比现成的开源模型性能更优,但尚未达到深度伪造取证分析师的准确率。数据集可在https://github.com/nuriachandra/Deepfake-Eval-2024获取。

英文摘要

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

2505.19342 2026-05-28 cs.LG cs.AI 版本更新

ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

ASTRA:面向多设备Transformer推理的通信高效加速

Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan

发表机构 * Manning College of Information and Computer Sciences, University of Massachusetts Amherst, MA, US(信息与计算机科学学院,马萨诸塞大学阿默斯特分校,马萨诸塞州,美国) Amazon, Seattle, WA, US(亚马逊,华盛顿州西雅图,美国) Amazon AWS, Washington, D.C., US(亚马逊AWS,华盛顿特区,美国)

AI总结 提出ASTRA框架,通过序列并行与混合精度注意力机制,在低带宽环境下实现高效多设备Transformer推理,显著加速并保持精度。

详情
AI中文摘要

多设备推理可以通过并行计算降低Transformer延迟。然而,现有方法需要高设备间带宽,使其在带宽受限环境中不实用。我们提出ASTRA,一个通信高效的框架,将序列并行与混合精度注意力相结合,其中非局部token嵌入作为低位向量量化码传输,而局部注意力保持全精度。为了在激进压缩下保持精度,ASTRA引入了噪声增强量化和分布式分类token。在视觉和语言模型(如ViT和GPT2)上,ASTRA在低至10 Mbps的带宽下,相比单设备推理实现了高达2.64倍的加速,相比先前的多设备基线实现了高达15.25倍的加速。即使在非理想网络条件(如丢包和动态网络)下,ASTRA在大模型(如Llama-3-8B)上仍然保持鲁棒性。

英文摘要

Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.

2309.17057 2026-05-28 cs.AI 版本更新

Tell Me a Story! Narrative-Driven XAI with Large Language Models

给我讲个故事!基于大语言模型的叙事驱动可解释人工智能

David Martens, James Hinns, Camille Dams, Mark Vergouwen, Theodoros Evgeniou

发表机构 * University of Antwerp, Department of Engineering Management(安特卫普大学工程管理系)

AI总结 提出XAIstories方法,利用大语言模型将SHAP或反事实解释转化为叙事,提升用户对AI决策的理解和体验,实验表明超90%普通用户认为叙事可信,数据科学家83%愿意使用。

详情
Journal ref
10.1016/j.dss.2025.114402
AI中文摘要

在当今许多AI应用中,黑盒机器学习模型因其通常更高的准确性而占据主导地位,这加剧了对可解释人工智能(XAI)的需求。现有的XAI方法,例如广泛使用的SHAP值或反事实(CF)解释,对于用户来说往往过于技术化,难以理解和采取行动。为了增强对AI决策解释的理解和整体用户体验,我们引入了XAIstories,它利用大语言模型提供关于AI预测如何做出的叙述:SHAPstories基于SHAP解释,而CFstories基于CF解释。我们研究了我们的方法对用户体验和理解AI预测的影响。结果令人瞩目:超过90%的受访普通公众认为SHAPstories生成的叙述令人信服。数据科学家主要看到SHAPstories在向普通公众传达解释方面的价值,83%的数据科学家表示他们可能会为此目的使用SHAPstories。在图像分类设置中,超过75%的参与者认为CFstories比或等同于他们自己编造的故事更令人信服。此外,CFstories在创建叙述方面带来了十倍的加速。我们还发现,在我们测试的信用评分设置中,SHAPstories帮助用户更准确地总结和理解AI决策,正确回答理解问题的频率显著高于仅提供SHAP值时。因此,结果表明XAIstories可能显著帮助解释和理解AI预测,最终支持各种应用中的更好决策。

英文摘要

In many AI applications today, the predominance of black-box machine learning models, due to their typically higher accuracy, amplifies the need for Explainable AI (XAI). Existing XAI approaches, such as the widely used SHAP values or counterfactual (CF) explanations, are arguably often too technical for users to understand and act upon. To enhance comprehension of explanations of AI decisions and the overall user experience, we introduce XAIstories, which leverage Large Language Models to provide narratives about how AI predictions are made: SHAPstories do so based on SHAP explanations, while CFstories do so for CF explanations. We study the impact of our approach on users' experience and understanding of AI predictions. Our results are striking: over 90% of the surveyed general audience finds the narratives generated by SHAPstories convincing. Data scientists primarily see the value of SHAPstories in communicating explanations to a general audience, with 83% of data scientists indicating they are likely to use SHAPstories for this purpose. In an image classification setting, CFstories are considered more or equally convincing as the users' own crafted stories by more than 75% of the participants. CFstories additionally bring a tenfold speed gain in creating a narrative. We also find that SHAPstories help users to more accurately summarize and understand AI decisions, in a credit scoring setting we test, correctly answering comprehension questions significantly more often than they do when only SHAP values are provided. The results thereby suggest that XAIstories may significantly help explaining and understanding AI predictions, ultimately supporting better decision-making in various applications.

2505.09861 2026-05-28 cs.LG cs.AI cs.IR stat.ME 版本更新

LiDDA: Data Driven Attribution at LinkedIn

LiDDA:领英的数据驱动归因

John Bencina, Erkut Aykutlug, Yue Chen, Zerui Zhang, Stephanie Sorenson, Shao Tang, Changshuai Wei

发表机构 * LinkedIn Corporation(LinkedIn公司)

AI总结 提出一种基于Transformer的统一归因方法,处理成员级、聚合级数据和外部宏观因素,并在领英大规模实施,显著提升营销效果。

详情
AI中文摘要

数据驱动归因基于从数据中学习到的因果模式,将转化功劳分配给营销互动,是现代营销智能的基础,对任何营销业务和广告平台至关重要。本文介绍了一种统一的基于Transformer的归因方法,能够处理成员级数据、聚合级数据以及外部宏观因素的整合。我们详细描述了该方法在领英的大规模实施,展示了显著的影响。我们还分享了广泛适用于营销和广告技术领域的经验与见解。

英文摘要

Data Driven Attribution, which assigns conversion credits to marketing interactions based on causal patterns learned from data, is the foundation of modern marketing intelligence and vital to any marketing business and advertising platform. In this paper, we introduce a unified transformer-based attribution approach that can handle member-level data, aggregate-level data, and integration of external macro factors. We detail the large scale implementation of the approach at LinkedIn, showcasing significant impact. We also share learnings and insights which are broadly applicable to the marketing and ad tech fields.

2504.04540 2026-05-28 cs.CV cs.AI 版本更新

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

点、视觉与文本:点云能否提升大语言模型的空间推理能力?一项偏差控制研究

Weichen Zhang, Ruiying Peng, Xin Zeng, Jianjie Fang, Ziyou Wang, Kaiyuan Li, Heng Dong, Wei Li, Chen Gao, Xin Wang, Xinlei Chen, Yong Li

发表机构 * Tsinghua University(清华大学) ByteDance Seed(字节跳动种子)

AI总结 本文通过引入包含文本、视觉和点云模态的3D空间推理基准ScanReQA,评估不同模态下大语言模型的空间推理能力,发现点云和视觉模态的模型表现优于纯文本模型,并揭示了3D大语言模型中的注意力下沉现象。

详情
AI中文摘要

利用点云中空间信息进行3D空间推理的3D大语言模型(LLMs)引起了广泛关注。尽管取得了一些有希望的结果,但点云相对于其他模态的优势仍不明确。此外,现有的3D基准不足以公平评估多模态大语言模型理解空间概念的能力。为了解决这些挑战,我们引入了ScanReQA,一个涵盖文本、视觉和点云模态的3D空间推理基准。然后,我们评估了文本、2D和3D大语言模型在该基准上的性能,以比较不同模态在理解空间概念方面的有效性。此外,我们分析了使用点云的3D大语言模型背后的推理机制。我们的发现表明:1)二元空间推理对当前的3D大语言模型仍然具有挑战性;2)基于点云和视觉模态的多模态大语言模型展现出比大语言模型更强的空间推理能力;3)3D大语言模型表现出类似于2D大语言模型中的注意力下沉现象,这损害了空间推理。我们认为这些结论有助于3D大语言模型的下一步发展,并为其他模态的基础模型提供见解。我们在项目页面发布了数据集和代码:https://github.com/EmbodiedCity/ScanReQA.code。

英文摘要

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the advantages of point clouds over other modalities remain unclear. Moreover, existing 3D benchmarks are insufficient for fairly evaluating the ability of multimodal LLMs to comprehend spatial concepts. To address these challenges, we introduce ScanReQA, a 3D spatial reasoning benchmark encompassing text, vision, and point cloud modalities. We then evaluate the performance of text, 2D, and 3D LLMs on the benchmark to compare the effectiveness of different modalities in understanding spatial concepts. Furthermore, we analyze the reasoning mechanisms behind 3D LLMs using point clouds. Our findings reveal that: 1) binary spatial reasoning remains challenging for current 3D LLMs, 2) MLLMs based on point cloud and visual modalities demonstrate stronger spatial reasoning capabilities than LLMs, and 3) 3D LLMs exhibit the attention sink phenomenon similar to that in 2D LLMs, impairing spatial reasoning. We think these conclusions can help the next step of 3D LLMs and also offer insights for foundation models in other modalities. We release datasets and codes in the project page: https://github.com/EmbodiedCity/ScanReQA.code.

2503.22655 2026-05-28 cs.AI cs.CV cs.MM 版本更新

Text-Only Data Synthesis for Vision Language Model Training

仅文本数据合成用于视觉语言模型训练

Xiaomin Yu, Wenjie Zhang, Ziyue Qiao, Chengwei Qin, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Great Bay University(大湾大学)

AI总结 提出一个跨集成的三阶段多模态数据合成框架,仅从文本生成高质量多模态训练数据,用于视觉语言模型的预训练和指令微调。

详情
AI中文摘要

训练视觉语言模型(VLM)通常需要大规模、高质量的图像-文本对,但收集或合成此类数据成本高昂。相比之下,文本数据丰富且廉价,这引发了一个问题:能否仅从文本中合成高质量的多模态训练数据?为解决这一问题,我们提出了一个跨集成的三阶段多模态数据合成框架,生成了两个数据集:Unicorn-1.2M和Unicorn-471K-Instruction。在第一阶段:多样化字幕数据合成,我们通过使用大语言模型(LLM)扩展稀疏字幕种子,构建了120万语义多样的高质量字幕。在第二阶段:指令微调数据生成,我们进一步将47.1万个字幕处理为多轮指令微调任务,以支持复杂推理。最后,在第三阶段:模态表示迁移,这些文本字幕表示被转换为视觉表示,从而产生多样化的合成图像表示。这一三阶段过程使我们能够构建用于预训练的Unicorn-1.2M和用于指令微调的Unicorn-471K-Instruction,而无需依赖真实图像。通过消除对真实图像的依赖,同时保持数据质量和多样性,我们的框架为VLM训练提供了一种成本效益高且可扩展的解决方案。

英文摘要

Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction. In Stage 1: Diverse Caption Data Synthesis, we construct 1.2M semantically diverse high-quality captions by expanding sparse caption seeds using large language models (LLMs). In Stage 2: Instruction-Tuning Data Generation, we further process 471K captions into multi-turn instruction-tuning tasks to support complex reasoning. Finally, in Stage 3: Modality Representation Transfer, these textual captions representations are transformed into visual representations, resulting in diverse synthetic image representations. This three-stage process enables us to construct Unicorn-1.2M for pretraining and Unicorn-471K-Instruction for instruction-tuning, without relying on real images. By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training.

2503.11477 2026-05-28 cs.AI 版本更新

Heterogeneous Causal Discovery of Repeated Undesirable Health Outcomes

重复不良健康结果的异质因果发现

Shishir Adhikari, Guido Muscioni, Mark Shapiro, Plamen Petrov, Elena Zheleva

发表机构 * Department of Computer Science University of Illinois Chicago Chicago IL USA Digital Medicine Icahn School of Medicine at Mount Sinai New York NY USA Elevance Health (formerly, Anthem Inc.) Chicago IL USA University of Illinois Chicago Icahn School of Medicine at Mount Sinai Elevance Health (formerly, Anthem Inc.)

AI总结 提出一个集成因果结构学习算法与异质因果效应估计的端到端框架,用于从观测数据中发现稳健且可解释的因果假设,并在两个大规模医疗应用中验证其有效性。

详情
AI中文摘要

理解触发或预防患者亚群中不良健康结果的因素对于设计针对性干预措施至关重要。虽然随机对照试验和专家主导的患者访谈是识别这些因素的标准方法,但它们可能耗时或不可行。因果发现通过从观测数据中生成因果假设提供了传统方法的替代方案,但其实际效用受到强假设或不可检验假设的限制。本文提出了一种新颖的端到端框架,该框架独特地集成了因果结构学习算法集成与异质因果效应估计。通过聚合多个算法的结果,该框架识别出在不同建模假设下仍然存在的稳健因果关系,同时揭示这些效应如何在特定患者情境中变化。所提出的异质因果发现框架提高了稳健性,并为实践者提供了一组优先排序的、可操作的、临床可解释的假设。我们通过两个大规模医疗应用展示了该框架的有效性:使用保险索赔和电子健康记录数据集,识别糖尿病患者重复急诊就诊和ICU患者再入院的驱动因素和抑制因素。我们在两种情境下的结果均将慢性病管理和护理协调确定为关键干预措施,同时揭示干预效果取决于特定的患者层面修饰因素。我们采用多层验证策略,包括通过模拟恢复真实情况、与临床文献对齐、由临床专家验证以及在现代医疗系统中使用外部数据集进行可移植性验证,以展示该框架的实际效用。

英文摘要

Understanding the factors that trigger or prevent undesirable health outcomes across patient subpopulations is essential for designing targeted interventions. While randomized controlled trials and expert-led patient interviews are standard methods for identifying these factors, they can be time-consuming or infeasible. Causal discovery offers an alternative to conventional approaches by generating cause-and-effect hypotheses from observational data, yet its practical utility is limited by strong or untestable assumptions. This work presents a novel, end-to-end framework that uniquely integrates an ensemble of causal structure learning (CSL) algorithms with heterogeneous causal effect estimation. By aggregating results across multiple algorithms, the framework identifies robust causal relationships that persist under different modeling assumptions while simultaneously revealing how these effects vary across specific patient contexts. The proposed heterogeneous causal discovery framework improves robustness and provides practitioners with a prioritized set of actionable, clinically interpretable hypotheses. We demonstrate the framework's effectiveness through two large-scale healthcare applications: identifying drivers and inhibitors of repeat emergency department visits among diabetic patients and hospital readmissions among ICU patients, using insurance claims and electronic health record datasets. Our results, across both settings, identify chronic disease management and care coordination as key interventions, while revealing that intervention effectiveness depends on specific patient-level modifiers. We employ a multi-layered validation strategy, including ground-truth recovery via simulations, alignment with clinical literature, validation by expert clinicians, and portability in modern healthcare systems using an external dataset, to demonstrate the framework's practical utility.

2503.04863 2026-05-28 cs.CV cs.AI 版本更新

Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism

Manboformer: 通过时空注意力机制学习高斯表示

Ziyue Zhao, Qining Qi, Jianfa Ma

AI总结 针对自动驾驶3D语义占用预测中高斯表示性能不足的问题,提出利用时空自注意力机制优化GaussianFormer,以提升模型性能。

Comments After careful self-check, we found several unnoticed deficiencies and incomplete discussions in this manuscript. To ensure the rigor and accuracy of academic results, we decide to withdraw this preprint. A refined, complete, and rigorous version will be submitted soon

详情
AI中文摘要

与基于体素的网格预测相比,在自动驾驶的3D语义占用预测领域,GaussianFormer提出使用3D高斯来描述场景,基于对象的稀疏3D语义高斯是另一种内存需求更低的方案。每个3D高斯函数表示一个灵活的兴趣区域及其语义特征,通过注意力机制迭代优化。实验中发现,该方法所需的高斯函数数量大于原始密集网格网络的查询分辨率,导致性能受损。因此,我们考虑通过利用未使用的时序信息来优化GaussianFormer。我们从先前的网格占用网络中学习时空自注意力机制,并将其改进应用于GaussianFormer。实验使用NuScenes数据集进行,目前正在进行中。

英文摘要

Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer proposed using 3D Gaussian to describe scenes with sparse 3D semantic Gaussian based on objects is another scheme with lower memory requirements. Each 3D Gaussian function represents a flexible region of interest and its semantic features, which are iteratively refined by the attention mechanism. In the experiment, it is found that the Gaussian function required by this method is larger than the query resolution of the original dense grid network, resulting in impaired performance. Therefore, we consider optimizing GaussianFormer by using unused temporal information. We learn the Spatial-Temporal Self-attention Mechanism from the previous grid-given occupation network and improve it to GaussianFormer. The experiment was conducted with the NuScenes dataset, and the experiment is currently underway.

2502.12468 2026-05-28 cs.LG cs.AI 版本更新

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation

MCTS-Judge:LLM作为裁判在代码正确性评估中的测试时扩展

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, Guillaume Sartoretti

发表机构 * Independent Contributor(独立贡献者)

AI总结 提出MCTS-Judge框架,利用蒙特卡洛树搜索在测试时进行多视角分解评估,显著提升LLM作为裁判在代码正确性评估中的准确性和效率。

详情
AI中文摘要

LLM作为裁判的范式在评估生成内容方面显示出潜力,但在推理密集型场景(如编程)中缺乏可靠性。受近期推理模型进展和扩展定律变化的启发,我们率先将测试时计算引入LLM作为裁判,提出MCTS-Judge,一种资源高效的、系统2思维框架,用于代码正确性评估。MCTS-Judge利用蒙特卡洛树搜索(MCTS)将问题分解为更简单的、多视角的评估。通过结合基于当前轨迹中历史动作的自我评估和基于先前rollout的树的上置信界(UCT)的节点选择策略,MCTS-Judge平衡了全局优化和当前轨迹的细化。我们进一步设计了一种高精度的、单元测试级别的奖励机制,以鼓励大语言模型(LLM)进行逐行分析。在三个基准和五个LLM上的大量实验证明了MCTS-Judge的有效性,它将基础模型的准确率从41%提升到80%,同时比o1系列模型少使用3倍的token。进一步的评估验证了其推理轨迹在逻辑、分析、全面性和整体质量上的优越性,同时揭示了LLM作为裁判范式的测试时扩展定律。

英文摘要

The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as programming. Inspired by recent advances in reasoning models and shifts in scaling laws, we pioneer bringing test-time computation into LLM-as-a-Judge, proposing MCTS-Judge, a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge leverages Monte Carlo Tree Search (MCTS) to decompose problems into simpler, multi-perspective evaluations. Through a node-selection strategy that combines self-assessment based on historical actions in the current trajectory and the Upper Confidence Bound for Trees based on prior rollouts, MCTS-Judge balances global optimization and refinement of the current trajectory. We further designed a high-precision, unit-test-level reward mechanism to encourage the Large Language Model (LLM) to perform line-by-line analysis. Extensive experiments on three benchmarks and five LLMs demonstrate the effectiveness of MCTS-Judge, which improves the base model's accuracy from 41% to 80%, surpassing the o1-series models with 3x fewer tokens. Further evaluations validate the superiority of its reasoning trajectory in logic, analytics, thoroughness, and overall quality, while revealing the test-time scaling law of the LLM-as-a-Judge paradigm.

2411.18502 2026-05-28 stat.ML cs.AI cs.IR cs.LG stat.ME 版本更新

Isometry pursuit

等距追踪

Samson Koelle, Marina Meila

发表机构 * Amazon(亚马逊) Department of Statistics University of Washington(华盛顿大学统计系)

AI总结 提出等距追踪算法,通过新颖的归一化方法和多任务基追踪识别宽矩阵中的正交列子矩阵,用于从可解释字典中发现等距嵌入。

详情
AI中文摘要

等距追踪是一种用于识别宽矩阵中正交列子矩阵的凸算法。它由一种新颖的归一化方法后接多任务基追踪组成。应用于假定坐标函数的雅可比矩阵时,它有助于从可解释字典中识别等距嵌入。我们提供了理论和实验结果来证明该方法的合理性。对于涉及坐标选择和多样化的问题,它提供了贪心搜索和暴力搜索的协同替代方案。

英文摘要

Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For problems involving coordinate selection and diversification, it offers a synergistic alternative to greedy and brute force search.

2410.10241 2026-05-28 cs.LG cs.AI stat.ML 版本更新

Revisiting Graph Autoencoders as Implicit Contrastive Learners

重新审视图自编码器作为隐式对比学习器

Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Zulun Zhu, Liang Chen

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(教育部多媒体可信感知与高效计算重点实验室,厦门大学) Coupang Shanghai China(Coupang上海) Sun Yat-sen University(中山大学) Nanyang Technological University(南洋理工大学)

AI总结 本文通过对比学习视角重新审视图自编码器,揭示其隐式对比学习本质,并强调对比视图设计的关键作用,提出非对称子图视图作为重要设计维度。

Comments KDD 2026 research track. Code available at https://github.com/EdisonLeeeee/lrGAE

详情
AI中文摘要

图自编码器(GAEs)和图对比学习(GCL)是图上自监督表示学习的两种主要范式,但它们通常被孤立研究并被视为根本不同的方法。在这项工作中,我们通过对比学习的视角重新审视GAEs,并表明基于结构和基于特征的GAEs都可以概念化为隐式图对比学习器。这一视角揭示了许多现有GAEs的主要区别在于对比视图的构建方式,而非学习目标或架构。基于这一见解,我们引入了一个统一公式,强调对比视图设计是GAEs中一个核心且先前较少探索的维度。特别是,我们识别出由子图视图不匹配产生的非对称对比视图,作为先前GAE研究中一个重要但未充分探索的设计轴。我们在统一框架内形式化这一见解,并在代表性图学习任务上进行系统实验,以检验其对性能和效率的影响。我们的结果表明,将GAEs解释为隐式对比学习器能更清晰地理解现有模型,并为设计有效且可扩展的图自编码器提供实用指导。

英文摘要

Graph autoencoders (GAEs) and graph contrastive learning (GCL) are two major paradigms for self-supervised representation learning on graphs, yet they are often studied in isolation and treated as fundamentally different approaches. In this work, we revisit GAEs through the lens of contrastive learning and show that both structure-based and feature-based GAEs can be conceptualized as implicitly graph contrastive learners. This perspective reveals that many existing GAEs differ primarily in how contrastive views are constructed, rather than in their learning objectives or architectures. Building on this insight, we introduce a unified formulation that highlights contrastive view design as a central and previously less explored dimension in GAEs. In particular, we identify asymmetric contrastive views, arising from mismatches in subgraph views, as an important yet underexplored design axis in prior GAE research. We formalize this insight within a unified framework and conduct systematic experiments on representative graph learning tasks to examine its impact on performance and efficiency. Our results show that interpreting GAEs as implicit contrastive learners offers a clearer understanding of existing models and provides practical guidance for designing effective and scalable graph autoencoders.

2405.09586 2026-05-28 eess.IV cs.AI cs.CV 版本更新

Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report Generation

事实序列化增强:胸部X光报告生成的关键创新

Kang Liu, Zhuoqi Ma, Mengmeng Liu, Zhicheng Jiao, Xiaolu Kang, Qiguang Miao, Kun Xie

发表机构 * School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Xi’an Key Laboratory of Big Data and Intelligent Vision(西安大数据与智能视觉重点实验室) Key Laboratory of Collaborative Intelligence Systems, Ministry of Education(教育部协同智能系统重点实验室) School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院) Department of Diagnostic Imaging, Brown University(布朗大学诊断影像科)

AI总结 提出FSE两阶段事实序列化增强方法,通过事实引导对比学习和证据驱动报告生成,提升胸部X光报告生成的临床准确性和自然语言质量。

Comments code is available at FSE" target="_blank" rel="noopener">https://github.com/mk-runner/FSE

详情
AI中文摘要

放射学报告包含呈现式词汇(确保清晰和组织)和事实性词汇(基于可观察发现提供准确客观描述)。手动编写这些报告耗时费力,而自动报告生成提供了一种有前景的替代方案。该过程中的关键步骤是将X光片与其对应报告对齐。然而,现有方法通常依赖完整报告进行对齐,忽略了呈现式词汇的影响。为解决此问题,我们提出FSE,一种两阶段事实序列化增强方法。在第一阶段,我们引入事实引导的对比学习用于视觉表示,通过最大化X光片与对应事实描述之间的语义对应关系。在第二阶段,我们提出证据驱动的报告生成,通过整合来自类似历史病例的结构化事实序列化见解,增强诊断准确性。在MIMIC-CXR和IU X-ray数据集上的实验(涵盖特定和一般场景)表明,FSE在自然语言生成和临床效能指标上均优于最先进方法。消融研究进一步强调了第一阶段和第二阶段中事实序列化的积极作用。代码可在https://github.com/mk-runner/FSE获取。

英文摘要

A radiology report comprises presentation-style vocabulary, which ensures clarity and organization, and factual vocabulary, which provides accurate and objective descriptions based on observable findings. While manually writing these reports is time-consuming and labor-intensive, automatic report generation offers a promising alternative. A critical step in this process is to align radiographs with their corresponding reports. However, existing methods often rely on complete reports for alignment, overlooking the impact of presentation-style vocabulary. To address this issue, we propose FSE, a two-stage Factual Serialization Enhancement method. In Stage 1, we introduce factuality-guided contrastive learning for visual representation by maximizing the semantic correspondence between radiographs and corresponding factual descriptions. In Stage 2, we present evidence-driven report generation that enhances diagnostic accuracy by integrating insights from similar historical cases structured as factual serialization. Experiments on MIMIC-CXR and IU X-ray datasets across specific and general scenarios demonstrate that FSE outperforms state-of-the-art approaches in both natural language generation and clinical efficacy metrics. Ablation studies further emphasize the positive effects of factual serialization in Stage 1 and Stage 2. The code is available at https://github.com/mk-runner/FSE.

2407.21075 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Apple Intelligence Foundation Language Models

Apple Intelligence 基础语言模型

Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Al Rashid, Albin Madappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, Anupama Mann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Daniel Parilla, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, Hannah Gillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler, Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, Zhao Tang Luo, Zhi Ouyang, Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, Eric Liang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Irina Belousova, Joris Pelemans, Karen Yang, Keivan Alizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, Qi Shan, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, Vivek Rangarajan Sridhar, Wencong Zhang, Wenqi Zhang, Wentao Wu, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, Zhongzheng Ren

发表机构 * Apple(苹果公司)

AI总结 本文介绍了为 Apple Intelligence 功能开发的基础语言模型,包括一个约30亿参数的设备端高效运行模型和一个用于私有云计算的服务器端大模型,并描述了其架构、训练数据、优化过程和评估结果。

详情
AI中文摘要

我们介绍了为支持 Apple Intelligence 功能而开发的基础语言模型,包括一个约30亿参数的模型,旨在设备上高效运行,以及一个用于私有云计算的大型服务器端语言模型。这些模型旨在高效、准确且负责任地执行广泛的任务。本报告描述了模型架构、用于训练模型的数据、训练过程、模型如何针对推理进行优化以及评估结果。我们强调了对负责任人工智能的关注,以及这些原则如何贯穿于模型开发的整个过程。

英文摘要

We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.

2405.09689 2026-05-28 cs.LG cs.AI cs.SC 版本更新

Generalized Holographic Reduced Representations

广义全息约简表示

Calvin Yeung, Zhuowen Zou, SungHeon Jeong, Wenjun Huang, Nathaniel D Bastian, Mohsen Imani

发表机构 * University of California, Irvine(加州大学尔湾分校) United States Military Academy(美国军事学院)

AI总结 提出广义全息约简表示(GHRR),通过灵活的非交换绑定操作改进超维计算对复杂组合结构的编码能力,并在语言建模任务中验证其可替代注意力机制并提升性能。

详情
AI中文摘要

超维计算(HDC)是一种计算和数据高效的范式,在连接主义和符号主义人工智能方法之间架起桥梁。然而,HDC的简单性给编码复杂组合结构带来了挑战,尤其是在其绑定操作中。为了解决这个问题,我们提出了广义全息约简表示(GHRR),它是傅里叶全息约简表示(FHRR)的扩展,FHRR是一种特定的HDC实现。GHRR引入了一种灵活的非交换绑定操作,能够改进复杂数据结构的编码,同时保留HDC的鲁棒性和透明性等理想特性。在这项工作中,我们介绍了GHRR框架,证明了其理论性质及其对HDC性质的遵循,探索了其核和绑定特性,并通过实证实验展示了其灵活的非交换性以及对组合结构增强的解码准确性。我们还证明了GHRR中的绑定比其他HDC变体更具表现力;特别地,我们展示了GHRR中的绑定可以实现一种注意力机制。我们通过在Transformer中将其注意力机制替换为GHRR等价物并在语言建模任务上进行测试来验证这一点,结果显示与普通Transformer相比性能有所提升。

英文摘要

Hyperdimensional Computing (HDC) is a computationally and data-efficient paradigm that acts as a bridge between connectionist and symbolic approaches to artificial intelligence (AI). However, HDC's simplicity poses challenges for encoding complex compositional structures, especially in its binding operation. To address this, we propose Generalized Holographic Reduced Representations (GHRR), an extension of Fourier Holographic Reduced Representations (FHRR), a specific HDC implementation. GHRR introduces a flexible, non-commutative binding operation, enabling improved encoding of complex data structures while preserving HDC's desirable properties of robustness and transparency. In this work, we introduce the GHRR framework, prove its theoretical properties and its adherence to HDC properties, explore its kernel and binding characteristics, and perform empirical experiments showcasing its flexible non-commutativity, enhanced decoding accuracy for compositional structures. We also demonstrate that binding in GHRR is more expressive than that in other HDC variants; in particular, we show that binding in GHRR can implement a kind of attention mechanism. We verify this by replacing the attention mechanism in a transformer with its GHRR-equivalent and testing it on a language modeling task, showing improved performance compared to a vanilla transformer.

2308.07772 2026-05-28 cs.LG cs.AI 版本更新

MOLE: MOdular Learning FramEwork via Mutual Information Maximization

MOLE: 基于互信息最大化的模块化学习框架

Tianchao Li, Yulong Pei

发表机构 * Department of Mathematics(数学系) Computer Science, Eindhoven University of Technology, Eindhoven, the Netherland(计算机科学系,埃因霍温理工大学,埃因霍温,荷兰)

AI总结 提出一种异步局部学习框架MOLE,通过层间模块化与互信息最大化实现梯度隔离的局部优化,适用于向量、网格和图数据,并在图级别和节点级别任务上验证了通用性。

Comments accepted by icml llw

详情
Journal ref
ICML Workshop on Localized Learning (LLW) 2023
AI中文摘要

本文介绍了一种名为模块化学习框架(MOLE)的异步局部学习框架,用于神经网络。该框架按层将神经网络模块化,通过互信息为每个模块定义训练目标,并通过互信息最大化依次训练每个模块。MOLE使训练成为跨模块梯度隔离的局部优化,这种方案比BP更具生物合理性。我们在向量、网格和图类型数据上进行了实验。特别地,该框架能够解决图类型数据的图级别和节点级别任务。因此,MOLE已被实验证明可普遍适用于不同类型的数据。

英文摘要

This paper is to introduce an asynchronous and local learning framework for neural networks, named Modular Learning Framework (MOLE). This framework modularizes neural networks by layers, defines the training objective via mutual information for each module, and sequentially trains each module by mutual information maximization. MOLE makes the training become local optimization with gradient-isolated across modules, and this scheme is more biologically plausible than BP. We run experiments on vector-, grid- and graph-type data. In particular, this framework is capable of solving both graph- and node-level tasks for graph-type data. Therefore, MOLE has been experimentally proven to be universally applicable to different types of data.

2304.12986 2026-05-28 cs.CL cs.AI 版本更新

Measuring Massive Multitask Chinese Understanding

测量大规模多任务中文理解

Hui Zeng

发表机构 * Besteasy (Beijing) Language Technology Co., Ltd.(北京最佳语言科技有限公司)

AI总结 针对中文大语言模型缺乏能力评估的问题,提出一个涵盖医学、法律、心理学和教育四大领域共23个子任务的多任务测试,通过零样本准确率评估模型性能,发现最佳模型平均领先最差模型18.6个百分点,且所有模型在法律领域表现最差。

详情
AI中文摘要

大规模中文语言模型的发展蓬勃,但缺乏相应的能力评估。因此,我们提出一个测试来衡量大型中文语言模型的多任务准确性。该测试涵盖四大领域,包括医学、法律、心理学和教育,其中医学有15个子任务,教育有8个子任务。我们发现,在零样本设置中,表现最好的模型平均比表现最差的模型高出近18.6个百分点。在四大领域中,所有模型的最高平均零样本准确率为0.512。在子领域中,只有GPT-3.5-turbo模型在临床医学上达到了0.693的零样本准确率,这是所有模型在所有子任务中的最高准确率。所有模型在法律领域表现不佳,最高零样本准确率仅为0.239。通过全面评估多个学科知识的广度和深度,该测试可以更准确地识别模型的不足之处。

英文摘要

The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.

2305.06426 2026-05-28 cs.AI cs.SY eess.SY math.OC 版本更新

Planning a Community Approach to Diabetes Care in Low- and Middle-Income Countries Using Optimization

使用优化规划中低收入国家糖尿病护理的社区方法

Katherine B. Adams, Justin J. Boutilier, Sarang Deo, Yonatan Mintz

发表机构 * Department of Operations and Analytics, University of Texas at San Antonio(运营管理与分析系,德克萨斯大学圣安东尼奥分校) Telfer School of Management, University of Ottawa(奥多恩大学泰弗管理学院) Max Institute of Healthcare Management, Indian School of Business(印度商学院医疗管理研究所) Department of Industrial and Systems Engineering, University of Wisconsin-Madison(工业与系统工程系,威斯康星大学麦迪逊分校)

AI总结 提出一个优化框架,通过个性化社区卫生工作者访视计划,在社区层面最大化血糖控制,并平衡筛查新患者与管理已登记患者的资源分配。

Comments 50 pages, 13 figures

详情
AI中文摘要

糖尿病是全球健康优先事项,尤其是在中低收入国家,超过50%的过早死亡归因于高血糖。社区卫生工作者项目可以提供负担得起且文化适宜的解决方案,用于糖尿病的早期发现和管理。我们引入了一个优化框架,用于确定个性化的社区卫生工作者访视,以在社区层面最大化血糖控制。我们的框架明确建模了筛查新患者与为已登记治疗患者提供管理访视之间的权衡。我们考虑了患者的动机状态,这会影响他们决定加入或退出治疗,从而影响干预的有效性。通过估计患者的健康和动机状态,我们的模型在构建访视计划时考虑了患者在决定加入治疗时的权衡,从而降低了退出率并改善了资源分配。我们应用该方法,使用印度城市贫民窟的运营数据生成社区卫生工作者访视计划。我们发现,与最佳基线方法相比,我们的方法在相同能力下可将空腹血糖降低高达25%。我们的实验还表明,该方法在不完美信息下表现良好。

英文摘要

Diabetes is a global health priority, especially in low- and-middle-income countries, where over 50% of premature deaths are attributed to high blood glucose. Community Health Worker (CHW) programs can provide affordable and culturally tailored solutions for early detection and management of diabetes. We introduce an optimization framework to determine personalized CHW visits that maximize glycemic control at a community level. Our framework explicitly models the trade-off between screening new patients and providing management visits to individuals who are enrolled in treatment. We account for patients' motivational states, which affect their decisions to enroll or drop out of treatment and, therefore, the effectiveness of the intervention. By estimating patients' health and motivational states, our model builds visit plans accounting for patients' tradeoffs when deciding to enroll in treatment, leading to reduced dropout rates and improved resource allocation. We apply our approach to generate CHW visit plans using operational data from urban slums in India. We find that our approach can reduce fasting blood glucose by up to 25% with the same capacity as the best baseline method. Our experiments also demonstrate that our approach performs well with imperfect information.