arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2374
2605.30152 2026-05-29 cs.CL cs.AI cs.HC

Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?

主动型智能体真的需要LLM来决定何时唤醒和锚定什么吗?

Xiaoze Liu, Ruowang Zhang, Amir H. Abdi, Michel Galley, Zhikai Chen, Siheng Xiong, Xiaoqian Wang, Jing Gao

AI总结 提出用时间图学习(TGL)模型替代LLM作为主动智能体的触发器,通过图更新而非文本处理用户活动,实现高效、低延迟的触发决策。

Comments 31 pages, 5 figures, 7 tables

详情
AI中文摘要

主动型智能体将用户活动读取为文本,并在每个事件上调用LLM来决定是否行动。但用户活动本质上不是文本:它是操作系统以图形式维护的结构化事件流(actor, verb, object, timestamp)元组。将结构渲染为文本并要求LLM恢复它是系统本不必进行的往返。我们将始终在线的信号视为图更新而非文本,并使用小型时间图学习(TGL)模型作为编码器:一次前向传播产生每个事件的触发概率和每个实体的路由分数,只有下游智能体(将小型结构化交接转化为流畅的用户面向句子)是LLM调用,仅在触发时调用。TGL在14个基线上平均提升F1 +16.7(最高+46.0);在触发架构比较中,一个TGL检查点给出了最强的触发AUC和最稳定的部署阈值。它在GPU服务器上每个事件运行11.13毫秒,在消费级笔记本电脑上运行13.99毫秒,比每种测试场景中的每个单次前向LLM作为触发配置快约4-7倍和12-83倍,其BF16驻留内存占用约220 MiB,可部署在设备上,与其消费的隐私敏感活动流一起运行。

英文摘要

Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.

2605.30151 2026-05-29 cs.AI

Temporal Stability and Few-Shot Prompting in Math Task Assessment

数学任务评估中的时间稳定性和少样本提示

Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn

AI总结 本研究通过纵向实验评估AI工具在数学任务认知需求分类中的时间稳定性和少样本提示效果,发现提示工程比模型版本更新更能提升性能。

Comments 23 pages, 1 figure

详情
AI中文摘要

随着AI工具日益融入教育环境,其随时间稳定性以及对提示工程技术的响应性成为问题。本纵向研究聚焦于不同AI工具使用任务分析指南(TAG; Stein & Smith, 1998)对数学任务认知需求进行分类的能力。具体而言,考察了这种分类能力是否随(1)模型版本更新和(2)使用示例任务的少样本提示而改变。我们测试了一个通用AI工具(Gemini)和一个教育专用AI工具(Coteach)。选择这些特定工具是因为它们在相关公开基准和先前任务特定测试中表现相对较高。模型在基线时进行测试,在模型版本更新后重新测试,然后再次使用少样本提示(每个认知需求类别两个示例任务)进行测试。结果显示,仅更新模型版本产生了混合效应:Gemini的准确率稳定在58%,而Coteach的准确率从75%下降到50%。然而,少样本提示提高了两个模型的性能:Gemini提高到67%,Coteach恢复到75%的准确率。这些发现表明,提示工程技术可以产生比被动模型改进更大且更可靠的效果,并且版本更新并不总是能提高在专门教育任务上的性能。该研究对教育工作者和研究人员在教育环境中如何选择、评估和实施AI工具具有重要意义。

英文摘要

As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their responsiveness to prompt engineering techniques. This longitudinal study focused on different AI tools' ability to use the Task Analysis Guide (TAG; Stein \& Smith, 1998) to classify the cognitive demand of mathematics tasks. In particular, it examined whether this classification ability changed with (1) model version updates over time and (2) few-shot prompting using exemplar tasks. We tested a general-purpose AI tool (Gemini) and an education-specific AI tool (Coteach). The specific tools were selected because of their relatively high performance on relevant published benchmarks and prior task-specific tests. Models were tested at baseline, retested with model version updates, and then tested again using few-shot prompting (two exemplar tasks for each cognitive demand category). Results revealed that newer model versions alone produced mixed effects: Gemini's accuracy remained stable at 58\%, while Coteach's accuracy decreased from 75\% to 50\%. However, few-shot prompting improved both models' performance: Gemini increased to 67\% and Coteach recovered to 75\% accuracy. These findings demonstrate that prompt engineering techniques can have larger and more reliable effects than passive model improvements, and that version updates may not always improve performance on specialized educational tasks. The study has important implications for how educators and researchers should approach AI tool selection, evaluation, and implementation in educational contexts.

2605.30150 2026-05-29 cs.AI

Anchorless Diversification for Parallel LLM Ideation

无锚点多样化并行LLM创意生成

Fares Nabil Ibrahim, Nafis Saami Azad, Raiyan Abdul Baten

AI总结 研究无锚点方法(如语义方向分层)在并行LLM创意生成中实现候选池多样化,无需依赖种子想法,在多样性、质量和计算效率上优于有锚点基线。

详情
AI中文摘要

大型语言模型越来越多地用于生成创意任务的候选想法池,其中广泛探索是有价值的。在此场景下,并行推理在拓宽池的同时保持质量和成本效率时具有吸引力。我们研究推理时控制以实现候选池多样化,探究无锚点方法是否能与依赖观察到的种子想法的方法相抗衡。在三个创意任务族中,我们在中性和群体参照发散指令下,比较了独立生成和语义方向分层与自我、同伴和代表性锚点基线。群体参照发散是一个强大的低成本基线,在保持质量代理的同时增加了语义多样性。语义方向分层更强:一次规划调用即可组织跨广泛语义方向的生成,产生最佳的多样性-质量-计算前沿。锚点再生在最终池多样性上可能很强,但其优势在完整流水线令牌核算下缩小。这些结果为开放式LLM创意生建立实用的无锚点基线。

英文摘要

LLMs are increasingly used to generate candidate-idea pools for creative tasks where broad exploration is valuable. Parallel inference can be attractive in this setting when it broadens the pool while retaining quality and cost efficiency. We study inference-time controls for candidate-pool diversification, asking whether anchorless methods can rival methods that depend on observed seed ideas. Across three creative task families, we compare independent generation and semantic direction stratification with self-, peer-, and representative-anchor baselines, under neutral and population-referential divergent instructions. Population-referential divergence is a strong low-cost baseline, increasing semantic diversity while preserving quality proxies. Semantic direction stratification is stronger: a single planning call organizes generations across broad semantic directions, yielding the best diversity--quality--compute frontier. Anchored regeneration can be strong in final-pool diversity, but its advantage shrinks under full-pipeline token accounting. These results establish practical anchorless baselines for open-ended LLM ideation.

2605.30148 2026-05-29 cs.LG cs.AI

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

克服LLM微调中的遗忘:进化策略方法

Kajetan Schweighofer, Conor F. Hayes, Roberto Dailey, Risto Miikkulainen, Xin Qiu

AI总结 本文发现进化策略微调中的先前任务遗忘实为性能漂移且可恢复,并引入锚定权重衰减(AWD)正则化技术有效稳定先前任务性能,表明遗忘可避免,使ES成为LLM持续学习的可行方法。

详情
AI中文摘要

进化策略(ES)最近作为强化学习(RL)在大语言模型(LLM)微调中的竞争性替代方案出现,通过简单性、可扩展性和仅推理训练提供优势。然而,近期研究表明,在新任务上进行ES微调可能导致对先前任务的遗忘。首先,本文表明先前任务遗忘(1)更好地被描述为性能漂移而非不可逆遗忘,在ES训练过程中先前任务性能通常会恢复;(2)并非ES特有的失败模式,使用RL方法微调时也可能出现。其次,本文分析了这种漂移何时以及为何出现,强调了其对ES训练动态的依赖性,特别是权重空间中弱约束方向上的随机游走行为。第三,基于这些见解,本文引入了锚定权重衰减(AWD)作为一种参数空间正则化技术,将优化约束向初始模型参数。AWD在保持目标任务性能的同时有效稳定了先前任务性能,以更低的计算成本实现了与大型ES种群规模相当的优势。因此,与先前观点相反,本文表明ES下的先前任务遗忘在很大程度上是可以避免的,使ES成为LLM持续学习中一种有前景的方法。

英文摘要

Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) fine-tuning, offering advantages through simplicity, scalability, and inference-only training. However, recent work suggests that ES fine-tuning on new tasks may induce forgetting of prior tasks. First, this paper shows that prior task forgetting (1) is better characterized as performance drift rather than irreversible forgetting, with prior-task performance often recovering during ES training; and (2) is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods. Second, it analyzes when and why such drift arises, highlighting its dependence on ES training dynamics, particularly random walk behavior in weakly constrained directions of the weight space. Third, based on these insights, it introduces Anchored Weight Decay (AWD) as a parameter-space regularization technique that constrains optimization toward the initial model parameters. AWD effectively stabilizes prior-task performance while preserving target-task performance, achieving benefits comparable to large ES population sizes at much lower computational cost. Thus, contrary to previous beliefs, the paper shows that prior-task forgetting under ES is largely avoidable, positioning ES as a promising approach for continual learning in LLMs.

2605.30144 2026-05-29 cs.AI cs.MA

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

AgentSchool:基于LLM的多智能体教育模拟系统

Yulei Ye, Wenhao Li, Zhong Wen, Yunshu Huang, Yichen Hu, Zifan Wei, Yige Wang, Xinyu Xie, Haoxuan Yang, Yanjun Huang, Ruijia Li, Hong Qian, Yu Song, Bo Jiang, Bingdong Li, Lijun Li, Bo Zhang, Pinlong Cai, Xingcheng Xu, Shuangye Chen, Xia Hu, Liang He, Aimin Zhou, Jingjing Qu, Jing Shao, Xiangfeng Wang

AI总结 提出AgentSchool,一种LLM驱动的多智能体模拟器,通过可成长的学生智能体(带知识图谱、思维工作流和错误概念)与自适应教师智能体(基于最近发展区)模拟学习过程,支持多尺度模拟,实验验证了其生成差异化掌握轨迹和符合课堂社会理论的行为模式。

Comments 39 pages, 10 figures

详情
AI中文摘要

尽管LLM已迅速部署到课堂中,验证教育AI仍然具有独特的棘手性:干预措施作用于发展中的学习者,其认知和社会轨迹被不可逆地塑造,而现实世界试验缓慢、受伦理约束且受制度限制。基于LLM的教育模拟器已成为潜在的补救措施,但许多模拟器仍将学习简化为角色扮演,并且当仅优化以再现现有课堂时,可能会结构性惩罚教学改革所需的制度创新。在这项工作中,我们介绍了AgentSchool,一种LLM驱动的多智能体模拟器,将学习建模为状态转换而非提示行为。AgentSchool将可成长的学生智能体(配备加权学科知识图谱、思维工作流池和显式错误概念)与自适应教师智能体(在最近发展区内规划、搭建支架和反思)相结合,嵌入可配置的场景生成器(将教学置于正式和非正式学习领域)和多尺度模拟器(解耦交互规模、时间粒度和模拟持续时间)。实验表明,结构化学生智能体比基线模拟器产生更差异化的掌握和错误概念轨迹,而教师智能体比较显示出与基于ZPD的适应一致的骨干依赖模式。此外,AgentSchool生成与课堂社会理论一致的外围参与、小团体形成、攻击者诱导的凝聚力和意见领袖出现的合理轨迹。除了作为教育研究工具的作用外,AgentSchool还将教育构建为在组织压力下进行长时记忆、多智能体协调和未来制度推理的社会意义测试平台。

英文摘要

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

2605.30140 2026-05-29 cs.CV

AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection

AnomalyAgent: 用于零样本/少样本异常检测的无训练智能体模型

Yi Zhang, Jiawen Zhu, Lele Fu, Guansong Pang

AI总结 提出一种基于多模态大语言模型的无训练智能体框架AnomalyAgent,通过定制工具集和记忆模块实现零样本/少样本异常检测,在逻辑/上下文异常等复杂场景中优于现有方法。

详情
AI中文摘要

受益于视觉语言模型(如CLIP)的泛化能力,许多零样本/少样本异常检测方法已在各种数据集上取得了令人印象深刻的检测性能。然而,它们需要在大规模辅助数据集上进行大量训练以适应异常检测,并且其推理主要依赖于基于视觉-文本嵌入相似度的异常分数,缺乏检测需要深度上下文理解的复杂异常的推理能力。为了解决这一局限性,我们提出了 extbf{AnomalyAgent},一种新颖的无训练智能体框架,利用多模态大语言模型的先进推理和泛化能力进行异常检测。关键要素包括: extbf{1)}一个全面的以异常为中心的工具集,能够在零样本设置下实现自适应MLLM驱动的智能体异常推理; extbf{2)}一个定制的记忆模块,通过少样本上下文参考示例来支撑异常推理。我们将评估从广泛使用的基准测试中检测简单异常(例如,裂纹和凹痕等表面缺陷以及明显病变)扩展到更多样化的异常类型,例如物流和制造环境中的逻辑/上下文异常。大量实验结果表明,我们的AnomalyAgent与无训练的基于VLM的异常检测和通用智能体方法相比,实现了显著更好的性能,突显了其在零样本和少样本异常检测设置中的优越泛化能力。代码实现可在此地址找到。

英文摘要

Benefiting from generalizability of vision-language models (VLMs) such as CLIP, many zero-/few-shot anomaly detection (AD) approaches have achieved impressive detection performance across various datasets. Nevertheless, they require substantial training on large auxiliary datasets to adapt VLMs to anomaly detection, and their inference largely relies on visual-text embedding similarity-based anomaly scores, lacking reasoning abilities to detect complex anomalies that require in-depth contextual understanding. To address this limitation, we propose \textbf{AnomalyAgent}, a novel training-free, agentic framework that leverages the advanced reasoning and generalization capabilities of multimodal large language models (MLLMs) for anomaly detection. The key ingredients include \textbf{1)} a comprehensive anomaly-centric toolset that enables adaptive MLLM-driven, agentic anomaly reasoning in zero-shot settings, and \textbf{2)} a customized memory module that grounds anomaly reasoning with few-shot, in-context reference examples. We extend evaluation beyond the detection of simple anomalies (e.g., surface defects like cracks and dents and clear lesions) in widely used benchmarks to more diverse types of anomalies such as logical/contextual anomalies in logistics and manufacturing settings. Extensive experiment results demonstrate that our AnomalyAgent achieves substantially better performance compared to training-free VLM-based AD and generic agentic methods, highlighting its superior generalization capability in both zero-shot and few-shot anomaly detection settings. The code implementation can be find at this address.

2605.30136 2026-05-29 cs.AI

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

通过上下文相关性的注意力引导增强多智能体通信

Hongxiang Zhang, Yuan Tian, Tianyi Zhang

AI总结 针对LLM多智能体系统中长对话历史导致信息稀释的问题,提出无训练的上下文管理方法Agent-Radar,利用时空衰减机制动态引导注意力,在五个基准上取得最高7.64个绝对点的提升。

详情
AI中文摘要

基于LLM的多智能体系统通过协作推理在复杂任务上表现出色。然而,这些系统在交互过程中会迅速积累极长的对话历史。随着对话变长,相关信息被无关上下文稀释,导致性能下降。在这项工作中,我们提出了Agent-Radar,一种无需训练的上下文管理方法,通过新颖的时空衰减机制动态引导每个智能体的注意力到相关上下文。实验表明,Agent-Radar在五个不同基准上优于最先进的方法,最高提升7.64个绝对点。此外,分析显示Agent-Radar在智能体数量和交互轮次增加时仍然有效且鲁棒。最后,消融研究表明Agent-Radar的核心组件对性能至关重要,且在不同设置下具有泛化性。

英文摘要

LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent-Radar, a training-free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent-Radar outperforms state-of-the-art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent-Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent-Radar are crucial to performance and generalizable in different settings.

2605.30135 2026-05-29 cs.LG cs.AI

DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced Learning

DAMEL: 双轴多专家学习用于类别不平衡学习

Hyuck Lee, Taemin Park, Heeyoung Kim

AI总结 提出双轴多专家学习算法DAMEL,通过表示轴和时间轴上的多专家集成,同时降低预测偏差和方差,有效解决类别不平衡学习问题。

详情
AI中文摘要

针对来自具有长尾分布的真实世界数据的类别不平衡学习所带来的挑战,已有多种算法被提出。这些算法通过重平衡技术减少了预测偏差,但通常以增加预测方差为代价。一些多专家学习算法旨在解决这一方差问题,但涉及复杂的过程。我们提出了一种新的多专家学习算法,称为双轴多专家学习(DAMEL),该算法通过沿表示轴和时间轴使用多个专家来同时降低预测的偏差和方差。沿表示轴,DAMEL拼接多个专家的表示,并同时使用拼接后的表示训练一个辅助的平衡分类器。沿时间轴,DAMEL聚合跨训练时期的网络权重,并在测试时使用这些聚合权重。实验结果表明,DAMEL同时降低了预测的偏差和方差,突显了其在类别不平衡学习中的有效性。

英文摘要

Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed distributions. While these algorithms reduce prediction bias through rebalancing techniques, they often introduce increased prediction variance as a trade-off. Several multi-expert learning algorithms aim to address this variance but involve complex procedures. We propose a new multi-expert learning algorithm, called the dual-axis multi-expert learning (DAMEL), which reduces both bias and variance of predictions by using multiple experts along both representation and time axes. Along the representation axis, DAMEL concatenates the representations of multiple experts and trains an auxiliary balanced classifier simultaneously with the concatenated representations. Along the time axis, DAMEL aggregates network weights across training epochs, employing these aggregated weights during testing. Experimental results demonstrate that DAMEL reduces both bias and variance of predictions, highlighting its effectiveness in class-imbalanced learning.

2605.30133 2026-05-29 cs.CL

CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference Resolution

CorPipe at CRAC 2026: 多语言共指消解中的空节点与跨语言迁移

Milan Straka

AI总结 本文提出CorPipe 26系统,通过单一模型联合预测空节点、提及和共指链接,在CRAC 2026多语言共指消解共享任务中超越所有其他系统,并在LLM赛道和不受限赛道分别领先2.8和9.5个百分点。

Comments Accepted to CODI-CRAC 2026

详情
AI中文摘要

我们介绍CorPipe 26,这是我们在CRAC 2026多语言共指消解共享任务中的获胜提交。该共享任务的第五版主要关注生成式LLM与专用系统的比较;此外,还引入了5个更多数据集和2种新语言。CorPipe 26是CorPipe 25的改进版本,具有一种新变体,可在单个模型中同时预测空节点、提及和共指链接。我们的系统在LLM赛道中优于所有其他提交2.8个百分点,在不受限赛道中优于所有提交9.5个百分点。此外,我们进行了一系列消融实验,涉及不同模型大小、空节点预测方法以及跨语言零样本评估。源代码和训练好的模型可在https://github.com/ufal/crac2026-corpipe公开获取。

英文摘要

We introduce CorPipe 26, our winning submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. The fifth edition of this shared task focuses mainly on the comparison of generative LLMs and specialized systems; additionally, 5 more datasets and 2 new languages are introduced. CorPipe 26 is an improved version of CorPipe 25, with a new variant predicting empty nodes together with mentions and coreference links in a single model. Our system outperforms all other submissions in the LLM track by 2.8 percent points and all submissions in the unconstrained track by 9.5 percent points. Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross-lingual zero-shot evaluation. The source code and the trained models are publicly available at https://github.com/ufal/crac2026-corpipe.

2605.30132 2026-05-29 cs.LG stat.ML

Learning to Extrapolate to New Tasks: A Relational Approach to Task Extrapolation

学习外推到新任务:一种关系型任务外推方法

Adam Ousherovitch, Yixin Wang

AI总结 提出关系型任务外推器(RTE),通过将目标任务分解为锚定任务和变换关系并学习关系算子,实现向未见任务的系统性外推,在函数预测和序列预测中显著优于现有方法。

Comments ICML 2026

详情
AI中文摘要

现代学习系统擅长内插,但难以泛化到训练分布支持范围之外的未见任务。即使在简单设置中(如处理超出训练范围的任务参数),这种失败也会发生,并且尽管基础模型取得了进展,问题依然存在。为此,我们开发了关系型任务外推器(RTE),一种旨在实现向新任务系统性外推的算法。关键观察是外推本质上是关系型的:外推到未见任务需要学习任务如何相互转换。如果模型在训练期间学习了任务A和B之间的变换,它可以在测试时应用相同的变换来关联已知任务和未见任务。RTE通过将每个目标任务分解为一个已知的锚定任务和一个连接锚定与目标的变换来实现这一思想。然后它学习一个关系算子,将锚定-变换对映射到目标任务的预测。我们在函数预测的多个任务外推场景中实例化RTE,例如目标任务使用超出范围的参数(参数外推)、具有更大的组合深度(长度外推)和/或以未见方式重新组合函数原语(组合外推)。我们进一步将RTE扩展到序列预测,将其集成到基础模型的微调算法中。在实证研究中,我们发现RTE在向新颖、未见任务的外推上显著优于现有方法。

英文摘要

Modern learning systems excel at interpolation but struggle to generalize to unseen tasks outside the training distribution's support. This failure occurs even in simple settings, such as handling task parameters beyond the training range, and persists despite advances in foundation models. To this end, we develop the Relational Task Extrapolator (RTE), an algorithm designed to enable systematic extrapolation to novel tasks. The key observation is that extrapolation is inherently relational: extrapolating to unseen tasks requires learning how tasks transform into one another. If a model learns the transformation between tasks A and B during training, it can apply that same transformation to relate known tasks to unseen ones at test time. RTE operationalizes this idea by decomposing each target task into a known anchor task and a transformation linking the anchor and target. It then learns a relational operator, mapping an anchor-transformation pair to predictions for the target task. We instantiate RTE across multiple task extrapolation regimes in function prediction, e.g. where target tasks use out-of-range parameters (parameter extrapolation), have greater compositional depth (length extrapolation), and/or recombine function primitives in unseen ways (compositional extrapolation). We further extend RTE to sequence prediction, integrating it into fine-tuning algorithms for foundation models. Across empirical studies, we find that RTE substantially outperforms existing approaches on extrapolation to novel, unseen tasks.

2605.30131 2026-05-29 cs.CL cs.CV

CCS: Clinical Consensus Selection for Radiology Report Generation

CCS:放射学报告生成的临床共识选择

Xi Zhang, Yingshu Li, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

AI总结 提出CCS框架,通过采样多个候选报告并选择临床共识最高的一个,以改进放射学报告生成在推理时的质量。

Comments 17 pages, 6 figures

详情
AI中文摘要

放射学报告生成(RRG)通常被表述为单路径生成任务,其中多模态大语言模型(MLLM)产生一个解码报告作为最终输出。虽然最近的进展主要通过扩展训练数据、模型容量和检索机制来推动,但在推理时提高报告质量仍未被充分探索。在这项工作中,我们观察到固定的放射学MLLM在其候选池中通常生成比默认解码选择的报告临床更强的报告,这表明推理时的决策仍然是一个被忽视的瓶颈。为了解决这个问题,我们提出了临床共识选择(CCS),一个解码器无关的推理时选择框架,它采样多个候选报告,并选择在展开池中具有最高临床共识的报告。CCS将基于文本的效用与由图像-报告训练的多模态嵌入器计算的放射学适应效用统一起来,该嵌入器测量超越表面文本相似性的候选一致性。在三个数据集和多个放射学MLLM上,CCS始终优于单路径解码和通用Best-of-N基线,特别是在临床指标上取得了明显提升。进一步分析表明,基于图像的效用形成了与文本共识不同的选择轴,并且在推理时改进RRG仍有很大的提升空间。

英文摘要

Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.

2605.30126 2026-05-29 cs.CV cs.AI cs.CL cs.LG

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

PARCEL: 基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

AI总结 提出PARCEL视觉分词架构,通过池锚定和条件弹性查询重采样解决视觉令牌压缩中的空间与查询表示冲突,在27个基准上提升性能-效率帕累托前沿。

Comments 33 pages, 4 figures

详情
AI中文摘要

大型视觉-语言模型(LVLMs)将视觉输入映射为密集的令牌序列,导致推理时的二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型以在多个视觉令牌预算下运行来解决这一问题。然而,现有方法在激进压缩下表现不佳。空间压缩(如嵌套池化)表现为不完美的低通滤波器,并引起频谱混叠,掩盖了细粒度细节。查询压缩(如嵌套查询重采样)用非局部摘要替代显式的网格对齐令牌,显著降低了空间定位能力。为解决这一表示冲突,我们引入了PARCEL(基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解),一种视觉分词架构,动态分配特征提取的工作。PARCEL将空间池令牌建立为低频布局锚点,并通过池条件查询重采样使弹性查询令牌依赖于这些锚点。这鼓励查询令牌专注于互补的视觉特征,而非冗余的空间映射。在27个基准上的广泛评估表明,PARCEL改进了性能-效率帕累托前沿,在各种视觉令牌预算下持续优于现有的嵌套基线,同时保留了“一次训练,随处部署”的范式。

英文摘要

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

2605.30117 2026-05-29 cs.AI

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

VLA-Trace: 通过表示与行为追踪诊断视觉-语言-动作模型

Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang, Jiayu Hu, Haozhe Shan, Han Dong, Jinpeng Lu, Yinda Chen, Yi Zhang, Yong Dai, Xiaozhu Ju

AI总结 提出VLA-Trace诊断框架,通过表示演化、因果控制归因和行为表现分析,揭示VLA模型在多模态知识向具身控制转化中的机制,发现不同模型在微调适应、多模态路由和语义遵循上的差异与局限。

详情
AI中文摘要

理解视觉-语言-动作(VLA)模型如何将多模态知识转化为具身控制仍然是一个开放的挑战。我们提出了VLA-Trace,一个渐进式诊断框架,通过从表示动态到因果控制归因再到行为表现的统一证据链来分析VLA模型。它具体结合了跨模态和以检查点漂移为中心的核对齐(CKA)来追踪表示演化,注意力阻断干预来识别模态特定的控制通路,以及 rollout 级别的行为探针来检查基础能力、捷径依赖和语义遵循。在 $π_{0.5}$ 和 OpenVLA 上的实验揭示了三个关键发现。第一,两个模型在 VLA 微调期间表现出不同的模态特定适应动态。第二,它们在动作解码期间依赖于不同的多模态路由策略和层间依赖关系。第三,尽管 VLA 策略在视觉引导的轨迹生成方面表现出色,但在细粒度语义遵循方面仍然有限。这些发现指出了表示保持适应、因果 VLA 回路和组合语义控制的未来方向。

英文摘要

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $π_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.

2605.30116 2026-05-29 cs.CV cs.LG

SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation

SGMD: 得分梯度匹配蒸馏用于少步视频扩散蒸馏

Zhuguanyu Wu, Ruihao Gong, Yang Yong, Yushi Huang, Xiangyu Fan, Lei Yang, Dahua Lin, Xianglong Liu

AI总结 针对分布匹配蒸馏在少步视频扩散中训练昂贵且运动动态保守的问题,提出得分梯度匹配蒸馏(SGMD),通过直接优化假得分朝向教师并使用教师停止梯度Fisher作为稳定目标,实现约3倍训练加速并显著提升运动动态。

Comments ICML 2026

详情
AI中文摘要

分布匹配蒸馏(DMD)是加速少步视频扩散模型推理的常用范式。然而,DMD风格的视频蒸馏面临两个耦合挑战:假得分必须跟踪不断演化的生成器,当需要频繁更新时训练成本高昂,而反向KL风格匹配可能具有模式寻求性和保守性,难以保持强运动动态。为解决这些问题,我们提出 extbf{得分梯度匹配蒸馏(SGMD)}。SGMD采用假得分视角,直接优化假得分朝向教师,同时使用教师停止梯度Fisher作为稳定的分布匹配目标。我们提供了梯度分析,论证了在理想跟踪下该目标选择的合理性。在此基础上,SGMD引入一对双重势:负残差(NR)用于外环校正,残差收缩(RC)用于内环跟踪。实验上,与DMD2相比,SGMD实现了约$\sim 3 imes$的训练加速,并显著改善了4步蒸馏模型的运动动态,同时保持了时间一致性。一项人类研究证实,SGMD在运动质量和整体偏好上更受青睐,而视觉质量和文本对齐保持相当。代码可在https://github.com/ModelTC/LightX2V获取。

英文摘要

Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.

2605.30115 2026-05-29 cs.CV

Large Depth Completion Model from Sparse Observations

来自稀疏观测的大深度补全模型

Zhu Yu, Zhengyi Zhao, Runmin Zhang, Lingteng Qiu, Kejie Qiu, Yisheng He, Siyu Zhu, Zilong Dong, Si-Yuan Cao, Hui-Liang Shen

AI总结 提出LDCM,利用单目基础模型和基于泊松的深度初始化策略,结合点图头回归3D坐标,实现稀疏观测下的度量准确深度补全。

Comments ICLR 2026. Project webpage: https://pkqbajng.github.io/ldcm/

详情
AI中文摘要

本文提出了大深度补全模型(LDCM),一个简单、有效且鲁棒的框架,用于稀疏观测下的单视图度量深度估计。在不依赖复杂架构设计的情况下,LDCM使用Transformer生成度量准确的密集深度图。它在多种数据集和稀疏观测下优于现有方法。我们从两个关键角度实现这一点:(1)利用现有的单目基础模型提高稀疏深度输入的质量,(2)重新制定训练目标以更好地捕捉几何结构和度量一致性。具体来说,首先引入基于泊松的深度初始化策略,从不同的稀疏观测生成均匀的粗密集深度图,为网络提供强大的结构先验。关于训练目标,我们用点图头替换传统的深度头,该点图头回归相机空间中的逐像素3D坐标,使模型能够直接学习底层3D场景结构,而不是执行逐像素深度图恢复。此外,这种设计消除了对相机内参的需求,使LDCM能够自然地产生度量尺度的3D点图。大量实验表明,LDCM在多个基准测试和不同稀疏度水平下,在深度补全和点图估计方面均持续优于最先进的方法,展示了其有效性和对未见数据分布的强泛化能力。

英文摘要

This work presents the Large Depth Completion Model (LDCM), a simple, effective, and robust framework for single-view metric depth estimation with sparse observations. Without relying on complex architectural designs, LDCM generates metric-accurate dense depth maps using a transformer. It outperforms existing approaches across diverse datasets and sparse observations. We achieve this from two key perspectives: (1) leveraging existing monocular foundation models to improve the quality of sparse depth inputs, and (2) reformulating training objectives to better capture geometric structure and metric consistency. Specifically, a Poisson-based depth initialization strategy is first introduced to generate a uniform coarse dense depth map from diverse sparse observations, providing a strong structural prior for the network. Regarding the training objective, we replace the conventional depth head with a point map head that regresses per-pixel 3D coordinates in camera space, enabling the model to directly learn the underlying 3D scene structure instead of performing pixel-wise depth map restoration. Moreover, this design eliminates the need for camera intrinsic parameters, allowing LDCM to naturally produce metric-scaled 3D point maps. Extensive experiments demonstrate that LDCM consistently outperforms state-of-the-art methods across multiple benchmarks and varying sparsity levels in both depth completion and point map estimation, showcasing its effectiveness and strong generalization to unseen data distributions.

2605.30112 2026-05-29 cs.LG

Striding Across Reynolds Numbers: Representation Geometry in Neural PDE Generalisation

跨越雷诺数:神经PDE泛化中的表示几何

Jianing Shi

AI总结 通过分析神经PDE求解器在跨雷诺数泛化中的表示几何,发现基于卷积自编码器的匹配方法(ConvAE-Relay)在无需目标域数据的情况下达到38.34%误差,揭示了局部多尺度表示对跨雷诺数迁移的关键作用。

Comments 12 pages, 8 figures, 5 tables

详情
AI中文摘要

神经PDE求解器中的跨雷诺数泛化仍然缺乏表征。在标准的强迫二维Navier-Stokes基准上,训练好的傅里叶神经算子在10倍雷诺数偏移下达到46.68%的相对L2误差,而零前向模型检索基线已经改进到41-42%。这表明表示几何是测试方法中的一个主要组织变量。我们通过ConvAE-Relay测试这一假设,该方法在源训练卷积自编码器潜在空间中匹配状态,并从源域数据库借用动力学,仅使用源域数据库且无需目标域拟合、标签或数据库条目,达到38.34+/-0.07%的误差。2x2消融实验将匹配质量隔离为优于更新规则的主导因素。Oracle实验证实,当匹配保持在流形上时,源域动力学方向仍然可迁移(余弦相似度~0.84);自回归漂移是主要瓶颈(约12个百分点)。从学习预测方面,具有多尺度跳跃连接的U-Net达到34.72+/-0.60%的误差,与检索方面的发现一致,即局部多尺度表示组织测试方法中的跨雷诺数迁移。所有结论均限于该基准。

英文摘要

Cross-Reynolds generalisation in neural PDE solvers remains poorly characterised. On the canonical forced 2D Navier-Stokes benchmark, a trained Fourier Neural Operator reaches 46.68% relative L2 error under a 10x Reynolds-number shift, yet zero-forward-model retrieval baselines already improve to 41-42%. This suggests representation geometry as a major organising variable among the tested methods. We test this hypothesis through ConvAE-Relay, which matches states in a source-trained convolutional autoencoder latent space and borrows dynamics from a source-regime database, achieving 38.34+/-0.07% using only a source-regime database and no target-regime fitting, labels, or database entries. A 2x2 ablation isolates matching quality as dominant over the update rule. Oracle experiments confirm that source-regime dynamics directions remain transferable (cosine similarity ~0.84) when matching stays on-manifold; autoregressive drift is the primary bottleneck (~12 percentage points). From the learned-prediction side, a U-Net with multi-scale skip connections achieves 34.72+/-0.60%, consistent with the retrieval-side finding that local, multi-scale representations organise cross-Reynolds transfer among tested methods. All claims are scoped to this benchmark.

2605.30111 2026-05-29 cs.CV cs.AI

xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

xModel-KD:基于LiDAR的3D场景感知跨模态知识蒸馏

Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan

AI总结 提出跨模态知识蒸馏框架xModel-KD,通过对比学习对齐2D图像纹理与3D点云几何特征,在无额外标注下提升LiDAR点云分割性能。

Comments 3 figures, and 5 tables

详情
AI中文摘要

点云分割是3D场景理解中的基础任务。其进展受到密集3D标注高成本和高时间的限制,导致标注样本难以获取。除了标注稀缺,不同感知模态面临固有局限性。2D图像提供丰富的纹理和外观线索,但缺乏明确的深度和几何结构。相比之下,3D点云捕捉精确的空间几何,但稀疏且不含纹理信息。因此,依赖单一模态限制了所学表示的丰富性并削弱了泛化能力。尽管最近结合3D点云与2D图像的多模态方法在分类和检索等任务中表现出色,但它们通常依赖大规模标注数据集,且尚未充分用于数据高效的密集预测。为解决这些限制,我们提出一种新颖的跨模态知识蒸馏框架xModel-KD,用于3D点云分割。我们的方法通过跨模态对齐学习统一的逐点表示,利用2D纹理和3D几何的互补优势。具体而言,我们设计了一个跨模态融合编码器,通过对比目标训练,强制多视图下对应的2D和3D表示之间的特征一致性。通过将强大的预训练骨干与有针对性的融合策略相结合,所提框架有效地将图像的外观线索迁移到几何感知的点特征中。实验结果表明,跨模态融合在mIoU上比仅使用LiDAR的基线实现了2%的绝对提升,证明了利用互补多模态信息进行可扩展和标注高效的3D场景理解的优势。

英文摘要

Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required for dense 3D annotations, making labeled samples difficult to obtain. Beyond annotation scarcity, different sensing modalities face inherent limitations. 2D images provide rich texture and appearance cues, yet they lack explicit depth and geometric structure. In contrast, 3D point clouds capture accurate spatial geometry but are sparse and contain no texture information. As a result, relying on a single modality restricts the richness of learned representations and weakens generalization. Although recent multi-modal methods that combine 3D point clouds with 2D images have demonstrated strong performance in tasks such as classification and retrieval, they typically depend on large-scale labeled datasets and have not been fully exploited for data-efficient dense prediction. To address these limitations, we propose a novel cross-modal knowledge distillation framework, xModel-KD, for 3D point cloud segmentation. Our method exploits the complementary strengths of 2D texture and 3D geometry by learning unified per-point representations through cross-modal alignment. Specifically, we design a cross-modal fusion encoder trained with a contrastive objective that enforces feature consistency between corresponding 2D and 3D representations across multiple views. By integrating powerful pre-trained backbones with a targeted fusion strategy, the proposed framework effectively transfers appearance cues from images to geometry-aware point features. Experimental results show that cross-modal fusion achieves a 2% absolute improvement in mIoU over a LiDAR-only baseline, demonstrating the benefit of leveraging complementary multi-modal information for scalable and annotation-efficient 3D scene understanding.

2605.30107 2026-05-29 cs.CL

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

Dial HEALTHDIAL for Advice: 一个用于知识驱动信息检索的多语言多平行口语对话数据集

Songbo Hu, Yinhong Liu, Ej Zhou, Evgeniia Razumovskaia, Xiaobin Wang, Alexander Fraser, Ivan Vulić, Anna Korhonen

AI总结 本文构建了HEALTHDIAL,一个大规模多语言多平行口语对话数据集,用于开发基于检索增强生成的口语对话系统,并揭示了不同语言间的性能差异。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

创建口语对话数据集在方法论上具有挑战性,当目标是构建大规模多语言多平行数据集时,这些挑战更加突出。本文介绍了HEALTHDIAL,一个用于开发和评估基于检索增强生成(RAG)的口语对话系统的大规模多语言多平行数据集。该数据集包含6,000个信息寻求对话(每种语言1,500个),这些对话基于世界卫生组织(WHO)的可信内容,以及来自四种WHO官方语言(阿拉伯语、中文、英语和西班牙语)的母语者录制的163小时用户语音。每个说话者都标注了人口统计学(如性别、年龄)和社会语言学(如主要语言、原籍地区)变量。我们报告了关键对话任务的基准结果,揭示了不同语言之间(即使是高资源语言)持续存在的性能差异。为支持未来研究,我们发布了该数据集、一个原型系统以及一个用于数据收集和系统评估的工具包。

英文摘要

Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual, and multi-parallel dataset for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. The dataset comprises 6,000 information-seeking dialogues (1,500 per language) grounded in trusted content from the World Health Organization (WHO) and 163 hours of user speech recorded from native speakers of diverse dialects across four official WHO languages: Arabic, Chinese, English, and Spanish. Each speaker is annotated with demographic (e.g., gender, age) and sociolinguistic (e.g., primary language, region of origin) variables. We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones. To support future research, we release the dataset, a prototype system, and a toolkit for data collection and system evaluation.

2605.30104 2026-05-29 cs.CL

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?

SEAL: 饱和基准能否通过LLM作为元裁判得以复兴?

Jiamin Chen, Yidi Wu, Qiexiang Wang, Qianben Chen, Yuchen Li, Yansen Zhang, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

AI总结 提出SEAL协议,通过自适应LLM元裁判从饱和基准中提取潜在排名信号,在代码生成、数学推理等任务上以更少调用实现高排名准确率。

详情
AI中文摘要

广泛使用的语言模型基准日益饱和,前沿系统常获得标准指标无法区分的接近分数。我们不构建更难的替代方案,而是探究是否可以通过改进对相同候选输出的评估来使现有任务重新具有信息量。因此,我们提出了带自适应LLM元裁判的种子淘汰法,这是一种自我改进的评估协议,用于从饱和基准中提取潜在排名信号。SEAL将候选输出种子化为单淘汰赛,并通过任务级原则和自改进检查表标准评估每场比赛。我们在涵盖代码生成、数学推理、知识密集型问答和工具使用智能体任务完成的多个饱和基准上评估SEAL。在这些设置中,SEAL改善了排名准确性与延迟之间的权衡,与完全成对评判相比达到了0.83-1.00的Spearman一致性和4/4的top-1一致性,同时每个任务仅需11.89次调用,而完全成对评估需要28.00次。

英文摘要

Widely used language-model benchmarks are increasingly saturated, with frontier systems often receiving near-tied scores that standard metrics cannot resolve. Rather than constructing harder alternatives, we ask whether existing tasks can be made informative again through improved evaluation over the same candidate outputs. Therefore, we present Seeded Elimination with Adaptive LLM-as-a-Meta-Judge, a self-improving evaluation protocol for extracting latent ranking signal from saturated benchmarks. SEAL seeds candidate outputs into a single elimination and evaluates each match with task-level principles plus self-improving checklist criteria. We evaluate SEAL on multiple saturated benchmarks covering code generation, mathematical reasoning, knowledge-intensive question answering, and tool-use agent task completion. Across these settings, SEAL improves the ranking-accuracy--latency trade-off over competing protocols, attaining 0.83--1.00 Spearman agreement with full pairwise judging and 4/4 top-1 agreement, while requiring only 11.89 calls per task compared with 28.00 for full pairwise evaluation.

2605.30103 2026-05-29 cs.LG

Convergence Theory for Iterative LLM-Based Neural Architecture Search: A Parametric Cross-Entropy Framework with Closed-Form Proxy Reliability

基于迭代式LLM的神经架构搜索的收敛理论:一个具有闭式代理可靠性的参数化交叉熵框架

Santosh Premi Adhikari, Radu Timofte, Dmitry Ignatov

AI总结 将迭代式LLM-NAS建模为参数化交叉熵方法,证明了收敛性、精英集概率几何收敛、增量生成有效性、MinHash-Jaccard去重防止模式崩溃以及代理可靠性闭式公式,并通过实验验证了理论预测。

Comments 14 pages, 2 figures, 2 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作迭代式神经架构搜索(NAS)中的生成器,然而这类算法尚无正式的收敛理论。我们将迭代式LLM-NAS建模为在可执行程序上的参数化交叉熵(CE)方法,并证明了六个结果:(1)在精英架构上的迭代式LLM微调等价于限制在LLM参数族内的CE更新;(2)期望架构质量在循环间单调非减;(3)精英集概率以几何速率C_t >= 1-(1-rho_0)^t收敛到不动点;(4)在一阶马尔可夫令牌误差模型下,基于增量的生成比全代码生成实现严格更高的有效生成率;(5)MinHash-Jaccard新颖性过滤器防止模式崩溃;(6)代理可靠性具有闭式形式rho_S = (6/pi) arcsin(rho_P(SNR)/2),从而得出实际诊断条件sigma^2_arch >> sigma^2_noise作为基于代理的可靠排名的必要条件。在22个循环、三个LLM、六个数据集、3300个生成架构的实验中,定量验证了两个预测,在效应方向层面验证了两个预测,并解释了先前经验观察到但未得到解释的代理可靠性天花板效应。

英文摘要

Large language models (LLMs) are increasingly used as generators in iterative neural architecture search (NAS), yet no formal convergence theory exists for this class of algorithms. We model iterative LLM-NAS as a parametric Cross-Entropy (CE) method over executable programs and prove six results: (1) iterative LLM fine-tuning on elite architectures is equivalent to the CE update restricted to the LLM parametric family; (2) expected architecture quality is monotonically non-decreasing across cycles; (3) elite-set probability converges to a fixed point at a geometric rate C_t >= 1-(1-rho_0)^t; (4) delta-based generation achieves a strictly higher valid-generation rate than full-code generation under a first-order Markov token-error model; (5) the MinHash-Jaccard novelty filter prevents mode collapse; (6) proxy reliability admits the closed-form rho_S = (6/pi) arcsin(rho_P(SNR)/2), yielding the practical diagnostic sigma^2_arch >> sigma^2_noise as a necessary condition for trustworthy proxy-based rankings. Testing against a 22-cycle, three-LLM, six-dataset experiment with 3,300 generated architectures confirms two predictions quantitatively, two at direction-of-effect level, and explains the proxy-reliability ceiling effect previously reported empirically but left unexplained.

2605.30100 2026-05-29 cs.LG

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

Chess-World-Model: 一个用于从国际象棋走棋序列精确状态跟踪的1000万对局基准

Benjamin Walker, Terry Lyons

AI总结 提出一个基于1000万真实国际象棋对局的大规模状态跟踪基准,通过预测合法走棋序列后的棋盘状态,测试模型学习转换规则的能力,并发现循环模型优于Transformer,且随机均匀分布子集能揭示规模掩盖的失败。

Comments 20 pages, 4 figures

详情
AI中文摘要

世界模型需要状态跟踪,即跨动作序列维持正确潜在状态的能力。现有基准通常是合成或基于语言的,限制了它们作为结构化状态更新测试在现实领域中的价值。我们引入了Chess-World-Model,一个基于1000万真实国际象棋对局构建的大规模状态跟踪基准,其中模型预测经过一系列合法走棋后达到的精确棋盘状态。除了一个留出的真实对局子集外,我们还包含一个来自均匀随机合法走棋的分布外子集,用于测试模型是否学习转换规则而非来自常见人类走法的捷径。先前的理论和实证工作表明,Transformer难以进行状态跟踪,而输入依赖的线性RNN需要表达性强的状态转换矩阵才能做到。因此,我们在匹配的接口和训练协议下,对因果Transformer、块对角SLiCE、Mamba-3和具有负特征值的Gated DeltaNet进行了基准测试。在300万和800万参数下,循环模型显著优于Transformer。真实对局性能在1800万参数以上饱和,但随机均匀子集在4000万参数下仍具有区分性,暴露了规模掩盖的失败。此外,消融实验表明,对于所有三种循环模型,表达性较弱的状态转换机制会降低分布外子集的性能。这些结果共同确立了Chess-World-Model作为一个实用的大规模状态跟踪基准,能够暴露模型规模原本会掩盖的失败。

英文摘要

World models require state tracking, which is the ability to maintain a correct latent state across action sequences. Existing benchmarks are often synthetic or language-based, limiting their value as tests of structured state updates in realistic domains. We introduce Chess-World-Model, a large-scale state-tracking benchmark built from 10 million real chess games, where models predict the exact board state reached after a sequence of legal moves. Alongside a held-out real-game split, we include an out-of-distribution split from uniformly random legal play, which tests whether models learn the transition rules rather than shortcuts from common human positions. Prior theoretical and empirical work has shown that Transformers struggle to state-track, while input-dependent linear RNNs require expressive state-transition matrices to do so. We therefore benchmark a causal Transformer, block-diagonal SLiCE, Mamba-3, and Gated DeltaNet with negative eigenvalues under a matched interface and training protocol. The recurrent models strongly outperform the Transformer at 3 and 8 million parameters. Real-game performance saturates above 18 million parameters, but the random-uniform split remains discriminative up to 40 million, exposing failures otherwise hidden by scale. Additionally, ablations show that less expressive state-transition mechanisms reduce performance on the out-of-distribution split for all three recurrent models. Together, these results establish Chess-World-Model as a practical large-scale benchmark for state tracking that exposes failures model scale would otherwise conceal.

2605.30099 2026-05-29 cs.CV

Evaluation of Conversational Agents: Understanding Culture, Context and Environment in Emotion Detection

对话代理评估:理解情感检测中的文化、背景与环境

Martha Teiko Teye, Yaw Marfo Missah, Emmanuel Ahene, Twum Frimpong, Auxane Boch

AI总结 针对黑人非洲社会,提出结合语音和图像数据、使用3层CNN和AFME算法的情感预测模型,准确率85%-96%,并识别讽刺,提升对话AI情感识别系统的可信度。

Comments IEEE paper on arxiv

详情
Journal ref
IEEE Access 10 (2022) 24976-24984; Erratum: IEEE Access (2022) 35900-35900
AI中文摘要

现在,有价值决策和高度优先分析依赖于面部生物识别、社交媒体照片标记和人机交互等应用。然而,成功部署这些应用的能力取决于它们在考虑可能边缘情况下的测试用例效率。多年来,已经实施了大量通用解决方案来模仿人类情感,包括讽刺。然而,地理位置或文化差异等因素在其解决伦理问题和改进对话AI(人工智能)的相关性中尚未得到充分探索。在本文中,我们旨在解决在黑人非洲社会中对话AI使用的潜在挑战。我们开发了一个情感预测模型,准确率在85%到96%之间。我们的模型结合了语音和图像数据来检测七种基本情感,并特别关注识别讽刺。它使用了3层卷积神经网络,并结合了一种新的音频帧平均表情(AFME)算法,重点放在模型的预处理和后处理阶段。最后,我们的解决方案有助于维护对话AI中情感识别系统的可信度。

英文摘要

Valuable decisions and highly prioritized analysis now depend on applications such as facial biometrics, social media photo tagging, and human robots interactions. However, the ability to successfully deploy such applications is based on their efficiencies on tested use cases taking into consideration possible edge cases. Over the years, lots of generalized solutions have been implemented to mimic human emotions including sarcasm. However, factors such as geographical location or cultural difference have not been explored fully amidst its relevance in resolving ethical issues and improving conversational AI (Artificial Intelligence). In this paper, we seek to address the potential challenges in the usage of conversational AI within Black African society. We develop an emotion prediction model with accuracies ranging between 85% and 96%. Our model combines both speech and image data to detect the seven basic emotions with a focus on also identifying sarcasm. It uses 3-layers of the Convolutional Neural Network in addition to a new Audio-Frame Mean Expression (AFME) algorithm and focuses on model pre-processing and post-processing stages. In the end, our proposed solution contributes to maintaining the credibility of an emotion recognition system in conversational AIs.

2605.30094 2026-05-29 cs.AI cs.GT

PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers

PokerSkill: 无需训练或求解器,大语言模型可达到专家级扑克水平

Boning Li, Baoxiang Wang, Longbo Huang

AI总结 提出PokerSkill框架,通过规则驱动的技能库约束大语言模型动作,无需训练或求解器即可在扑克中达到接近GTO水平的性能。

Comments 45 pages, 3 figures

详情
AI中文摘要

扑克是人工智能的一个标志性挑战。主流方法依赖于基于反事实遗憾最小化的均衡求解器,需要数百万核心小时的训练。大语言模型(LLMs)拥有广泛的扑克知识,但当被要求直接游戏时,其表现远低于基于求解器的智能体。传统的基于规则的扑克智能体是可解释且无需训练的,但其策略上限仍远低于均衡玩法。我们提出了 extbf{PokerSkill},一个无需训练且无需求解器的框架,通过使用详细的基于规则的扑克技能作为LLMs的结构化动作基础接口来弥合这一差距。一个确定性上下文引擎分析当前状态,并从完全由人类扑克专家设计的分层技能库中仅检索相关片段,将LLM的选择限制在合理动作内。针对最先进的GTO基准GTOWizard,使用PokerSkill的GPT-5.5 XHigh达到$-57 \pm 21$ mbb/hand,Claude Opus 4.6达到$-80 \pm 29$ mbb/hand,Claude Opus 4.7达到$-87\pm 64$ mbb/hand,相比默认提示基线减少了49-61%的损失,并优于强机器人Slumbot。我们的关键发现是,仅靠基于规则的技能不足以构成强大策略,仅靠LLM也无法良好游戏,但它们的结合产生了一个既不需要训练也不需要求解器访问,却能媲美基于数百万核心小时计算构建的系统的智能体。据我们所知,这是首次证明LLM在复杂不完美信息游戏中无需特定游戏训练或求解器查询即可达到竞争性能。代码可在https://github.com/lbn187/PokerSkill获取。

英文摘要

Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regret minimization, requiring millions of core-hours of training. Large Language Models (LLMs) possess extensive poker knowledge but perform far below solver-based agents when asked to play directly. Traditional rule-based poker agents are interpretable and training-free, but their strategic ceiling remains far below equilibrium play. We introduce \textbf{PokerSkill}, a training-free and solver-free framework that bridges this gap by using detailed rule-based poker skills as a structured action-grounding interface for LLMs. A deterministic context engine analyzes the current state and retrieves only the relevant fragments from a layered skill library, which is entirely designed by human poker experts, constraining the LLM's choice to reasonable actions. Against GTOWizard, a state-of-the-art GTO benchmark, GPT-5.5 XHigh with PokerSkill achieves $-57 \pm 21$ mbb/hand, Claude Opus 4.6 achieves $-80 \pm 29$ mbb/hand and Claude Opus 4.7 achieves $-87\pm 64$ mbb/hand, reducing losses by 49--61\% compared to default-prompt baselines and outperforming the strong bot Slumbot. Our key finding is that rule-based skills alone do not constitute a strong strategy, and LLMs alone cannot play well, but their combination yields an agent that requires neither training nor solver access yet competes with systems built on millions of core-hours of computation. To our knowledge, this is the first demonstration of an LLM achieving competitive performance in a complex imperfect-information game without game-specific training or solver queries. Code is available at https://github.com/lbn187/PokerSkill.

2605.30093 2026-05-29 cs.CV

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

几何至关重要:用于学习语义对应的3D基础先验

Artur Jesslen, Olaf Dünkel, Adam Kortylewski

AI总结 提出一种3D感知的后训练框架,利用3D基础模型(SAM3D)估计物体几何和姿态,生成几何感知特征图,结合DINO和Stable Diffusion特征,通过测地距离过滤候选对应,训练轻量适配器改进语义对应。

Comments 9 pages (main paper), 21 pages (total), 4 figures

详情
AI中文摘要

来自自监督视觉模型和文本到图像扩散模型的基础特征已被证明对语义对应估计有效。然而,由于这些特征主要从2D图像目标学习,它们缺乏明确的3D意识,并且常常混淆对称物体侧面、重复部分以及在3D中不同的视觉相似结构。我们引入了一个3D感知的后训练框架,通过结合3D基础模型的先验,超越了现有的2D基础特征。给定一张图像,我们的方法使用SAM3D估计物体几何和姿态,并通过渲染-比较优化来细化姿态。随后,我们根据估计的物体姿态,将重建几何中的PartField描述符渲染到图像平面。由此产生的几何感知特征图补充了DINO和Stable Diffusion特征,而重建形状上的测地距离能够可靠地过滤候选对应。我们使用过滤后的匹配作为监督,在DINO和Stable Diffusion之上训练一个轻量适配器用于语义对应。与之前需要姿态标注并依赖粗略球形几何的后训练方法相比,我们的方法自动获得实例特定的3D结构,并用它来指导对应学习。实验表明,我们的方法改进了语义对应,同时减少了人工几何监督。代码和模型可在 https://github.com/GenIntel/3D-SC 获取。

英文摘要

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.

2605.30090 2026-05-29 cs.CL cs.CV

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

DirectorBench: 通过个性化多智能体评估诊断长视频生成

Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

AI总结 提出DirectorBench,一种基于多智能体的诊断基准,通过80个结构化元数据、7个用户画像和40个检查点标准,在脚本、视觉、音频、跨模态和稳定性五个维度上评估长视频生成,并定位瓶颈和用户偏好依赖。

详情
AI中文摘要

长视频生成正从短的单场景合成快速转向分钟级、多镜头的创作,具有叙事结构、电影控制、音频和跨模态同步。然而,评估此类视频仍然具有挑战性,因为现有基准主要关注局部视觉质量、短期时间一致性或通用提示对齐,并且对工作流故障和用户依赖偏好的诊断有限。我们引入了DirectorBench,一个用于长视频生成的个性化多智能体诊断基准。DirectorBench根据80个结构化元数据、7个用户画像和40个检查点标准,在脚本、视觉、音频、跨模态和稳定性五个维度上评估生成的视频。DirectorBench不将质量简化为单一聚合分数,而是定位检查点级别的瓶颈并支持画像感知评估。我们评估了4个长视频生成工作流、6个基础LLM和7个用户画像。在不同工作流中,DirectorBench揭示了一个单元间瓶颈:过渡质量平均仅为0.256,最佳工作流达到0.356,而提示级别的用户需求满足度平均为0.71。我们进一步进行了14名标注者的人工评估,以验证DirectorBench与人类判断的一致性。结果表明,DirectorBench捕捉到了人类可感知的质量差异,并揭示了聚合评分所隐藏的工作流和画像依赖的故障模式。这些发现强调了长视频生成中诊断性和画像感知基准的重要性。

英文摘要

Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.

2605.30089 2026-05-29 cs.LG

Distributionally Robust Set Representation Learning Under Inference-Time Element Corruption

推理时元素损坏下的分布鲁棒集合表示学习

Yankai Chen, Hanrong Zhang, Bowei He, Philip S. Yu, Xue, Liu

AI总结 针对推理时元素损坏问题,提出SW-DRSO分布鲁棒优化框架,通过重心对抗近似最坏情况损失,在四个任务上验证了鲁棒性和性能。

Comments Accepted by ICML'26

详情
AI中文摘要

标准集合表示学习方法通常在精心整理的数据上表现良好,但往往忽略了推理时元素损坏的挑战。这指的是部署模型遇到元素级别的退化(如异常值或缺失组件)时,可能扭曲集合表示并降低性能。我们提出了SW-DRSO,一个专门为集合设计的分布鲁棒优化框架。SW-DRSO不是仅最小化观测训练数据上的损失,而是优化一个关于一系列合理推理时变体的最坏情况期望损失的可处理替代项。我们引入了一个重心对抗,通过可微的训练时优化单纯形权重来近似对损坏集合的难以处理的搜索。在四个任务上的大量实验表明,SW-DRSO在保持高整体性能的同时,有效增强了对损坏的鲁棒性。

英文摘要

Standard Set Representation Learning methods typically excel on curated data but often overlook the challenge of inference-time element corruption. This refers to scenarios where deployed models encounter element-level degradations, such as outliers or missing components, that may distort set representation and degrade performance. We propose SW-DRSO, a distributionally robust optimization framework tailored for sets. Rather than minimizing loss solely on observed training data, SW-DRSO optimizes a tractable surrogate of the worst-case expected loss over a family of plausible inference-time variations. We introduce a barycentric adversary that approximates the intractable search over corrupted sets by a differentiable training-time optimization over simplex weights. Extensive experiments across four tasks demonstrate that SW-DRSO effectively enhances robustness against corruption while maintaining high overall performance.

2605.30087 2026-05-29 cs.AI

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

冲突多源个人记忆上的选择性问答:诊断性测试平台与方法比较

Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky

AI总结 针对多源冲突记忆的选择性问答问题,构建了包含34,560个实例的诊断基准,评估了多种方法,发现结构化融合方法在准确性和选择性上优于纯提示LLM。

Comments 55 pages, 5 figures

详情
AI中文摘要

新兴的个人AI代理正朝着持久、多源记忆的方向发展。这带来了一个评估问题:系统必须决定如何使用冲突或不完整的证据;它们不能仅从一个干净的历史中检索事实。现有的基准很少能显示错误是来自提供给方法的证据还是来自方法的冲突解决步骤。我们将此研究为冲突多源个人记忆上的选择性问答:系统基于冲突的、有时不完整的来源进行回答,或者在证据不足时放弃回答。我们开发了一个基准,包含8种推理类型下的18个问题模板、480个角色、4个随机种子和34,560个实例,具有受控的来源扭曲和确定性的真实答案。我们评估了无法访问任何来源、访问单一来源、结构化融合方法以及前沿LLM的基线性能。最佳训练融合解析器达到80.3%的准确率,而最强的纯提示LLM基线达到70.0%。在允许弃权的情况下,同一解析器在78.3%的覆盖率下达到85.3%的选择性准确率,最佳LLM在95.4%的覆盖率下达到71.0%的选择性准确率。不同模型在不同推理类型上具有不同的优势。我们发布了数据、代码、缓存的模型输出以及数据生成过程以供复用。

英文摘要

Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.

2605.30085 2026-05-29 cs.AI cs.CL cs.LG stat.ML

Conformal Certification of Reasoning Trace Prefixes

推理轨迹前缀的保形认证

Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

AI总结 提出CROP方法,通过保形校准选择阈值,返回最长无错前缀,并控制错误包含概率,平衡保留有效推理与丢弃误导后缀。

Comments Code available at https://github.com/matthewyccheung/crop

详情
AI中文摘要

语言模型推理轨迹很少是全有或全无;在关键错误发生之前,它们通常包含有效的中间步骤。现有的不确定性量化方法通常认证最终答案或整个响应,未能为顺序轨迹中可安全保留的比例提供统计保证。为了解决这个问题,我们引入了CROP(保形推理输出前缀),一种与验证器无关的校准程序,用于干净前缀认证。给定任何步骤级风险代理,CROP选择一个校准阈值,并返回其步骤风险代理保持低于该阈值的最长连续前缀,将未认证的后缀路由到下游审查或修复。假设可交换性,CROP严格控制了返回前缀包含注释错误的边际概率。在六个过程标记的推理数据集上,我们证明了标准步骤级指标(如AUROC)不能完全捕捉前缀效用,建议验证器应改为通过认证前缀长度进行评估。此外,CROP平衡了过度保留和不足保留,通过保留有效的中间推理同时丢弃误导后缀,提高了下游修复的准确性。最终,这项工作将前缀认证定位为过程监督、弃权和修复之间的严格、实用的桥梁。

英文摘要

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

2605.30083 2026-05-29 cs.CV

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

未来强制:自回归视频生成中无需训练的未来感知KV缓存策略

Jiayi Luo, Qiyan Liu, Tengyang Wang, JunHao Liu, Jiayu Chen, Cong Wang, Hanxin Zhu, Chen Gao, Xiaobin Hu, Qingyun Sun, Zhibo Chen

AI总结 提出Future Forcing,一种无需训练的未来感知KV缓存策略,通过利用自回归视频模型中查询分布的平稳性来估计未来查询,从而改进长视频生成的一致性。

详情
AI中文摘要

自回归(AR)视频生成已成为长时域视频合成的一种有前景的范式,其中每一帧的生成基于先前生成的令牌。为了加速推理,使用KV缓存避免跨生成步骤的冗余重计算。然而,随着生成长度的增长,KV缓存会引入越来越多的内存和误差累积,限制了AR模型扩展到更长序列的可扩展性。现有的KV缓存压缩方法通过选择性地保留被认为重要的视频令牌来缓解这一问题。然而,大多数现有方法使用从当前或历史生成上下文中提取的短时域信号来评估令牌重要性,这使得这些方法容易忽略在早期步骤中看似不重要但后来对未来帧至关重要的令牌。在这项工作中,我们识别了训练好的AR视频模型的一个重要性质:尽管RoPE调制的查询在自回归步骤中演变,但底层的规范预RoPE查询分布在视频生成过程中保持显著稳定。这种近似平稳性意味着未来查询分布可以从历史统计中估计,从而无需额外训练即可实现原则性的未来感知缓存决策。基于这一洞察,我们提出了Future Forcing,一种用于AR视频生成的无需训练的未来感知KV缓存策略。具体来说,Future Forcing首先从历史统计中构建未来查询代理,然后根据该代理下的重要性对KV缓存令牌进行评分,最后在未来查询诱导的仿射子空间内合并冗余令牌对。大量实验表明,Future Forcing在有限的KV缓存下改善了长时域一致性,在VBench-Long上针对60秒生成,与现有的AR视频KV缓存策略相比,主体一致性提升了高达1.49。

英文摘要

Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation steps. Nevertheless, its growth with generation length introduces increasing memory and error accumulation, limiting the scalability of AR models to even longer sequences. Existing KV cache compression methods mitigate this issue by selectively retaining only video tokens deemed important. However, most existing methods assess token importance using short-horizon signals derived from the current or historical generation context, making these methods prone to overlooking tokens that appear unimportant at early steps but later become critical for future frames. In this work, we identify an important property of trained AR video models: although RoPE-modulated queries evolve across autoregressive steps, the underlying canonical pre-RoPE query distribution remains remarkably stable throughout the video generation process. This approximate stationarity implies that future query distributions are estimable from historical statistics, enabling principled future-aware cache decisions without any additional training. Building on this insight, we propose Future Forcing, a training-free future-aware KV cache policy for AR video generation. Specifically, Future Forcing first constructs a future query proxy from historical statistics, then scores KV cache tokens by their importance under this proxy, and finally merges redundant token pairs within the affine subspace induced by the future query. Extensive experiments show that Future Forcing improves long-horizon consistency under limited KV caches, achieving up to 1.49 improvement in subject consistency on VBench-Long for 60s generation over existing AR video KV cache policies.

2605.30080 2026-05-29 cs.CL

Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model

自适应目标动态分块用于无分词层次模型

Thang Dang, Akira Nakagawa, Kenichi Kobayashi, Koichi Shirahata

AI总结 提出自适应目标动态分块(ATDC)机制,通过课程学习动态调整压缩比,以优化无分词层次模型的字节压缩效果,在FineWeb-Edu 100B数据集上实现竞争性的每字节比特数性能,并提升训练稳定性和下游任务表现。

详情
AI中文摘要

无分词层次模型正成为传统大型语言模型(LLM)的有前途替代方案,解决了词汇设计复杂性、词汇外(OOV)错误和语言特定约束等固有预处理问题。然而,这些字节级方法的一个重大挑战是压缩比的优化,这是决定模型通过分块处理字节数据性能的关键因素。在本文中,我们提出自适应目标动态分块(ATDC),一种新颖的字节压缩控制机制,旨在增强层次架构中动态分块的有效性。我们的方法利用课程学习在训练过程中逐步调整压缩比,从低压缩过渡到高压缩以稳定学习过程。我们提供分析,建立了目标压缩比与每最内层分块字节数(BPIC)之间的关系,从而能够在整个训练阶段跟踪分块大小的演变。在FineWeb-Edu 100B数据集上进行的评估表明,配备ATDC的层次模型在每字节比特数(BPB)性能上与在字节和词元级别上运行的常规基线相比具有竞争力。此外,与使用固定压缩比的模型相比,所提出的方法在多种下游任务中表现出更稳定的训练动态和更优的最终性能,同时保持了字节级处理的固有鲁棒性和灵活性。

英文摘要

Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.