arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2374
2605.29971 2026-05-29 cs.CL

Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning

连续变量的因果干预:以上下文学习中转向向量的动词偏向为例

Zhenghao Herbert Zhou, R. Thomas McCoy, Robert Frank

AI总结 提出一种对连续变量进行因果干预的方法,通过定位低维方向并编辑向量实现反事实目标值,应用于动词偏向特征,证明其在语言模型中的因果表示,并探讨与上下文学习的关系。

详情
AI中文摘要

语言模型表示中的因果干预主要针对离散特征,如语法数。然而,语言模型也必须利用分级特征。我们引入了一种对连续变量进行因果干预的方法:给定与分级目标变量配对的激活向量,我们定位该变量的低维方向,并使用该方向将向量编辑为反事实目标值。我们将此方法应用于心理语言学中研究充分的连续特征,即动词偏向(反映给定动词后倾向于出现哪种句法结构)。我们表明,动词偏向因果地表示在从大型语言模型中提取的转向向量中:对动词偏向的反事实编辑系统地改变了下游结构偏好。动词偏向此前也与上下文学习相关联;在进一步分析中,我们发现转向向量编码了可能驱动上下文学习中观察到的误差驱动更新行为的误差信号,但这些转向向量的方面在下游生成中并未被因果使用。总体而言,这些结果表明因果干预可以应用于连续变量,尽管将连续变量与上下文学习联系起来仍然是一个挑战。

英文摘要

Causal interventions in language model representations have largely targeted discrete features, like grammatical number. However, language models must also make use of features that are graded. We introduce a method for causal intervention on continuous variables: given activation vectors paired with a graded target variable, we localize a low-dimensional direction for that variable and use this direction to edit a vectors toward counterfactual target values. We apply this method to a continuous feature that is well-studied in psycholinguistics, namely verb bias (which reflects which syntactic structures tend to follow a given verb). We show that verb bias is causally represented in steering vectors extracted from large language models: counterfactual edits to verb bias systematically shift downstream structural preferences. Verb bias has also previously been linked to in-context learning; in further analyses, we find that steering vectors encode error signals that could drive the error-driven update behavior seen in in-context learning but that these aspects of the steering vectors are not causally used in downstream production. Overall, these results show causal interventions can be applied to continuous variables, though connecting continuous variables to in-context learning remains a challenge.

2605.29966 2026-05-29 cs.AI

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Compass: 通过专家引导的LLM代理导航全球海洋铅数据整合

Yiming Liu, Bin Lu, Meng Jin, Ziyuan Sang, Shuo Jiang, Lei Zhou, Xinbing Wang, Chenghu Zhou, Jing Zhang

AI总结 针对海洋铅数据分散于非结构化论文中的问题,提出专家引导的LLM代理框架Compass,结合知识树分解任务,从23万篇论文中提取3751条铅记录,构建最大海洋铅数据库,准确率达92%。

详情
AI中文摘要

海洋铅及其同位素是海洋环流和人为污染的关键示踪剂,然而实地观测仍然成本高昂且稀疏。尽管存在大量历史记录,但它们被埋藏在学术论文的非结构化内容中,形成了无法进行综合分析的数据孤岛。手动提取不可扩展,而通用大语言模型缺乏必要的领域特定知识,导致幻觉和科学上无效的输出。为了解决这个问题,我们引入了一种专家引导的适应方法,使LLM能够在不进行微调的情况下执行严格的科学数据提取。我们通过Compass(一个由与海洋科学家共同设计的知识树增强的LLM代理框架)来实现这种方法,该框架将复杂任务分解为可验证的步骤,引导代理的推理以确保科学有效性。将Compass应用于超过23万篇相关开放获取论文的语料库,我们成功提取了3751条先前未纳入的铅记录。这项工作建立了迄今为止最大的综合海洋铅数据库。除了标准指标外,Compass通过多层验证展示了卓越的可靠性,经专家手动验证确认准确率达到92%。新整合的数据扩展了先前采样不足区域(如东海和南大洋)的覆盖范围,为未来的科学发现提供了丰富的数据基础。我们发布了一个交互式可视化平台以促进开放科学访问。我们的工作表明,专家引导的代理可以有效弥合通用LLM与高风险科学领域之间的差距,实现地球科学中的可扩展数据发现。

英文摘要

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

2605.29965 2026-05-29 cs.AI

Meta-Programming for Linear-time Temporal Answer Set Programming

线性时态回答集编程的元编程

Susana Hahn, Amade Nems, Javier Romero, Torsten Schaub

AI总结 提出一种统一的元编程框架,通过扩展clingo的理论语法并引入转换管道保护嵌套模态,实现了对多种线性时态逻辑(TEL、MEL、DEL)的语义操作化,并开发了metasp系统。

详情
AI中文摘要

回答集编程(ASP)的时态扩展的发展导致了非单调线性时态(TEL)、动态(DEL)和度量(MEL)时态均衡逻辑的出现。然而,高度优化的ASP系统固有的刚性常常阻碍了替代逻辑设计的快速探索和实现。在这项工作中,我们提出了一个灵活的元编程框架,通过统一的声明性框架操作化各种时态逻辑的语义。我们的方法通过用形式类型规范和嵌套能力增强clingo的理论语法,扩展了标准ASP元编程。为了确保语义正确性,我们引入了一个转换管道,在实例化过程中保护嵌套模态免受基于稳定模型的简化。我们通过实现TEL、MEL和DEL的元编码来展示我们框架的可扩展性。我们提供了TEL的全面说明,并突出了管理MEL的区间约束和DEL中的Fischer-Ladner闭包的关键特性。最后,我们介绍了metasp系统,这是一个封装了此工作流程的多功能工具。

英文摘要

The development of temporal extensions of Answer Set Programming (ASP) has led to the emergence of non-monotonic linear-time (TEL), dynamic (DEL), and metric (MEL) temporal equilibrium logics. However, the inherent rigidity of highly optimized ASP systems often hinders the rapid exploration and implementation of alternative logical designs. In this work, we propose a flexible meta-programming framework that operationalizes the semantics of varied temporal logics through a unified, declarative framework. Our approach extends standard ASP meta-programming by augmenting clingo's theory grammar with formal type specifications and nesting capabilities. To ensure semantic correctness, we introduce a transformation pipeline that protects nested modalities from stable-model-based simplifications during grounding. We demonstrate the extensibility of our framework by implementing meta-encodings for TEL, MEL, and DEL. We provide a comprehensive account of TEL and highlight the key features for managing the interval constraints of MEL and the Fischer-Ladner closure in DEL. Finally, we introduce the metasp system, a versatile tool that encapsulates this workflow.

2605.29955 2026-05-29 cs.AI

Formalizing Mathematics at Scale

大规模形式化数学

Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat, Julia Kempe, Remi Munos, Charles Arnal, Vivien Cabannes

AI总结 提出多智能体系统AutoformBot,利用LLM和形式化验证工具,自动将非正式教材翻译为Lean 4可验证代码,构建了包含超过45,000个声明和50万行代码的Atlas形式化库。

详情
AI中文摘要

我们提出了AutoformBot,一个用于在Lean 4中大规模构建自动形式化教材库(Atlas)的多智能体系统。AutoformBot协调数千个LLM智能体,配备形式化验证工具、依赖感知的任务调度和协作版本控制,将非正式的教材文本转化为机器可检查的定义和证明。我们将方法应用于26本开放获取教材,涵盖分析、代数、拓扑、组合学和概率论,生成了Atlas:一个包含超过45,000个Lean 4声明和50万行代码的已验证库。我们发布两个成果:(i)AutoformBot,开源的多智能体框架;(ii)Atlas,生成的形式化库。我们的结果表明,大规模自动形式化研究生级别数学的核心内容在经济和技术上现在是可行的。这为在研究层面上自动验证人类和机器生成的数学打开了大门。

英文摘要

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

2605.29954 2026-05-29 cs.CV

SwInception -- Local Attention Meets Convolutions

SwInception -- 局部注意力与卷积的结合

David Hagerman, Roman Naeem, Jakob Lindqvist, Carl Lindström, Fredrik Kahl, Lennart Svensson

AI总结 提出SwInception架构,通过在Swin Transformer的前馈层引入Inception块增强归纳偏置,并改进解码器以更少参数捕捉细节,在多个医学数据集上提升分割性能。

Comments International Conference on Pattern Recognition and Artificial Intelligence, 2024

详情
AI中文摘要

稀疏视觉变换器作为医学体积分割的高效编码器已广受欢迎,其中Swin成为突出选择。Swin使用局部注意力降低复杂度,在许多任务上表现优异,但仍倾向于在小数据集上过拟合。为缓解这一弱点,我们提出了一种新颖架构,通过在前馈层引入Inception块进一步增强Swin的归纳偏置。这些多分支卷积的引入使得在变换器块内能够更直接地对局部多尺度特征进行推理。我们还修改了解码器层,以使用更少的参数捕捉更精细的细节。通过大量实验,我们在十一个不同的医学数据集上展示了性能提升。我们特别展示了在医学分割十项全能(Medical Segmentation Decathlon)和颅穹窿外(Beyond the Cranial Vault)等基准挑战中,相较于先前最先进骨干网络的进步。通过证明Swin中现有的归纳偏置可以进一步改进,我们的工作为增强稀疏视觉变换器在医学和自然图像分割任务中的能力提供了一条有前景的途径。代码和预训练权重可在 https://github.com/Eiphodos/SwInception 获取。

英文摘要

Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance for many tasks but still tends to overfit on small datasets. To mitigate this weakness, we propose a novel architecture that further enhances Swin's inductive bias by introducing Inception blocks in the feed-forward layers. The introduction of these multi-branch convolutions enables more direct reasoning over local, multi-scale features within the transformer block. We have also modified the decoder layers in order to capture finer details using fewer parameters. We demonstrate a performance improvement on eleven different medical datasets through extensive experimentation. We specifically showcase advancements over the previous state-of-the-art backbones on benchmark challenges like the Medical Segmentation Decathlon and Beyond the Cranial Vault. By showing that the existing inductive bias in Swin can be further improved, our work presents a promising avenue for enhancing the capabilities of sparse vision transformers for both medical and natural image segmentation tasks. Code and pre-trained weights can be accessed at https://github.com/Eiphodos/SwInception.

2605.29953 2026-05-29 cs.CV

Mesh-Aware Epipolar Matching for Multi-View Multi-Person 3D Pose Estimation in Basketball

网格感知的对极匹配用于篮球多视角多人3D姿态估计

Li Yin, Qin Haobin, Tomohiro Suzuki, Calvin Yeung, Mariko Isogawa, Keisuke Fujii

AI总结 提出一种无训练框架MAEM,通过单目3D人体网格恢复模型和两阶段对极匹配策略,解决团队运动场景中多视角多人3D姿态估计的遮挡和外观相似问题。

详情
AI中文摘要

团队运动场景中的多视角多人3D姿态估计因球员遮挡、队服造成的外观相似性以及标注多视角数据的稀缺而仍然具有挑战性,这些因素限制了基于学习方法的有效性和泛化能力。相比之下,无训练方法的性能固有地受限于2D关键点检测的准确性和跨视角关联的鲁棒性。为应对这些挑战,我们提出了网格感知的对极匹配(MAEM),一种用于多视角多人3D姿态估计的无训练框架。我们的方法采用单目3D人体网格恢复模型作为前端,并基于恢复的网格输出引入了一种两阶段对极匹配策略。具体而言,所提出的框架结合了基于并查集的聚类与每关节三角测量,以实现鲁棒的跨视角关联和准确的3D姿态重建。在两个公开的多视角篮球数据集上的实验表明,MAEM持续优于现有的无训练关联基线,同时在室内和室外篮球场景中实现了有竞争力的仅RGB性能。MAEM在SportCenter EPFL上达到MPJPE/PA-MPJPE分数59.8/40.7毫米,在Human-M3 Basketball上达到74.0/51.8毫米,突显了密集网格几何在无需目标域训练或微调的情况下进行跨视角关联的有效性。

英文摘要

Multi-view multi-person 3D pose estimation in team sports scenarios remains challenging due to player occlusions, appearance similarity caused by team uniforms, and the scarcity of annotated multi-view data, all of which limit the effectiveness and generalization capability of learning-based methods. In contrast, the performance of training-free approaches is inherently constrained by the accuracy of 2D keypoint detection and the robustness of cross-view association. To address these challenges, we propose Mesh-Aware Epipolar Matching (MAEM), a training-free framework for multi-view multi-person 3D pose estimation. Our method employs a monocular 3D human mesh recovery model as the frontend and introduces a two-stage epipolar matching strategy based on the recovered mesh outputs. Specifically, the proposed framework combines disjoint-set-union-based clustering with per-joint triangulation to achieve robust cross-view association and accurate 3D pose reconstruction. Experiments on two public multi-view basketball datasets demonstrate that MAEM consistently outperforms existing training-free association baselines while achieving competitive RGB-only performance in both indoor and outdoor basketball scenarios. MAEM achieves MPJPE/PA-MPJPE scores of 59.8/40.7 mm on SportCenter EPFL and 74.0/51.8 mm on Human-M3 Basketball, highlighting the effectiveness of dense mesh geometry for cross-view association without requiring target-domain training or fine-tuning.

2605.29952 2026-05-29 cs.LG

From Short Histories to Long Futures: Horizon-Aware Graph Neural Networks for Long Horizon Forecasting

从短历史到长未来:面向长时域预测的视界感知图神经网络

Zesheng Liu, Maryam Rahnemoonfar

AI总结 提出一种多视界图神经网络模拟器,通过共享图骨干网络和增量预测策略,联合优化多步超前预测,实现长时域稳定且准确的地球物理系统模拟。

Comments Accepted for International Conference on Pattern Recognition (ICPR) 2026

详情
AI中文摘要

由于强非线性动力学、全物理模拟的高计算成本以及单步自回归代理在数十年滚动中产生的误差累积,地球物理系统的精确长期预测十分困难。深度神经网络可作为高效模拟器,但大多数仅训练用于下一步预测,且随着预测视界增长常出现漂移或不稳定。我们提出一种多视界图神经网络模拟器,在统一模型中学习从单个当前时间到多个未来超前时间的状态到状态转换。物理域表示为图,其中节点对应具有时变地球物理属性的空间位置,边编码局部空间相互作用。给定当前图状态,模型预测关键场(冰厚度和冰速度)在所有节点上的未来演化,使用共享图骨干网络和每个目标变量的独立输出分支。为提高稳定性,网络预测相对于当前状态的状态增量,然后将其加回以重建未来状态。训练联合优化所有超前时间,使用统一回归目标,推理采用从粗到细的滚动方式,以较大步长推进并有选择地以较短步长细化,以减少漂移并避免冗余计算。在数十年期松岛冰川模拟上的实验表明,我们的方法在长期精度和稳定性上均优于(i)直接从初始状态预测每个未来时间的基线模型和(ii)标准单步自回归滚动,为下游气候和海平面研究提供了更可靠的模拟器。

英文摘要

Accurate long-range prediction of geophysical systems is difficult due to strongly nonlinear dynamics, the high computational cost of full-physics simulations, and the error accumulation that arise when one-step autoregressive surrogates are rolled out over decades. Deep neural network can serve as efficient emulators, but most are trained only for next-step prediction and often drift or become unstable as the forecast horizon grows. We propose a multi-horizon graph neural network emulator that learns state-to-state transitions from a single current time to multiple future lead times within one unified model. The physical domain is represented as a graph, where nodes correspond to spatial locations with time-varying geophysical attributes and edges encode local spatial interactions. Given the current graph state, the model predicts the future evolution of key fields, ice thickness and ice velocities at all nodes, using a shared graph backbone with separate output branches for each target variable. To improve stability, the network predicts state increments relative to the current state, which are then added back to reconstruct future states. Training jointly optimizes all lead times with a unified regression objective, and inference uses a coarse-to-fine rollout that advances with larger jumps and selectively refines with shorter jumps to reduce drift and avoid redundant computation. Experiments on multi-decadal Pine Island Glacier simulations show that our approach achieves higher long-range accuracy and improved stability than both (i) an initial-state baseline that predicts each future time directly from the starting state and (ii) a standard single-step autoregressive rollout, producing a more reliable emulator for downstream climate and sea-level studies.

2605.29951 2026-05-29 cs.AI cs.CL cs.LG cs.MM

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

MuPHI: 通过语义基础奖励优化学习隐式多模态有害推理

Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann, Timothy Hospedales, Vera Demberg

AI总结 针对视觉语言模型在隐式跨模态有害语义推理上的不足,提出MuPHI数据集和MuPHIRM训练框架,通过多视角奖励优化联合语义学习,提升有害检测与推理质量及分布外鲁棒性。

详情
AI中文摘要

理解看似良性的图像-文本对之间交互如何产生危害,需要超越表面特征的意图感知跨模态推理。现有的视觉语言模型(VLM)擅长对感知线索进行字面推理,但往往无法推导出依赖于隐式、上下文相关推理的有害语义。为了评估VLM在组合性有害检测和推理方面的能力,我们引入了多模态语用有害解释(MuPHI)数据集,其中包含有害编码在微妙多模态线索中的图像-文本对。MuPHI涵盖多种有害类别,并包含用于评估VLM推理链的注释有害理由。为了改进VLM的检测和推理能力,我们提出了MuPHIRM,一种推理增强的训练框架,通过优化多视角奖励来学习联合语义。MuPHIRM提高了VLM的有害检测和推理质量,同时与训练和推理时基线相比,表现出优越的分布外鲁棒性。我们的发现表明,面向推理的奖励优化为构建超越基准特定捷径进行泛化的多模态系统提供了一个有前景的方向。

英文摘要

Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.

2605.29940 2026-05-29 cs.AI

Make LLM Learn to Synthesize from Streaming Experiences through Feedback

使大语言模型通过反馈从流式经验中学习合成

Zhenlin Hu, Yan Wang, Zhen Bi, Zihao Xue, Bingyu Zhu, Longtao Huang, Xiongtao Zhang, Zeyu Yang, Zhixuan Chu, Jungang Lou

AI总结 提出StreamSynth设置和SynLearner框架,使模型通过任务流积累经验并利用反馈提升合成数据生成性能。

详情
AI中文摘要

大语言模型(LLMs)已被广泛用于合成数据生成,显著降低了标注成本。然而,现有研究大多将合成视为一组孤立任务,忽略了一个更基本的问题:模型能否通过积累过去任务的经验并将其迁移到未来任务来学习合成。在这项工作中,我们引入了StreamSynth,一种新的设置,其中合成任务顺序到达,历史任务的经验为未来合成提供信息信号。为了解决这一设置,我们提出了SynLearner,一个通用框架,使合成模型能够在任务流上获取可重用的合成经验。SynLearner不是为每个任务独立生成数据,而是鼓励模型探索多样化的合成模式,从反馈中学习,并在任务演化中平衡样本质量与集合级多样性。在多个基准上的大量实验表明,SynLearner有效地利用了早期任务的经验来改进后期任务的合成性能,表现出一致的跨任务可迁移性。这些发现为StreamSynth的可行性提供了证据,并突显了合成数据生成作为一个经验驱动过程,可以从任务流中受益。

英文摘要

Large language models (LLMs) have been widely adopted for synthetic data generation, significantly reducing annotation costs. However, most existing studies treat synthesis as a set of isolated tasks and overlook a more fundamental question: whether a model can learn to synthesize by accumulating experience from past tasks and transferring it to future ones. In this work, we introduce StreamSynth, a new setting in which synthesis tasks arrive sequentially and experience from historical tasks provides informative signals for future synthesis. To address this setting, we propose SynLearner, a general framework that enables synthesis models to acquire reusable synthesis experience over a task stream. Instead of generating data independently for each task, SynLearner encourages the model to explore diverse synthesis patterns, learn from feedback, and balance sample quality with set-level diversity as tasks evolve. Extensive experiments across multiple benchmarks show that SynLearner effectively leverages experience from earlier tasks to improve synthesis performance on later ones, exhibiting consistent cross-task transferability. These findings provide evidence for the feasibility of StreamSynth and highlight synthetic data generation as an experience-driven process that can benefit from task streams.

2605.29937 2026-05-29 cs.RO cs.LG

Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion Control

Fisher保持引导:用于安全扩散控制的免训练流形约束

Hao Ren, Zetong Bi, Yiming Zeng, Le Zheng, Zhi Li, Zhaoliang Wan, Lu Qi, Hui Cheng

AI总结 提出一种免训练的Fisher保持引导方法,通过低秩雅可比分解计算Fisher保持更新,并利用截断Fisher去噪敏感性作为不确定性信号,在视觉导航中实现可靠且高效的轨迹预测。

Comments ICML2026

详情
AI中文摘要

扩散模型在视觉导航中的航路点预测是有效的,但当更新偏离训练流形时,标准采样和测试时引导可能产生不可靠或低效的轨迹。我们提出带有外积跨度投影的Fisher保持引导,这是一种免训练的推理方法,在优化任务目标的同时避免与分布外动作相关的大Fisher漂移。我们的方法通过低秩雅可比分解计算Fisher保持更新,每步仅需一次反向传播,支持实时使用。我们进一步引入截断Fisher去噪敏感性作为不确定性信号,并将其用于鲁棒的多样本动作混合。在玩具和真实导航基准上的实验,包括基于TSDF引导的Maze2D、使用官方扩散策略权重的PushT,以及仿真和真实机器人上的视觉导航,均表明与强扩散策略基线相比,无需额外训练即可获得一致的性能提升。

英文摘要

Diffusion models are effective for waypoint prediction in visual navigation, but standard sampling and test time guidance can produce unreliable or inefficient trajectories when updates drift off the training manifold. We propose Fisher Preserving Guidance with Outer Product Span Projection, a training-free inference method that avoids large Fisher drift associated with off-distribution actions while optimizing a task objective. Our method computes the Fisher-preserving update via a low-rank Jacobian factorization, requiring only a single backward pass per step and enabling real-time use. We further introduce Truncated Fisher Denoising Sensitivity as an uncertainty signal and use it for robust multi-sample action blending. Experiments on toy and realistic navigation benchmarks, including Maze2D with TSDF-based guidance, PushT with official Diffusion Policy weights, and visual navigation in simulation and on real robots, demonstrate consistent improvements in performance over strong diffusion-policy baselines without additional training.

2605.29935 2026-05-29 cs.CV cs.AI

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

CityGen: 结构引导的城市风格合成用于跨城市自动驾驶

Zezhong Qian, Zhao Yang, Lu Tan, Zhihao Yan, Weiyi Hong, Haizhuang Liu, Yawei Jueluo

AI总结 提出CityGen,一种基于扩散模型的生成框架,通过高清地图条件和城市级视觉提示实现零标签城市适应,提升跨城市自动驾驶在感知、分割和规划任务上的鲁棒性。

详情
AI中文摘要

自动驾驶系统通常在有限的地理区域内进行训练和评估,这阻碍了它们在新城市部署时的可扩展性。然而,外观、道路拓扑和交通模式的显著域偏移常常导致跨城市部署时性能严重下降。现有的基于域适应、数据增强或合成数据生成的方法通常依赖于标注的目标数据、城市特定的标注或任务特定的设计,限制了它们在整体评估中的可扩展性和有效性。在本文中,我们引入了CityTransfer-Bench,一个地理上不重叠的基准,用于评估跨城市泛化在感知、分割和规划任务上的表现,并提出了CityGen,一个基于扩散的生成框架,通过城市级视觉提示引导的高清地图条件合成实现零标签城市适应。大量实验表明,CityGen在多个任务上持续提高了跨城市鲁棒性,为可泛化的自动驾驶建立了可扩展且标签高效的基石。

英文摘要

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deployed in new cities. However, significant domain shifts in appearance, road topology, and traffic patterns often cause severe performance degradation under cross-city deployment. Existing approaches based on domain adaptation, data augmentation, or synthetic data generation typically rely on labeled target data, city-specific annotations, or task-specific designs, limiting their scalability and effectiveness for holistic evaluation. In this paper, we introduce CityTransfer-Bench, a geographically disjoint benchmark for evaluating cross-city generalization across perception, segmentation, and planning, and propose CityGen, a diffusion-based generative framework that performs zero-label city adaptation via HD-map-conditioned synthesis guided by city-level visual prompts. Extensive experiments demonstrate that CityGen consistently improves cross-city robustness across multiple tasks, establishing a scalable and label-efficient foundation for generalizable autonomous driving.

2605.29933 2026-05-29 cs.LG

CLUBench: A Clustering Benchmark

CLUBench:一个聚类基准测试

Feng Xiao, Dazhi Fu, Chris Ding, Jicong Fan

AI总结 本文提出CLUBench,一个包含24种算法在131个数据集上的综合聚类基准,通过大规模实验分析超参数调优、数据类型、预训练嵌入、大语言模型聚类等,揭示传统算法仍具竞争力,并结合预训练嵌入可提升效率。

详情
AI中文摘要

聚类是数据科学中的一个基本问题,有着悠久的研究历史,产生了许多富有洞察力的算法。尽管取得了这些进展,但缺乏一个系统且大规模的经验评估,同时考虑传统算法、基于深度学习的方法以及最近基于基础模型的聚类,导致对算法选择和部署的指导有限。为了填补这一空白,我们引入了CLUBench,一个全面的聚类基准,包含24种不同原理的算法,在131个数据集上进行了评估,涵盖表格、文本和图像数据,涉及178,815次实验。重要的是,我们对(i)超参数调优的影响、(ii)数据类型和特征的影响、(iii)预训练嵌入的影响、(iv)基于大语言模型的聚类、(v)算法的相似性以及(vi)性能矩阵的低秩结构的分析,为聚类研究提供了有意义的见解和有前景的途径。例如,我们的研究揭示:1) 所有评估的深度聚类方法在平均性能方面并不比表现最佳的传统聚类算法(如KMeans、SpeClu)具有显著优势;2) 对于图像和文本聚类任务,将预训练嵌入与传统聚类算法(如KMeans、SpeClu)相结合提供了有效且高效的聚类;3) 即使在大模型日益占据主导地位的时代,聚类仍然是一个具有挑战性和非平凡的问题。此外,我们提出利用跨模型性能矩阵中的低秩结构来高效近似实际应用中的整体性能评估。我们进一步展示了基于所有超参数配置下的性能矩阵进行模型选择的可行性。

英文摘要

Clustering is a fundamental problem in data science with a long-standing research history, yielding numerous insightful algorithms. Despite this progress, a systematic and large-scale empirical evaluation that jointly considers conventional algorithms, deep learning-based methods, and recent foundation model-based clustering remains largely absent, leading to limited guidance on algorithm selection and deployment. To address this gap, we introduce CLUBench, a comprehensive clustering benchmark comprising 24 algorithms of diverse principles evaluated on 131 datasets across tabular, text, and image data, involving 178,815 experiments. Importantly, our analyses of (i) the impact of hyperparameter tuning,(ii) the impact of data types and characteristics,(iii) the impact of pretrained embeddings,(iv) large language model-based clustering,(v) the similarity of algorithms, and (vi) the low-rank structures of performance matrices, yield meaningful insights and promising pathways for clustering research. For instance, our study reveals that: 1) All evaluated deep clustering methods do not exhibit a significant advantage compared with the top-performing conventional clustering algorithms (e.g., KMeans, SpeClu) in terms of average performance; 2) For image and text clustering tasks, combining pretrained embeddings with conventional clustering algorithms (e.g., KMeans, SpeClu) offers effective and efficient clustering; 3) Clustering remains a challenging and nontrivial problem, even in the era of increasingly dominant foundation models. Moreover, we propose to use the low-rank structure in cross-model performance matrices to efficiently approximate the overall performance evaluation in practical applications. We further demonstrate the feasibility of model selection based on the performance matrices across all hyperparameter configurations.

2605.29932 2026-05-29 cs.LG cs.CV

Treatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease Progression

治疗条件扩散用于预测神经退行性疾病进展

Danylo Boiko, Viktoriia Mishkurova

AI总结 提出一种治疗条件扩散框架,通过条件化生成过程于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量,预测高保真未来脑状态,在临床保真度上显著优于基线。

Comments 9 pages, 5 figures, 1 table

详情
AI中文摘要

预测帕金森病等神经退行性疾病的进展对于有效的长期规划和个性化治疗干预至关重要。现有系统通常产生忽略纵向神经影像丰富结构的标量临床评分,而传统生成方法则遭受解剖细节丢失和细微进展模式模糊的问题。为此,我们引入了一种新颖的治疗条件扩散框架,通过将生成过程条件化于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量,预测高保真的未来脑状态。该流程使用基于Transformer的编码器表示非线性、时间依赖的药理学动态,并通过一个关注生物关键区域的多权重感兴趣区域掩码优化生成。实验评估表明,我们的框架保持了清晰的解剖边界,并在临床保真度上显著优于基线,实现了MSE降低14.0%,MAE降低7.2%,SSIM提高4.9%。

英文摘要

Forecasting the progression of neurodegenerative diseases, such as Parkinson's disease, is essential for effective long-term planning and personalized therapeutic intervention. Existing systems typically produce scalar clinical scores that ignore the rich structure of longitudinal neuroimaging, while traditional generative approaches suffer from a loss of anatomical details and blurring subtle progression patterns. To address this, we introduce a novel treatment-conditioned diffusion framework that predicts high-fidelity future brain states by conditioning the generative process on patients' screening DaTscan images and levodopa equivalent daily dose over one year. The pipeline uses a Transformer-based encoder to represent non-linear, time-dependent pharmacological dynamics and optimizes generation through a multi-weight region-of-interest mask that focuses on biologically critical areas. Experimental evaluation shows that our framework maintains sharp anatomical boundaries and significantly improves clinical fidelity relative to the baseline, achieving 14.0% lower MSE, 7.2% lower MAE, and 4.9% higher SSIM.

2605.29931 2026-05-29 cs.AI eess.AS

It`s All About Speed: AI`s Impact on Workflow in Music Production

一切都关乎速度:AI对音乐制作工作流程的影响

Finn McClellan, Fabio Morreale

AI总结 通过民族志研究,探讨AI和自动化工具如何影响音乐制作工作流程,重点关注录音工程师、混音师和制作人的使用体验与态度,并分析速度、可控性与创造性自主权之间的张力及其缓解方法。

Comments Audio Engineering Society Conference Paper - Presented at the AES International Conference on Machine Learning and Artificial Intelligence for Audio 2025 - September 8-10, London, UK

详情
AI中文摘要

在本文中,我们展示了一项关于AI和自动化工具对音乐制作工作流程影响的民族志研究结果。我们特别关注那些自认为是录音工程师、混音师和制作人的专业参与者,讨论了他们对常见AI和自动化软件的使用情况,以及他们对这些工具普及的看法。我们讨论了在速度和效率、可控性以及保持创造性自主权等关键领域,用户与自动化工具之间可能产生的紧张关系,以及如何通过工具设计来缓解这些紧张关系。

英文摘要

In this paper, we present the results of an ethnographic study into the impact of AI and automated tools on music production workflow. Focusing specifically on professional participants who identified as recording engineers, mixers, and producers, we discuss their usage of common AI and automated software, as well as their sentiments on the proliferation of these tools. We discuss tensions that may be created between users and automated tools in key areas such as the need for speed and efficiency, controllability, and maintaining creative agency, and how these tensions may be alleviated through tool design.

2605.29927 2026-05-29 cs.CL cs.AI cs.LG

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

计划方式重要吗?LLM网络代理计划表示的实证研究

Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim

AI总结 本研究提出PlanAhead框架,通过自动难度分类和四种计划表示(顺序子目标、叙述、伪代码、检查清单)的对比实验,发现计划表示形式和生成计划的LLM显著影响网络代理的鲁棒性和任务成功率。

Comments Extended version of paper submitted to EMNLP, waiting for acceptance

详情
AI中文摘要

尽管最近取得了进展,基于LLM的网络代理仍然面临探索有限、遗漏关键步骤以及对任务约束敏感等问题。先前的研究表明,许多这些失败源于规划中的弱点,但替代自然语言计划表示的影响尚未被探索。为了解决这个问题,我们引入了PlanAhead,一个静态规划器-执行器框架,评估计划表示对代理性能的影响。我们首先将WebArena任务自动分类为3个难度级别,无需人工标注即可实现一致的难度分级。然后,我们在被分类为困难的任务上系统评估了4种不同的计划表示:顺序子目标、叙述、伪代码和检查清单;跨越不同系列的多模态LLM驱动的代理(OpenAI、阿里巴巴和谷歌)。为了解释随机变异性,我们引入了两个新的评估指标:达成率(AR)和解决任务一致性(STC)。我们的结果表明,计划制定和生成计划的底层LLM都显著影响网络代理的鲁棒性和任务成功率。

英文摘要

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.

2605.29926 2026-05-29 cs.LG

A Triple-Modal Contrastive Learning Framework with Sequence, Graph, and 3D Features for Drug-Target Interaction Prediction

一种融合序列、图和3D特征的三模态对比学习框架用于药物-靶标相互作用预测

Le Xu, Xi Zhang, Dan Luo, Ting Wang, Xuan Lin

AI总结 提出TriMod-DTI框架,通过融合药物和蛋白质的1D序列、2D图和3D结构,并采用三模态对比学习策略对齐潜在空间表示,从而提升药物-靶标相互作用预测性能。

Comments 12 pages, 5 figures, ISBRA 2026

详情
AI中文摘要

准确预测药物-靶标相互作用(DTI)对药物发现至关重要。现有方法通常依赖单模态表示(如序列或图)或仅结合两种模态,忽视了3D结构特征。为解决这一挑战,我们提出TriMod-DTI,一种三模态对比学习框架,融合药物和蛋白质的1D序列、2D图和3D结构,获得用于DTI预测的通用且互补的特征表示。我们设计了一个特征提取器,用于捕获三种模态下的药物和靶标特征,从而丰富其表示。我们进一步提出了一种三模态对比学习策略,以在潜在空间中对齐同一药物或蛋白质的不同模态表示。通过构建跨模态的正负样本对,该方法增强了模型的判别能力。在三个基准数据集上的实验表明,TriMod-DTI优于最先进的方法。消融研究验证了每种模态的贡献。此外,案例研究突显了其在DTI预测和药物发现中的实际潜力。

英文摘要

Accurate prediction of drug-target interactions (DTI) is critical for drug discovery. Existing methods often rely on single-modal representations (e.g., sequences or graphs) or combine only two modalities, overlooking 3D structural features. To address this challenge, we propose TriMod-DTI, a triple-modal contrastive learning framework that incorporates 1D sequences, 2D graphs, and 3D structures of drugs and proteins, obtaining the universal and complementary feature representations for DTI prediction. We design a Feature Extractor to capture drug and target features across the three modalities, thereby enriching their representations. We further propose a triple-modal contrastive learning strategy to align different modal representations of the same drug or protein in the latent space. By constructing cross-modal positive and negative sample pairs, this approach enhances the model's discriminative ability. Experiments on three benchmark datasets demonstrate that TriMod-DTI outperforms state-of-the-art methods. The ablation studies validate the contributions of each modality. Moreover, case studies highlight its practical potential for DTI prediction and drug discovery.

2605.29919 2026-05-29 cs.AI cs.MA

On the Geometry of Games and their Solvers

论博弈及其求解器的几何结构

Yaqi Sun, Julian Ma, David Mguni

AI总结 提出一种结构感知的求解器合成框架,通过学习连续求解器对齐的博弈几何表示,实现自适应均衡计算并揭示求解器行为的连续区域。

详情
AI中文摘要

博弈论和生成对抗网络等学习系统中的一个核心挑战是理解哪些算法能够在异质博弈景观中高效计算均衡。均衡计算通常按求解器和博弈类别分别研究,产生了强局部保证但碎片化的求解器行为视图。现有的离散分类法往往无法完整解释算法成功的原因。我们通过一个将博弈与有效求解器动力学联系起来的求解器-博弈映射来研究这一问题。经典理论识别出该映射的孤立区域,但对中间或重叠区域提供的见解有限,表明可解性由定义连续求解器对齐博弈几何的潜在结构属性控制。我们通过结构感知的求解器合成来形式化这一视角。一个学习到的结构识别器将每个博弈映射到低维求解器对齐表示,一个策略将该表示映射到有效的原始机制,从而跨区域调整求解器行为。这揭示了特定求解器动力学有效的区域,以及需要原始机制混合而非单一主导求解器的区域。一个有界残差充当局部校正器和诊断信号,用于不完整的求解器基或表示。该框架同时产生自适应求解器和分析视角:具有相似优化动力学的博弈聚类在一起,揭示了算法有效性的连续区域和重叠的求解器行为。实验表明,固定原始机制表现出系统性的区域不匹配,而学习到的表示将博弈空间组织成与求解器行为对齐的结构化地图。这些结果表明,应将均衡计算视为学习求解器机制和映射可解性几何的联合问题。

英文摘要

A central challenge in game theory and learning systems such as GANs is understanding which algorithms can efficiently compute equilibria across the heterogeneous landscape of games. Equilibrium computation is typically studied solver by solver and game class by game class, yielding strong local guarantees but a fragmented view of solver behaviour. Existing discrete taxonomies often provide an incomplete account of where algorithms succeed. We study this problem through a solver-game map linking games to effective solver dynamics. Classical theory identifies isolated regions of this map but provides limited insight into intermediate or overlapping regimes, suggesting that solvability is governed by latent structural properties defining a continuous solver-aligned geometry of games. We formalise this perspective through structure-aware solver synthesis. A learned structure recogniser maps each game to a low-dimensional solver-aligned representation, and a policy maps this representation to effective primitive mechanisms, adapting solver behaviour across regimes. This reveals regions where particular solver dynamics are effective and where mixtures of primitives are required rather than a single dominant solver. A bounded residual acts as a local corrector and diagnostic signal for incomplete solver bases or representations. The framework yields both an adaptive solver and an analytical lens: games with similar optimisation dynamics cluster together, revealing continuous regions of algorithmic validity and overlapping solver behaviour. Empirically, we show that fixed primitives exhibit systematic regime mismatch, while the learned representation organises game space into a structured cartography aligned with solver behaviour. These results suggest viewing equilibrium computation as the joint problem of learning solver mechanisms and mapping the geometry of solvability.

2605.29911 2026-05-29 cs.LG cs.CV

Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image Interpolation

通过逐像素生成图像插值减少空间推进薄膜冷却分析中的实验测试

Adam T. Müller, Philipp J. Teuffel, Konstantin Manassis, Nicolaj C. Stache

AI总结 提出一种基于轻量级前馈神经网络和位置编码的机器学习方法,从稀疏实验测量中进行图像回归,以减少推进系统薄膜冷却研究中的物理测试需求。

Comments Presented at the 11th European Conference for Aeronautics and Aerospace Sciences (EUCASS), 2025, DOI: 10.13009/EUCASS2025-285

详情
AI中文摘要

我们提出了一种从稀疏实验测量中进行图像回归的机器学习方法。我们展示了该方法在推进系统开发中薄膜冷却研究中的应用,旨在减少对大量物理测试的需求。我们的方法采用带有位置编码的轻量级前馈神经网络,根据输入参数生成图像。在真实和合成数据上的验证表明,该方法在减少30%测量量的同时,实现了高图像相似度(RMSE < 8%,SSIM > 93%)。我们进一步提出了一种知识驱动的扩展,用于生成图像的局部适应性。该方法显著减少了所需测试次数,同时保持了高质量数据,从而能够高效优化冷却剂喷射器配置,其应用范围超越航空航天领域。

英文摘要

We propose a machine learning approach for image regression from sparse experimental measurements. We show the application of the proposed method on film cooling studies in propulsion system development, aiming to reduce the need for extensive physical testing. Our method employs a lightweight feed-forward neural network with positional encoding to generate images conditioned by input parameters. Validated on real and synthetic data, it achieves high image similarity (RMSE < 8 %, SSIM > 93 %) while maintaining accuracy with a 30 \% reduction of measurements. We further propose a knowledge-informed extension for local adaptability of the generated images. This approach significantly reduces required tests while preserving high-quality data, enabling efficient optimization of coolant injector configurations with applications beyond aerospace.

2605.29900 2026-05-29 cs.LG cs.IT math.IT

OVA-IB: One vs All Information Bottleneck for Multi-Modal Alignment

OVA-IB:用于多模态对齐的一对多信息瓶颈

Tianchao Li, Shujian Yu, Xinrui Zu, Zhaolong Wei, Jeremy Gummeson, Jack C. P. Cheng, Robert Jenssen

AI总结 提出基于信息瓶颈的一对多对齐框架OVA-IB,通过充分性对比下界和最小性正则化实现任意数量模态的对齐,在分类、回归和跨模态检索任务中表现鲁棒。

详情
AI中文摘要

对比学习对于对齐配对视图或模态是有效的,但超出两个模态的对齐仍然具有挑战性且相对未被充分探索。成对的CLIP风格损失将多模态对齐分解为独立的双向比较,因此没有显式建模多个模态之间的高阶依赖关系。最近的超越成对目标从统计或几何角度处理这个问题,但任意模态对齐仍然缺乏一个原则性的标准来定义每个模态相对于其他模态应该保留和压缩什么。我们通过信息瓶颈原则重新审视任意模态对齐。在多模态学习中,充分性应保留可从其余模态预测的信息,而最小性应压缩不被其余模态支持的模态特定信息。这自然导致一对多视角,其中每个模态相对于其余模态进行表征。我们提出OVA-IB,一个用于任意模态对齐的信息瓶颈框架。OVA-IB优化一个可处理的一对多对比下界用于充分性,该下界与双总相关风格目标相连,使用无参数的几何感知投影分数,并通过用其余模态诱导的表示分布来约束每个表示对其自身输入的依赖,导出一个可处理的上界正则化器用于最小性。在分类、回归、模态无关评估和跨模态检索基准上的实验展示了强大且鲁棒的性能。

英文摘要

Contrastive learning is effective for aligning paired views or modalities, but alignment beyond two modalities remains non-trivial and comparatively underexplored. Pairwise CLIP-style losses decompose multi-modal alignment into independent two-way comparisons and therefore do not explicitly model higher-order dependencies among multiple modalities. Recent beyond-pairwise objectives approach this problem from statistical or geometric perspectives, but arbitrary-modality alignment still lacks a principled criterion for defining what each modality should preserve and compress relative to the others. We revisit arbitrary-modality alignment through the Information Bottleneck principle. In multi-modal learning, sufficiency should preserve information predictable from the remaining modalities, while minimality should compress modality-specific information not supported by them. This naturally leads to a One-vs-All view, where each modality is characterized with respect to the remaining modalities. We propose OVA-IB, an Information Bottleneck framework for arbitrary-modality alignment. OVA-IB optimizes a tractable One-vs-All contrastive lower bound for sufficiency connected to a Dual Total Correlation-style objective, uses a parameter-free geometry-aware projection score, and derives a tractable upper-bound regularizer for minimality by bounding each representation's dependence on its own input with representation distributions induced by the remaining modalities. Experiments on classification, regression, modality-agnostic evaluation, and cross-modal retrieval benchmarks demonstrate strong and robust performance.

2605.29897 2026-05-29 cs.CL

ExCAM: Explainable Cultural Awareness Metrics

ExCAM:可解释的文化意识度量

Christoph Leiter, Haiyue Song, Hour Kaing, Jin Tei, Hideki Tanaka, Masao Utiyama, Steffen Eger

AI总结 提出ExCAM,首个可识别、评分并解释指令-输出对中文化错误的专用评估度量,在平衡测试集上达到80%准确率。

Comments preprint

详情
AI中文摘要

评估大型语言模型的文化意识对于确保生成文本的公平性和应用在全球范围内的泛化能力至关重要。最近的基准通过问答或文本生成任务探索食物等文化物品或压力情境下的行为等价值观。然而,创建这些基准需要耗时且昂贵的人工标注。此外,评估自由文本中文化意识的基准很少,且往往依赖过时的评估机制。为弥补这一空白,我们引入了ExCAM,一种可解释的文化意识度量,据我们所知,这是第一个专门用于识别、评分和解释指令-输出对中文化错误的评估度量。为了训练和评估ExCAM,我们引入了ExCAM40k,一个由九个现有基准组成的数据集,我们对其进行了重新格式化并增加了合成错误。与包括GPT-5在内的多个基线相比,ExCAM在平衡测试集上实现了高达80%的最高错误检测准确率。因此,ExCAM为自由文本的细粒度、可解释的文化评估开辟了道路。

英文摘要

Evaluating the cultural awareness of large language models is crucial to ensure the fairness of generated text and the generalizability of applications across the world. Recent benchmarks explore cultural goods like food or values like behavior in stressful situations through the lens of question answering or text generation tasks. However, creating these benchmarks requires time-intensive and costly human annotations. Also, benchmarks that evaluate cultural awareness in free text are scarce and often rely on dated evaluation mechanisms. To address this gap, we introduce ExCAM, an Explainable Cultural Awareness Metric, which is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs. To train and evaluate ExCAM, we introduce ExCAM40k, a dataset comprised of nine existing benchmarks that we reformat and enhance with synthetic errors. Compared to several baselines, including GPT-5, ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set. Therefore, ExCAM opens the pathway towards fine-grained and explainable cultural evaluation of free text.

2605.29894 2026-05-29 cs.CV

Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning

训练智能体而非专家:学习利用异构专家进行多轮视觉推理

Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan

AI总结 提出VisHarness,一种可训练的视觉智能体,通过解耦高层感知推理与低层任务执行,学习利用异构视觉专家模型,以轻量训练实现多轮交互下的通用视觉任务求解。

详情
AI中文摘要

计算机视觉的最新进展产生了大量用于检测、分割、计数和其他视觉任务的强大专用模型。然而,这些模型通常针对孤立的任务形式进行优化,使得直接支持通用视觉智能变得困难,尤其是当任务需要复杂的语言理解和密集的小物体感知时。在本文中,我们提出了VisHarness,一种可训练的视觉智能体,它将高层感知、推理和决策与低层任务执行解耦。VisHarness不是训练模型来解决特定的视觉任务,而是学习利用一组精心设计的异构视觉专家。这种范式保留了智能体的通用智能,同时充分利用了专用视觉模型在具体视觉任务中的精度优势。仅通过轻量训练,VisHarness就能学习到可泛化的视觉专家利用策略,并通过与视觉专家模型的多轮交互,在各种复杂条件下解决常见的基础视觉任务。为了在实时环境中实现高效的在策略强化学习训练,我们引入了动态视觉记忆归档,这缓解了与视觉专家模型多轮交互导致的快速累积的视觉令牌开销。在涵盖推理分割、广义指代分割、密集小物体检测和指代计数的四个代表性基准上的实验表明,VisHarness显著优于现有的通用模型,并与任务专用模型相比取得了具有竞争力或更优的性能。

英文摘要

Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general-purpose visual intelligence, especially when a task requires complex language understanding and dense small-object perception. In this paper, we propose VisHarness, a trainable visual agent that decouples high-level perception, reasoning, and decision-making from low-level task execution. Instead of training a model to solve a specific visual task, VisHarness learns to harness a set of carefully designed heterogeneous visual experts. This paradigm preserves the general intelligence of the agent while fully leveraging the precision advantages of specialized visual models in concrete visual tasks. With only lightweight training, VisHarness learns a generalizable visual expert-harnessing policy and can solve common fundamental vision tasks under various complex conditions through multi-turn interactions with visual expert models. To enable efficient on-policy reinforcement learning training in a live environment, we introduce dynamic visual memory archiving, which mitigates the rapidly accumulating visual-token overhead caused by multi-turn interactions with visual expert models. Experiments on four representative benchmarks covering reasoning segmentation, generalized referring segmentation, dense small-object detection, and referring counting demonstrate that VisHarness substantially outperforms existing general-purpose models and achieves competitive or superior performance compared with task-specific models.

2605.29893 2026-05-29 cs.AI

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

冗余还是必要?检测智能体轨迹中冗余步骤的基准

Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang, Guo Jiahao, Yiyang Yin, Xiongwei Han

AI总结 针对LLM智能体轨迹中的冗余步骤检测问题,提出RedundancyBench基准,包含标注轨迹的数据集,并评估三种方法,发现最佳方法仅达到24.88%的检测分数。

详情
AI中文摘要

基于LLM的智能体通过多步推理和工具使用在解决复杂任务方面表现出强大的能力。然而,现有的评估协议主要关注任务成功,忽略了智能体行为的一个关键方面:执行效率。在实践中,智能体轨迹通常包含冗余步骤,这些步骤消耗大量资源但对任务完成贡献甚微。在这项工作中,我们提出并定义了一个新的研究领域:智能体轨迹的 extbf{冗余步骤检测}。为了支持这一倡议,我们引入了 extbf{RedundancyBench},这是一个新的基准,包含多样化的任务和精心标注的轨迹,其中每个步骤根据其对任务完成的贡献进行标记。利用RedundancyBench,我们开发并评估了3种代表性方法,以回答轨迹中的步骤是冗余还是必要的问题。我们的结果表明,即使是最优方法在检测冗余步骤方面也仅达到24.88%的分数,而有些方法的表现甚至不如随机猜测。这些结果突显了该任务的复杂性以及在该领域进一步研究的必要性。 ootnote{本文的代码和数据集均可在\href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}获取。}

英文摘要

LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: \textbf{redundant step detection} for agent trajectories. To support this initiative, we introduce \textbf{RedundancyBench}, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best-performing method achieves only 24.88\% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \footnote{Code and dataset in this paper are both available in \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}.}

2605.29891 2026-05-29 cs.CV

DVSM: Decoder-only View Synthesis Model Done Right

DVSM: 正确的仅解码器视图合成模型

Cheng Sun, Jaesung Choe, Min-Hung Chen, Ryo Hachiuma, Yu-Chiang Frank Wang

AI总结 提出仅解码器架构DVSM,通过隐式KV-cache表示场景,在相同渲染复杂度下以更少参数超越编码器-解码器变体,并利用共享权重、基础模型先验和分阶段块大小优化效率与质量,在多个基准上实现新视点合成的最优结果。

Comments Code at https://github.com/NVLabs/dvsm

详情
AI中文摘要

近期的大型视图合成模型(LVSMs)倡导一种编码器-解码器架构,将重建和渲染分离到不同的网络中。我们重新审视了这种设计。通过控制实验,我们表明仅解码器架构(将场景隐式表示为KV-cache)在相同渲染复杂度下使用更少参数,性能优于编码器-解码器变体。进一步分析表明,在颜色输入重建网络和仅相机渲染网络之间共享权重,能更好地对齐同一视点下的特征,从而促进图像合成。基于这一发现,我们的模型DVSM进一步结合了基础模型先验和分阶段块大小调整,以改进效率与质量的权衡。我们的结果在多个基准上为新颖视图合成设立了新的最先进水平,在某些情况下,甚至在密集输入视图下优于每场景优化的3DGS。

英文摘要

Recent Large View Synthesis Models (LVSMs) advocate an encoder-decoder architecture that separates reconstruction and rendering into distinct networks. We re-examine this design. Through controlled experiments, we show that a decoder-only architecture, which represents scenes implicitly as a KV-cache, outperforms encoder-decoder variants while using fewer parameters at identical rendering complexity. Further analysis shows that sharing weights between the color-input reconstruction network and the camera-only rendering network better aligns their features at the same viewpoint, facilitating image synthesis. Building on this finding, our model, dubbed DVSM, further incorporates foundation model priors and stage-wise patch sizing for an improved efficiency-quality tradeoff. Our results establish a new state of the art for novel-view synthesis across multiple benchmarks, in some cases even outperforming per-scene-optimized 3DGS under dense input views.

2605.29889 2026-05-29 cs.CL cs.AI

Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

内部表示,而非临床知识:明显的大语言模型分诊失败源于何处

David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky

AI总结 本研究通过稀疏自编码器特征分析,发现大语言模型在分诊任务中表现不佳源于输出格式限制,而非临床知识表示缺陷。

Comments 9 pages main text, 27 pages total including appendices; 7 figures, 25 tables

详情
AI中文摘要

患者语音临床分诊基准报告显示,在受限的多选输出中,消费级大语言模型存在较高的分诊不足率,但同样的案例在自由文本中得分不同。我们探究输出格式是否改变了模型的\emph{临床表示},还是仅改变了从保留表示到答案的映射。使用Gemma 3 4B/12B IT和Qwen3-8B中的稀疏自编码器(SAE)特征,我们发现相同的医学特征在两种格式下对共享临床叙述激活,但在所有模型的每个案例的多选决策标记处变得{沉默}。三种独立方法(自然语言自编码器言语化、决策标记logit归因和顶部特征表征)一致认为,驱动决策logit的是支架和格式特征,而非医学特征。行为上,多选惩罚在结构化和自然语言输入下均反转,选项顺序洗牌排除了位置偏差,且差距主要由偏差一个决策(模型选择与黄金答案相邻的敏锐度字母)主导,而非知识失败。因此,失败源于输出格式,而非临床表示。

英文摘要

Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text. We ask whether output format changes the model's \emph{clinical representation} or only the mapping from a preserved representation to an answer. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, we find the same medical features fire on the shared clinical narrative under both formats but go {silent} at the multiple-choice decision token in all the cases at every model. Three independent methods (natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization) agree that scaffold and format features, but not medical features, drive the decision logits. Behaviorally, the multiple-choice penalty inverts under both structured and natural-language input, option-order shuffle rules out positional bias, and the gap is dominated by off-by-one decision (the model picks an adjacent acuity letter to the gold answer) rather than knowledge failure. Thus, the failure originates in the output format and not in the clinical representation.

2605.29888 2026-05-29 cs.LG cs.AI

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

LaRA: 面向RL后训练中数据污染的逐层表示分析

Minju Gwak, Minseo Kwak, Dongseok Lee, Guijin Son, Alan Ritter, Jaehyung Kim

AI总结 提出LaRA框架,通过逐层表示分析检测强化学习后训练中的污染数据,利用扰动敏感性、方向坍缩和局部表示刚性三个指标,优于现有输出级方法。

Comments Work in Progress

详情
AI中文摘要

强化学习(RL)后训练已被证明能提升大型语言模型(LLMs)的推理能力。然而,关于RL后训练中数据污染问题的探索很少,这可能损害训练过程本身的泛化能力和评估可靠性。现有的检测方法主要依赖于输出级信号,如似然或熵,这对于RL训练的模型变得不可靠,因为RL通过轨迹级奖励而非token似然来塑造行为。我们提出LaRA,一个用于检测RL后训练LLMs中数据污染的逐层表示分析框架。LaRA引入了三个互补指标,测量受控扰动下的扰动敏感性、方向坍缩和局部表示刚性。我们发现污染会在各层产生渐进式的几何偏差,包括放大的扰动敏感性、更强的方向坍缩和增强的局部刚性。基于我们的发现,我们还开发了一个污染检测协议,聚合跨层和跨指标的表示级偏差。在RL训练推理模型上的实验表明,我们的协议在污染检测方面优于现有的输出级基线。

英文摘要

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. Based on our findings, we also develop a contamination detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that our protocol outperforms existing output-level baselines for contamination detection.

2605.29886 2026-05-29 cs.CL cs.AI

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

CRITIC-R1: 学习结构化评论用于检索增强生成

Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu, Qingyun Sun, Runhua Xu, Jianxin Li

AI总结 提出CRITIC-R1框架,通过强化学习将RAG评论建模为结构化错误诊断问题,设计保守判断对齐和诊断质量对齐奖励函数,提升检索增强生成的答案质量。

Comments 17 pages,13 figures

详情
AI中文摘要

检索增强生成(RAG)通过引入外部证据改进了知识密集型问答。然而,现有的RAG方法仍然存在幻觉和细微推理错误。最近的研究引入外部评论来优化RAG输出,但它们通常提供粗粒度且结构薄弱的反馈,表现出过度激进的干预,导致噪声大且不可靠的优化,限制了其纠正效果。为解决这些问题,我们提出了CRITIC-R1,一个结构化评论框架,将RAG评论制定并学习为使用强化学习(RL)的显式错误诊断问题。我们的框架将常见的RAG错误分类为多个诊断维度,包括判定、错误位置、推理分析和修复生成。为了学习这些能力,我们设计了两个奖励函数:保守判断对齐(CJA)首先鼓励校准的高层判断,同时减轻过度激进现象;而诊断质量对齐(DQA)通过门控奖励进一步改进细粒度诊断反馈。我们使用基于GRPO的RL训练评论模型,并从外部LLM教师模型收集过程级监督。在五个QA基准上的实验表明,CRITIC-R1在强RAG基线上持续提高了答案质量。我们的源代码可在 https://anonymous.4open.science/r/critic-r1-FCB0 获取。

英文摘要

Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines. Our source code is available at https://anonymous.4open.science/r/critic-r1-FCB0

2605.29885 2026-05-29 cs.LG cond-mat.dis-nn math.OC math.RT stat.ML

Open Problem: Separating Geometric and Algorithmic Compression via Cayley-Table Completion

开放问题:通过凯莱表完成分离几何压缩与算法压缩

Dongsung Huh

AI总结 提出凯莱表完成作为测试缺失的算法复杂度最小化归纳偏置的规范问题,并挑战社区将连续平坦性先验推广以自主发现离散算法公理。

Comments 6 pages. Submitted to the Conference on Learning Theory (COLT) 2026 Open Problem track

详情
AI中文摘要

现代统计学习理论和深度学习主要从连续容量控制(如基于范数的正则化、间隔最大化、低秩偏置)的角度来表征泛化。虽然在连续领域非常成功,但深度学习始终无法外推精确的算法或离散代数规则,这反映出缺失了向算法复杂度最小化的归纳偏置。我们提出凯莱表完成作为这一缺失偏置的规范测试平台,作为矩阵完成的离散代数对应物。正如矩阵分解结合权重衰减产生对低线性秩的隐式几何偏置,最近的结果表明,算子值张量分解结合平坦性先验产生对精确离散结合性的隐式算法偏置。我们提出了为凯莱表建立形式化精确恢复界限的开放问题,并挑战社区将连续平坦性先验推广,以自主发现更广泛的离散算法公理,而无需组合搜索。

英文摘要

Modern statistical learning theory and deep learning characterize generalization primarily in terms of continuous capacity control (e.g., norm-based regularization, margin maximization, low-rank bias). While highly successful in continuous domains, deep learning consistently fails to extrapolate exact algorithmic or discrete algebraic rules, reflecting a missing inductive bias toward algorithmic complexity minimization. We propose the Cayley-table completion as the canonical testbed for this missing bias, serving as the discrete algebraic counterpart to matrix completion. Just as matrix factorization combined with weight decay yields an implicit geometric bias toward low linear rank, recent results demonstrate that operator-valued tensor factorizations paired with a flatness prior yield an implicit algorithmic bias toward exact discrete associativity. We pose the open problem of establishing formal exact recovery bounds for Cayley-table completion, and challenge the community to generalize continuous flatness priors to autonomously discover broader discrete algorithmic axioms without combinatorial search.

2605.29881 2026-05-29 cs.CV cs.AI

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

通过屏障调控自适应闭式引导缓解视觉语言模型中的幻觉

Soumyadeep Jana, Pulkit Mittal, Sanasam Ranbir Singh

AI总结 提出BRACS框架,通过监测视觉注意力并仅在接地退化时进行闭式修正,无需训练即可有效减少LVLM中的物体幻觉。

详情
AI中文摘要

大型视觉语言模型(LVLMs)经常幻觉出输入图像中不存在的物体,这主要是因为随着解码进行,视觉接地减弱。现有的推理时缓解方法在生成过程中修改logits或隐藏状态,但它们存在三个关键限制:缺乏明确的接地目标,即使在模型已经良好接地时也进行干预,以及使用固定的修正强度,无法适应接地失败的严重程度。我们提出BRACS(屏障调控自适应闭式引导),一种无需训练的引导框架,通过屏障调控自适应闭式引导解决这些问题。BRACS监测模型自身的注意力以衡量视觉接地,并仅在接地恶化时对隐藏状态进行修正。修正更新以闭式解析计算,无需训练辅助网络或重新训练模型。在LLaVA-1.5-7B和Qwen-VL-Chat上的实验表明,BRACS在幻觉基准上持续优于先前方法,将CHAIR$_s$降低9.4个点,将POPE F1提高2.7个点,同时在四个通用多模态基准上匹配或提升性能。BRACS还保持高效,运行速度为贪心解码吞吐量的80%,平均速度比基线快1.3倍。

英文摘要

Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding weakens as decoding progresses. Existing inference-time mitigation methods modify logits or hidden states throughout generation, but they suffer from three key limitations: they lack an explicit grounding objective, intervene even when the model is already well-grounded, and use fixed correction strengths that do not adapt to the severity of grounding failure. We propose BRACS (Barrier-Regulated Adaptive Closed-form Steering), a training-free steering framework that addresses these issues through barrier-regulated adaptive closed-form steering. BRACS monitors the model's own attention to measure visual grounding and applies corrections to the hidden states only when grounding deteriorates. The corrective update is computed analytically in closed form, requiring no training of auxiliary networks or model retraining. Experiments on LLaVA-1.5-7B and Qwen-VL-Chat show that BRACS consistently outperforms prior methods on hallucination benchmarks, reducing CHAIR$_s$ by 9.4 points and improving POPE F1 by 2.7 points, while matching or improving performance on four general multimodal benchmarks. BRACS also remains efficient, operating at 80% of greedy decoding throughput and achieving 1.3 times higher speed on average than the baselines.

2605.29873 2026-05-29 cs.AI

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Moment-KV: 基于动量的解码时KV缓存压缩用于长文本生成

Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh

AI总结 提出Moment-KV方法,利用动量驱动的时序注意力聚合在解码阶段压缩KV缓存,以提升长文本生成质量并保持解码延迟。

详情
AI中文摘要

键值(KV)缓存仍然是大型语言模型(LLM)在长文本生成任务中部署的主要瓶颈。先前的工作通常对预填充和解码缓存应用均匀压缩,但压缩预填充缓存会破坏关键上下文从而降低性能。虽然保留预填充缓存至关重要,但解码阶段的压缩仍未被充分探索,现有方法依赖于固定的近期窗口或瞬时注意力。我们对注意力动态的分析揭示了强时间模式:关键标记在长时间范围内获得持续注意力,而局部推理涉及短暂的爆发。静态启发式方法无法捕捉这种行为,导致重要标记被过早驱逐或陈旧标记被保留。我们提出Moment-KV,一种基于动量驱动的时序注意力聚合的解码时KV缓存压缩方法。我们的方法将标记重要性建模为连续演化的状态,其中注意力通过衰减进行聚合,捕捉长期影响和近期相关性。实验表明,Moment-KV在长文本生成任务中显著提高了生成保真度(2.3-3.2%),同时保持了解码延迟。

英文摘要

Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache degrades performance by corrupting critical context. While preserving the prefill cache is essential, decoding-phase compression remains underexplored, with existing methods relying on rigid recency windows or instantaneous attention. Our analysis of attention dynamics reveals strong temporal patterns: critical tokens receive sustained attention over long horizons, while local reasoning involves short-lived bursts. Static heuristics fail to capture this behavior, leading to premature eviction of important tokens or retention of stale ones. We propose Moment-KV, a decoding-time KV cache compression method based on momentum-driven temporal attention aggregation. Our method models token importance as a continuously evolving state, where attention is aggregated with decay, capturing both long-term influence and recent relevance. Experiments show that Moment-KV significantly improves generation fidelity in long-generation tasks (2.3-3.2 %) while maintaining decoding latency.

2605.29864 2026-05-29 cs.RO

LLM-Guided Future Hypotheses for Horizon-Aware Exploration in Multi-Step Robot Manipulation

LLM引导的未来假设用于多步机器人操作中的视野感知探索

Mohammad Khoshnazar, Andrew Melnik, Michael Beetz

AI总结 提出未来经验条件化(FEC)框架,利用LLM生成短期未来视频作为结构化先验,结合行为克隆和强化学习微调,提升多步机器人操作中的探索和策略适应能力。

详情
AI中文摘要

多步机器人操作需要在场景如何演化的不确定性下行动,这使得探索和策略适应具有挑战性。我们研究了短期、任务一致的未来视频能否为控制和强化学习微调提供有用的结构化先验。我们通过未来经验条件化(FEC)形式化这一思想,这是一种简单的接口,将闭环策略条件化于短期未来视频的潜在表示上。在我们的模拟设置中,未来片段通过三个阶段生成:一个基于当前场景状态初始化的任务本体上运行的LLM推理器,一个无机器人的数字孪生展开预期物体运动,以及一个无需推理时分割的掩码自由视频扩散模型,用于合成机器人一致的未来片段。我们主要使用BC和BC+RL实例化这一未来条件化接口,并在RoboCasa和CALVIN上与无未来、GT未来、生成未来和错误未来条件下的未来条件化流式流策略(SFP)基线进行比较。生成的未来比无未来条件化提高了性能,而不匹配的未来则降低了性能,我们的BC+RL实例化实现了最强的整体结果。对CALVIN的8个任务的平均BC+RL学习曲线分析进一步表明,GT未来改进最快,生成未来比无未来更早且更高水平地改进,而错误未来在整个训练过程中保持为零。这些结果表明,短期未来视频可以在不完美的未来预测下作为探索和策略适应的有用结构化先验。https://enact2026.github.io/

英文摘要

Multi-step robot manipulation requires acting under uncertainty about how the scene will evolve, making exploration and policy adaptation challenging. We study whether short-horizon, task-consistent future videos can provide useful structured priors for control and reinforcement-learning fine-tuning. We formalize this idea through Future-Experience Conditioning (FEC), a simple interface that conditions closed-loop policies on a latent representation of a short future video. In our simulation setup, future clips are generated in three stages, an LLM reasoner operating over a task ontology initialized from the current scene state, a robot-free digital-twin rollout of the intended object motion, and a mask-free video diffusion model that synthesizes a robot-consistent future clip without requiring segmentation at inference. We instantiate this future-conditioning interface primarily with BC and BC+RL, and compare against a future-conditioned Streaming Flow Policy (SFP) baseline on RoboCasa and CALVIN under NoFuture, GTFuture, GenFuture, and WrongFuture. Generated futures improve performance over no-future conditioning, while mismatched futures degrade it, and our BC+RL instantiation achieves the strongest overall results. An average BC+RL learning-curve analysis across 8 CALVIN tasks further shows that GTFuture improves fastest, GenFuture improves earlier and to a higher level than NoFuture, and WrongFuture remains at zero throughout training. These results suggest that short-horizon future videos can serve as useful structured priors for exploration and policy adaptation under imperfect future predictions. https://enact2026.github.io/