arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8098
2605.03644 2026-06-03 cs.AI

AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse

AdapShot: 自适应多示例上下文学习与语义感知的KV缓存重用

Jie Ou, Jinyu Guo, Shiyao Guo, Yuang Li, Ruiqi Wu, Zhaokun Wang, Wenyi Li, Wenhong Tian

发表机构 * School of Information and Software Engineering, University of Electronic Science and Technology of China(电子科技大学信息与软件学院)

AI总结 提出AdapShot方法,通过基于熵的探针机制动态优化示例数量,并结合语义感知的KV缓存重用策略,实现高效的多示例上下文学习,性能提升约10%,速度提升4.64倍。

详情
AI中文摘要

多示例上下文学习(Many-Shot ICL)已成为一种有前景的范式,利用大量示例来释放大型语言模型(LLMs)的推理潜力。然而,现有方法通常依赖于预定的固定示例数量。这种静态方法往往无法适应不同查询的难度变化,导致上下文不足或噪声干扰。此外,长上下文的过高计算和内存成本严重限制了多示例的可行性。为了解决上述限制,我们提出了AdapShot,它动态优化示例数量,并利用KV缓存重用实现高效推理。具体来说,我们设计了一种基于探针的评估机制,利用输出熵确定最佳示例数量。为了在探测和推理阶段避免冗余的预填充计算,我们引入了一种语义感知的KV缓存重用策略。在该重用策略中,为了解决位置编码不兼容问题,我们提出了一种解耦和重新编码方法,使得缓存的键值对能够灵活重新排序。大量实验表明,与最先进的DBSA相比,AdapShot平均性能提升约10%,速度提升4.64倍。

英文摘要

Many-Shot In-Context Learning (ICL) has emerged as a promising paradigm, leveraging extensive examples to unlock the reasoning potential of Large Language Models (LLMs). However, existing methods typically rely on a predetermined, fixed number of shots. This static approach often fails to adapt to the varying difficulty of different queries, leading to either insufficient context or interference from noise. Furthermore, the prohibitive computational and memory costs of long contexts severely limit Many-Shot's feasibility. To address the above limitations, we propose AdapShot, which dynamically optimizes shot counts and leverages KV cache reuse for efficient inference. Specifically, we design a probe-based evaluation mechanism that utilizes output entropy to determine the optimal number of shots. To bypass the redundant prefilling computation during both the probing and inference phases, we incorporate a semantics-aware KV cache reuse strategy. Within this reuse strategy, to address positional encoding incompatibilities, we introduce a decoupling and re-encoding method that enables the flexible reordering of cached key-value pairs. Extensive experiments demonstrate that AdapShot achieves an average performance gain of around 10% and a 4.64x speedup compared to state-of-the-art DBSA.

2605.02488 2026-06-03 cs.AI cs.DB cs.LO

Efficient Temporal Datalog Materialisation for Composite Event Recognition

高效的时间Datalog物化用于复合事件识别

Periklis Mantenoglou

发表机构 * Örebro University, Sweden(奥雷布罗大学,瑞典)

AI总结 针对高速事件流中的关键情况检测需求,通过将主流事件规范语言映射到时间Datalog->-并扩展流触发图技术,实现统一的复合事件识别机制。

详情
AI中文摘要

许多应用需要在高速符号事件流中及时检测关键情况,例如对安全和透明度的威胁。这一需求推动了(i)事件规范语言的发展,该语言通过简单事件上的时间模式定义复合事件,以及(ii)流推理框架,评估用这些语言表达的模式。然而,事件规范语言通常被孤立研究,使得它们在表达性方面的比较复杂化,并模糊了其相关流推理器的范围。为了缓解这一问题,我们将突出的事件规范语言的实用片段映射到时间Datalog->-,一种具有分层否定且无未来依赖的时间Datalog。为了支持对时间Datalog->-的高效流推理,我们提出了流触发图,这是对最先进的Datalog物化技术的扩展。我们的方法产生了一个统一的复合事件识别机制,具有跨广泛实用事件规范语言进行泛化的潜力。

英文摘要

Several applications demand the timely detection of critical situations, such as threats to safety and transparency, over high-velocity streams of symbolic events. This demand has motivated the development of (i) event specification languages, which define composite events via temporal patterns over simpler events, and (ii) stream reasoning frameworks, evaluating patterns expressed in these languages. However, event specification languages are typically studied in isolation, complicating their comparison in terms of expressivity and obscuring the scope of their associated stream reasoners. To mitigate this issue, we map practical fragments of prominent event specification languages into Temporal Datalog->-, a temporal Datalog with stratified negation and no future dependencies. To support efficient stream reasoning over Temporal Datalog->-, we propose Streaming Trigger Graphs, an extension of a state-of-the-art technique for Datalog materialisation. Our approach yields a uniform composite event recognition mechanism that has the potential to generalise across a wide range of practical event specification languages.

2605.03299 2026-06-03 cs.CL

LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models

LLM-XTM:利用大语言模型增强跨语言主题模型

Minh Chu Xuan, Tien-Phat Nguyen, Linh Ngo Van, Dinh Viet Sang, Nguyen Thi Ngoc Diep, Trung Le

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) VNU University of Engineering and Technology(VNU工程技术大学) Monash University(墨尔本大学)

AI总结 提出LLM-XTM框架,通过LLM引导的主题精炼与自一致性不确定性量化,以黑盒方式稳定提升跨语言主题模型的连贯性和对齐性,减少对双语资源的依赖。

Comments ACL 2026

详情
AI中文摘要

跨语言主题建模旨在发现跨语言的共享语义结构,但现有模型依赖稀疏的双语资源,往往产生不连贯或弱对齐的主题。最近基于LLM的精炼方法提高了可解释性,但成本高、在文档级别且容易产生幻觉,而先前的白盒方法需要无法获取的token概率。我们提出LLM-XTM,一个集成LLM引导的主题精炼与自一致性不确定性量化的框架,能够以黑盒、稳定且可扩展的方式增强跨语言主题模型。在多语言语料库上的实验表明,LLM-XTM在减少对双语词典和昂贵LLM调用依赖的同时,实现了更优的主题连贯性和对齐性。

英文摘要

Cross-lingual topic modeling aims to discover shared semantic structures across languages, yet existing models depend on sparse bilingual resources and often yield incoherent or weakly aligned topics. Recent LLM-based refinements improve interpretability but are costly, document-level, and prone to hallucination, with prior white-box approaches requiring inaccessible token probabilities. We propose LLM-XTM, a framework that integrates LLM-guided topic refinement with self-consistency uncertainty quantification, enabling black-box, stable, and scalable enhancement of cross-lingual topic models. Experiments on multilingual corpora show that LLM-XTM achieves superior topic coherence and alignment while reducing reliance on bilingual dictionaries and expensive LLM calls.

2605.01712 2026-06-03 cs.LG

CoAction: Cross-task Correlation-aware Pareto Set Learning

CoAction: 跨任务相关性感知的帕累托集学习

Xinyue Chen, Yingxuan Liang, Yiqin Huang, Chikai Shang, Hai-Lin Liu, Fangqing Gu

发表机构 * Guangdong University of Technology(广东工业大学) Xiamen University(厦门大学)

AI总结 提出CoAction框架,利用任务感知Transformer同时处理多个多目标优化问题,通过自注意力机制捕获任务间相关性,提升超体积、范围和稀疏性指标。

Comments Accepted by ICIC 2026 (Oral)

详情
AI中文摘要

帕累托集学习(PSL)是多目标优化中的一种新兴范式,它训练神经网络将偏好向量映射到帕累托最优解。然而,现有的PSL方法主要关注一次解决单个多目标优化问题。这一局限性不仅在多目标多任务优化场景中增加了计算成本(因为每个任务需要单独的模型),而且未能利用任务间的相关性。为了解决这个问题,我们提出了一个跨任务相关性感知的帕累托集学习(CoAction)框架,该框架利用任务感知Transformer同时处理多个任务。具体来说,通过为每个任务分配任务特定的嵌入向量,模型有效地区分任务,同时促进任务间的知识共享。我们采用Transformer编码器作为骨干架构,利用其自注意力机制捕获复杂的任务依赖关系。该方法在涵盖基准问题和实际应用的全面多任务测试套件上进行了评估,在超体积、范围和稀疏性方面展示了有效性和有竞争力的性能。

英文摘要

Pareto set learning (PSL) is an emerging paradigm in multi-objective optimization that trains neural networks to map preference vectors to Pareto optimal solutions. However, existing PSL methods primarily focus on solving a single multi-objective optimization problem at a time. This limitation not only increases computational costs in multi-objective multitask optimization scenarios by requiring a separate model for each task, but also fails to exploit the inter-task correlations across tasks. To address this, we propose a Cross-tAsk correlation-aware Pareto Set Learning (CoAction) framework, which leverages task-aware transformer to handle multiple tasks simultaneously. Specifically, by assigning task-specific embedding vectors to individual tasks, the model effectively distinguishes between tasks while facilitating knowledge sharing among them. We utilize a Transformer encoder as the backbone architecture to leverage its self-attention mechanism for capturing complex task dependencies. The proposed approach is evaluated on comprehensive multitask test suites covering both benchmark problems and real-world applications, demonstrating effectiveness and competitive performance in Hypervolume, Range, and Sparsity.

2606.03994 2026-06-03 cs.CV cs.RO

SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

SimuScene: 从单张图像重建仿真就绪的组合式3D场景

Inhee Lee, Sangwon Baik, Sungjoo Kim, Hyeonwoo Kim, Hyunsoo Cha, Hanbyul Joo

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出SimuScene,一种将物理仿真融入形状和布局估计的组合式3D重建流水线,通过物理引擎诊断重建错误并驱动修正,生成稳定且仿真就绪的场景。

Comments Project Page: https://snuvclab.github.io/SimuScene/

详情
AI中文摘要

从单张图像重建可交互、仿真就绪的3D场景是机器人操作的关键瓶颈。虽然最近的单图像提升器能恢复合理的每个物体形状,但组合它们会产生因物体相互穿透、悬浮或下沉而在物理仿真中崩溃的场景。现有的物理感知方法严格将其作为事后布局修正,而未解决底层几何误差。为此,我们引入SimuScene,一种将物理置于形状和布局估计循环中的组合式3D重建流水线。我们不仅将物理用于布局清理,还在生成过程中利用物理引擎作为诊断测量工具。通过在重力下对重建物体进行诊断性仿真,我们将穿透和支撑失败转化为定量修正信号,驱动重力轴拉伸和非模态形状重采样。这种物理信息反馈循环减轻了累积的重建误差,并产生稳定、仿真就绪的组合式3D场景。大量实验在物理稳定性和几何对齐基准上展示了最先进的性能。我们进一步通过在仿人控制和机器人臂操作任务中部署重建环境来突出SimuScene的实用性。

英文摘要

Reconstructing interactive, simulation-ready 3D scenes from a single image is a critical bottleneck for robotic manipulation. While recent single-image lifters recover plausible per-object shapes, composing them yields scenes that collapse under physical simulation due to interpenetrating, hovering, or sinking objects. Existing physics-aware methods address this strictly as a post-hoc layout correction, leaving the underlying geometric errors unresolved. To address this, we introduce SimuScene, a compositional 3D reconstruction pipeline that puts physics in the loop of shape and layout estimation. Rather than using physics merely for layout cleanup, we utilize the physics engine as a diagnostic measurement tool during the generative process itself. By diagnostically simulating reconstructed objects under gravity, we convert penetration and support failures into quantitative correction signals that drive gravity-axis stretching and amodal shape resampling. This physics-informed feedback loop mitigates accumulated reconstruction errors and produces a stable, simulation-ready compositional 3D scene. Extensive experiments demonstrate state-of-the-art performance on physical stability and geometric alignment benchmarks. We further highlight SimuScene's utility by deploying reconstructed environments in humanoid control and robot-arm manipulation tasks.

2606.03990 2026-06-03 cs.LG cs.CL cs.CV

Neuron Populations Exhibit Divergent Selectivity with Scale

神经元群体随规模表现出分化的选择性

Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman

发表机构 * UC Berkeley(加州大学伯克利分校) TTIC

AI总结 通过分析Rosetta神经元在不同规模模型中的分布与特性,发现其数量遵循次线性幂律增长,且选择性随规模增强,而非Rosetta神经元则保持低选择性,提出一个平衡特征效用与神经元容量的分析模型解释这一极化现象。

Comments Project page and code: https://avdravid.github.io/rosetta-neuron-scaling/

详情
AI中文摘要

我们研究神经网络中的神经元群体是否随规模可预测地演化,将缩放定律扩展到损失等宏观可观测指标之外。为探究此问题,我们研究了Rosetta神经元——一类先前被表征的、其激活模式在独立训练的模型中相似的神经元(Dravid et al., 2023)。在分别对高达30B参数的语言模型和高达5B参数的视觉模型的分析中,我们观察到Rosetta神经元群体遵循模型规模的次线性幂律,绝对数量增长但占总神经元数的比例缩小。我们进一步观察到神经元极化效应:Rosetta神经元随规模变得更具选择性且日益单语义化,与不断增长但仍保持低选择性的非Rosetta群体分离。一个平衡特征效用与有限神经元容量的分析模型解释了次线性幂律缩放和这种极化效应。最后,我们发现Rosetta神经元随规模变得更加领域专业化,并通过一个针对持续预训练的目标数据过滤案例研究展示了其选择性。我们的结果指向一个可解释的、共享的神经元层面结构的缩放定律,将模型大小与神经元通用性、选择性和专业化的系统性变化联系起来。

英文摘要

We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non-Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power-law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain-specialized with scale and illustrate their selectivity through a targeted data-filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron-level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

2606.03989 2026-06-03 cs.CV

PixVOD: Pixel-Distributed Direct Visual Odometry and Depth Estimation

PixVOD: 像素分布式直接视觉里程计与深度估计

Shinjeong Kim, Ignacio Alzugaray, Callum Rhodes, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Department of Computing, Imperial College London(帝国理工学院伦敦分校计算机系)

AI总结 提出一种基于高斯信念传播的像素级分布式视觉里程计与深度估计方法,通过关键帧锚定机制实现传感器上并行计算。

详情
AI中文摘要

由二维像素阵列组成的图像是计算机视觉算法的标准输入,然而许多底层计算可以分布在像素之间。传输原始、冗余且带有噪声的像素数据离开传感器仍然效率低下,这促使人们转向焦平面传感器处理器,其在每个像素内直接执行大部分计算。我们设想像素在本地合成更高级别的信号,减少下游负载,并为更高级别的视觉任务提供更丰富的输入。我们提出了一种完全可并行化的视觉里程计和深度估计形式,跨像素进行,其中传感器处理器通过高斯信念传播(GBP)交换信息,以达成关于相机运动的共识,并从逐像素光度观测和表面法线先验中推断深度。为了在优化过程中保持几何稳定性,我们引入了一种类似关键帧的锚定机制,该机制调节帧之间的有效基线,从而实现一致的运动和深度更新。我们的方法在真实数据集上进行了评估,证明了基于GBP的像素级分布式里程计和深度估计与传感器上关键帧锚定的可行性。项目页面:此 https URL

英文摘要

Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Transmitting raw, redundant, and noisy pixel data off the sensor remains inefficient, motivating a shift toward focal-plane sensor-processors that perform a significant part of the computation directly within each pixel. We envision pixels synthesizing higher-level signals locally, reducing downstream load, and providing richer inputs for higher-level vision tasks. We propose a fully parallelizable form of visual odometry and depth estimation across pixels, where sensor-processors exchange information through Gaussian Belief Propagation (GBP) to achieve consensus about camera motion and infer depth from per-pixel photometric observations and a surface normal prior. To maintain geometric stability during optimization, we introduce a keyframe-like anchoring mechanism that regulates the effective baseline between frames, enabling consistent motion and depth updates. Our method is evaluated on realistic datasets, demonstrating the feasibility of GBP-based pixel-level distributed odometry and depth estimation with keyframe anchoring on-sensor. Project Page: https://www.shinjeongkim.com/pixvod/

2606.03986 2026-06-03 cs.CV

NewtPhys: Do Foundation Models Understand Newtonian Physics?

NewtPhys: 基础模型理解牛顿物理学吗?

Sebastian Cavada, Soumava Paul, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette

发表机构 * Inria(法国国家信息与自动化研究所) Valeo.ai(Valeo人工智能公司) MBZUAI(马克斯·普朗克人工智能研究所)

AI总结 本文提出NewtPhys,一个基于真实场景多视图图像和物理模拟的4D物理标注数据集,用于系统评估基础模型在低层次牛顿物理推理中的能力,揭示了现有模型的局限性。

详情
AI中文摘要

先前的工作使用合成或半合成场景以及视觉问答任务评估基础模型中的物理推理。然而,这些基准强调高层次事件,缺乏评估真正低层次牛顿理解所需的视觉保真度。我们引入了NewtPhys,一个从真实场景的多视图图像构建的4D物理标注数据集,并带有基于物理的模拟。该数据集提供了跨时间步的密集、细粒度标注——包括3D力和覆盖物理、跟踪、语义和几何的逐像素非模态量——弥合了简单合成设置与真实视觉复杂性之间的差距。利用NewtPhys,我们系统评估了56个VLM,包括54个开放权重模型和2个闭源前沿模型,以及10个VFM,揭示了低层次物理推理中的局限性。除了基准测试外,我们的数据集还支持基于物理的视觉的未来研究和下一代物理感知评估的开发。代码和数据集可在该网址获取。

英文摘要

Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps -- including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry -- bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at https://astra-vision.github.io/NewtPhys.

2606.03985 2026-06-03 cs.RO cs.AI cs.CV

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Humanoid-GPT:扩展数据与结构以实现零样本运动跟踪

Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi

发表机构 * Tsinghua University(清华大学) Galbot Inc.(Galbot公司) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 提出Humanoid-GPT,一种基于GPT风格的因果Transformer,在十亿级运动语料上预训练,实现全身控制,通过扩展数据和模型容量达到对未见运动和任务的零样本泛化。

Comments Accepted at CVPR 2026

详情
AI中文摘要

我们介绍了Humanoid-GPT,一种具有因果注意力的GPT风格Transformer,在十亿级运动语料上训练用于全身控制。与受限于稀缺数据和敏捷性-泛化权衡的先前浅层MLP跟踪器不同,Humanoid-GPT在一个包含所有主要动作捕捉数据集和大规模内部录制的20亿帧重定向语料上预训练。扩展数据和模型容量产生了一个单一的生成式Transformer,它能够跟踪高度动态的行为,同时实现对未见运动和控制任务的前所未有的零样本泛化。大量实验和扩展分析表明,我们的模型建立了新的性能前沿,展示了对未见任务的鲁棒零样本泛化,同时能够跟踪高度动态和复杂的运动。

英文摘要

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

2606.03980 2026-06-03 cs.LG cs.CL

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Skill-RM: 通过智能体技能统一异构评估标准

Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba(通义千问大模型应用团队,阿里巴巴) Sun Yat-sen University(中山大学) The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学) ETH Zürich University of Zurich(苏黎世联邦理工学院)

AI总结 提出Skill-RM框架,将奖励建模重构为可重用的奖励评估技能执行,通过动态选择和聚合证据统一异构评估标准,在奖励基准和下游任务中优于传统方法。

详情
AI中文摘要

奖励模型(RMs)为LLM后训练提供关键反馈信号,特别是在强化微调(RFT)和强化学习(RL)流程中。然而,当前的奖励评估依赖于异构标准,如基于规则的验证器、真实参考、程序化检查表和复杂评分标准,而统一整合所有类型证据的机制尚未被探索。为此,我们提出技能奖励模型(Skill-RM),一个统一框架,将奖励建模重构为可重用的奖励评估技能的执行。通过将奖励计算视为结构化的智能体任务,Skill-RM提供一致的接口来编排异构资源,动态选择和聚合针对每个输入特定要求定制的证据。这种方法使奖励模型能够超越静态评估,确保跨不同任务的一致性和透明度。在奖励基准和下游应用(包括最佳N选择和强化学习)上的大量实验表明,Skill-RM始终优于传统的评判基线。我们的发现表明,Skill-RM不仅为奖励建模提供了统一解决方案,而且通过战略性和动态的证据编排实现了卓越性能。代码见此链接。

英文摘要

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

2606.03979 2026-06-03 cs.LG cs.AI

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

语言模型需要睡眠:学习自我修改和巩固记忆

Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni

发表机构 * Google(谷歌) Cornell University(康奈尔大学)

AI总结 受人类学习过程启发,提出“睡眠”范式,通过记忆巩固(知识播种)和梦境(自我改进)两阶段,使模型持续学习、将短期记忆转化为长期知识并自我提升。

Comments A version of this work has been publicly available from September 2025 on OpenReview

详情
AI中文摘要

过去几十年见证了机器学习算法设计的重大进步,从早期针对特定任务的浅层模型研究到更通用的深度大语言模型(LLMs)。尽管在需要即时预测或上下文学习的任务中显示出有希望的结果,现有模型缺乏持续学习并有效将其时间上下文知识转移到长期参数的能力。受人类学习过程的启发,我们引入了一种“睡眠”范式,允许模型持续学习,通过重放将其短期脆弱记忆蒸馏为稳定的长期知识,并通过“梦境”过程递归地自我改进。更详细地说,睡眠包括两个阶段:(1)记忆巩固:一个向上的蒸馏过程,称为知识播种,其中较小自我的记忆被蒸馏到更大的网络中,以在保留知识的同时提供更多容量。作为概念验证,我们提出了一种新的广义蒸馏过程用于知识播种(即在线策略蒸馏与基于强化学习的模仿学习的结合);(2)梦境:一个自我改进阶段,其中模型使用强化学习生成合成数据的课程,以排练新知识并在没有人类监督的情况下完善现有能力。我们在长视野、持续学习、知识整合和少样本泛化任务上的实验支持了睡眠阶段的重要性。

英文摘要

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.

2606.03971 2026-06-03 cs.CV

Video-Mirai: Autoregressive Video Diffusion Models Need Foresight

Video-Mirai: 自回归视频扩散模型需要远见

Yonghao Yu, Lang Huang, Runyi Li, Zerun Wang, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学) National Institute of Informatics(信息处理研究所) Peking University(北京大学)

AI总结 提出Video-Mirai训练方法,通过冻结的远见编码器从完整生成序列中提取未来信息并蒸馏到因果状态,在不改变推理过程的情况下弥合表示层面的规划差距,提升长视频生成的一致性。

详情
AI中文摘要

因果视频生成器必须从过去预测,但它们不必仅从过去学习。在流式自回归视频扩散中,每个发射的片段成为未来片段必须保留的承诺。然而,标准训练只要求每个因果状态解释当前。这造成了我们称之为表示层面的规划差距:适合当前片段的状态可能丢弃未来一致性所需的身份、布局和运动信息。我们引入Video-Mirai,一种仅训练的方法,在不改变因果推理的情况下弥合这一差距:生成器因果地展开,一个冻结的远见编码器非因果地读取完成的展开,一个轻量级预测器将得到的停止梯度目标蒸馏到因果状态。未来帧监督表示,从不监督生成器输入。在推理时,编码器和预测器被丢弃,原始架构、每步FLOPs和KV缓存行为保持不变。Video-Mirai在5秒VBench上将强因果强制基线从83.8提高到84.6(总分)。在超出训练范围的30秒展开中,主体一致性从84.9提高到88.5,背景一致性从90.2提高到91.9。消融实验确定未来条件目标是关键因素,探针实验显示未来帧从当前特征中更易解码。因果性应约束推理,而非表示监督。我们的研究强调视觉自回归模型需要远见。项目页面:此https URL。

英文摘要

Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, however, only asks each causal state to explain the present. This creates what we call a representation-level planning gap: states that fit the current segment may discard identity, layout, and motion information needed for a consistent future. We introduce Video-Mirai, a training-only method that closes this gap without changing causal inference: the generator rolls out causally, a frozen foresight encoder reads the completed rollout non-causally, and a lightweight predictor distills the resulting stopped-gradient targets into causal states. Future frames supervise representations, never generator inputs. At inference, the encoder and predictor are discarded, leaving the original architecture, per-step FLOPs, and KV-cache behavior unchanged. Video-Mirai improves a strong Causal-Forcing baseline on 5-second VBench from 83.8 to 84.6 in terms of Total Score. On 30-second rollouts beyond the training horizon, subject consistency improves from 84.9 to 88.5 and background consistency from 90.2 to 91.9. Ablations identify future-conditioned targets as the key ingredient, and probes show that future frames become more decodable from current features. Causality should constrain inference, not representation supervision. Our study highlights that visual autoregressive models need foresight. Project page: https://y0uroy.github.io/Video-Mirai.

2606.03969 2026-06-03 cs.CL cs.AI

Quantifying Faithful Confidence Expression in Large Reasoning Models

量化大型推理模型中的忠实置信表达

Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

发表机构 * Yale University(耶鲁大学)

AI总结 针对大型推理模型(LRM)在长链思维输出中难以忠实表达内在置信度的问题,提出基于令牌概率、隐藏状态和响应一致性的框架,系统量化其语言决断性与内部不确定性之间的对齐程度。

Comments Code: https://github.com/yale-nlp/faithful_lrm

详情
AI中文摘要

可靠的不确定性沟通对于LLMs的可信度至关重要,然而忠实校准(FC)——模型内在置信度与(语言上)表达的置信度之间的对齐——是一个持续存在的失败模式。这一挑战对大型推理模型(LRM)尤为关键,因为其扩展的推理轨迹常被用户解读为深思熟虑、能力和信心的证据。尽管FC重要且LRM广泛使用,但LRM能否忠实表达其置信度仍知之甚少。此外,衡量FC的主流范式难以泛化到LRM生成的长链思维输出,这些输出往往缺乏清晰的步骤边界、步骤结构不一致,并在整个轨迹中编码复杂的条件依赖——使得内在置信度的估计复杂化。为应对这一挑战,我们引入了一个新颖的框架来系统量化LRM的FC。我们的框架基于令牌概率、隐藏状态和采样响应一致性,分析语言决断性与三种内部不确定性来源的关系。我们还设计了一种前缀条件采样方法,以控制轨迹中的条件和结构变化。将我们的框架应用于一系列多样化的领先模型、数据集和提示,我们发现忠实置信表达是LRM的一个重大挑战。推理行为不会自动转化为改进的FC,针对非推理模型的提示干预在推理设置中并不能提高忠实性。不同的置信估计器还对同一轨迹产生不同评估,揭示了先前评估方法的脆弱性。综合来看,我们的工作将FC确立为LRM的一个独特的可靠性和对齐目标,尤其是在这些系统越来越多地部署在高风险场景中的背景下。

英文摘要

Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

2606.03968 2026-06-03 cs.CL cs.AI

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

QUBRIC:为超越可验证奖励的强化学习协同设计查询与评分标准

Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang

发表机构 * Amazon(亚马逊) Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对基于评分标准的强化学习中查询分布固定导致的评分标准质量瓶颈,提出QUBRIC框架,通过协同设计查询与评分标准,利用教师关键点、对比生成和可学习性过滤,在ArenaHard上取得+5.5点提升,并泛化到法律、道德和叙事推理任务。

详情
AI中文摘要

基于评分标准的强化学习是将强化学习扩展到可验证奖励之外的一条有前景的途径,但现有方法在优化评分标准时,将查询分布视为固定不变。我们识别出一个结构性瓶颈:评分标准的质量受限于查询结构。开放式查询会导致模糊的评分标准;而简单地将查询收窄则会引入任何模型都无法验证的虚构参考,导致所有回答失败,训练无法获得奖励信号。我们提出QUBRIC,一个协同设计查询和评分标准的框架。教师导出的关键点将开放式查询改写为基于场景、可评估的问题。然后,对比评分标准生成将教师策略的差距转化为查询级别的标准,可学习性过滤仅保留信息量丰富的查询-评分标准对用于GRPO训练。QUBRIC在ArenaHard上相比SFT基线取得了+5.5分的提升。仅使用指令遵循数据训练,它进一步迁移到三个涵盖法律、道德和叙事推理的保留基准(平均提升+6.3分),改进集中在推理相关维度。这些结果证明,协同设计查询和评分标准可以使基于评分标准的强化学习成为严格可验证任务之外RLVR的实用补充。

英文摘要

Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

2606.03967 2026-06-03 cs.CL cs.AI

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

AlignAtt4LLM:面向仅解码器LLM的快速AlignAtt方法在IWSLT 2026同声传译任务中的应用

Quentin Fuxa, Dominik Macháček

发表机构 * Charles University, MFF, ÚFAL(查理大学,人文学院,ÚFAL) University of Edinburgh(爱丁堡大学)

AI总结 提出AlignAtt4LLM系统,通过显式源文本跨度、离线选择翻译对齐头、选择性qk快速重放和运行时查询/键捕获,首次将AlignAtt策略应用于仅解码器LLM,在英德、英意同声传译中优于基线。

Comments Accepted to IWSLT 2026

详情
AI中文摘要

我们描述了AlignAtt4LLM,一个用于英语到德语、意大利语和中文的IWSLT 2026同声传译系统。该系统是一个同步级联:Qwen3-ASR结合强制对齐生成增量更新的源文本转录,Gemma-4 E4B-it在MT侧的AlignAtt策略下翻译该前缀。据我们所知,这是AlignAtt首次应用于仅解码器LLM,而早期AlignAtt系统使用的编码器-解码器交叉注意力在此类模型中不存在。我们通过提出(1)提示中的显式源文本跨度,(2)离线选择翻译特定的对齐头,(3)草稿到源注意力块的选择性qk快速重放,以及(4)保持模型输出比特一致的运行时查询/键捕获,恢复了一个可用的策略。在IWSLT 2026开发集上,AlignAtt4LLM在约2秒的低延迟和低于4秒CU-LongYAAL的高延迟场景下,均优于欧洲目标语言(英语到德语和英语到意大利语)的提供基线。英语到中文的结果较为复杂,但该方法不依赖于Gemma-4:由于AlignAtt4LLM仅需要确定的提示布局、校准的对齐头和查询/键捕获,相同的策略可以重新应用于针对非欧洲目标语言的更强翻译专用仅解码器MT骨干网络。

英文摘要

We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.

2606.03962 2026-06-03 cs.LG cs.AI

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

利用奖励不确定性在强化学习中诱导多样化行为

Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser Aygün, David Smalling, Shibl Mourad, Doina Precup, André Barreto, Mark Rowland

发表机构 * New York University(纽约大学) Google DeepMind(谷歌深Mind)

AI总结 针对传统强化学习缺乏多样性的问题,提出将奖励函数替换为奖励分布,通过非线性集合目标自然产生可控的多样化行为,并推导出梯度估计器,实验证明其鲁棒性和理论优势。

Comments Core contributors: Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, André Barreto, Mark Rowland

详情
AI中文摘要

经典强化学习通常寻求最大化标量奖励期望和的确定性策略。然而,现代应用如语言模型微调或科学发现需要多样性。现有的补救措施如熵正则化或多样性奖励通常需要脆弱的权衡,以性能换取随机性,或依赖可能使策略排名错位的启发式指标。我们认为,多样性更自然地理解为对奖励不确定性的理性响应。当奖励函数不完全已知时——例如模糊偏好或不完美的奖励模型——承诺单一行动可能是次优的。基于此,我们提出对强化学习目标进行根本性重新表述,将标量奖励替换为奖励函数上的分布,并对行动集合应用非线性目标。结果是一个框架,其中校准的行为多样性自然出现,通过奖励函数分布保持可控,且无需牺牲期望奖励即可获得。聚焦于上下文赌博机设置,我们为该目标推导出原则性的梯度估计器,并证明我们的公式自然泛化了原始策略梯度以及最近发展的行动集方法。我们的实证结果表明,该框架为传统问题表述无法诱导所需行为广度的复杂强化学习任务提供了鲁棒且理论基础的替代方案。

英文摘要

Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known--as is the case with ambiguous preferences or imperfect reward models--committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.

2606.03957 2026-06-03 cs.CL cs.AI cs.SD eess.AS

Efficient ASR Training with Conversations that Never Happened

利用从未发生的对话进行高效的ASR训练

Máté Gedeon, Péter Mihajlik

发表机构 * Dept. of Telecommunications and Artificial Intelligence, Budapest University of Technology and Economics(电信与人工智能系,布达佩斯技术与经济大学) SpeechTex Ltd.(SpeechTex公司) ELTE Research Centre for Linguistics(ELTE语言研究所)

AI总结 针对低资源语言和特定领域,提出通过LLM生成对话场景、映射说话人属性到TTS语音配置文件并组装合成话语的增强流水线,实验表明合成对话能有效提升ASR性能,在匈牙利语基准上仅用67小时真实对话和636小时模拟数据即超越2700小时零样本模型。

详情
AI中文摘要

低资源语言和特定领域的对话式ASR受到领域匹配的多说话人训练数据稀缺的限制。我们提出了一种增强流水线,该流水线生成带有参与者元数据的场景级对话,将说话人属性映射到TTS语音配置文件,并将合成的话语组装成感知说话人的模拟对话。我们在相同的FastConformer-Large训练方案下,评估了五种LLM家族,分别采用单生成器、固定预算混合和扩展设置。我们在匈牙利语BEA-Dialogue基准语料库上进行了全面评估,该方法本身适用于任何语言,只要各组件有相应资源。结果表明,合成对话持续改善语音识别性能,但生成器选择和组成数据强烈影响增益。我们最大的训练配置仅使用67小时真实对话和636小时模拟数据,在评估基准上实现了比在2700小时匈牙利语语音上训练的零样本模型更好的性能。这些发现表明,通过TTS合成的LLM生成的对话数据是真实对话语料库在语音模型训练中的实用补充。

英文摘要

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

2606.03954 2026-06-03 cs.CV cs.LG cs.RO

VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

VLESA: 用于人类活动监测的视觉语言具身安全智能体

Hanjiang Hu, Yiyuan Pan, Jiaxing Li, Xusheng Luo, Alexander Robey, Na Li, Yebin Wang, Changliu Liu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Mitsubishi Electric Research Laboratories(三菱电机研究实验室) Harvard University(哈佛大学)

AI总结 提出VLESA框架,通过自我中心视频监测人类活动,利用GRPO训练的目标条件安全Q过滤器进行实时安全干预,在ASIMOV-2.0基准上实现更高干预精度。

Comments 18 pages, 5 tables, 5 figures

详情
AI中文摘要

随着AI系统越来越多地协助人类完成物理任务,确保安全变得至关重要——物理动作会带来即时且不可逆转的后果,而数字错误则不会。我们引入了视觉语言具身安全智能体(VLESA),这是一个从自我中心视频监测人类活动,并在预测到危险动作时触发实时安全干预的框架。VLESA处理意图依赖的安全问题,其中相同的动作可能根据上下文而安全或危险。我们引入了一个将自我中心帧与目标条件安全注释配对的数据集,使得能够通过GRPO训练一个目标条件安全Q过滤器,该过滤器在不重新训练的情况下根据推断的意图评估动作。在此基础上,提出了一个意图-动作预测智能体,用于从视频中联合推断目标并预测未来动作。在ASIMOV-2.0基准上,VLESA在精确的地面真值帧处实现了比基线更高的干预准确率,而通过目标条件约束解码,GRPO训练的Q过滤器将动作安全性提高了超过41个百分点。代码可在该网址获取。

英文摘要

As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.

2606.03951 2026-06-03 cs.CV

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

Demo2Tutorial:从人类经验到多模态软件教程

Zechen Bai, Zhiheng Chen, Yiqi Lin, Kevin Qinghong Lin, Difei Gao, Xiangwu Guo, Xin Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室)

AI总结 提出Demo2Tutorial框架,通过屏幕录制和交互日志将人类经验解析为结构化多模态教程,用于人类学习和GUI智能体训练,实验证明其生成质量超越人工教程并提升任务效率。

Comments Accepted by CVPR 2026

详情
AI中文摘要

数字环境中的人类经验提供了大量未被充分探索的真实、未修剪的交互资源,其中包含丰富的程序性知识。我们提出了Demo2Tutorial,一个将屏幕录制和交互日志捕获的人类经验转化为结构化多模态软件教程的框架,用于同时教授人类和智能体。Demo2Tutorial首先通过专用记录器收集人类经验,然后使用多模态动作解析器解析原始经验,以重建感知、动作和意图。接着,步骤规划器将这些步骤抽象为表示目标和步骤的分层任务图。最后,教程合成器将解析后的经验转化为结构化的、可复用的图文指令。我们在一个基于官方软件文档的新基准上评估了教程生成质量。我们进一步证明,这种蒸馏表示有利于(i)人类学习,通过自动生成多模态教程,以及(ii)智能体学习,通过改进下游GUI智能体规划和泛化。实验表明,Demo2Tutorial生成的高质量教程超越了人工编写的教程,并显著优于基线方法,同时实现了更快的人类任务完成和更好的GUI智能体规划,证明从人类经验中蒸馏的结构化教程可以作为有效知识表示,促进人类学习和智能体能力。代码和数据将在https://this https URL提供。

英文摘要

Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and agent capabilities. Code and data will be available at https://github.com/showlab/Demo2Tutorial.

2606.03949 2026-06-03 cs.RO

Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation

偏好校准的人机协同强化学习用于机器人操作

Zeyi Liu, Guangyao Liu, Yinuo Qu, Yuquan Xue, Bofang Jia, Chunhua Yang, Weihua Gui, Keke Huang, Ziwei Wang

发表机构 * Central South University(中南大学) Nanyang Technological University(南洋理工大学) Zhejiang University(浙江大学)

AI总结 提出PACT框架,通过干预隐式偏好信号进行信用重分配和策略对齐,提升人机协同强化学习的样本效率和性能。

Comments Submitted to CoRL2026

详情
AI中文摘要

人机协同强化学习(HIL-RL)通过在线人类干预提高了真实机器人操作中的样本效率。然而,成功的轨迹可能包含偏离期望任务执行路径并迫使人类干预的次优动作。现有的HIL-RL方法通常对所有转换应用一致的信用分配原则,通过次优段均匀传播折扣终端奖励,忽略了每个转换对任务成功的实际贡献。这高估了评论家学习的Q值,并间接误导演员更新朝向次优行为模式。为此,我们提出了PACT,一种偏好校准的演员-评论家训练框架,利用干预引起的隐式偏好信号对识别出的次优段进行信用重分配,同时直接指导策略训练以实现无偏的评论家-演员学习。具体来说,我们首先设计了一个从人类演示中学习并识别次优段进行信用校正的进度模型。然后,从干预状态下的人类动作和重采样策略动作中,我们构建偏好对来定义一个反事实优势,惩罚识别出的次优段的贝尔曼目标,实现方向性信用校准。此外,我们在有界均值空间中直接将策略与人类纠正动作对齐,提供了评论家引导更新之外的额外信号。在五个真实机器人操作任务中,PACT将平均成功率提高了24.5%,并实现了1.3倍的更快收敛,从而提高了强化学习的样本效率和性能。代码可在https://this URL获取。

英文摘要

Human-in-the-loop reinforcement learning (HIL-RL) improves sample efficiency in real-robot manipulation through online human intervention. However, successful trajectories may include suboptimal actions that deviate from the desired task-execution path and force human intervention. Existing HIL-RL methods typically apply the consistent credit assignment principle to all transitions, uniformly propagating discounted terminal rewards through suboptimal segments, ignoring the actual contribution of each transition to task success. This overestimates Q-values for critic learning and indirectly misguides actor updates toward suboptimal behavior patterns. To this end, we propose PACT, a Preference-calibrated Actor-Critic Training framework that leverages the implicit preference signals induced by intervention to perform credit reassignment on identified suboptimal segments while directly guiding policy training for unbiased critic-actor learning. Specifically, we first design a progress model that learns from human demonstration and identifies suboptimal segments for credit correction. Then, from the human action and resampled policy action at the intervention state, we build preference pairs to define a counterfactual advantage that penalizes Bellman targets of the identified suboptimal segment, enabling directional credit calibration. Moreover, we directly align the policy with human corrective actions in the bounded mean space, providing an additional signal beyond critic-guided updates. Across five real-robot manipulation tasks, PACT improves the average success rate by 24.5% and achieves 1.3 times faster convergence, thereby improving both RL sample efficiency and performance. Code is available at https://anonymous.4open.science/r/HILRL-A1X-BC05.

2606.03948 2026-06-03 cs.CL

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

CUNI 提交至 IWSLT 2026 的用于同声传译的袖珍离线模型

Aziz Sharipov Ortega, Dominik Macháček

发表机构 * Charles University, MFF, ÚFAL(查理大学,人文学院,语言学与应用语言学研究所)

AI总结 本研究通过将离线直接语音到文本翻译模型 Canary 与最先进的策略 AlignAtt 结合,实现了同声传译能力,并在 IWSLT 2026 同声传译共享任务中提交了捷克语到英语以及英语到德语和意大利语的系统,展示了高翻译质量、低计算需求和多语言支持。

Comments IWSLT 2026

详情
AI中文摘要

我们使用最先进的策略 AlignAtt,为离线直接语音到文本翻译模型 Canary 实现了同声传译能力,并将其提交至 IWSLT 2026 同声传译共享任务,涵盖捷克语到英语以及英语到德语和意大利语的翻译。我们系统的优势在于:(1) 高翻译质量,在计算无关的模拟中,无论是在低延迟还是高延迟场景下,均优于类似规模的基线系统;(2) 低计算需求,模型仅有 10 亿参数;(3) 多语言能力——支持 25 种源语言和 25 种目标语言。

英文摘要

We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task for Czech to English and English to German and Italian. The strengths of our system are: (1) high translation quality, outperforming similarly sized baselines both in low- and high-latency regimes in computationally unaware simulations; (2) low computational requirements, as the model has only 1B parameters; (3) multilinguality -- support of 25 source and 25 target languages.

2606.03939 2026-06-03 cs.LG cs.AI cs.PF

FlashbackCL: Mitigating Temporal Forgetting in Federated Learning

FlashbackCL:缓解联邦学习中的时间遗忘

Mubarak A. Ojewale, Adriana E. Chis, Jorge M. Cortes-Mendoza, Bernardo Pulido-Gaytan, Horacio Gonzalez-Velez

发表机构 * Cloud Competency Centre, National College of Ireland, Dublin, Ireland(云竞争力中心,爱尔兰国家学院,都柏林,爱尔兰)

AI总结 针对联邦学习中客户端数据分布随时间漂移导致的时间遗忘问题,提出FlashbackCL方法,通过时间衰减标签计数、类别平衡水库采样重放和服务器端主动核心集筛选,在CIFAR-10上相对Flashback提升6.9%-10.0%,时间遗忘减少68%。

详情
AI中文摘要

基础模型和边缘模型的联邦学习(FL)越来越多地部署在客户端数据分布随时间漂移的场景中,然而现有的遗忘缓解方法假设每个客户端的分布是平稳的。Flashback是近期最强的针对跨客户端(空间)遗忘的FL方法,它使用单调累积的每类标签计数作为知识代理;该代理在时间分布漂移下会失准,并将全局模型锚定在过时的类别平衡上。我们通过一个与协议级波动隔离的每阶段指标形式化定义了FL中的时间遗忘,并提出了Flashback Continual Learning(FlashbackCL),它是Flashback的即插即用扩展,包含:(i) 时间衰减的标签计数;(ii) 具有类别平衡水库采样(CBRS)的设备感知重放缓冲区;(iii) 在公共蒸馏集上的服务器端主动核心集筛选。结果表明,在具有50个客户端和三种受控时间漂移模式的CIFAR-10上,FlashbackCL相对于Flashback实现了6.9%至10.0%的相对改进,同时将时间遗忘减少了高达68%。一项5变体消融实验表明CBRS重放是关键组件。FlashbackCL在平稳CIFAR-100上也比Flashback提高了3.5个百分点,表明类别平衡重放同样正则化了空间异质性和时间漂移。

英文摘要

Federated Learning (FL) of foundation and edge models increasingly targets deployments where client data distributions drift over time, yet existing forgetting-mitigation methods assume each client's distribution is stationary. Flashback, the strongest recent FL method against cross-client (spatial) forgetting, uses monotonically accumulating per-class label counts as a knowledge proxy; this proxy becomes miscalibrated under temporal distribution shift and anchors the global model to an outdated class balance. We formalise temporal forgetting in FL with a per-phase metric isolated from protocol-level fluctuations and propose Flashback Continual Learning (FlashbackCL), a drop-in extension of Flashback with (i) temporally-decayed label counts; (ii) a device-aware replay buffer with Class-Balanced Reservoir Sampling (CBRS); and (iii) server-side active coreset curation on the public distillation set. The results show that FlashbackCL achieves 6.9% to 10.0% relative improvement relative to Flashback, on CIFAR-10 with 50 clients and three controlled temporal shift modes, while simultaneously reducing temporal forgetting by up to 68%. A 5-variant ablation identifies CBRS replay as the critical component. FlashbackCL also improves Flashback by 3.5 points on stationary CIFAR-100, suggesting that class-balanced replay regularises spatial heterogeneity as well as temporal shift.

2606.03936 2026-06-03 cs.LG physics.geo-ph

Correcting Neural Operator Spectral Bias via Diffusion Posterior Sampling with Sparse Observations

通过稀疏观测的扩散后验采样校正神经算子谱偏差

Niccolò Perrone, Fanny Lehmann, Stefania Fresca, Filippo Gatti

发表机构 * Université Paris-Saclay, CentraleSupélec, CNRS, ENS Paris-Saclay(巴黎-萨克雷大学,中央理工学院,国家科学研究中心,巴黎-萨克雷理工学院) Laboratoire de Mécanique Paris-Saclay UMR 9026(巴黎-萨克雷力学实验室 UMR 9026) Politecnico di Milano(米兰理工大学) ETH AI Center(苏黎世联邦理工学院人工智能中心) Department of Mechanical Engineering University of Washington(华盛顿大学机械工程系)

AI总结 提出FreqNO-DPS方法,利用扩散后验采样结合谱形状引导分数,校正神经算子在稀疏观测下的高频衰减谱偏差,实现近零谱偏差。

详情
AI中文摘要

神经算子代理(NO)比数值求解器快数个数量级地近似PDE解,但受谱偏差影响:高频内容被系统性地衰减,限制了在细尺度结构重要时的可靠性。通常也可获得场的稀疏传感器测量,提供点精度而无谱失真,但仅覆盖域的一小部分。我们通过将NO预测视为扩散后验采样框架中的辅助观测来解决这一问题。我们的方法FreqNO-DPS(此 https URL )将基于无条件分数扩散先验(在高保真模拟上训练)与扩散后验采样(DPS)相结合,以稀疏观测为条件并由冻结的神经算子引导。朴素集成会重新引入代理的谱偏差;我们通过一个闭式、谱形状的引导分数来解决这一问题,该分数根据代理的频率相关精度加权,且无需去噪器反向传播。一个无分布分析在频率-扩散-时间平面上界定了近似误差,并表明引导的频率依赖性无论分布假设如何都得以保持。在3D弹性波场预测中,传感器覆盖率为5%和2%时,该方法在所有频带上达到近零谱偏差,而代理和仅传感器DPS均显示出系统性的高频衰减。各向同性引导(自然基线)提高了点精度,但几乎完整地将偏差带入后验,证实了频率依赖性校准是必要的,而不仅仅是有益的。该框架仅需配对的代理/参考数据,且除了残差的近似谱对角性外,不利用任何问题特定结构,可通过我们提供的相干性诊断对新代理进行验证。

英文摘要

Neural operator surrogates (NO) approximate PDE solutions orders of magnitude faster than numerical solvers, but suffer from spectral bias: high-frequency content is systematically attenuated, limiting reliability where fine-scale structure matters. Sparse sensor measurements of the field are often available too, offering pointwise accuracy without spectral distortion but covering only a small fraction of the domain. We address this by treating NO predictions as auxiliary observations in a diffusion posterior sampling framework. Our method, FreqNO-DPS (https://github.com/niccoloperrone/FreqNO-DPS), combines an unconditional score-based diffusion prior, trained on high-fidelity simulations, with diffusion posterior sampling (DPS) conditioned on sparse observations and guided by a frozen neural operator. Naive integration reintroduces the surrogate's spectral bias; we resolve this with a closed-form, spectrally shaped guidance score that weights the surrogate by its frequency-dependent accuracy and needs no denoiser backpropagation. A distribution-free analysis bounds the approximation error across the frequency-diffusion-time plane and shows the guidance's frequency dependence is preserved regardless of distributional assumptions. On 3D elastic wavefield prediction at 5% and 2% sensor coverage, the method reaches near-zero spectral bias across all bands, where both the surrogate and sensor-only DPS show systematic high-frequency attenuation. Isotropic guidance, the natural baseline, improves pointwise accuracy but carries the bias into the posterior nearly intact, confirming that frequency-dependent calibration is essential, not merely beneficial. The framework needs only paired surrogate/reference data and exploits no problem-specific structure beyond the residual's approximate spectral diagonality, verifiable for new surrogates via the coherence diagnostic we provide.

2606.03931 2026-06-03 cs.RO cs.SY eess.SY

Multi-Robot Bearing-only Pose Estimation via Angle Rigidity

基于角度刚性的多机器人仅方位姿态估计

J. Francisco Presenza, Leonardo J. Colombo, Ignacio Mas, Juan I. Giribet

发表机构 * Institute of Engineering Technology and Sciences "Hilario Fernández Long" (CONICET-UBA)(希拉里·费尔南德斯·隆工程技术与科学研究所(CONICET-UBA)) Centre for Automation and Robotics (CSIC-UPM)(自动化研究中心(CSIC-UPM)) Artificial Intelligence and Robotics Laboratory, Universidad de San Andrés and CONICET(人工智能与机器人实验室,圣安德烈斯大学及CONICET)

AI总结 提出一种分布式仅方位姿态估计器,利用体坐标系方位角计算位置并恢复姿态,仅需角度刚性条件,实现局部一致指数稳定。

详情
AI中文摘要

本文提出了一种新颖的分布式基于方位的姿态估计器,用于时变多机器人系统。该方法利用从体坐标系方位计算出的角度来估计机器人在 $\mathbb{R}^3$ 中的位置,而无需知道其方向。方向在 $\mathrm{SO}(3)$ 中从估计的位置、方位和方位导数中恢复。所提出的观测器仅要求(有向)感知拓扑是 extit{角度刚性的},这是一个比常用条件(如方位刚性)更弱的条件。在假设部分机器人持续激励运动的情况下,建立了所提出观测器的局部一致指数稳定性。通过仿真评估了该方案的有效性和实用性。

英文摘要

This letter proposes a novel distributed bearing-based pose estimator for time-varying multi-robot systems. The method uses angles computed from body-frame bearings to estimate the robots' positions in $\mathbb{R}^3$ without knowledge of their orientations. The orientations in $\mathrm{SO}(3)$ are recovered from the estimated positions, the bearings, and the bearing derivatives. The proposed observer only requires the (directed) sensing topology to be \textit{angle-rigid}, a weaker condition than the commonly used ones like bearing rigidity. Local uniform exponential stability of the proposed observer is established under the assumption of persistently exciting motions for a subset of robots. Simulations are presented and discussed to evaluate the scheme's effectiveness and practicality.

2606.03928 2026-06-03 cs.LG cs.CL

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

面向推理模型的价值感知随机KV缓存淘汰

Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason, Robin Jia

发表机构 * University of Southern California(南加州大学) University of Chicago(芝加哥大学)

AI总结 针对推理模型长输出导致的KV缓存瓶颈,提出价值感知随机淘汰方法VaSE,通过保护大幅度值状态和引入随机性,在4倍压缩下比最强淘汰方法准确率提升超4%。

Comments Codes: https://github.com/terarachang/VaSE

详情
AI中文摘要

推理模型通过扩展思维链提高了准确性,但其长输出造成了内存和计算瓶颈。KV缓存淘汰方法通过从缓存中淘汰不重要的键值对来降低这一成本,但它们的准确性往往不如基于选择的稀疏注意力替代方案,后者保留了完整的KV缓存。我们识别出对KV缓存淘汰准确性至关重要的关键因素。首先,一小部分值状态具有异常大的幅度,淘汰它们会导致灾难性失败,模型进入重复推理循环。其次,在淘汰过程中引入随机性通过增加缓存多样性提高了准确性。基于这些发现,我们提出了价值感知随机KV缓存淘汰(VaSE),这是一种无需训练的方法,保护大幅度值状态并促进多样化的淘汰决策。在六个推理任务上,使用VaSE进行4倍KV缓存压缩的Qwen3模型在相同稀疏度下比最先进的选择方法获得了更高的平均准确率,同时比最强的淘汰方法高出超过4%。总体而言,VaSE弥合了效率与准确性之间的差距,支持FlashAttention2,并为推理模型实现了静态内存占用。

英文摘要

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value-aware Stochastic KV Cache Eviction (VaSE), a training-free recipe that protects large-magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.

2606.03927 2026-06-03 cs.LG cs.AI

FFR: Forward-Forward Learning for Regression

FFR:前向-前向学习用于回归

Xinyang Liu, Xuanyu Liang, Shiqi Ding, Boyang Li, Zhiqiang Que, Jiayang Li, Guosheng Hu

发表机构 * University of Bristol(布里斯托大学) University College London(伦敦大学学院) University of Cambridge(剑桥大学)

AI总结 提出FFR框架,通过序数竞争 goodness 函数、分层阶梯架构和层次化预测将前向-前向算法扩展到回归任务,在多个数据集上恢复BP 98.6%的精度并显著降低内存和时间开销。

详情
AI中文摘要

前向-前向(FF)算法通过纯局部、逐层优化训练神经网络,提供了反向传播(BP)的计算高效且生物合理的替代方案。然而,FF本质上是为通过对比正负样本对进行分类而设计的,将其扩展到回归面临根本性挑战:连续目标空间缺乏用于对比学习的自然“对立面”,且标准 goodness 函数不携带关于目标幅度或顺序的信息。我们提出FFR(前向-前向回归),据我们所知,这是第一个将FF扩展到现实世界回归并展示在多样化真实数据集上具有竞争力的性能的框架。FFR引入了三项关键创新:(1)序数竞争 goodness 函数,通过距离感知序数监督下分区神经元组之间的竞争学习取代对比对;(2)分层阶梯架构,其中浅层学习粗序数判别,深层细化到细粒度回归,并通过多尺度特征聚合实现层间协作;(3)带不确定性估计的层次化预测,其中多尺度预测器联合提供鲁棒预测和预测置信度作为免费午餐。大量实验结果表明,FFR在五个真实世界回归基准上平均恢复了BP 98.6%的精度,同时将峰值训练内存降低到深度8时BP的27%和深度32时BP的8%,每次迭代时间约为BP的72%,并且显著优于所有无BP的竞争对手。

英文摘要

The Forward-Forward (FF) algorithm offers a computationally efficient and biologically plausible alternative to backpropagation (BP) by training neural networks through purely local, layer-wise optimization. However, FF is inherently designed for classification via contrastive positive-negative sample pairs, and extending it to regression poses fundamental challenges: continuous target space lack natural "opposites" for contrastive learning, and the standard goodness function carries no information about target magnitude or ordering. We propose FFR (Forward-Forward for Regression), to our knowledge, the first framework to extend FF to real-world regression and demonstrate competitive performance across diverse real-world datasets. FFR introduces three key innovations: (1) an ordinal competitive goodness function that replaces contrastive pairs with competitive learning between partitioned neuron groups under distance-aware ordinal supervision; (2) a stratified ladder architecture where shallow layers learn coarse ordinal discrimination and deeper layers refine into fine-grained regression, with multi-scale feature aggregation for inter-layer collaboration; and (3) hierarchical prediction with uncertainty estimation, where multi-scale predictors jointly provide robust predictions and prediction confidence as a free-lunch. Extensive experimental results show FFR recovers on average 98.6% of BP's accuracy across five real-world regression benchmarks while reducing peak training memory to only 27% of BP's at depth 8 and 8% at depth 32, with per-iteration time around 72% of BP's, and substantially outperforms all BP-free competitors.

2606.03925 2026-06-03 cs.CV

Adaptive Causal Alignment for High-Confidence Adversarial Training

自适应因果对齐用于高置信度对抗训练

Zhiming Luo, Kejia Zhang, Yingxin Lai, Junwei Wu, Juanjuan Weng, Shaozi Li

发表机构 * Department of Artificial Intelligence, Xiamen University(厦门大学人工智能学院) Department of Computer Science, Emory University(埃默里大学计算机科学系) College of Information Science and Technology, Jinan University(济南大学信息科学与技术学院)

AI总结 针对高置信度对抗训练中模型过度依赖非因果背景相关性的问题,提出HICAT框架,通过可学习背景偏差估计器与自适应去偏机制实现因果对齐,提升鲁棒泛化性能。

详情
AI中文摘要

逆对抗训练利用高置信度预测来稳定鲁棒学习,然而我们发现了一个关键悖论:高置信度往往源于对非因果背景相关性的过拟合,而非内在对象语义。我们的研究表明,视觉上下文作为双重信号,既可以是必要的支持先验,也可以是混杂的虚假相关。这一洞察使得现有的盲目抑制策略存在缺陷,因为它们不可避免地导致严重的特征损失。为解决此问题,我们提出高置信度因果对齐训练(HICAT),一个建立语义均衡的统一框架。HICAT遵循“测量-去偏-对齐”流程,集成了可学习背景偏差估计器(LBBE)以自适应诊断上下文效用。在该诊断指导下,自适应去偏机制执行精细的逻辑校正,并辅以几何基础的背景逻辑正交增强(FLOE)损失以强制执行特征解耦。在CIFAR-10、CIFAR-100和ImageNet-1K上的大量实验表明,HICAT在不同架构(CNN和ViT)上均持续优于匹配基线,同时显著缩小了鲁棒泛化差距。

英文摘要

Inverse adversarial training leverages high-confidence predictions to stabilize robust learning, yet we uncover a critical paradox: high confidence often stems from overfitting to non-causal background correlations rather than intrinsic object semantics. Our investigation reveals that visual context functions as a dual-natured signal, serving as either a necessary supportive prior or a spurious confounder. This insight renders existing blind suppression strategies flawed, as they inevitably lead to severe Feature Loss. To resolve this, we propose High-Confidence Causally Aligned Training (HICAT), a unified framework that establishes a Semantic Equilibrium. Operating on a ``Measure-Debias-Align'' pipeline, HICAT integrates a Learnable Background-Bias Estimator (LBBE) to adaptively diagnose context utility. Guided by this diagnosis, an Adaptive Debiasing mechanism performs surgical logit rectification, complemented by a geometrically grounded Foreground Logit Orthogonal Enhancement (FLOE) loss to enforce rigorous feature disentanglement. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that HICAT consistently improves over matched baselines across diverse architectures (CNNs and ViTs) while significantly reducing the robust generalization gap.

2606.03924 2026-06-03 cs.CL

Knowledge Editing in Masked Diffusion Language Models

掩码扩散语言模型中的知识编辑

Haewon Park, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院)

AI总结 研究将定位-编辑方法从自回归模型迁移到掩码扩散模型,发现编辑位置可迁移但多词编辑性能下降,并提出优化中间状态的简单修正方法。

详情
AI中文摘要

知识编辑旨在更新或纠正语言模型中的事实知识。一种广泛使用的方法是定位-编辑,它分两步进行:首先在模型中定位事实,然后在那里编辑权重。迄今为止,此类方法仅在自回归模型(ARMs)上开发。它们的基本假设是否适用于掩码扩散模型(MDMs)——后者双向建模文本并通过迭代去噪而非下一个词预测生成——仍是一个开放问题。我们通过将定位-编辑迁移到MDMs,并在匹配规模下比较两个MDMs(LLaDA, Dream)与两个ARMs(LLaMA, Qwen)来解决这个问题。我们的核心发现分为两部分。首先,编辑应用的位置跨范式迁移:因果追踪在两种模型中均突出显示最后一个主体词处的相同早期到中间层MLP,并且编辑在那里最有效。其次,这个共享位置并不能保证共享结果。单词编辑在两种模型中均成功,但随着目标变长,编辑在MDMs中系统性退化,而在ARMs中则不然。失败源于编辑事实的生成方式:生成多词目标需要经过部分未掩码的中间状态,而编辑从未针对这些状态进行优化。在此诊断指导下,我们引入了一个简单的修正方法,针对这些状态优化编辑,从而显著恢复了多词性能。

英文摘要

Knowledge editing aims to update or correct factual knowledge in a language model. A widely used approach, locate-then-edit, does this in two steps: it first localizes a fact within the model, then edits the weights there. To date, such methods have been developed exclusively on autoregressive models (ARMs). Whether their underlying assumptions hold for masked diffusion models (MDMs), which model text bidirectionally and generate by iterative denoising rather than next-token prediction, remains an open question. We address it by transferring locate-then-edit to MDMs and comparing two MDMs (LLaDA, Dream) with two ARMs (LLaMA, Qwen) at matched scale. Our central finding has two parts. First, where an edit is applied transfers across paradigms: causal tracing highlights the same early-to-mid-layer MLP at the last subject token in both, and editing is most effective there. Second, this shared location does not guarantee a shared outcome. Single-token edits succeed in both, but as targets grow longer, editing degrades systematically in the MDMs but not the ARMs. The failure stems from how the edited fact is generated: producing a multi-token target requires passing through partially unmasked intermediate states for which the edit was never optimized. Guided by this diagnosis, we introduce a simple correction that optimizes the edit for these states, substantially restoring multi-token performance.

2606.03923 2026-06-03 cs.LG

Contrastive Neural Algorithmic Reasoning for Graph Coloring

对比神经算法推理用于图着色

Thien Le, Tianyu Zhao, Melanie Weber

发表机构 * Harvard University SEAS(哈佛大学SEAS) Harvard University T.H. Chan School of Public Health(哈佛大学T.H. Chan公共卫生学院)

AI总结 提出对比学习框架学习可迁移的着色几何结构,通过图神经网络编码器实现低冲突着色,并推广到不同规模的图。

Comments 52 pages, 5 figures, 45 tables

详情
AI中文摘要

图着色旨在用尽可能少的颜色为图的节点分配颜色,使得相邻节点颜色不同。这里,我们研究近似$k$-着色,目标是用最多$k$种颜色同时最小化单色边的数量。该问题是图论的核心问题,并在调度和资源分配等领域有应用。最近的無监督GNN方法直接优化每个实例,阻碍了跨图大小和分布的泛化。我们转而提出一个对比学习框架,学习可迁移的着色几何结构,其中同色节点的嵌入对齐,而相邻节点的表示被推向不同方向。我们分析了有界大小图上的总体目标。对于单位范数嵌入,我们证明其最优解具有线原型结构:同色节点的表示坍缩到共享的一维子空间,边连接正交子空间。该几何结构在有监督设置中产生平稳条件,并在平衡着色假设下通过投影次梯度动力学保持。在非归一化变体中,梯度下降具有由商图硬间隔问题控制的最大间隔偏差。在合成和真实世界图上的实验表明,对比GNN编码器有效泛化并产生低冲突着色,与贪心方法匹配甚至有时改进。

英文摘要

Graph coloring seeks to assigns colors to a graph's nodes so that adjacent nodes receive different colors, using as few colors as possible. Here, we study approximate $k$-coloring, where the goal is to use at most $k$ colors while minimizing the number of monochromatic edges. This problem is central to graph theory and has applications in areas such as scheduling and resource allocation. Recent unsupervised GNN approaches optimize each instance directly, precluding generalization across graph sizes and distributions. We instead propose a contrastive learning framework that learns transferable coloring geometry where the embeddings of same-color nodes align, while adjacent nodes' representations are pushed toward distinct directions. We analyze the resulting population objective over bounded-size graphs. For unit-norm embeddings, we show that its optima have a line-prototype structure: Representations of nodes of the same color collapse to a shared one-dimensional subspace, and edges connect orthogonal subspaces. This geometry yields stationarity conditions in the supervised setting and is preserved by projected subgradient dynamics under a balanced-coloring assumption. In an unnormalized variant, gradient descent has a max-margin bias governed by a quotient-graph hard-margin problem. Experiments on synthetic and real-world graphs show that contrastive GNN encoders generalize effectively and produce low-conflict colorings, matching and sometimes improving on greedy approaches.

2606.03921 2026-06-03 cs.CV

GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

GARDEN: 从RGB图像中重力对齐的解耦环境重建

Jiahao Sun, Dingkun Wei, Zehong Shen, Hongyu Zhou, Yujun Shen, Liang Li

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 提出GARDEN框架,利用重力先验将多视图RGB图像重建为具有显式刚体和解耦背景的结构化混合场景表示,支持直接物理模拟。

详情
AI中文摘要

将多视图RGB观测转换为可用于模拟的3D环境仍然具有挑战性,因为当前的重建流程会产生没有显式物理结构的整体场景表示。它们通常定义到任意全局旋转,并将刚性前景物体与背景几何纠缠在一起,这阻碍了稳定的物理交互。现有的解决方案通常通过用检索到的CAD资产替换重建的物体来恢复交互性,但这引入了缓慢的检索和替换阶段,并削弱了场景特定的几何保真度。我们提出GARDEN,一个仅使用RGB的框架,将重建重新表述为基于物理的场景分解,并输出结构化的混合场景表示。关键思想是使用重力作为通用物理先验:我们首先将重建对齐到统一的重力视角坐标系以解决规范模糊性,然后恢复具有准确6自由度放置的物体中心刚性网格,最后通过条件3D点分类从背景中移除重复的物体几何。得到的表示结合了显式刚体和解耦背景,能够在保持视觉真实感的同时实现直接物理模拟。在模拟和真实多视图场景上的实验表明,与基于检索的基线相比,GARDEN提高了物体放置可靠性、解耦质量和渲染模拟效率。

英文摘要

Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.