arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1998
2606.12604 2026-06-12 cs.RO 新提交

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

EgoEngine:从自我中心人类视频到高保真灵巧机器人演示

Yangcen Liu, Shuo Cheng, Xinchen Yin, Woo Chul Shin, Alfred Cueva, Yiran Yang, Zhenyang Chen, Chuye Zhang, Danfei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Tsinghua University(清华大学)

AI总结 提出EgoEngine框架,通过视觉和动作桥接,将自我中心人类视频转化为高保真机器人数据,首次实现零样本灵巧策略学习。

详情
AI中文摘要

灵巧操作受限于大规模机器人演示数据的收集成本。自我中心人类视频提供了多样操作行为的可扩展来源,但直接用于机器人学习需要弥合两个差距:人类与机器人观测之间的视觉差距,以及人类运动与机器人可执行动作之间的动作差距。我们提出EgoEngine,一个可扩展的框架,用于将自我中心人类操作视频转化为高保真机器人数据。给定一个自我中心RGB视频,EgoEngine生成:(i) 高保真机器人观测视频,用机器人替换人类,同时保留场景上下文和时间对齐,以及(ii) 在可行性约束下,与任务对齐、可执行的机器人动作轨迹。在仿真和真实机器人上的实验表明,EgoEngine能够将人类视频可扩展地转化为机器人数据,并且据我们所知,首次展示了无需真实机器人演示,从自我中心人类视频进行零样本视觉运动灵巧策略学习。项目网站:此 https URL。

英文摘要

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.

2606.12603 2026-06-12 cs.RO cs.AI 新提交

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

从模仿到对齐:面向长距离人行道导航的人类偏好流策略

Honglin He, Zhizheng Liu, Yukai Ma, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出FlowPilot,一种仅使用单目RGB相机的无地图导航策略,通过锚定流匹配进行预训练,并引入人类偏好学习实现对齐,在长距离人行道导航中提升鲁棒性和社会合规性。

详情
AI中文摘要

自主长距离人行道导航对于微出行应用(如机器人送餐和辅助电动轮椅)至关重要。与道路上的自动驾驶不同,长距离人行道导航需要在不可预测的人行道地形和行人中精确操作,且感知栈轻量,仅需单个单目RGB相机。虽然从演示中模仿学习(IL)提供了一种实用解决方案,但由此产生的自动驾驶策略常常遭受复合误差、人行道上缺乏社会合规性以及缺乏处理复杂情况的反事实推理能力。为解决这些挑战,我们提出了FlowPilot,一种仅使用单目RGB相机即可实现稳健高效长距离导航性能的无地图导航策略。我们首先提出使用锚定流匹配作为动作表示,用于在大型机器人车队数据上进行策略预训练,并捕捉人行道导航行为的多样、复杂、多模态分布。为弥合模仿与对齐之间的差距,我们进一步设计了一种人在环的偏好学习方案,通过少量人类干预数据调整策略。它增强了模型的反事实推理能力和在人行道上的社会合规性。我们通过在多样化人行道环境中的广泛仿真和真实世界实验评估了FlowPilot。在仿真中,FlowPilot实现了42%的成功率和66%的路线完成率,而FlowPilot-HP进一步提升了真实世界的鲁棒性和社会合规性,相对于基础模型,IR降低了40.0%,NIR降低了52.1%。

英文摘要

Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

2606.12601 2026-06-12 cs.CV 新提交

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

双状态槽注意力:解耦外观与身份用于视频目标中心学习

Sieu Tran, Duc Nguyen, Hao Vo, Khoa Vo, Ngan Le

发表机构 * University of Arkansas(阿肯色大学)

AI总结 提出双状态槽注意力(DSSA),通过分离每个槽为局部状态(外观)和身份状态(稳定身份),并采用竞争调制聚合减少弱匹配槽的干扰,提升视频目标分割质量与时间一致性。

详情
AI中文摘要

无监督视频目标中心学习旨在无需监督地将动态场景分解为持久的目标级表示。然而,现有的基于槽的方法在快速运动和部分遮挡等挑战性场景中难以维持稳定的目标身份。首先,它们通常将目标的每帧外观和跨帧身份编码在单个槽向量中,造成目标冲突导致槽交换:重建需要对瞬态视觉变化敏感,而时间一致性需要对它们不变。其次,槽注意力中使用的令牌重归一化可能放大弱注意力槽,使其吸收其他目标的令牌,破坏槽与目标的对应关系。我们提出双状态槽注意力(DSSA),一种完全自监督框架,通过分离外观与身份并减少弱匹配槽的虚假更新来解决这些限制。DSSA将每个槽分解为用于每帧外观的局部状态和用于时间稳定目标信息的身份状态,从而用分离的表示对齐重建和时间一致性。身份状态通过学习的循环转换更新,该转换作为局部状态的时间滤波器,而竞争调制聚合(CMA)降低弱匹配槽的更新权重,防止它们吸收其他目标的令牌。在MOVi-C、MOVi-D和YouTube-VIS上的实验表明,DSSA在分割质量和时间一致性上持续优于先前方法,同时在下游目标识别和视频动态预测中表现更强。代码和模型将在接收后公开。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.

2606.12595 2026-06-12 cs.LG cs.AI cs.CV 新提交

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

发表机构 * Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 本文系统比较了不同架构的地理空间基础模型,在统一设置下评估其灵活性与性能,为多模态推理提供设计指导。

详情
AI中文摘要

基础模型通过跨多样未标记地理空间模态的可扩展预训练,正在迅速改变地球观测。然而,其架构多样性——从编码器-only到编码器-解码器以及掩码自编码范式——使得以一致方式评估性能权衡变得具有挑战性。在这项工作中,我们对领先的、专为地理空间多模态推理设计的基础模型架构进行了同类比较,特别关注不同光谱波段配置下的灵活性。我们使用相同的自监督学习目标和训练数据集标准化预训练,并在GEOBench基准测试上,在一致参数化下评估所有模型的分类和分割任务。我们的结果为模型灵活性、模态对齐和下游任务性能之间的设计权衡提供了新见解。通过强调受控条件下的架构优势和局限性,本研究为构建能够进行鲁棒多模态推理的下一代地理空间基础模型提供了实用指导。

英文摘要

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

2606.12594 2026-06-12 cs.AI 新提交

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover: 通过增强型Lean形式化推进高效形式化证明

Joshua Ong Jun Leang, Zheng Zhao, Mihaela Cătălina Stoian, Qiyuan Xu, Haonan Li, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia

发表机构 * Imperial College London(伦敦帝国学院) University of Edinburgh(爱丁堡大学) Nanyang Technological University(南洋理工大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出Pythagoras-Prover系列,包括自回归和扩散模型,通过课程SFT、动态过滤和增强型Lean形式化(ALF)扩展验证数据,在MiniF2F-Test上以更少参数超越DeepSeek-Prover-V2。

Comments Pythagoras-Prover: Technical Report

详情
AI中文摘要

现代Lean定理证明器只有在大量训练和推理计算下才能取得强性能,部分原因是由于稀缺的验证证明数据和形式化证明搜索的长推理轨迹,使得监督微调(SFT)和采样成本高昂。我们介绍了Pythagoras-Prover,一个计算高效的开源Lean定理证明器系列,专为实际计算预算而构建。该系列涵盖两种生成范式:4B和32B参数的自回归模型,以及首个概念验证的基于扩散的证明器(4B),它在推理时迭代地精炼Lean证明。为了提高训练效率,我们构建了一个Lean验证的语料库,按易、中、难问题分层,用于课程SFT,使模型逐步从较短、较简单的证明过渡到较长、较难的证明。在SFT期间,动态证明推理过滤方案保留了信息丰富的证明轨迹,同时将每个实例保持在8k令牌的上下文预算内。我们还引入了增强型Lean形式化(ALF),它将稀缺的验证语料库扩展为形式化语句的变体,通过自蒸馏填充以提供额外训练信号,而无需正式验证每个变异实例。通过扰动已知问题同时保留其形式化特征,ALF减少了对任何语句表面形式的依赖。实验上,Pythagoras-Prover-4B在MiniF2F-Test上的pass@32(86.1% vs 82.4%)超过了DeepSeek-Prover-V2-671B,参数数量约为其1/167,而Pythagoras-Prover-32B在MiniF2F-Test上以93.0%的成绩创下了开源最先进水平,并在672个PutnamBench问题中解决了93个。我们发布了MiniF2F-ALF,一个经ALF变异的对污染敏感的基准,每个评估模型在该基准上的准确率均下降;在此基准上,我们的32B模型仍然最强,而4B模型匹配了先前最先进的Goedel-Prover-V2-32B。

英文摘要

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

2606.12590 2026-06-12 cs.CV cs.AI 新提交

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

分析与改进医学LVLMs中的细粒度偏好优化

Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Queen’s University(女王大学)

AI总结 针对医学大视觉语言模型在事实一致性、视觉定位和临床对齐方面的不足,提出一种结合双向令牌级KL正则化和视觉对比定位目标的细粒度在线偏好优化框架,通过最小编辑模型输出构建偏好对,仅修正临床错误片段,显著提升诊断准确性。

详情
AI中文摘要

大型视觉语言模型(LVLMs)在医学影像任务中取得了强劲性能,但仍容易出现事实不一致、视觉定位差以及与临床有意义反馈对齐不足的问题。现有的后训练对齐方法,包括直接偏好优化(DPO)及其变体,在医学领域面临三个关键限制:(1)序列级奖励信号将临床关键令牌与通用填充文本等同对待;(2)依赖静态监督微调参考作为偏好响应引入了离策略分布偏移,将优化导向风格伪影而非临床正确性;(3)对齐目标缺乏明确的视觉定位约束,使模型对微妙但诊断决定性的病理特征不敏感。我们的方法利用双向令牌级KL正则化以及视觉对比定位目标,该目标将干净图像与病变破坏图像配对,以惩罚缺乏足够视觉证据生成的响应。这些组件共同构成了一个细粒度的在线对齐框架,通过最小编辑模型生成的输出来构建偏好对,仅修正临床错误片段,同时保留原始语言风格。在医学影像任务和临床文本生成基准上的大量实验验证了我们方法的有效性。

英文摘要

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

2606.12587 2026-06-12 cs.AI cs.HC 新提交

Strategic Decision Support for AI Agents

AI智能体的战略决策支持

Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 针对AI智能体作为主要决策者时的可靠性问题,提出通过优化问题最小化支持使用并控制反事实遗漏支持误差的战略决策支持框架,并开发在线算法自适应阈值化支持分数。

详情
AI中文摘要

传统上,决策支持研究人类如何使用机器学习模型做出更好的决策。在现代智能体系统中,这种角色分工日益反转:AI智能体代表用户行动,而人类和工具成为围绕它们的支持机制。这种角色反转将可靠性问题推至前沿,因为智能体错误可能产生严重后果,且智能体行为必须始终与人类目标和约束保持一致。脱离经典的决策支持观点,我们在AI智能体作为核心行动者的设定下,重新审视其两个基本原则:寻求支持的成本-价值权衡以及不确定性量化的作用。我们提出了一个AI智能体战略决策支持框架,通过一个优化问题来最小化支持使用,同时控制一个反事实遗漏支持误差:即智能体在那些支持本可实质改善其输出的实例上单独行动的概率。在总体层面,我们证明最优策略是关于支持价值的阈值规则。基于这一结构,我们开发了一种在线算法,该算法自适应地阈值化这样的分数,并使用随机探索来控制遗漏支持误差,无需分布假设。我们进一步引入了一种即时校准方法,在线减少不必要的支持调用。我们将该框架实例化到多种场景中,包括信息收集、人机协作和工具使用,展示了每种场景如何通过相同的战略决策支持视角建模。跨这些场景的实验表明,我们的方法可靠地控制了目标误差,同时在实际中大幅减少了支持使用。

英文摘要

Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints. Departing from the classical view of decision support, we revisit its two basic principles, the cost--value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors. We propose a framework for strategic decision support for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed-support error: the probability that the agent acts alone on instances where support would have materially improved its output. At the population level, we show that the optimal policy is a threshold rule on the value of support. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. We further introduce a calibration-on-the-fly method that reduces unnecessary support calls online. We instantiate this framework across diverse scenarios, including information gathering, human--AI collaboration, and tool use, showing how each can be modeled through the same strategic decision-support lens. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice.

2606.12579 2026-06-12 cs.RO 新提交

G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation

G-MAPP: 基于GPU加速的多智能体规划与感知用于反应式运动生成

Tanmay Bishnoi, Riddhiman Laha, Tobias Löw, Jose Alex Chandy, Luis F. C. Figueredo, Sami Haddadin

发表机构 * Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University(多伦多都会大学电气、计算机与生物医学工程系) Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich (TUM)(慕尼黑工业大学慕尼黑机器人与机器智能研究所) Institute for Experiential Robotics, Northeastern University(东北大学体验式机器人研究所) Idiap Research Institute(Idiap 研究所) EPFL(瑞士联邦理工学院洛桑) CHART Group at the School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院 CHART 小组) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出GPU加速的框架,通过并行状态探索和紧密耦合感知-动作循环,实现非结构化环境中的实时反应式运动生成,在7自由度机器人上达到5倍加速并成功避障。

Comments The implementation is available at: https://github.com/chart-research/g-mapp

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7516-7523, June 2026
AI中文摘要

在非结构化环境中的反应式运动生成仍然是机器人学中的一个开放挑战。由于无碰撞运动生成的计算复杂性,现有方法要么为静态场景生成全局轨迹,要么采用对环境做出保守假设的模型。本文指出主要瓶颈在于高保真环境规划的运行时性能需求,以及感知与规划模块之间的时间集成。因此,我们提出一个框架,通过使用GPU加速世界建模和基于向量场的规划,不牺牲运行时性能和感知与规划的世界表示。这使得我们能够实现更快的并行状态探索以进行准全局轨迹规划,并在动态杂乱环境中使用现成的深度传感器实时紧密耦合感知-动作循环。我们定量评估了CPU和GPU版本规划器的计算时间和成功率差异,并在7自由度Franka Emika机器人上通过真实世界实验对我们的耦合框架进行了定性评估。实验结果表明,我们的基于GPU的框架相比CPU版本实现了高达5倍的加速,并在简单和具有挑战性的物理世界场景中成功避免了碰撞。

英文摘要

Reactive motion generation in unstructured environments remains an open challenge in robotics. Due to the computational complexity of collision-free motion generation, existing methods either generate global trajectories for static scenarios, or employ models that make conservative assumptions about the environment. This paper identifies the primary bottleneck as the runtime performance demand of planning on high-fidelity environments, and the temporal integration between the perception and planning modules. Therefore, we propose a framework that does not compromise on runtime performance and world representations for perception and planning by accelerating world modeling and vector-field based planning using the GPU. This allows us to achieve faster parallel state exploration for quasi-global trajectory planning, and tighter coupling of the perception-action loop in real-time for dynamic cluttered environments with off-the-shelf depth sensors. We quantitatively evaluate the computation-time and success rate differences for the CPU and GPU versions of our planner, and perform qualitative evaluations of our coupled framework using real-world experiments on a 7-DoF Franka Emika robot. Experimental results demonstrate that our GPU-based framework achieves up to a 5x speedup over the CPU version and successfully avoids collisions across both trivial and challenging physical world scenarios.

2606.12578 2026-06-12 cs.CL 新提交

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

MARD: 镜像增强推理蒸馏用于机制级药物-药物相互作用预测

Mohammadreza Riyazat, Vian Lelo, Rameen Jafri, Yumna Khan, Abeer Badawi

发表机构 * University of Guelph(圭尔夫大学) York University(约克大学) Vector Institute(向量研究所)

AI总结 提出MARD-7B模型,通过镜像增强推理蒸馏、单token KL散度、PRM加权DPO和机制感知检索通道,在机制级DDI预测中准确率超越GPT-4o 6.7个百分点,且成本仅为1%。

Comments 29 pages, 9 figures. Preprint

详情
AI中文摘要

机制级药物-药物相互作用(DDI)预测需要识别涉及的酶或药效学轴、作用方向及证据,而不仅仅是判断两种药物是否相互作用。我们引入了一个可复现的机制级DDI标注与评估协议,包括结构化的7家族/147亚型分类法、无泄漏的冷切分协议以及可审计的推理指标,用于评估超越平面交互分类的药理学预测。我们提出一个流水线,生成了7B推理模型MARD(镜像增强推理蒸馏),结合了三种训练创新:方向标签上的单token KL散度,将模型的预测与方向标签绑定;基于PRM权重的DPO,使用程序化硬负样本;以及无泄漏的机制感知检索通道。过程奖励步骤标签可自动根据DrugBank结构化字段验证,无需人工或LLM评判。在2026年4月的DrugBank版本上,我们的MARD-7B是32个系统比较中唯一在药物对新颖性下准确率保持稳定的系统,以约1%的前沿API成本,比最佳基线高出13.9个百分点,比GPT-4o高出6.7个百分点。进一步分析揭示了反记忆特征,即在罕见药物上准确率提升,表明增益来自结构化药理学推理而非药物频率记忆。我们发布了语料库、DDI-PRM、检索索引和训练代码。

英文摘要

Mechanism-level drug-drug interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is implicated, in which direction, and with which evidence -- not merely whether two drugs interact. We introduce a reproducible mechanism-level DDI labelling and evaluation protocol with a structured 7-family/147-subtype taxonomy, leakage-safe cold-split protocols, and auditable reasoning metrics for evaluating pharmacological prediction beyond flat interaction classification. We propose a pipeline that produces a 7B reasoning MARD (Mirror-Augmented Reasoning Distillation), combining three training innovations: a single-token KL divergence on direction tag that ties the model's prediction, per-loss PRM-weighted DPO with programmatic hard negatives, and a leakage-safe mechanism-aware retrieval channel. Process-reward step labels are automatically verifiable against DrugBank-structured fields, requiring no human or LLM judges. On the April-2026 DrugBank release, our MARD-7B is the only system in a 32-system comparison whose accuracy survives drug-pair novelty, beating the best baseline by +13.9 pp and GPT-4o by +6.7 pp at ~1% of frontier API cost. Further analysis reveals an anti-memorisation signature where accuracy improves on rarely seen drugs, suggesting that gain comes from structured pharmacological reasoning rather than drug-frequency memorisation. We release corpus, DDI-PRM, retrieval index, and training code.

2606.12575 2026-06-12 cs.CV 新提交

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

高保真两步图像生成:通过教师对齐的端到端蒸馏

Dongyang Liu, Ruoyi Du, David Liu, Dengyang Jiang, Liangchen Li, Qilong Wu, Zhen Li, Steven C. H. Hoi, Hongsheng Li, Peng Gao

发表机构 * Z-Image Team, Alibaba Group(阿里巴巴集团Z-Image团队) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Z-Image Turbo++,通过分布对齐对抗学习、步解耦参数化和迭代正则化端到端训练,将8步教师模型蒸馏为2步生成模型,显著缩小质量差距。

详情
AI中文摘要

少步扩散蒸馏在4-8步生成中已日趋成熟,但进一步推进到2步仍具挑战。本文介绍Z-Image Turbo++,一种从8步Z-Image Turbo教师模型蒸馏得到的高质量2步图像生成模型。我们的方法通过三个针对该场景简单而有效的设计选择,解决了2步生成中任务难度增加和模型容量有限的核心瓶颈。首先,我们提出分布对齐对抗学习,使用教师生成的图像而非外部真实图像作为GAN训练的真实样本,提供更易实现且信息量更大的对抗目标。其次,我们采用步解耦参数化,为两个去噪步骤分配独立的模型参数,以更好地匹配它们不同的容量需求。第三,我们执行带迭代正则化的端到端训练,使第一步能够接收来自最终图像质量的梯度,同时通过显式的步1损失保留有意义的中间生成。这些设计共同在定性和定量评估中显著缩小了2步与8步生成之间的质量差距,凸显了精心定制的蒸馏策略在改善少步生成中质量-效率权衡方面的潜力。

英文摘要

Few-step diffusion distillation has become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. In this work, we introduce Z-Image Turbo++, a high-quality 2-step image generation model distilled from the 8-step Z-Image Turbo teacher. Our method addresses the central bottlenecks of increased task difficulty and limited model capacity in 2-step generation through three simple but effective design choices tailored to this regime. First, we propose Distribution-Aligned Adversarial Learning, which uses teacher-generated images rather than external real images as real samples for GAN training, providing a more attainable and informative adversarial target. Second, we adopt Step-Decoupled Parameterization, assigning independent model parameters to the two denoising steps to better match their distinct capacity demands. Third, we perform End-to-End Training with Iterative Regularization, allowing the first step to receive gradients from final image quality while preserving a meaningful intermediate generation through an explicit step-1 loss. Together, these designs substantially narrow the quality gap between 2-step and 8-step generation in both qualitative and quantitative evaluations, highlighting the potential of carefully tailored distillation strategies for improving the quality-efficiency trade-off in few-step generation.

2606.12569 2026-06-12 cs.CL cs.AI 新提交

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

EDEN:意大利语临床笔记的大规模语料库

Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Istituto di Ricerche Farmacologiche Mario Negri IRCCS(马里奥·内格里药理研究所IRCCS) University of Padua(帕多瓦大学)

AI总结 本文介绍EDEN,一个大规模意大利语急诊临床笔记语料库,包含约400万份匿名笔记及6000份专家标注数据,用于支持大语言模型在医疗中的应用,并提出了CRF填充作为新的结构化信息提取基准。

详情
AI中文摘要

我们提出了EDEN(急诊电子笔记),这是一个新颖且独特的大规模临床笔记语料库,这些笔记来自意大利医院的急诊科。当前版本的语料库由约400万份完全匿名的临床笔记组成,涵盖了患者在急诊科停留期间的不同护理阶段。此外,约六千份笔记的子集由临床专家通过结构化病例报告表(CRF)进行了手动标注,该CRF包含132个项目,涉及急诊科两种患者情况:呼吸困难和意识丧失。项目可能取数值(例如血氧饱和度)、分类(例如意识水平)、二元(例如是否存在创伤)和混合值类型。标注过程涉及多位临床医生,并经过迭代修订以解决项目表述中的歧义,从而形成了一个结构丰富(尽管高度不平衡)的资源。该数据集旨在填补能够支持大语言模型在具体医疗应用中开发和使用的重要数据缺口。我们描述了数据收集协议、现场匿名化流程、语料库统计数据和标注方案。最后,我们提出了CRF填充作为一项新的结构化信息提取基准,并提供了基于Gemma-27B和MedGemma-27B的零样本基线。据我们所知,EDEN数据集是意大利语现有最大的免费临床笔记语料库。

英文摘要

We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

2606.12563 2026-06-12 cs.AI 新提交

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor:作为自主智能体认知层的树搜索

Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao, Xi Zhao, Mou Li, Zhenyu Gu, Emad Barsoum

发表机构 * AMD

AI总结 提出Arbor多智能体框架,通过结构化树搜索作为认知层,在大型有状态动作空间中实现自主优化,在LLM推理优化中实现高达193%的吞吐量-延迟帕累托改进。

详情
AI中文摘要

Arbor是一个多智能体框架,引入了结构化树搜索作为自主智能体在大型有状态动作空间中运行的认知层。先前的自主优化系统在具有无状态评估的孤立目标上运行。相反,Arbor维护一个显式的得分假设搜索树,作为跨智能体的共享工作记忆,随着每次测量而演变,将失败视为诊断信号以重塑后续探索,并随着先前的成功转移瓶颈分布而扩展。我们在全栈LLM推理优化上验证了Arbor,这是一个历史上需要应用程序、框架、编译器、内核和硬件栈的工程团队协调努力才能达到峰值性能的领域。Arbor将Orchestrator智能体(通过将优化委托给推理栈中的领域专家来驱动优化)与Critic智能体(通过根本原因分析、内省和测量验证来维护稳定性)配对——这是一种制衡架构,其中没有一个智能体可以单方面驱动系统。智能体能力被分解为硬技能(领域专业知识)和软技能(决定贡献如何组合的协调协议),从而实现完全自主的多日活动。Arbor在供应商优化的基线上实现了高达193%的推理吞吐量-延迟帕累托改进,而没有该框架的单个智能体在吞吐量改进上达到+33%后几小时内就不可恢复地崩溃。Arbor可推广到多代硬件平台,运行间方差在2个百分点以内,表明该方法与硬件无关且可重复。

英文摘要

Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution. We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation -- a checks-and-balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible.

2606.12562 2026-06-12 cs.CV cs.GR 新提交

HairPort: In-context 3D-aware Hair Import and Transfer for Images

HairPort: 上下文感知的3D发型导入与迁移

Alireza Heidari, Amirhossein Alimohammadi, Wallace Michel Pinto Lira, Adi Bar-Lev, Ali Mahdavi-Amiri

发表机构 * Simon Fraser University(西蒙菲莎大学) Huawei Canada(华为加拿大)

AI总结 提出HairPort框架,通过显式分离发型移除与迁移,并利用3D感知管道实现大姿态差异下的发型迁移,结合LoRA适配的秃头转换器和条件流匹配生成器,实现高质量、身份保持的发型迁移。

Comments Accepted to SIGGRAPH 2026 (Conference Papers Track). 23 pages, 15 figures, 10 tables, including supplementary material as appendices. Project page: https://deepmancer.github.io/HairPort/

详情
AI中文摘要

在图像之间迁移发型是计算机图形学、计算机视觉和视觉效果中一个重要但具有挑战性的任务。它使用户能够在无需实际改变发型的情况下探索新造型,应用于虚拟试穿系统、增强现实和娱乐等领域。大多数先前的方法在姿态差异较小时表现最佳,但在视角和尺度差异较大时效果不佳,此时缺失的发型内容必须合成而非迁移。我们提出HairPort,一个3D感知的发型迁移框架,通过显式分离发型移除与迁移,并在合成前强制几何一致性来解决这些问题。我们引入了一个秃头转换器,通过基于LoRA的上下文适配FLUX.1 Kontext生成逼真的秃头人脸版本。为了训练我们的秃头转换器,我们引入了一个新数据集Baldy,包含6000对在不同身份和条件下的秃头和原始图像。我们还使用了一个3D感知迁移管道,在将参考发型合成到源图像之前,从目标视角重建并重新渲染该发型。由于具有3D感知能力,我们的方法支持源和目标之间的大姿态和尺度差异。最后,一个条件流匹配生成器从秃头源和几何对齐的参考引导中合成迁移结果。综合来看,我们的方法实现了准确、姿态一致且身份保持的发型迁移,在定性和定量上均优于现有方法。

英文摘要

Transferring hairstyles between images is an important but challenging task in computer graphics, computer vision, and visual effects. It enables users to explore new looks without physically altering their hair, with applications in virtual try-on systems, augmented reality, and entertainment. Most prior works operate best under small pose gaps, and they fall short under large viewpoint and scale differences, where missing hair content must be synthesized rather than transferred. We propose HairPort, a 3D-aware hairstyle transfer framework that attempts to solve these issues by explicitly separating hair removal from transfer and enforcing geometric consistency before synthesis. We introduce a Bald Converter, which produces realistic bald versions of faces through LoRA-based in-context adaptation of FLUX.1 Kontext. To train our Bald Converter, we introduce a new dataset, Baldy, containing 6,000 paired bald and original images across diverse identities and conditions. We also use a 3D-Aware Transfer Pipeline that reconstructs and re-renders the reference hairstyle from the target viewpoint before compositing it onto the source image. Being 3D aware, our method supports large pose and scale discrepancies between the source and target. Finally, a conditional flow-matching generator synthesizes the transferred result from the bald source and geometry-aligned reference guidance. Together, our method enables accurate, pose-consistent, and identity-preserving hairstyle transfer, outperforming existing methods both qualitatively and quantitatively.

2606.12555 2026-06-12 cs.SD cs.CV cs.MM 新提交

AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

AudioX-Turbo:高效任意到音频生成的统一框架

Zeyue Tian, Lei Ke, Zhaoyang Liu, Ruibin Yuan, Liumeng Xue, Yujiu Yang, Weijia Chen, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) Noiz AI Independent Researcher(独立研究员)

AI总结 提出AudioX-Turbo,基于教师-学生范式的统一高效框架,通过多模态扩散Transformer和分布匹配蒸馏实现文本、视频、音频到音频的生成,仅需4步采样,NFE减少约25倍。

详情
AI中文摘要

基于灵活的多模态控制信号生成音频和音乐是一个广泛适用的课题,面临以下关键挑战:1) 统一的多模态建模框架,2) 大规模、高质量的训练数据,3) 多步扩散采样的高昂推理成本。为此,我们提出AudioX-Turbo,一个统一且高效的任意到音频生成框架,集成了多种多模态条件(即文本、视频和音频信号)。AudioX-Turbo遵循教师-学生范式。教师模型AudioX-Base基于多模态扩散Transformer,并带有模态自适应融合模块,用于对齐多样化的多模态输入以实现高保真合成,然后通过适用于流匹配的分布匹配蒸馏将其蒸馏为少步学生模型AudioX-Turbo,并辅以基于扩散的判别器以实现高质量的少步生成。为支持AudioX-Turbo的训练,我们构建了一个大规模、高质量的数据集IF-caps-Pro,包含约920万个样本,通过两阶段数据收集和标注流程整理而成。我们在广泛的任务上对AudioX-Turbo进行基准测试,发现我们的模型实现了优越的性能,尤其是在文本到音频和文本到音乐生成方面,同时仅需4个采样步骤,所需的函数评估次数(NFE)比多步基线减少约25倍。这些结果表明,我们的方法能够在灵活的多模态控制下进行音频生成,展现出高效且强大的指令跟随能力。代码和数据集将在https://this URL上提供。

英文摘要

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.

2606.12552 2026-06-12 cs.LG 新提交

Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well

跨越验证危机:交叉验证出人意料地有效降低基准测试方差

Célestin Eve, Gaël Varoquaux, Thomas Moreau

发表机构 * MIND Team, Université Paris-Saclay, Inria, CEA, Palaiseau, France(MIND团队,巴黎-萨克雷大学,法国国家信息与自动化研究所,法国原子能委员会,帕莱索,法国) SODA Team, Inria, Palaiseau, France(SODA团队,法国国家信息与自动化研究所,帕莱索,法国) Probabl

AI总结 本文提出交叉验证通过样本增益概念量化虚拟数据增强,显著提升算法性能评估的置信度与稳定性,并引入动态早停机制减少计算开销。

Comments 34 pages, 11 figures

详情
AI中文摘要

现代机器学习通过实证工作推进,对新方法进行基准测试以评估相对性能。然而,评估固有的统计变异性——由于许多算法的随机性而加剧——常常因有限的测试样本而使性能估计不可靠,导致验证危机,其中真正的进步难以辨别。在这项工作中,我们展示了交叉验证在评估和比较学习算法性能时显著提高了置信度。我们引入了样本增益的概念,它量化了通过使用多个交叉验证分割来减少基准测试方差所实现的虚拟数据增强。在合成和真实世界数据集(组织病理学扫描和NLP微调)上的实验表明,多个分割可以显著提高性能估计的可靠性和稳定性,且收益递减往往比预期来得更晚。我们还引入了一种动态早停交叉验证的程序,通过从最初几个折叠估计后续折叠是否会带来大的样本增益。我们的发现强调了在可用样本上推行交叉验证以实现稳健可靠基准测试的价值。

英文摘要

Modern machine learning progresses through empirical work, benchmarking new methods to evaluate relative performance. However, the statistical variability inherent to evaluation - exacerbated by the stochastic nature of many algorithms - often makes performance estimation unreliable due to the limited test samples available, leading to a validation crisis in which genuine advances are difficult to discern. In this work, we show that cross-validation improves markedly confidence when evaluating and comparing learning algorithm performances. We introduce the concept of sample gain, which quantifies the virtual data augmentation achieved by using multiple cross-validation splits to reduce benchmarking variance. Experiments on both synthetic and real-world datasets (histopathologic scans and NLP fine-tuning) demonstrate that multiple splits can substantially improve the reliability and stability of performance estimates, with diminishing returns often setting in later than expected. We also introduce a procedure to dynamically early-stop cross-validation by estimating from the first few folds if subsequent folds will bring large sample gains. Our findings highlight the value of pushing cross-validation on available samples to achieve robust and reliable benchmarking.

2606.12550 2026-06-12 cs.RO cs.AI 新提交

Foresight: Iterative Reasoning About Clues that Matter for Navigation

Foresight: 关于导航关键线索的迭代推理

Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas

发表机构 * UT Austin(德克萨斯大学奥斯汀分校) FieldAI

AI总结 提出Foresight框架,利用微调VLM交替提出和批评图像空间运动计划,通过人类反馈学习奖励模型进行强化学习后训练,实现无地图导航中稀疏语言指令下的迭代运动优化,任务成功率提升37%。

Comments 22 pages, 10 figures, 3 tables

详情
AI中文摘要

从稀疏语言指令进行开放世界无地图导航需要解决未明确指定的目标,并推断哪些环境线索与到达目标相关。例如,到达一个视野外的目的地可能需要解释坡道、标志或绕行路线,这些揭示了去哪里或走哪条路线。先前的工作受限于对已知导航因素和封闭集因素类别的依赖,或者在运动规划之前识别线索而遗漏了依赖于计划的线索。我们认为预训练的视觉语言模型(VLM)可以发现新的指令相关线索,但需要适应以关注哪些线索重要以及它们应如何影响运动规划。我们在Foresight中实现了这些想法,这是一个测试时框架,其中微调的VLM交替提出图像空间运动计划并使用语言目标和视觉上下文对其进行批评。后续计划基于先前的批评,使得在执行前能够进行迭代运动优化。为了将计划批评和优化与开放集行为偏好对齐,我们从人类反馈中学习一个奖励模型,并使用它在计划-批评循环中通过强化学习对VLM进行后训练。在离线评估和6个真实世界环境中,相对于最先进的测试时推理和基础模型基线,Foresight将平均任务成功率提高了37%,并将每次任务的干预次数减少了52%,同时在Jetson AGX Orin上实时运行。我们将发布代码、数据和训练细节,以支持未来关于机器人运动优化的测试时推理工作。更多视频请见:this https URL

英文摘要

Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: https://amrl.cs.utexas.edu/foresight

2606.12507 2026-06-12 cs.LG 新提交

Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

基于评分标准的自蒸馏:无需评分标准验证器的后训练

MohammadHossein Rezaei, Anas Mahmoud, Zihao Wang, Utkarsh Tyagi, Advait Gosai, Razvan-Gabriel Dumitru, Aakash Sabharwal, Bing Liu, Yunzhong He

发表机构 * Scale AI

AI总结 提出RGSD方法,通过将评分标准作为条件蒸馏到学生模型,无需验证器即可实现密集逐令牌学习,在医学和科学领域达到与基于评判的GRPO相当的评分标准满足率。

详情
AI中文摘要

在开放领域(单一标准答案不可用)中,评分标准已成为RLVR的替代方案。现有的基于评分标准的训练方法依赖LLM验证器对每次生成根据评分标准进行评分。这引入了大量的训练时间开销,使优化暴露于验证器特定偏差,并将评分标准反馈简化为稀疏的轨迹末端信号。我们提出无验证器的训练方法——基于评分标准的自蒸馏(RGSD),其中基础策略以评分标准为条件,作为无条件学生的教师。RGSD将基于评分标准的教师分布逐令牌蒸馏到学生,用密集的逐令牌学习信号替代稀疏的轨迹级奖励,并完全从训练循环中移除LLM评判。在Qwen-2.5(3B、7B)和Qwen3-Thinking(4B、8B)模型上,针对医学和科学领域,RGSD在每次提示仅使用一次在线生成且无需训练时验证器调用的情况下,实现了与基于评判的GRPO相当的评分标准满足率。消融实验表明,原始评分标准比自生成参考响应提供更强的教师增强信号,而更强的GRPO评判在某些设置下可能优于RGSD,使RGSD成为验证器成本或可靠性成为瓶颈时的互补性无验证器替代方案。

英文摘要

Rubrics have emerged as an alternative to RLVR in open-ended domains where a single ground-truth final answer is not available. Existing rubric-based training methods rely on an LLM verifier that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student. RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO while using one on-policy rollout per prompt and no training-time verifier calls. Ablations show that raw rubrics provide a stronger teacher enrichment signal than self-generated reference responses, while a stronger GRPO judge can outperform RGSD in some settings, positioning RGSD as a complementary verifier-free alternative when verifier cost or reliability is the bottleneck.

2606.12505 2026-06-12 cs.LG cs.AI 新提交

Boosting Direct Preference Optimization with Penalization

通过惩罚增强直接偏好优化

Pengwei Sun

发表机构 * Pengwei Sun(Sun Pengwei)

AI总结 提出DPOP,在DPO损失上增加对参考模型贪婪响应的门控惩罚,仅当当前策略对偏好响应概率低于拒绝响应时激活,在AlpacaEval 2.0上显著提升胜率。

Comments Accepted at ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning

详情
AI中文摘要

离线偏好优化已成为从人类反馈中进行强化学习的实用替代方案,但诸如直接偏好优化(DPO)及其变体等成对目标仅使用存储在静态数据集中的选择和拒绝响应。这留下了一个有用的信号未被利用:参考模型本身为同一提示生成的响应。我们提出了带惩罚的直接偏好优化(DPOP),这是DPO的一个简单扩展,它在基础偏好损失上增加了一个对参考贪婪响应的门控惩罚。DPOP仅在当前策略对偏好响应的似然仍低于对拒绝响应的似然时激活此惩罚。在AlpacaEval 2.0上,DPOP在Llama-3-8b-it和Gemma-2-9b-it上均提高了长度控制的胜率,相对于DPO、SimPO和AlphaDPO,在两个模型上分别实现了5.3%和4.4%的相对增益。消融实验进一步表明,在此设置下,SimNPO风格的长度归一化惩罚比NPO和token级非似然惩罚更强。

英文摘要

Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and rejected responses stored in a static dataset. This leaves a useful signal unused: the response that the reference model itself would generate for the same prompt. We propose Direct Preference Optimization with Penalization (DPOP), a simple extension of DPO that augments the base preference loss with a gated penalty on reference-greedy responses. DPOP activates this penalty only when the current policy still assigns a lower likelihood to the preferred response than to the rejected response. On AlpacaEval 2.0, DPOP improves length-controlled win rate over DPO, SimPO, and AlphaDPO on both Llama-3-8b-it and Gemma-2-9b-it, achieving relative gains of 5.3\% and 4.4\% over baselines on the two models, respectively. Ablations further show that a SimNPO-style length-normalized penalty is stronger than NPO and token-level unlikelihood in this setting.

2606.12503 2026-06-12 cs.LG cs.SD 新提交

Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

Dolph2Vec: 海豚发声的自监督表示

Chiara Semenzin, Faadil Mustun, Roberto Dessi, Pierre Orhan, Alexis Emanuelli, Yair Lakretz, Gonzalo de Polavieja, German Sumbre

发表机构 * École Normale Supérieure, Paris, France(巴黎高等师范学院) Not Diamond, San Francisco, USA(Not Diamond公司) Institut du Cerveau, Paris, France(巴黎脑研究所) Champalimaud Foundation, Lisbon, Portugal(尚帕利莫基金会)

AI总结 提出Dolph2Vec,首个基于五年纵向海豚录音数据训练的自监督模型,在签名哨声分类和检测任务上显著优于通用基线,并发现可解释的声学单元。

详情
AI中文摘要

自监督学习(SSL)通过无需昂贵人工标注即可对动物发声进行可扩展建模,为生物声学开辟了新机遇。然而,当前该领域的SSL模型优先考虑跨物种的广泛泛化,并未针对揭示个体通信系统的细粒度结构进行优化。在这项工作中,我们收集并发布了一个新颖的数据集,包含来自半自然海洋环境中五只已知海豚的超过五年的纵向录音,这是研究海豚通信的前所未有的资源。我们将Wav2Vec2.0 Baevski等人(2020)的架构适应于此领域,并引入Dolph2Vec,这是第一个仅在此数据上训练的大规模、物种特异性SSL模型。我们在两个生物学相关任务上对模型进行基准测试:签名哨声分类和哨声检测。Dolph2Vec在这两个任务上均显著优于通用基线。除了性能,我们还展示了学习到的嵌入和码本结构捕获了与海豚哨声类别以及可能的子哨声结构对齐的可解释声学单元,从而能够对通信模式进行细粒度分析。我们的发现证明了SSL如何作为模型和科学工具来探索动物通信研究中的假设。

英文摘要

Self-supervised learning (SSL) has opened new opportunities in bioacoustics by enabling scalable modeling of animal vocalizations without the need for expensive manual annotation. However, current SSL models in this domain prioritize broad generalization across species and are not optimized for uncovering the fine-grained structure of individual communication systems. In this work, we collect and release a novel dataset of over five years of longitudinal recordings, from five known dolphins in a semi-naturalistic marine environment, an unprecedented resource for studying dolphin communication. We adapt the Wav2Vec2.0 Baevski et al. (2020) architecture to this domain and introduce Dolph2Vec, the first large-scale, species-specific SSL model trained exclusively on this data. We benchmark our model on two biologically relevant tasks: signature whistle classification and whistle detection. Dolph2Vec significantly outperforms general-purpose baselines in both tasks. Beyond performance, we show that learned embeddings and codebook structure capture interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure, enabling fine-grained analysis of communication patterns. Our findings demonstrate how SSL can serve as both a model and a scientific tool to explore hypotheses in animal communication research.

2606.12501 2026-06-12 cs.LG 新提交

Policy-driven Conformal Prediction for Trustworthy QoT Estimation

策略驱动的可信QoT估计的保形预测

Kiarash Rezaei, Omran Ayoub, Paolo Monti, Carlos Natalino

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) University of Applied Sciences and Arts of Southern Switzerland(瑞士南方应用科学与艺术大学)

AI总结 提出Conformal QoT框架,结合统计保证的QoT估计与操作决策策略,实现域偏移下可靠的光路可行性预测,在开放数据集上将准确率从92%提升至99.6%。

详情
Journal ref
Proc. Optical Fiber Communication Conference (OFC) 2026
AI中文摘要

我们提出Conformal QoT,一个策略驱动的框架,将具有统计保证的QoT估计与操作决策策略相结合,能够在域偏移下实现可靠的光路可行性预测,并在开放数据集上将准确率从92%提升至99.6%。

英文摘要

We propose Conformal QoT, a policy-driven framework that combines statistically guaranteed QoT estimation with operational decision policies, enabling reliable lightpath-feasibility predictions under domain shift and improving accuracy from 92\% to 99.6\% on open datasets.

2606.12499 2026-06-12 cs.RO 新提交

Action-Effect Memory Pretraining for Robot Manipulation

动作-效应记忆预训练用于机器人操作

Yijing Zhou, Qiwei Liang, Sitong Zhuang, Jiaxi Li, Xianpeng Wang, Boyang Cai, Yunyang Mo, Renjing Xu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Shenzhen University(深圳大学)

AI总结 提出AEM框架,通过视觉-动作历史掩码建模学习紧凑时间表征,提升机器人操作在部分可观测环境下的性能,优于单帧预训练和帧堆叠方法。

详情
AI中文摘要

我们提出了AEM,一个用于机器人操作的动作-效应记忆预训练框架,从视觉-动作历史中学习紧凑的时间表征。与先前主要关注单帧视觉编码的机器人表征预训练方法不同,AEM针对操作的时间特性,在部分可观测性下,仅凭当前观测往往不足。AEM通过交错视觉和动作特征将操作建模为动作驱动的交互过程,并应用掩码建模从不完整历史中恢复缺失内容,从而学习动作条件化的状态演化。最终视觉令牌的Mamba编码输出用作紧凑的历史表征,作为解码和下游控制的全局上下文。该设计在保持推理高效的同时,保留了单向量时间瓶颈。我们使用扩散策略和流策略评估AEM。AEM在仿真和真实环境中一致提升了操作性能,在干净场景、杂乱和随机场景以及非马尔可夫任务中均优于基线。消融研究进一步表明,历史感知预训练超越了单帧预训练和直接帧堆叠,同时降低了推理延迟和计算成本。

英文摘要

We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.

2606.12497 2026-06-12 cs.LG cs.RO 新提交

$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

$μ$VLA:部分可观测操作中VLA模型的循环记忆研究

Egor Cherepanov, Nikita Kachaev, Daniil Zelezetsky, Aydar Bulatov, Artem Pshenitsyn, Yuri Kuratov, Alexey Skrynnik, Aleksandr I. Panov, Alexey K. Kovalev

发表机构 * CogAI Lab, Moscow, Russia(CogAI实验室,莫斯科,俄罗斯) MIRAI, Moscow, Russia(MIRAI,莫斯科,俄罗斯)

AI总结 针对VLA模型在部分可观测场景中的记忆缺失问题,提出仅通过可学习记忆令牌和截断反向传播时间实现最小化循环记忆增强,在MIKASA-Robo上将训练任务成功率从0.42提升至0.84,并在LIBERO上保持全可观测性能。

Comments 34 pages, 20 figures, 9 tables

详情
AI中文摘要

视觉-语言-动作(VLA)模型从当前观测预测未来动作块,这一假设在部分可观测性下失效,因为决策依赖于不再可见的信息。现有的记忆增强VLA同时引入了循环、检索、压缩模块、辅助目标、层次化记忆或特定任务架构变化,因此循环本身的贡献与周围机制纠缠不清。我们提出了一个在强预训练VLA骨干网络中的受控隔离研究。我们的方案通过一小部分可学习的记忆令牌增强Transformer,这些令牌跨时间步传递并通过自注意力更新,使用截断反向传播时间进行端到端训练,没有辅助损失和架构变化。我们将其实例化为$μ$VLA,一组由记忆宽度m、TBPTT长度K和记忆更新规则(跨步梯度或分离的EMA)参数化的OpenVLA-OFT变体,使得循环是唯一变化的因素。在MIKASA-Robo上,$μ$VLA在最强设置下将五个训练任务的平均成功率从0.42提高到0.84,并在具有相同记忆结构的保留任务上达到0.23,而无记忆基线为0.07。在需要不同记忆结构的任务上,性能接近基线。在LIBERO上,最强的循环变体达到96.2%的平均成功率,表明在全可观测性下没有性能下降。我们将这些结果解释为对最小化骨干网络循环能力范围的校准,识别了其足够的情况以及需要额外记忆结构的情况。演示和视频可在以下链接找到:https://example.com。

英文摘要

Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $μ$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $μ$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in https://avanturist322.github.io/mu-vla/.

2606.12495 2026-06-12 cs.SD 新提交

Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification

缺失令牌提示的可靠性感知融合用于鲁棒多语种说话人识别

Peng Jia, Li Dai, Jia Li, Zhenzhen Hu, Ye Zhao, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) Intelligent Interconnected Systems Laboratory of Anhui Province(安徽省智能互联系统实验室)

AI总结 提出MRAF框架,通过可学习的缺失令牌和可靠性感知交叉注意力融合,解决多语种场景下跨语言泛化和人脸缺失时的鲁棒性问题,在POLY-SIM 2026测试集上取得高准确率。

Comments 8 pages, 3 figures, 4 tables

详情
AI中文摘要

准确且鲁棒的多模态说话人识别对于多媒体理解和生物特征认证至关重要。然而,现实中的多语种场景带来了两个关键挑战:说话人判别性表示应跨语言泛化,并且当人脸信息不可用时模型应保持可靠。为了解决这些挑战,我们提出了MRAF,一个缺失令牌提示的可靠性感知融合框架,用于跨完整模态、缺失人脸和跨语言场景的多语种说话人识别。MRAF用可学习的缺失令牌代替固定的零值特征来表示不可用的人脸输入,提供了缺失视觉状态的可训练表示。这种设计减少了由缺失输入引起的分布差距,并允许后续的可靠性估计和跨模态融合在统一的令牌空间内操作。为了自适应地集成具有不同可靠性的模态,MRAF进一步引入了可靠性感知的交叉注意力融合模块,该模块估计人脸和音频的可靠性分数,将其归一化为模态权重,并在双向交叉注意力之前将这些权重应用于令牌表示。这样,模型可以强调可靠的模态线索,同时抑制不可靠的。在训练过程中,MRAF联合优化多分支分类损失、仅音频知识蒸馏和中心损失,以提高说话人判别性和缺失模态鲁棒性。在官方POLY-SIM 2026测试集上的实验证明了所提出框架的有效性。在最终评估中,MRAF在P3和P5上达到了100%的准确率,并在更具挑战性的缺失人脸设置P4和P6上获得了有竞争力的结果。源代码将在https://this URL发布。

英文摘要

Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model should remain reliable when face information is unavailable. To address these challenges, we propose MRAF, a Missing-Token Prompted Reliability-Aware Fusion framework for polyglot speaker identification across complete-modality, missing-face, and cross-lingual scenarios. MRAF represents unavailable face inputs with a learnable missing token instead of fixed zero-valued features, providing a trainable representation of the missing visual state. This design reduces the distribution gap caused by missing inputs and allows subsequent reliability estimation and cross-modal fusion to operate within a unified token space. To adaptively integrate modalities with different reliability, MRAF further introduces a reliability-aware cross-attention fusion module, which estimates face and audio reliability scores, normalizes them into modality weights, and applies these weights to token representations before bidirectional cross-attention. In this way, the model can emphasize reliable modality cues while suppressing unreliable ones. During training, MRAF jointly optimizes multi-branch classification losses, audio-only knowledge distillation, and center loss to improve speaker discrimination and missing-modality robustness. Experiments on the official POLY-SIM 2026 test set demonstrate the effectiveness of the proposed framework. In the final evaluation, MRAF achieves 100% accuracy on P3 and P5, and obtains competitive results on the more challenging missing-face settings P4 and P6. The source code will be released at https://github.com/MSA-LMC/MRAF.

2606.12494 2026-06-12 cs.LG 新提交

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

Net-Ev$^2$:网络事件演化的生成式模拟器

Guangyu Wang, Zhaonan Wang

发表机构 * NYU Shanghai(上海纽约大学)

AI总结 提出Net-Ev$^2$,一种结合事件线索与网络拓扑的生成式模拟器,通过结构引导掩码预训练和拓扑感知扩散过程模拟网络事件演化,在多个道路网络数据集上达到最优性能。

Comments Accepted by KDD 2026 Research Track

详情
Journal ref
In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

减少现实世界的试错一直是决策的核心目标,生成式模拟器通过建模未来状态的演化推进了这一目标。一个更具挑战性且更有意义的任务是模拟扰动事件(如事故)如何通过网络传播其影响。现有方法在模拟网络事件演化时,未能同时建模事件的结构化属性和非结构化语义,也未能捕捉拓扑结构。因此,我们提出Net-Ev$^2$($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution),一种新颖的生成式模拟器,在模拟中联合利用事件线索并保留网络拓扑。具体而言,该框架包含两个阶段:结构引导的掩码预训练和拓扑感知扩散过程,后者通过类似U-Net的图下采样和上采样实现去噪。在推理时,Net-Ev$^2$仅需自然语言事件输入即可生成模拟,具有更大的实际使用灵活性。此外,我们引入了Net-Ev$^2$-6.5M,一个跨四个大规模道路网络的对齐事件和网络流量数据的多模态基准,以及一个新的拓扑感知指标JL-MMD,用于评估生成网络动态的拓扑保真度。大量实验证明了Net-Ev$^2$的最优性能和强泛化能力。代码已开源。

英文摘要

Reducing real-world trial and error has long been a central goal of decision making, and generative simulators advance this goal by modeling the evolution of future states. An even more challenging yet meaningful task is simulating how disturbance events (e.g., accidents) propagate their impacts across real-world networks. The existing approaches fall short of modeling both structured attributes and unstructured semantics of events, and capturing topological structures in simulating network event evolution. Therefore, we are motivated to propose Net-Ev$^2$ ($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution), a novel generative simulator that jointly leverages event cues while preserving network topology in simulations. Specifically, the framework consists of two stages, namely structure-guided masked pre-training and topology-aware diffusion process, which is achieved by U-Net-like graph downsampling and upsampling during denoising. At inference time, Net-Ev$^2$ can generate simulations using natural-language event input only, with greater flexibility for practical usage. Furthermore, we introduce Net-Ev$^2$-6.5M, a multimodal benchmark of aligned event and network traffic data across four large-scale road networks, as well as a new topology-aware metric, namely JL-MMD, to evaluate topological fidelity in generated network dynamics. Extensive experiments demonstrate the state-of-the-art performance and strong generalization ability of Net-Ev$^2$. Code is made available at https://github.com/Guangyu4/Net-Ev-2.

2606.12490 2026-06-12 cs.LG 新提交

Robustness Verification of Recurrent Neural Networks with Abstraction Refinement

基于抽象精化的循环神经网络鲁棒性验证

Li-Jen Lin, Chih-Duo Hong

发表机构 * National Science and Technology Council (NSTC), Taiwan(台湾国家科学与技术委员会)

AI总结 提出抽象精化框架,通过分割预激活区间消除非线性松弛误差,并利用SHAP引导的时间步选择策略降低组合成本,显著提升RNN鲁棒性验证成功率。

详情
AI中文摘要

循环神经网络(RNN)的认证局部鲁棒性验证具有挑战性,因为非线性松弛引入的近似误差会通过循环连接传播并随时间累积。因此,可扩展的线性边界传播方法往往过于保守,无法认证实际上鲁棒的输入,尤其是当许多预激活区间跨越零点时。我们提出了一种用于RNN验证的抽象精化框架,该框架划分此类区间以消除主要的松弛误差:在每个精化分支上,ReLU变得精确,而tanh和sigmoid等平滑激活函数则允许更紧的线性包络。为了控制在长序列中分裂的组合成本,我们引入了一种SHAP引导的时间步选择策略,该策略根据隐藏状态对验证目标的贡献进行排序,并按时间顺序仅精化最关键的时间步。在CIFAR10和MNIST笔画基准上的实验表明,与仅使用抽象的基线相比,验证成功率和鲁棒性边界紧度持续提升,同时揭示了ReLU和tanh模型之间清晰的运行时权衡。

英文摘要

Certified local robustness verification for recurrent neural networks (RNNs) is challenging because approximation errors introduced by nonlinear relaxations can propagate through recurrent connections and accumulate over time. As a result, scalable linear bound propagation methods often become overly conservative and fail to certify inputs that are in fact robust, especially when many pre-activation intervals cross zero. We propose an abstraction-refinement framework for RNN verification that partitions such intervals to remove the dominant relaxation error: on each refined branch, ReLU becomes exact, and smooth activations such as tanh and sigmoid admit substantially tighter linear envelopes. To control the combinatorial cost of splitting in long sequences, we introduce a SHAP-guided timestep selection strategy that ranks hidden states by their contribution to the verification objective and refines only the most critical timesteps in temporal order. Experiments on CIFAR10 and MNIST stroke benchmarks demonstrate consistent improvements in verification success and robustness-margin tightness over abstraction-only baselines, while exposing clear runtime trade-offs between ReLU and tanh models.

2606.12488 2026-06-12 cs.LG 新提交

A Stationary (and Therefore Compatible) Representation is All You Need

静态(因此兼容)表示即所需

Niccolò Biondi, Federico Pernici, Simone Ricci, Alberto Del Bimbo

发表机构 * Media Integration and Communication Center (MICC), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze(佛罗伦萨大学信息工程系媒体集成与通信中心(MICC))

AI总结 本文证明d-Simplex固定分类器学习的静态表示满足兼容性定义,并通过交叉熵与对比损失的凸组合捕获高阶依赖,实现模型更新时无需重处理的检索服务。

Comments Accepted to TPAMI2026. Extension of the CVPR2024 version (arXiv:2405.02581)

详情
AI中文摘要

学习兼容表示旨在当模型更新时,特征表示可以互换使用。本文证明,由d-Simplex固定分类器学习的静态表示隐含了其正式定义中的兼容性。这一结果为未来工作奠定了基础,并可直接应用于实际学习场景。我们解决了在模型顺序微调时使用d-Simplex固定分类器学习兼容性的挑战。使用交叉熵损失的d-Simplex固定分类器学习对齐一阶统计量的特征分布,因此可能无法完全捕捉模型更新之间表示的高阶依赖。为解决此问题,我们证明通过交叉熵损失和对比损失的凸组合使用d-Simplex固定分类器训练模型,不仅能捕捉高阶依赖,而且等价于在兼容性约束下使用交叉熵学习。我们通过大量实验证实了我们的发现,并考虑了一个新场景:预训练模型被顺序微调,偶尔被改进模型替换。我们表明,静态表示能够实现不间断的检索服务(无需重新处理图库图像),同时在模型更新和替换期间提升性能,达到最先进水平。代码见此 https URL。

英文摘要

Learning compatible representations aims to learn feature representations that can be used interchangeably over time whenever a model undergoes updates. In this paper, we demonstrate that stationary representations learned by d-Simplex fixed classifiers imply compatibility as in its formal definition. This result establishes a foundation for future works and can be directly exploited in practical learning scenarios. We address the challenge of learning compatibility using $d$-Simplex fixed classifiers when the model is sequentially fine-tuned. Learning according to a d-Simplex fixed classifier with the cross-entropy loss aligns feature distributions at the first-order statistics. Consequently, it may not fully capture higher-order dependencies in the representation between model updates. To address this issue, we demonstrate that training the model using a $d$-Simplex fixed classifier through a convex combination of the cross-entropy loss and a contrastive loss not only captures higher-order dependencies, but is also equivalent to learning with the cross-entropy under the compatibility constraints. We confirm our findings with extensive experiments also considering a new scenario where a pre-trained model is sequentially fine-tuned and occasionally replaced with an improved model. We show that stationary representations enable uninterrupted retrieval services (without reprocessing gallery images) while improving performance during model updates and replacements, achieving state-of-the-art. Code at https://github.com/miccunifi/iamcl2r.

2606.12487 2026-06-12 cs.LG 新提交

DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

DynamicPTQ: 通过残差流动态缓解激活量化崩溃

Zimo Zhao, Maolin Wang, Bowen Yu, Bowen Liu, Xiao Han, Xiangyu Zhao

发表机构 * City University of Hong Kong(香港城市大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出DynamicPTQ,通过分析残差流中激活的相位式动态变化,识别量化敏感层并分配8位精度,在W4A4KV4量化下提升LLaMA-2/3的困惑度和零样本QA性能,吞吐量提升1.05-1.07倍。

详情
AI中文摘要

训练后量化(PTQ)对于高效的大语言模型推理至关重要,但当权重、激活和KV缓存全部量化到4位精度时,可靠地量化激活仍然具有挑战性。一个关键困难在于大规模激活,其极端值主导激活范围并放大量化误差。最先进的方法主要通过基于变换的平滑(如正交旋转和仿射缩放)来缓解大规模激活,但忽略了残差流的跨层动态。在本文中,我们展示了大规模激活在网络深度上以相位模式出现和消失,触发大的残差变化。这些变化导致新注入的逐层更新主导4位量化尺度,并削弱历史残差信息。为了表征这种行为,我们引入了跳跃比和历史特征信噪比。这表明基于静态变换的平滑无法完全解决由跨层残差变化引起的动态量化不稳定性。基于这一分析,我们提出了DynamicPTQ,一种用于相位感知混合精度激活量化的动态训练后量化策略。DynamicPTQ从残差流动态中识别量化敏感层,并仅对这些层分配8位激活精度,同时保持权重、KV缓存和其他激活为4位精度。它可以直接集成到强大的PTQ基线中,如QuaRot、SpinQuant和FlatQuant。在LLaMA-2和LLaMA-3上的实验表明,DynamicPTQ在W4A4KV4量化下一致地提高了困惑度和零样本QA性能,同时实现了1.05到1.07倍的吞吐量提升,且内存开销适中。这些结果展示了实现鲁棒低位LLM推理的实用路径。

英文摘要

Post-training quantization (PTQ) is essential for efficient large language model inference, but reliably quantizing activations remains challenging when weights, activations, and KV caches are all quantized to 4-bit precision. A key difficulty lies in massive activations, whose extreme values dominate the activation range and amplify quantization errors. State-of-the-art methods mainly mitigate massive activations through transformation-based smoothing, such as orthogonal rotations and affine scaling, but overlook the cross-layer dynamics of the residual stream. In this paper, we show that massive activations emerge and disappear in a phase-wise pattern across network depth, triggering large residual changes. These changes cause newly injected layer-wise updates to dominate the 4-bit quantization scale and weaken historical residual information. To characterize this behavior, we introduce Jump Ratio and Historical Feature SNR. This suggests that static transformation-based smoothing cannot fully resolve dynamic quantization instability caused by cross-layer residual changes. Based on this analysis, we propose DynamicPTQ, a Dynamic Post-Training Quantization policy for phase-aware mixed-precision activation quantization. DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics and assigns 8-bit activation precision only to these layers, while keeping weights, KV caches, and other activations in 4-bit precision. It can be directly integrated with strong PTQ baselines such as QuaRot, SpinQuant, and FlatQuant. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization, while achieving 1.05 to 1.07 times throughput improvement with modest memory overhead. These results demonstrate a practical path toward robust low-bit LLM inference.

2606.12485 2026-06-12 cs.LG cs.AI 新提交

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

面向质量多样性的Web智能体模仿的推测性回滚修正

Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou

发表机构 * Beihang University(北京航空航天大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) The Hong Kong University of Science and Technology(香港科技大学) Northwestern Polytechnical University(西北工业大学) Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Peking University(北京大学)

AI总结 提出推测性回滚修正(SRC)框架,通过固定视野分支审查和回滚机制,在减少教师查询的同时保持轨迹多样性,在WebArena-Infinity上收集了977条通过验证的轨迹和9183个下一步动作示例。

详情
AI中文摘要

通过从专家轨迹进行模仿学习来训练交互式Web智能体已成为一种高效的方法。然而,在此背景下,确定专家干预的最佳时机是一个关键挑战。延迟干预往往导致早期错误的累积,将页面状态推入不可恢复的区域。相反,过早或过度干预会使智能体过度依赖专家策略,将模型困在以单一刚性轨迹为特征的局部最优中。我们提出推测性回滚修正(SRC),一种针对可重置智能体环境的分支级模仿框架。SRC不是在每个访问状态请求教师标签,也不是仅在完成轨迹后修正,而是采用固定视野分支审查:学生先执行一个短的推测性片段,然后由教师审查,仅当局部进展中断时,教师才定位第一个有害偏差。回滚保留有用的前缀,而成功的展开由硬验证器过滤并保留在轻量级质量多样性档案中。所得数据支持对局部修正和通过验证器的轨迹进行下一步动作监督微调。在WebArena-Infinity上,SRC收集了977条通过验证器的轨迹和9183个下一步动作示例;固定视野审查在保留通过验证器的解决方案变体的同时,改善了恢复与查询的权衡。代码可在该https URL获取。

英文摘要

Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at https://github.com/LongkunHao/SRC_gui_agent.

2606.12481 2026-06-12 cs.LG cs.AI 新提交

Representing Time Series as Structured Programs for LLM Reasoning

将时间序列表示为结构化程序以进行LLM推理

Jaeho Kim, Changhun Oh, Seokhyun Lee, Irina Rish, Changhee Lee

发表机构 * Korea University(高丽大学) Mila, University of Montreal(蒙特利尔大学米拉研究所)

AI总结 提出T2SP方法,将时间序列分解为趋势、周期和显著事件并表示为结构化符号程序,使LLM无需微调即可高效推理,在编辑、描述和问答任务上优于原始序列表示。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)展示了强大的推理和指令遵循能力,使其成为时间序列分析的潜在强大工具。然而,时间序列超出了其原生文本模态,引发了一个基本问题:应该如何表示时间序列,以便LLM能够有效地推理它们?现有工作通常序列化原始数值序列或在时间序列数据上微调预训练的LLM。这些方法将提取时间结构的负担直接放在LLM上,造成了模态不匹配,常常降低长序列的性能并引入大量计算开销。在这项工作中,我们引入了时间序列到结构化程序表示(T2SP),一种确定性的、无需训练的方法,将时间序列表示为结构化的符号程序。T2SP将时间序列分解为趋势、周期和显著事件,并以与LLM原生训练的文本和代码类模态对齐的程序友好格式表达它们。通过将时间结构提取从模型转移到表示本身,T2SP使现成的LLM能够利用其现有的推理能力进行时间序列理解。我们在三个推理任务上评估T2SP——编辑、描述和问答——与原始字符串表示相比,它持续提高了性能,减少了推理时间,并降低了失败率。我们的结果表明,T2SP提供了时间序列和LLM之间的有效接口。

英文摘要

Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks -- editing, captioning, and question answering -- where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.

2606.12479 2026-06-12 cs.LG cs.AI 新提交

ReCal: Reward Calibration for RL-based LLM Routing

ReCal: 基于强化学习的LLM路由的奖励校准

Qihang Yu, Hanwen Tong, Zhengqi Zhang, Bo Zheng, Feng Wei, Shengyu Zhang, Zemin Liu, Fei Wu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ReCal框架,通过分层奖励分解和分布感知优化校准奖励信号,解决多目标冲突和异质性任务优化偏差,提升LLM路由性能与稳定性。

详情
AI中文摘要

大型语言模型(LLM)路由已成为一种有效范式,通过动态模型和推理策略选择来利用多个LLM的互补优势。最近的基于强化学习(RL)的路由方法通过从交互反馈中优化路由策略,进一步提高了路由质量。然而,在难度不同的异质性任务下,它们仍然难以提供信息丰富且可比较的学习信号。在实践中,多个目标(如正确性、格式行为)被聚合为单个标量奖励,导致模糊的信用分配和冲突的优化信号。此外,奖励信号在不同实例间表现出显著变异性,其中一些实例产生更高或更可变的奖励,引入了偏向于平凡样本而非信息性样本的优化偏差。为了解决这些问题,我们提出了\textbf{ReCal},一个用于基于RL的LLM路由的\textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration(奖励校准)框架。我们首先引入了一种具有分量式优势估计的分层奖励分解机制。我们进一步提出了一种分布感知的优化策略,通过方差感知重加权和每数据集归一化来校准优化变异性。在七个数据集上的实验表明,ReCal在路由性能和训练稳定性上持续优于基线方法。代码可在该网址获取。

英文摘要

Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at https://anonymous.4open.science/r/ReCal.