arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.19506 2026-05-20 cs.CV

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

EventPrune: 用于高效第一人称动态空间推理的级联事件辅助标记修剪

Pengtao Ma, Ziliang Zhou, Ciyu Ruan, Haoyang Wang, Kaiyuan Li, Zihang Gong, Wenhua Ding, Chen Gao, Jingao Xu, Xinlei Chen

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Harbin Institute of Technology（哈尔滨工业大学）； Tsinghua University（清华大学）； The University of Hong Kong（香港大学）

AI总结本文提出Event Cascade Pruning (ECP)，一种无需训练的框架，利用事件相机的高频运动线索作为连续事件引导的运动先验，指导标记选择，从而在第一人称动态空间推理中实现高效的标记修剪，提升推理速度和减少计算量。

详情

AI中文摘要

第一人称动态空间推理需要模型跟踪连续运动和精确的几何结构，但基于Transformer的视频大语言模型（Video-LLMs）的二次注意力成本使得密集视觉标记计算成本高昂。现有标记修剪方法主要依赖离散静态快照，无法保留推理所需的关键运动和几何线索。我们提出了Event Cascade Pruning (ECP)，据我们所知，这是首个无需训练的框架，利用事件相机的高频运动线索作为连续事件引导的运动先验来指导标记选择。ECP结合了三个阶段：事件触发的因果采样用于锚定包含运动信息的关键帧，事件引导的运动显著性过滤用于抑制事件不活跃的视觉标记，以及事件-注意力排名融合用于校准空间注意力与运动显著动态。在减少80%的视觉标记的情况下，ECP在准确率上优于全标记基线（37.62% vs. 36.31%），同时实现了1.89倍的推理加速和52%的GFLOPs减少。我们进一步引入了ESR-Real，首个用于第一人称空间推理的真实世界RGB-事件基准，其中ECP在全标记基线上的准确率提高了2.68个百分点。

英文摘要

First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80% visual token reduction, ECP outperforms the full-token baseline (37.62% vs. 36.31%) while achieving 1.89x inference speedup and 52% GFLOPs reduction. We further introduce ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning, where ECP improves accuracy by 2.68 percentage points over full-token baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.19501 2026-05-20 cs.RO cs.AI

CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

CANINE: 为视觉障碍者提供交互导航的机器人导盲犬教学系统

Cunjun Yu, Zishuo Wang, Anxing Xiao, Linfeng Li, David Hsu

发表机构 * School of Computing（computing 学院）； Smart Systems Institute（智能系统研究所）

AI总结本文提出CANINE系统，通过个性化适应性语音反馈帮助视觉障碍者学习与机器人导盲犬的交互导航，通过分解复杂协调任务并分层训练提升学习效率和最终导航性能。

Comments Accepted to RSS 2026

详情

AI中文摘要

机器人导盲犬提供了显著扩展视障者独立移动能力的导航帮助，但其有效使用需要微妙的人机协调，这使得用户难以从通用口头指令中学习。为解决这一挑战，我们提出了CANINE，一个自动化教学系统，通过个性化、适应性的语音反馈训练用户进行交互导航。CANINE将复杂协调任务分解为子技能，并在两个层次上运作。在高层，它通过知识追踪跟踪学习者在子技能中的熟练度，并优先训练最薄弱的领域。在底层，CANINE通过观察每个人类实践片段，利用基础模型推断错误的根本原因，并生成适应性的针对性语音纠正。通过盲folded参与者受控研究，将受试者视为定量评估的代理群体，证明CANINE在学习效率和最终导航性能上均优于通用口头指令。我们进一步通过保留研究和探索性案例研究验证CANINE。保留研究显示在两周后仍保持技能提升。案例研究确认CANINE在训练视障用户方面的有效性，同时揭示了实际部署中的额外设计考虑因素。两者均与受控研究的结果一致。项目页面：https://cunjunyu.github.io/project/canine/

英文摘要

Robot guide dogs offer navigation assistance that greatly expands the independent mobility of the visually impaired, but their effective use requires subtle human-robot coordination that is difficult for users to learn from generic verbal instructions. To tackle this challenge, we present CANINE, an automated coaching system that trains users for interactive navigation with a robot guide dog, through personalized, adaptive verbal feedback. CANINE decomposes a complex coordination task into sub-skills and operates at two levels. At the high level, it decides what to train by tracking the learner's proficiency across sub-skills using knowledge tracing and prioritizing training on the weakest areas. At the low level, CANINE decides how to train each sub-skill by observing each human practice episode, using foundation models to infer the underlying causes of errors, and generating targeted verbal corrections adaptively. A controlled study with blindfolded participants, treated as a proxy population for quantitative evaluation, demonstrates that CANINE significantly improves both learning efficiency and final navigation performance compared to generic verbal instructions. We further validate CANINE through a retention study and an exploratory case study. The retention study shows lasting skill improvement after two weeks. The case study confirms CANINE's effectiveness in training a visually impaired user, while revealing additional design considerations for real-world deployment. Both are well aligned with the findings of the controlled study. Project page: https://cunjunyu.github.io/project/canine/

URL PDF HTML ☆

赞 0 踩 0

2605.19490 2026-05-20 cs.RO cs.CV

Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

闭环混合数字孪生平台用于联网和自动化车辆验证

Kanglong Quan, Zhebing Xia, Linfeng Jiang, Hao Yu, Ziheng Qiao, Dapeng Dong, Dongyao Jia

发表机构 * National Natural Science Foundation of China（中国国家自然科学基金委员会）； Suzhou Science and Technology Development Planning Programme（苏州科技发展计划）

AI总结本文提出一种闭环混合数字孪生平台，通过高保真CARLA-SUMO协同模拟与物理测试现场和车辆的紧密耦合，实现联网和自动化车辆的高效验证。

详情

AI中文摘要

联网和自动化车辆（CAVs）的全面且高效的验证在实际部署前至关重要。虽然基于模拟的测试提供了可扩展性，但现有方法往往缺乏与真实车辆和现场数据的无缝集成，限制了其在捕捉动态真实世界交互方面的保真度。为弥合这一差距，本文提出了一种新的实时混合数字孪生平台。其核心创新在于高保真CARLA-SUMO协同模拟与物理测试现场和车辆通过低延迟的车辆到万物（V2X）通信链路的紧密耦合。定制开发的中间件作为关键桥梁，同步真实CAV的运动状态作为模拟中的影子车辆，并将虚拟控制命令转换为底盘执行的控制器局域网络（CAN）消息以实现闭环控制。详细的实现包括使用摄影测量法进行全尺寸资产重建以及云边协同架构以实现可扩展的多用户操作。实验结果表明同步稳定且闭环控制有效，延迟低，证实了该平台在多场景CAV验证中的实用性。

英文摘要

Comprehensive and efficient validation of connected and automated vehicles (CAVs) is critical prior to real-world deployment. While simulation-based testing offers scalability, existing approaches often lack seamless integration with real vehicles and field data, limiting their fidelity in capturing dynamic, real-world interactions. To bridge this gap, this paper proposes a novel real-time hybrid digital twin platform. Its core innovation lies in the tight coupling of a high-fidelity CARLA-SUMO co-simulation with a physical test site and vehicle via a low-latency Vehicle-to-Everything (V2X) communication link. A custom-developed middleware serves as the critical bridge, synchronizing a real CAV's kinematic state as a shadow vehicle in the simulation and translating virtual control commands into chassis-actuating Controller Area Network (CAN) messages for closed-loop control. Detailed implementation includes using photogrammetry for full-scale asset reconstruction and a cloud-edge collaborative architecture for scalable, multi-user operation. Experimental results demonstrate stable synchronization and effective closed-loop control with low latency, confirming the platform's practicality for multi-scenario CAV verification.

URL PDF HTML ☆

赞 0 踩 0

2605.19485 2026-05-20 cs.AI

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

基于注意力引导的强化学习对抗大推理模型的 jailbreak 方法

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao

发表机构 * Xidian University（西安电子科技大学）； Xi’an Jiaotong University（西安交通大学）

AI总结本文研究了对抗大推理模型的 jailbreak 攻击，发现攻击成功率与模型的注意力模式密切相关，并提出了一种基于强化学习的方法，通过将注意力信号纳入奖励函数设计来提升攻击效果，同时引入多样化的说服策略以提高攻击成功率。

详情

AI中文摘要

大推理模型（LRMs）在通过生成结构化的分步推理内容解决复杂问题方面表现出显著的能力。然而，暴露模型的内部推理过程会引入额外的安全风险；例如，最近的研究表明，LRMs比标准LLMs更容易受到jailbreak攻击。在本文中，我们研究了对LRMs的jailbreak攻击，并揭示出攻击成功率（ASR）与LRMs的注意力模式密切相关。具体而言，成功的jailbreak攻击倾向于在输入提示中对有害标记分配较低的注意力，而在推理内容中对这些标记分配较高的注意力。受此发现启发，我们提出了一种针对LRMs的新型jailbreak方法，利用强化学习（RL）来增强攻击效果，明确地将注意力信号纳入奖励函数设计。此外，我们引入了多样化的说服策略以丰富RL的动作空间，这始终提高了ASR。在五个开源和闭源LRMs上进行的广泛实验表明，我们的方法在三个基准测试中实现了显著更高的ASR，优于现有方法在有效性、效率和可迁移性方面。

英文摘要

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward function design. In addition, we introduce diverse persuasion strategies to enrich the RL action space, which consistently improves the ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR, outperforming existing approaches in terms of effectiveness, efficiency, and transferability.

URL PDF HTML ☆

赞 0 踩 0

2605.19484 2026-05-20 cs.CV cs.AI cs.GR cs.HC

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

CutVerse: 一个用于媒体后期制作编辑的组合式GUI代理基准测试

Haobo Hu, Xiangwu Guo, Zhiheng Chen, Difei Gao, Haotian Liu, Libiao Jin, Qi Mao

发表机构 * MIPG, Communication University of China（MIPG，中国传媒大学）； National University of Singapore（新加坡国立大学）； USEIT AI（USEIT人工智能）

AI总结本研究提出CutVerse，一个用于评估自主GUI代理在真实媒体后期制作环境中的能力的基准测试，揭示现有代理在复杂、长周期媒体后期制作工作流中的局限性。

详情

AI中文摘要

尽管GUI代理在网页导航和基础操作系统任务中取得了显著进展，但其在专业创意工作流中的能力仍鲜有研究。为弥合这一差距，我们引入CutVerse，一个旨在系统评估自主GUI代理在真实媒体后期制作环境中的基准测试。我们收集了7个专业应用（如Premiere Pro、Photoshop）的专家演示，涵盖186个复杂、长周期任务，这些任务基于真实的编辑工作流，涉及密集的多模态界面和紧密耦合的交互序列。为支持可扩展评估，我们开发了一个轻量级解析器，将原始屏幕记录和低级交互日志转换为结构化、组合式的GUI动作轨迹，具有精确的定位。广泛评估显示，现有代理在现实媒体编辑任务中的任务成功率仅为36.0%，凸显了复杂、长周期媒体后期制作工作流在本基准测试中的挑战。尽管当前模型在空间定位、多模态对齐和协调动作执行方面表现出色，但在长周期可靠性和领域特定规划方面仍存在限制。

英文摘要

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

URL PDF HTML ☆

赞 0 踩 0

2605.19483 2026-05-20 cs.LG

Adynamical systems view of training generativemodels and the memorization phenomenon

用动力系统观点看训练生成模型及记忆现象

Siva Athreya, Chiranjib Bhattacharya, Vivek S. Borkar

发表机构 * International Institute for Theoretical Sciences（理论科学国际研究所）； Department of Computer Science and Automation（计算机科学与自动化系）； Indian Institute of Science（印度科学研究所）； Department of Electrical Engineering（电气工程系）； Indian Institute of Technology Bombay（博亚理工大学）

AI总结本文从动力系统角度分析生成模型训练中的记忆现象，通过研究SGD中的时间尺度差异及崩溃现象，揭示生成模型在训练过程中产生相同或相似输出的机制。

Comments 12 pages

详情

AI中文摘要

利用作者之一（VSB）关于生成模型崩溃和高维随机梯度下降中双时间尺度动态的研究，本文从系统理论角度解释了生成模型中的记忆现象。这纯粹依赖于训练阶段的动力学特性。具体来说，我们使用Austin [2016] 的结果，提出一个简化的SGD损失函数模型，其中损失函数对某些变量有强依赖性，对其他变量有弱依赖性。这自然导致常数步长SGD中存在两个不同的时间尺度。这一事实已被用于解释SGD中的双下降现象（Borkar [2026]）。结合Borkar [2025a] 中开发的SGD崩溃现象数学模型，我们利用Azizian等人 [2024] 的最新结果，分析常数步长SGD，以解释记忆现象，即在同时进行调优的生成模型中，输出在显著时间段内保持相同或相似。这为机器学习文献中报告的上述现象及其相互关系提供了新的视角，使用动力系统观点。

英文摘要

Using recent works of one of the authors (VSB) on collapse in generative models and two time scale dynamics in stochastic gradient descent in high dimensions, we give a system theoretic explanation of the memorization phenomenon in generative models. This relies purely on the dynamic aspects of the training phase. Specifically, we use a result of Austin [2016] to motivate a stylized model for the loss function for stochastic gradient descent (SGD) wherein the loss function has a strong dependence on some variables and weak dependence on the rest in a precise sense. This naturally leads to two distinct time scales in the constant step size SGD that is commonly used in machine learning. This fact has been used to explain the double descent phenomenon in SGD in Borkar [2026]. In conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a], we analyze the constant step size SGD using the recent results of Azizian et al. [2024] in order to explain the phenomenon of memorization wherein a generative model that is concurrently being tuned yields the same or similar outputs for significant stretches of time. This gives a novel perspective on the aforementioned phenomena reported in machine learning literature and their interrelationships, using a dynamical systems viewpoint.

URL PDF HTML ☆

赞 0 踩 0

2605.19470 2026-05-20 cs.CL cs.LG

Drifting Objectives for Refining Discrete Diffusion Language Models

漂移目标用于细化离散扩散语言模型

Daisuke Oba, Hiroki Furuta, Naoaki Okazaki

发表机构 * Institute of Science Tokyo（东京科学研究院）； AIST（日本产业技术综合研究所）； NII LLMC（日本信息处理学会LLMC）

AI总结本文研究如何将漂移方法应用于离散扩散语言模型，通过引入TokenDrift目标，将类别预测提升为软令牌特征，并在冻结语义空间中应用反称漂移，从而提升生成质量。

Comments Project page: https://daioba.github.io/tokendrift/

详情

AI中文摘要

离散扩散语言模型（DDLMs）通过迭代去噪类别令牌序列生成文本，而近期针对连续生成器的漂移方法表明，部分采样时间的修正可以通过反称固定点目标在训练中吸收。我们研究如何将这一原理转移到DDLMs中，其中主要挑战是与离散文本的接口：硬令牌样本不可微，类别预测不直接提供连续样本进行漂移。我们提出了TokenDrift，一种漂移目标，将类别预测提升为软令牌特征，在冻结的语义空间中应用反称漂移，并将由此产生的stop-gradient特征目标反向传播到DDLM的logits中。在受控的持续训练实验中，使用掩码和均匀状态扩散基础架构，TokenDrift在匹配的延续基线之上提升了固定NFE生成质量，在MDLM上将Gen.-PPL在4 NFEs时降低了89%，在DUO上降低了86%。这些结果表明，漂移可以为DDLMs提供实用的细化目标。

英文摘要

Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling-time correction can instead be absorbed into training through an anti-symmetric fixed-point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non-differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates the resulting stop-gradient feature target to DDLM logits. In controlled continual-training experiments with masked and uniform-state diffusion backbones, TokenDrift improves fixed-NFE generation quality over matched continuation baselines, reducing Gen.-PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.19469 2026-05-20 cs.LG cs.AI cs.RO

Sampling-Based Safe Reinforcement Learning

基于采样的安全强化学习

Luca Vignola, Bruce D. Lee, Manish Prajapat, Manuel Wendl, Melanie Zeilinger, Andreas Krause, Yarden As

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出了一种基于采样的安全强化学习方法，通过在有限的动力学样本集上联合施加约束，确保学习过程中的安全性，并在连续域中提供实用的安全保证，同时通过限制认知不确定性实现了高效的探索。

详情

AI中文摘要

安全探索仍然是强化学习（RL）中的基本挑战，限制了RL智能体在现实世界中的部署。我们提出了一种基于采样的安全强化学习（SBSRL），这是一种基于模型的RL算法，通过在有限的动力学样本集上联合施加约束，确保学习过程中的安全性。这种形式近似了在不确定动力学下的不可行最坏情况优化，并在连续域中实现了实用的安全保证。我们进一步引入了一种基于限制认知不确定性的探索策略，消除了显式探索奖励的需要。在常规条件下，我们推导了学习过程中安全性的高概率保证以及恢复近最优策略的有限时间样本复杂度界。实验证明，SBSRL在仿真和真实机器人硬件中均实现了安全且高效的探索，并可轻松扩展到实际的深度集合实现，以解决高维连续控制问题。

英文摘要

Safe exploration remains a fundamental challenge in reinforcement learning (RL), limiting the deployment of RL agents in the real world. We propose Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout the learning process by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. We further introduce an exploration strategy based on constraining epistemic uncertainty, eliminating the need for explicit exploration bonuses. Under regularity conditions, we derive high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy. Empirically, SBSRL achieves safe and efficient exploration both in simulation and in real robotic hardware, and readily extends to practical deep-ensemble implementations that scale to high-dimensional continuous control problems.

URL PDF HTML ☆

赞 0 踩 0

2605.19462 2026-05-20 cs.LG cs.AI

Quantifying the Pre-training Dividend: Generative versus Latent Self-Supervised Learning for Time Series Foundation Models

量化预训练红利：生成与潜在自监督学习在时间序列基础模型中的应用

Noam Major, Kathy Razmadze, Yoli Shavit

发表机构 * Faculty of Engineering, Bar-Ilan University（巴伊兰大学工程学院）

AI总结本文研究了自监督学习在时间序列中的应用，比较了生成范式与潜在对齐架构，发现预训练红利在异常检测和分类任务中显著提升，但在预测任务中效果有限，同时表明表示质量与数据来源无关，且在适度的架构深度下趋于稳定。

详情

AI中文摘要

自监督学习（SSL）在视觉和自然语言处理中的成功促使其在时间序列中的快速应用。然而，研究主要集中在生成范式和预测任务上，未量化学习表示的广泛应用。我们建立了一个受控框架来评估“预训练红利”：SSL在多样时间任务中的价值。我们系统比较了生成范式与潜在对齐架构，引入了适用于时间序列的LeJEPA和DINO的变体。这些变体利用离散小波变换（DWT）增强来强制对局部波动的不变性。我们的分析揭示预训练红利高度不对称：SSL在异常检测和分类任务中可获得高达375%的收益，但在预测任务中效果有限。我们证明表示的实用性非普遍，由精度-不变性权衡决定，任务所需的特定信号分辨率必须与目标一致。最后，我们显示表示质量与数据来源无关，并在适度的架构深度下趋于稳定，表明通过大规模合成生成可实现扩展。我们的代码可在：https://github.com/noammajor/Models 获取。

英文摘要

The success of self-supervised learning (SSL) in vision and NLP has motivated its rapid adoption for time series. However, research has focused primarily on Generative paradigms and forecasting tasks, leaving the broader utility of learned representations unquantified. We establish a controlled framework to evaluate the "pre-training dividend": the value added by SSL across diverse temporal tasks. We systematically compare Generative paradigms against Latent Alignment architectures, introducing adaptations of LeJEPA and DINO for time series. These adaptations utilize Discrete Wavelet Transform (DWT) augmentations to enforce invariance to local fluctuations. Our analysis reveals that the pre-training dividend is highly asymmetric: SSL yields gains of up to 375% for anomaly detection and classification, yet remains marginal for forecasting. We demonstrate that representational utility is non-universal, governed by a precision-invariance trade-off where the specific signal resolution required by the task must align with the objective. Finally, we show that representation quality is largely independent of data origin and saturates at moderate architectural depths, suggesting a path to scaling via massive synthetic generation. Our code is available at: https://github.com/noammajor/Models

URL PDF HTML ☆

赞 0 踩 0

2605.19461 2026-05-20 cs.AI

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

超越模式崩溃：用于多样化推理的分布匹配

Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, Kai Chen

发表机构 * Tongji University（同济大学）； Independent（独立）； Shanghai AI Laboratory（上海人工智能实验室）； Zhejiang University（浙江大学）； Fudan University（复旦大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出DMPO方法，通过原理性近似前向KL最小化来防止on-policy强化学习中的模式崩溃，展示了在NP难组合优化问题上的改进效果，提升了多样化推理能力。

详情

AI中文摘要

像GRPO这样的在线强化学习方法会遭遇模式崩溃：它们表现出减少的解决方案多样性，在发现一个解决方案后，将概率质量集中在单一解决方案上，并停止探索替代策略。我们证明这源于反KL最小化的行为，这种行为强化了首次发现的高回报轨迹，而不是维持多个多样解决方案的分布。我们提出DMPO（分布匹配策略优化），通过原理性近似前向KL最小化来防止模式崩溃。DMPO构建一个群体层面的目标分布，该分布与采样的轨迹成正比于其奖励，然后将策略分布对齐到此目标。这提供了覆盖模式的行为，而无需采样自不可行的全局目标分布，使训练过程中持续探索成为可能。我们在NP难组合优化问题上验证了DMPO，其中存在指数级多的可行解，但只有少数接近最优解，是评估探索的理想测试环境。DMPO在文本基NP-Bench上实现了43.9%的Quality Ratio（对比GRPO的40.1%），在视觉基NP-Bench上实现了43.1%（对比38.4%），分别展示了9%和12%的相对改进。这些收益扩展到数学推理（+2.0%）和跨领域任务（+2.3%），表明保持多样性训练增强了跨模态的通用推理能力。我们的工作确立了分布匹配作为防止on-policy RL中模式崩溃的实用且原理性方法，一致的质量改进表明在多样化推理任务中持续探索的能力。

英文摘要

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.19458 2026-05-20 cs.LG

Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning

隐式偏置与同质神经网络中的稀疏和密集特征学习

Tom Jacobs, Guido Montufar

发表机构 * CISPA Helmholtz Center（CISPA海德堡中心）； UCLA（加州大学洛杉矶分校）； MPI MiS（马克斯·普朗克研究所（MiS））

AI总结研究隐式偏置如何影响同质神经网络中的稀疏和密集特征学习，通过推导新的平衡方程和实验验证，揭示了镜像流在优化动态和分类器几何结构中的作用。

Comments 36 pages, 14 figures

详情

AI中文摘要

我们研究了在具有同质激活函数的深度神经网络中，镜像流达到的最大边际解。扩展经典梯度流结果，我们从凸对偶性推导出镜像流的新平衡方程，从而能够表征诱导边际的水平函数。我们进一步建立了最大边际特征以及收敛速度和范数增长估计。最后，我们通过合成数据集和标准视觉任务的实验支持我们的理论。具体而言，我们显示：(1)不同的非同质镜像映射可以诱导相同的最大边际解；(2)收敛可以非常缓慢，包括指数级缓慢的区域；以及(3)尽管所有考虑的镜像映射都表现出特征学习，但它们可以产生从稀疏到密集神经元激活的明显不同表示。这些结果为同质神经网络中的稀疏和密集特征学习提供了统一的视角，突显了镜像映射如何影响优化动态和学习分类器的几何结构。

英文摘要

We study the max-margin solutions reached by mirror flow in deep neural networks with homogeneous activation functions. Extending classical results on gradient flow, we derive a novel balance equation for mirror flow from convex duality, enabling a characterization of the horizon function governing the induced margin. We further establish max-margin characterizations together with convergence rates and norm growth estimates. Finally, we support our theory through experiments on synthetic datasets and standard vision tasks. Concretely, we show that: (1) distinct non-homogeneous mirror maps can induce the same max-margin solution; (2) convergence can be extremely slow, including exponentially slow regimes; and (3) although all considered mirror maps exhibit feature learning, they can produce markedly different representations, ranging from sparse to dense neuron activations. Together, these results provide a unified perspective on sparse and dense feature learning in homogeneous neural networks, highlighting how mirror maps shape both optimization dynamics and the geometry of the learned classifiers.

URL PDF HTML ☆

赞 0 踩 0

2605.19457 2026-05-20 cs.AI

Generative Auto-Bidding with Unified Modeling and Exploration

生成式自动出价：统一建模与探索

Mingming Zhang, Feiqing Zhuang, Na Li, Shengjie Sun, Xiaowei Chen, Junxiong Zhu, Fei Xiao, Keping Yang, Lixin Zou, Chenliang Li

发表机构 * Key Laboratory of Aerospace Information Security ； Trusted Computing, Ministry of Education, School of Cyber Science ； Engineering, Wuhan University\ \& Tmall Group of Alibaba Wuhan China ； Taobao \& Tmall Group of Alibaba Hangzhou China ； Engineering, Wuhan University Wuhan China ； Engineering, Wuhan University\ \& Tmall Group of Alibaba ； Taobao \& Tmall Group of Alibaba ； Engineering, Wuhan University

AI总结本文提出GUIDE框架，通过结合定向探索与安全回退机制，解决生成模型在自动出价中探索与安全平衡的问题，实现效率与安全的统一。

Comments 11pages, sigir2026

详情

DOI: 10.1145/3805712.3809661

AI中文摘要

自动化出价是现代数字广告的核心。早期基于规则的方法缺乏适应性，而后续的强化学习方法将出价建模为马尔可夫决策过程，但难以处理长期依赖。最近的生成模型显示了潜力，但缺乏明确的机制来平衡探索和安全性，仅依赖动作扰动或轨迹引导，没有安全回退。这导致了低效的探索和广告平台的高财务风险。为了解决这一差距，我们提出了GUIDE（生成式自动出价：统一建模与探索）框架，通过协同整合定向探索与安全回退机制。GUIDE使用决策变压器（DT）联合建模历史出价动作和环境状态转移。Q值模块通过正则化约束引导DT的探索，而逆向动力学模块（IDM）利用DT预测的未来状态来推断鲁棒且行为一致的动作作为安全策略回退。Q值模块随后在两者之间自适应地选择最终动作，平衡探索和安全性。这些组件共同形成一个集成的“探索-安全回退-选择”流水线，实现了效率和安全的统一。我们在公开数据集、模拟拍卖环境以及通过大规模在线部署在淘宝（中国领先的广告平台）上进行了广泛实验。结果表明，GUIDE在所有场景中均优于最先进的基线。在实际部署中，GUIDE实现了显著的收益：广告GMV增长+4.10%，广告点击增长+1.40%，广告成本下降+1.66%，广告ROI增长+3.52%，证明了其有效性和强大的工业适用性。

英文摘要

Automated bidding is central to modern digital advertising. Early rule-based methods lacked adaptability, while subsequent Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback. This results in inefficient exploration and elevated financial risk for advertising platforms. To address this gap, we propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback. The Q-value module then adaptively selects the final action between these two options, balancing exploration and safety. Together, these components form an integrated "explore-safeguard-select" pipeline that unifies efficiency and safety. We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao, a leading Chinese advertising platform. Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI, demonstrating its effectiveness and strong industrial applicability.

URL PDF HTML ☆

赞 0 踩 0

2605.19447 2026-05-20 cs.AI

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

什么和何时去蒸馏：多轮代理的定向 hindsight 蒸馏

Xiaozhe Li, Tianyi Lyu, Yang Li, Yichuan Ma, Peiji Li, Linyang Li, Qipeng Guo, Dahua Lin, Kai Chen

发表机构 * Tongji University Shanghai AI Laboratory（同济大学上海人工智能实验室）； Shanghai AI Laboratory（上海人工智能实验室）； Fudan University（复旦大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文研究了多轮代理中如何选择性地利用 hindsight 蒸馏，提出了一种基于环境反馈的强化学习框架 SERL，通过任务奖励和环境反馈的结合，在 ALFWorld 和 WebShop 任务中取得了较高的成功率。

详情

AI中文摘要

强化学习可以通过稀疏任务奖励训练大语言模型代理，但长周期信用分配仍然具有挑战性：一个成功或失败的信号必须分布在许多动作上。现有方法依赖于轨迹级奖励或代理信号，没有充分利用每一步的环境反馈。多轮代理设置尚不充分探索，其中反馈可以包括错误信息、页面变化、观察或参考轨迹。我们系统研究了五个反馈源和两种插入粒度，并引入了 SERL，一种选择性环境加权学习框架。SERL 使用任务奖励确定更新方向，而环境反馈调整放置和大小，专注于关键动作。在 ALFWorld 和 WebShop 上，SERL 分别达到 90.0% 和 80.1% 的成功率，优于强大的 RL 和蒸馏基线。分析显示，有意义的点上的基于事实、与动作相关的反馈始终优于随意使用更长或更丰富的上下文。

英文摘要

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

URL PDF HTML ☆

赞 0 踩 0

2605.19446 2026-05-20 cs.CV cs.AI

Targeted Downstream-Agnostic Attack

定向下游无关攻击

Zhuxin Lei, Ziyuan Yang, Yi Zhang

发表机构 * College of Computer Science, Sichuan University（四川大学计算机学院）

AI总结本文提出了一种定向下游无关攻击（TDAA）方法，通过在更严格的威胁模型下，要求攻击同时具有针对性和下游无关性，解决了传统下游无关攻击（DAAs）在目标未知和编码器不直接生成预测时的挑战。通过引入威胁图像作为特征级锚点，构建了任务无关的桥梁，揭示了受害者编码器的脆弱性。

详情

AI中文摘要

近年来，由于其在表示提取方面的强大能力，预训练编码器得到了广泛应用。然而，它们容易受到下游无关攻击（DAAs）的攻击。现有的DAA方法基于一种宽松的威胁模型，只要生成的下游无关对抗样本（DAEs）改变原始预测，攻击就算成功，而无需特定目标。在本文中，我们提出了一种在更严格的威胁模型下进行的定向DAA（TDAA）方法，要求攻击必须同时具有针对性和下游无关性。由于下游任务未知且编码器不直接生成预测，实现针对性攻击尤其具有挑战性。为此，我们引入了一个名为“威胁图像”的新组件，由攻击者预先选择作为目标。具体来说，设计了一个生成器，生成针对每个样本的对抗扰动，迫使受害者编码器为DAEs和威胁图像输出相同的特征。与以往的DAA方法生成所有样本共享的单一扰动不同，我们的方法采用样本特定的范式。这生成了针对每个图像的定制扰动，以确保高攻击成功率和隐蔽性。通过利用威胁图像作为特征级锚点，我们的方法构建了一个任务无关的桥梁，揭示了受害者编码器的脆弱性。在10种自监督方法上对3个基准数据集的广泛实验展示了我们方法的有效性，并揭示了预训练编码器的显著脆弱性。代码将在审查期结束后公开。

英文摘要

Recently, pre-trained encoders have gained widespread use due to their strong capability in representation extraction. However, they are vulnerable to downstream-agnostic attacks (DAAs). Existing DAA methods operate under a permissive threat model, where an attack is successful if the generated downstream-agnostic adversarial examples (DAEs) change the original prediction, without requiring a specific target. In this paper, we propose a Targeted DAA (TDAA) method under a stricter threat model requiring the attack to be both targeted and downstream-agnostic. Since the downstream task is unknown and encoders do not directly produce predictions, achieving a targeted attack is particularly challenging. To address this, we introduce a novel component termed the 'threat image', pre-selected by the attacker as the target. Specifically, a generator is designed to produce example-specific adversarial perturbations that compel the victim encoder to output identical features for both the DAEs and the threat image. Unlike previous DAA methods that generate a single shared perturbation for all samples, which often fails due to image diversity, our method adopts an example-specific paradigm. This generates tailored perturbations for each image to ensure a high attack success rate and invisibility. By leveraging the threat image as a feature-level anchor, our method builds a task-agnostic bridge to reveal the vulnerabilities of the victim encoder. Extensive experiments on 10 self-supervised methods across 3 benchmark datasets demonstrate the effectiveness of our approach and reveal the pronounced vulnerability of pre-trained encoders. The code will be made publicly available after the review period.

URL PDF HTML ☆

赞 0 踩 0

2605.19436 2026-05-20 cs.LG cs.CL cs.CV

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

CEPO: 使用对比证据策略优化进行RLVR自蒸馏

Ahmed Heakl, Abdelrahman M. Shaker, Youssef Mohamed, Rania Elbadry, Omar Fetouh, Fahad Shahbaz Khan, Salman Khan

发表机构 * MBZUAI ； Linköping University（林雪平大学）； Australian National University（澳大利亚国立大学）

AI总结本文提出CEPO，通过对比证据策略优化解决RLVR中自蒸馏的问题，通过区分关键推理步骤与填充内容来提升模型性能。

Comments 9 pages

详情

AI中文摘要

当模型在强化学习中产生正确解时，每个token都会收到相同的奖励信号，无论其是关键推理步骤还是语法填充。一种自然的解决方法是将模型条件化为正确的答案作为教师，识别出模型在知道答案时会生成不同的token。先前的工作表明，这种方法要么通过泄露答案到梯度而破坏训练，要么产生弱信号，无法区分关键步骤和填充内容，因为两者在模型基线下看起来同样令人惊讶。我们提出对比证据策略优化（CEPO），在每个token上提出更尖锐的问题：不仅“正确答案是否偏好此token？”而且“正确答案是否偏好它，而错误答案是否厌恶它？”满足两者的是真正的推理步骤；不满足的是填充内容。错误答案的教师是从训练批次中已有的拒绝rollouts构造的，不增加额外的采样成本。我们证明CEPO继承了先前最先进状态下的所有结构安全保证，同时在关键token上严格提高信用，改进在填充位置恰好消失。实验表明，CEPO在五个多模态数学推理基准上分别达到43.43%和60.56%的平均准确率（在2B和4B规模下），而GRPO在相同训练预算下为41.17%和57.43%。分布匹配自蒸馏方法（OPSD、SDPO）在未训练基线下表现低于，实验证实了我们的理论预测的信息泄漏。我们的代码可在https://github.com/ahmedheakl/CEPO上获得。

英文摘要

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model's baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just "does the correct answer favor this token?" but "does the correct answer favor it while the wrong answer disfavors it?" A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at https://github.com/ahmedheakl/CEPO.

URL PDF HTML ☆

赞 0 踩 0

2605.19435 2026-05-20 cs.CV cs.AI

KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision

KappaPlace: 通过原型锚定监督学习超球面不确定性用于视觉位置识别

Maya Yanko, Yoli Shavit

发表机构 * Faculty of Engineering Bar-Ilan University（工程学院巴伊兰大学）

AI总结本文提出KappaPlace，一种学习具有不确定性的视觉位置识别表示的框架，通过原型锚定监督策略利用潜在类别代表作为概率目标，以减轻视觉位置识别中不确定性估计不准确的问题，从而提高导航系统的可靠性。

详情

AI中文摘要

视觉位置识别（VPR）对于自主导航至关重要，但最先进的方法缺乏良好的校准不确定性估计。标准流程无法可靠地指示查询是否模糊或匹配可能不正确，这在安全关键的机器人学中带来风险。我们提出KappaPlace，一种学习不确定性感知VPR表示的原理性框架。我们的核心贡献是一种原型锚定监督策略，利用潜在类别代表作为概率目标。通过将图像描述符建模为von Mises-Fisher（vMF）变量，我们学习了一个轻量级模块来预测浓度参数作为对aleatoric不确定性的直接代理。虽然现有的VPR不确定性方法通常局限于查询中心的视角，我们推导出一种新的匹配层面的公式来量化特定查询-参考对的可靠性。在五个多样化的基准测试中，KappaPlace将预期校准误差（ECE@K）比现有方法减少了高达50%，同时保持或提高了检索召回率。我们提供了联合训练变体和冻结骨干的后训练扩展。我们的结果表明，KappaPlace提供了稳健、稳定且校准良好的信号，能够在VPR流程中实现可靠的决策。我们的代码可在：https://github.com/mayayank95/UncertaintyAwareVPR

英文摘要

Visual Place Recognition (VPR) is critical for autonomous navigation, yet state-of-the-art methods lack well-calibrated uncertainty estimation. Standard pipelines cannot reliably signal when a query is ambiguous or a match is likely incorrect, posing risks in safety-critical robotics. We propose KappaPlace, a principled framework for learning uncertainty-aware VPR representations. Our core contribution is a Prototype-Anchored supervision strategy that leverages latent class representatives as targets for a probabilistic objective. By modeling image descriptors as von Mises-Fisher (vMF) variables, we learn a lightweight module to predict the concentration parameter as a direct proxy for aleatoric uncertainty. While existing VPR uncertainty methods are typically restricted to a query-centric view, we derive a novel match-level formulation to quantify the reliability of specific query-reference pairs. Across five diverse benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% compared to existing methods while maintaining or improving retrieval recall. We provide both a joint-training variant and a post-training extension for frozen backbones. Our results demonstrate that KappaPlace provides a robust, stable, and well-calibrated signal that enables reliable decision-making within the VPR pipeline. Our code is available at: https://github.com/mayayank95/UncertaintyAwareVPR

URL PDF HTML ☆

赞 0 踩 0

2605.19433 2026-05-20 cs.CL cs.AI

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

在偏离时回溯：缓解大语言模型推理蒸馏中的双重暴露偏差

Bing Wang, Shaotian Yan, Chen Shen, kaiyuan liu, Sinan Fan, Ximing Li, Rui Miao, Xiaosong Yuan, Zhanming Shen, Jieping Ye

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； Key Laboratory of Symbolic Computation and Knowledge Engineering, MoE, Jilin University（吉林大学符号计算与知识工程重点实验室）； Tongyi Lab, Alibaba Group（阿里集团通义实验室）； College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）

AI总结本文提出了一种新的LLM推理蒸馏方法MOTAB，通过动态监控学生模型生成过程并回溯偏离安全边界的情况，缓解了传统蒸馏方法中因训练分布与推理上下文不匹配导致的双重暴露偏差问题，从而提升推理性能。

Comments 26 pages, 8 figures

详情

AI中文摘要

大型语言模型（LLMs）通过长链思考（CoT）在复杂推理任务中取得了显著成功，但其巨大的计算开销阻碍了实际应用。LLM推理蒸馏通过将推理能力从强大的教师模型转移到紧凑的学生模型来解决这一问题。然而，现有蒸馏方法面临根本性的困境。典型的离线蒸馏严格利用教师生成的黄金轨迹，由于训练分布与学生生成的推理上下文不匹配，导致长链CoT推理中出现错误级联。为了解决这一问题，在线蒸馏允许学生探索自己的轨迹，但我们证明这会引入相互的反向暴露偏差：当学生生成次优上下文时，教师模型也难以提供积极指导。为了解决这一双重暴露偏差问题，我们提出监控轨迹并在偏离时回溯（MOTAB）新的LLM推理蒸馏流程。具体而言，MOTAB动态监控学生在线生成过程，对照自适应的安全边界。当生成偏离并超过此阈值时，MOTAB回溯到上一个安全状态，并利用教师干预来纠正方向。这种方法本质上可以容忍少量学生错误以缓解暴露偏差，同时防止次优上下文以避免反向暴露偏差。在LIMO-v2和AceReason数据集上的广泛实验表明，MOTAB有效缓解了双重暴露偏差，使推理任务的平均性能提高了约3%。

英文摘要

Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive guidance when conditioned on student-generated sub-optimal contexts. To resolve this dual exposure biases problem, we propose Monitoring Trajectories and Backtracking when it strays (MOTAB), a new LLM reasoning distillation pipeline. Specifically, MOTAB dynamically monitors the student's on-policy generation against an adaptive safety boundary. When the generation strays and exceeds this threshold, MOTAB backtracks to the last safe state and leverages teacher intervention to correct the course. This approach inherently tolerates minor student errors to mitigate exposure bias, while preventing sub-optimal contexts to circumvent reversed exposure bias. Extensive experiments on the LIMO-v2 and AceReason datasets demonstrate that MOTAB effectively alleviates the dual exposure biases, yielding a roughly 3% average performance improvement in reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.19431 2026-05-20 cs.RO

Self-assembling Modular Aerial Robot for Versatile Aerial Tasks

自组装模块化空中机器人用于多功能空中任务

Junichiro Sugihara, Masaki Kitagawa, Jinjie Li, Yunong Li, Takuzumi Nishio, Kei Okada, Moju Zhao

发表机构 * Department of Mechanical Engineering, The University of Tokyo（东京大学机械工程系）； Department of Mechano Informatics, The University of Tokyo（东京大学机械信息学系）

AI总结本文提出了一种自组装模块化空中机器人LEGION，通过飞行中自组装实现协同操作，结合灵活 maneuverability 和可重构性，实现了从被动观察者到主动参与者转变，拓展了空中物理交互的范围。

详情

AI中文摘要

多旋翼空中机器人在三维空间中具有出色的机动性，最近的进展使它们能够在复杂和狭窄的环境中进行灵活导航，尤其是对于小型机架。相比之下，用于高空工作的平台通常更大，以提供高推力以实现与环境的稳定物理交互。然而，这些矛盾的设计要求导致了灵活导航和稳健空中操作之间的长期权衡。本文提出了LEGION单元，这是一种可重新配置的模块化空中机器人，能够飞行中自组装以实现协同操作，灵感来自蚂蚁形成的自组织群体。每个单元保留了灵活的机动性，而两端的关节配备的对接接口使单元能够端到端自组装成飞行操作器。我们证明了多个单元可以自主飞行中对接；一旦锁定，它们通过控制接触力和扭矩保持零间隙锁定，即使在户外也能实现可靠的聚集和关节运动。我们进一步证明，自重构能力使单元能够在灵活的个体飞行和集体关节操作之间进行形态切换，同时实现核心飞行中操作原始操作，包括推、拉、旋转、抓取和携带。LEGION的自组织能力使空中机器人，特别是群组中的机器人，能够从被动观察者转变为环境中的主动参与者，拓展了空中物理交互的范围。

英文摘要

Multirotor aerial robots excel at maneuvering in three-dimensional space, and recent advances enable nimble navigation in cluttered and confined environments, especially for small airframes. By contrast, platforms built for high-altitude work tend to be larger to deliver high thrust for stable physical interaction with the environment. However, these conflicting design requirements create a long-standing trade-off between nimble navigation and robust aerial manipulation. Here, we present LEGION units, which are reconfigurable modular aerial robots capable of in-flight self-assembly for cooperative manipulation, drawing inspiration from the self-organized collectives formed by ants. Each unit retains nimble maneuverability while joint-equipped docking interfaces at both ends enable end-to-end self-assembly into a flying manipulator. We show that multiple units autonomously dock in flight; once latched, they maintain a zero-clearance interlock by controlling the contact force and torque, enabling reliable aggregation and articulated motion even outdoors. We further show that self-reconfigurability enables morphological switching between nimble individual flight and collective articulated manipulation, while realizing core in-flight manipulation primitives including pushing, pulling, rotating, grasping, and carrying. LEGION's self-organization enables aerial robots, especially in swarms, to shift from passive observers to active participants in their environment, broadening the scope of aerial physical interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.19425 2026-05-20 cs.LG cs.AI

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

何时停止重用：动态梯度门控用于样本高效的RLVR

Yuchun Miao, Sen Zhang, Yuqi Zhang, Yaorui Shi, Qi Gu, Xunliang Cai, Lefei Zhang

发表机构 * National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University（国家多媒体软件工程研究中心，武汉大学计算机学院）； Meituan Longcat Team（美团Longcat团队）； The University of Sydney（悉尼大学）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出动态梯度门控（DGG）方法，通过实时监控lm_head梯度范数来检测并阻止有害的梯度传播，从而提高样本效率和训练速度。

Comments 23 pages, 10 figures

详情

AI中文摘要

可验证奖励的强化学习（RLVR）已成为大型语言模型（LLMs）高级推理的主要范式，但获取rollout样本成本高昂，使得样本效率成为关键瓶颈。一种自然的解决方法是将每个rollout批次用于多个梯度更新，这是经典强化学习中的标准做法。然而在RLVR中，这会放大策略偏移，导致严重性能下降。检测降级的早期迹象并停止重用仍是一个开放且具有挑战性的问题。我们通过识别不均衡权重分歧（DWD）现象来填补这一空白：性能下降与lm_head权重变化的急剧上升同步，而中间层保持稳定。经验上，我们验证DWD在各种LLM和任务中一致出现。理论上，我们证明（i）有害梯度集中在lm_head，而中间层在结构上被衰减，（ii）lm_head梯度范数下界了策略偏移。这些结果确立了lm_head梯度范数作为灾难性策略偏移的原理性、实时信号。基于这一见解，我们提出动态梯度门控（DGG），一种轻量级干预，通过实时监控lm_head梯度范数并在有害梯度污染优化器前拦截它们。DGG在数学、ALFWorld、WebShop和搜索增强型问答任务中一致匹配或超过标准单次使用基线，实现高达2.93倍的样本效率和2.14倍的墙钟加速。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for advanced reasoning in Large Language Models (LLMs), but rollout samples are expensive to obtain, making sample efficiency a critical bottleneck. A natural remedy is to reuse each rollout batch for multiple gradient updates, a standard practice in classical RL. Yet in RLVR, this amplifies policy shift, leading to severe performance degradation. Detecting the onset of degradation early enough to stop reuse remains an open and challenging problem. We close this gap by identifying the \textit{Disproportionate Weight Divergence (DWD)} phenomenon: performance degradation is synchronized with a sharp surge in the \texttt{lm\_head} weight change, while intermediate layers remain stable. Empirically, we verify that DWD emerges consistently across diverse LLMs and tasks. Theoretically, we prove that (i) harmful gradients concentrate at the \texttt{lm\_head} while intermediate layers are structurally attenuated, and (ii) the \texttt{lm\_head} gradient norm lower-bounds the policy divergence. These results establish the \texttt{lm\_head} gradient norm as a principled, real-time signal of catastrophic policy shift. Guided by this insight, we propose \textit{Dynamic Gradient Gating (DGG)}, a lightweight intervention that monitors the \texttt{lm\_head} gradient norm in real time and intercepts harmful gradients before they corrupt the optimizer. DGG consistently matches or exceeds the standard single-use baseline, achieving up to $2.93\times$ sample efficiency and $2.14\times$ wall-clock speedup across math, ALFWorld, WebShop, and search-augmented QA tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.19420 2026-05-20 cs.RO

Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation

超越航点：双热图接地用于跨具身语义导航

Kaijie Yun, Yue Chen

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； JD AI Research（京东人工智能研究院）

AI总结本文提出一种统一的视觉-语言框架，通过双热图表示替代单点回归，以解决语义指令与物理可达性之间的差距，从而提升跨具身语义导航的鲁棒性和性能。

详情

AI中文摘要

将开放式的语义指令接地为可执行的局部目标是人机交互中的基本挑战。尽管现有导航框架通常回归确定性的航点，但这种刚性方法会压缩空间不确定性，并且经常针对不可通行的物体中心，导致严重的执行失败。在本文中，我们专注于在视场内（in-FOV）的语义导航实际场景，其中机器人接收到简短的、交织的多模态（文本和图像）提示。为了弥合抽象语义意图与物理可达性之间的差距，我们提出了一种统一的视觉-语言框架，该框架放弃单点回归，转而采用双热图表示。我们的框架预测一个导航可及性热图，以捕捉连续的可到达区域，并结合一个面向热图用于方向约束。这些密集输出本质上充当可微的语义势场，能够无缝整合到下游的局部规划器中。为了支持这一范式，我们构建了一个完全自动化的、基于基础模型的合成数据管道，并建立了全面的模拟基准。广泛的实验表明，我们的框架在可比的8B基线中实现了最先进的性能。关键的是，通过特征融合研究和在不同机器人具身（Jetbot、H1、Aliengo）上的模拟研究，揭示出显式热图预测显著提高了可及率（AR）。通过将目标可靠地放置在可执行的自由空间中，我们的框架有效缓解了点回归的脆弱性，提供了一种可转移的路径，朝着安全的跨具身语义导航迈进。

英文摘要

Grounding open-ended semantic instructions into physically executable local goals is a fundamental challenge in human-robot interaction. While existing navigation frameworks often regress deterministic waypoints, this rigid formulation collapses spatial uncertainty and frequently targets non-traversable object centers, leading to severe execution failures. In this work, we focus on the practical setting of in-FOV semantic navigation, where a robot receives concise, interleaved multimodal (text and image) prompts. To bridge the gap between abstract semantic intent and physical reachability, we propose a unified Vision-Language framework that abandons single-point regression in favor of a Dual-Heatmap representation. Our framework predicts a navigation affordance heatmap that captures continuous reachable regions, coupled with a facing heatmap for orientation constraints. These dense outputs inherently function as a differentiable semantic potential field, integrating seamlessly with downstream local planners. To support this paradigm, we build a fully automated, foundation-model-assisted synthetic data pipeline and establish a comprehensive simulation benchmark. Extensive experiments demonstrate that our framework achieves state-of-the-art performance among comparable 8B baselines. Crucially, a feature-fusion study and simulation studies across diverse robot embodiments (Jetbot, H1, Aliengo) reveal that explicit heatmap prediction drastically improves the Affordance Rate (AR). By placing targets reliably in executable free space, our framework effectively mitigates the brittleness of point regression, offering a transferable path toward safe cross-embodiment semantic navigation.

URL PDF HTML ☆

赞 0 踩 0

2605.19418 2026-05-20 cs.AI

Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

通过有向图建模实现冲突容忍的多智能体推理

Longgang He, Longzhu He, Daojing He, Chaozhuo Li

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结本文提出SIGMA框架，通过有向图建模显式捕捉智能体间的信任、冲突和中性关系，以提升多智能体系统的推理能力和冲突容忍性。

详情

AI中文摘要

基于大语言模型的多智能体系统（MAS）已展现出强大的推理和决策能力，其性能常受到简单聚合机制的限制，假设所有交互都是合作性的。经过深入分析，我们发现现有基于图的MAS框架存在两个问题：（1）当出现冲突信号时，错误会传播而无法控制；（2）缺乏对冲突智能体关系的显式建模以及结构意识，无法识别可靠的交互模式。为弥补这一差距，我们引入SIGMA，一种新的基于有向图的多智能体推理框架，通过有向关系图显式捕捉智能体间的信任、冲突和中性关系。具体而言，给定一个查询，SIGMA首先选择一组相关且多样化的智能体，然后构建一个具有置信度加权边的结构化有向交互图。推理过程通过冲突感知的有向信息传递进行，这会加强来自可信智能体的信息，同时抑制冲突信号，并以结构和冲突感知的加权聚合结束，以产生一致且冲突容忍的预测。在六个基准数据集上进行的大量实验表明，SIGMA在多个LLM后端和多智能体配置中一致优于最先进的基线，实现了准确性和冲突容忍性能的显著提升。

英文摘要

LLM-based multi-agent systems (MAS) have demonstrated strong reasoning and decision-making capabilities that consistently surpass those of single LLM agents. However, their performance often suffers from naive aggregation mechanisms that assume uniformly cooperative interactions. Upon close inspection, we observe that existing graph-based MAS frameworks (1) propagate errors when conflicting signals arise without control, and (2) lack explicit modeling of conflicting inter-agent relations as well as structural awareness, failing to identify reliable interaction patterns. To bridge this gap, we introduce SIGMA, a novel SIgned Graph-informed Multi-Agent reasoning framework that explicitly captures trust, conflict, and neutral relations among agents via a signed relational graph. Specifically, given a query, SIGMA first selects a set of relevant and diverse agents, then constructs a structured signed interaction graph with confidence-weighted edges. Reasoning proceeds through conflict-aware signed message passing, which reinforces information from trustworthy agents while suppressing conflicting signals, and terminates with a structure- and conflict-aware weighted aggregation to yield globally consistent and conflict-resilient predictions. Extensive experiments on six benchmark datasets, across multiple LLM backbones and diverse multi-agent configurations, demonstrate that SIGMA consistently outperforms state-of-the-art baselines, achieving notable gains in both accuracy and conflict-resilient performance.

URL PDF HTML ☆

赞 0 踩 0

2605.19410 2026-05-20 cs.CV

Vision Harnessing Agent for Open Ad-hoc Segmentation

用于开放即兴分割的视觉引导代理

Zilin Wang, Stella X. Yu

发表机构 * University of Michigan（密歇根大学）

AI总结本文提出了一种名为VASA的视觉引导即兴分割代理，该代理通过结合视觉语言模型、分割基础模型和视觉引导工作流，实现了无需训练的即兴分割任务，其在PARS和RefCOCO等基准测试中均表现出色。

Comments 23 pages, 11 figures

详情

AI中文摘要

分割任务在了解概念后变得容易，需要从文本中检索已学习的视觉基础。然而，对于开放即兴概念，这种基础可能不存在，必须通过图像证据中的部分、关系、排除和集合来构建。我们提出了视觉引导的即兴分割代理（VASA），这是首个用于开放即兴分割的视觉引导代理。VASA无需训练，结合了VLM代理、分割基础模型和视觉引导工作流。不同于仅修改文本提示，VASA使用持久的工作掩码来推理、构建和验证解决方案。它计划视觉操作，调用分割工具，检查结果，编辑掩码并恢复错误。我们构建了PARS，一个将PartImageNet中的部分级标签转换为开放即兴概念的新基准，通过长文本定义查询实现。在PARS上，VASA优于开放词汇、推理和代理基线，超越SAM3代理14-25%。在RefCOCO，一个标准的多粒度指引用分割基准上，VASA比SAM3代理提高5-9%，比其他代理基线提高高达20%。这些结果验证了代理视觉构建在开放即兴分割中的有效性。我们的工作指出了AI代理超越将基础模型作为工具的路径：通过任务知识、VLM行为、视觉规程、工作记忆和故障意识工作流来编程它们。

英文摘要

Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc concepts through long-form definition queries. On PARS, VASA outperforms open-vocabulary, reasoning-based, and agentic baselines, surpassing SAM3 Agent by 14-25%. On RefCOCOm, a standard multi-granularity referring segmentation benchmark, VASA improves over SAM3 Agent by 5-9% and over other agentic baselines by up to 20%. These results validate agentic visual construction for open ad-hoc segmentation. Our work points to a path for AI agents beyond wrapping foundation models as tools: Programming them with task knowledge, VLM behavior, visual routines, working memory, and failure-aware workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.19407 2026-05-20 cs.LG cs.AI

A Bitter Lesson for Data Filtering

数据过滤的惨痛教训

Christopher Mohri, John Duchi, Tatsunori Hashimoto

发表机构 * Department of Computer Science（计算机科学系）； Departments of Statistics and Electrical Engineering（统计学与电气工程系）； Stanford University（斯坦福大学）

AI总结本文研究了大规模模型预训练中的数据过滤，发现即使有足够的计算资源，过滤数据也不是最佳选择，因为充分训练的大型模型能够容忍低质量数据甚至从中受益。

2605.19403 2026-05-20 cs.LG

TIDE: Asymmetric Neural Circuits for Stabilized Temporal Inhibitory-Excitatory Dynamics

TIDE：用于稳定时间抑制-兴奋动态的非对称神经电路

Alexander Kyuroson, Denis Kleyko, Marcus Liwicki

发表机构 * Luleå University of Technology（卢莱大学技术学院）； Örebro University（奥雷布罗大学）； RISE Research Institutes of Sweden（瑞典RISE研究所）

AI总结本文提出TIDE架构，通过非对称兴奋-抑制网络稳定时间动态，结合Wilson-Cowan动态和横向抑制，提升生物真实性和学习性能，实验表明其在训练时间和准确率上均优于CTM。

详情

AI中文摘要

最近的Continuous Thought Machine架构通过神经动态将内部计算与外部输入解耦，但依赖多层感知机而缺乏稳定性保证。我们提出使用非对称兴奋-抑制（E-I）网络建模神经动态，该网络可通过网络理论原理稳定，并可表示为通过博弈论损失优化的能量系统。基于此视角，我们引入时间抑制-兴奋动态引擎（TIDE），一种受神经启发的架构，通过稳定神经动态计算内部表示，整合Wilson-Cowan动态和横向抑制。TIDE通过例如使用分层感受野和强制Dale原则，平衡生物真实性，确保现实的80:20 E-I平衡比。本文的目标是引入一种新架构，将神经启发式学习置于 forefront。我们提供了收敛性、稳定性和复杂度界限的证明，以及实证消融研究。总体而言，TIDE在训练时间上比CTM少50%以下，并在各种扰动下将ImageNet的top-1准确率提高平均1.65%。

英文摘要

Recent Continuous Thought Machine architecture decouples internal computation from external inputs via neural dynamics, but relies on multi-layer perceptrons without stability guarantees. We propose to model neural dynamics using asymmetric Excitatory-Inhibitory (E-I) networks, which can be stabilized via principles from network theory and can be expressed as energy-based systems optimized through a game-theoretic loss. Building on this perspective, we introduce Temporal Inhibitory-Excitatory Dynamic Engine (TIDE), a neuro-inspired architecture that computes internal representations through neural dynamics stabilized by incorporating the Wilson-Cowan dynamics and lateral inhibition. TIDE balances biological realism by, for instance, using Hierarchical Receptive Fields and enforcing Dale's principle to ensure a realistic $80:20$ E-I balance ratio with an end-to-end trainable architecture. The aim of this paper is to introduce a new architecture that brings neuro-inspired learning to the forefront. We present proofs of convergence, stability, and complexity bounds, along with empirical ablation studies. Overall, TIDE surpasses CTM with under $50\%$ of the training time and improves $\texttt{top-1}$ accuracy by an average of $+1.65\%$ on ImageNet under various perturbations.

URL PDF HTML ☆

赞 0 踩 0

2605.19394 2026-05-20 cs.CL cs.AI

EmbGen: Teaching with Reassembled Corpora

EmbGen：利用重组语料库进行教学

Arun K Lenin, Kai Rouse, Andrea Nicastro, Anna Leontjeva

发表机构 * Commonwealth Bank of Australia（澳大利亚联邦银行）

AI总结本文提出EmbGen，一种通过重组语料库生成合成数据的pipeline，旨在提高在不同语义异质性数据集上指令微调模型的性能，通过实体-描述对的分解、基于嵌入相似性的重组以及基于聚类的采样生成问题-答案对，从而在固定token预算下提升二元准确率。

Comments 8 pages, 4 images (32 pages with appendix)

详情

AI中文摘要

适应小型指令微调模型到专业领域通常依赖于在精心挑选的指令-响应示例上进行监督微调（SFT），这在大规模收集时成本高昂。由教师LLM从领域语料库生成的合成训练示例可以降低此成本，但现有流程会产生同质化输出，并且不一致地捕捉跨段落或跨文档依赖性。我们引入EmbGen，一种合成数据生成流程，该流程将语料库分解为实体-描述对，通过从嵌入相似性推断出的语义结构重新组装它们，并通过接近性、集群内和集群间采样生成问题-答案（QA）对，使用集群专门化的系统提示。我们评估EmbGen在三个语义异质性不同的数据集上，固定token预算（5和20百万token）下的表现，与EntiGraph、InstructLab和Knowledge-Instruct进行比较。我们使用词汇重叠度量、LLM作为判断标准的评分表以及二元准确率（结合事实准确性和完整性）作为评估指标。EmbGen在最异质的数据集上，相对于最强基线，在5M和20M token预算下分别提高了12.5%和88.9%的二元准确率，同时在其他异质性较低的数据集上保持竞争力。

英文摘要

Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

URL PDF HTML ☆

赞 0 踩 0

2605.19393 2026-05-20 cs.CV cs.LG

Neuron Incidence Redistribution for Fairness in Medical Image Classification

神经元发生再分配用于医疗图像分类中的公平性

Abin Shoby, Lyle John Palmer, Nikhil Cherian Kurian

发表机构 * Neuron Incidence Redistribution for Fairness in Medical Image Classification（神经发生再分配用于医学图像分类）

AI总结本文提出了一种轻量级的正则化方法Neuron Incidence Redistribution (NIR)，通过减少预测概率加权平均激活值的方差来提升医疗图像分类中的公平性，实验结果显示在不同年龄和性别组别中，TPR和FPR的不平等现象显著降低。

Comments 4 Pages, 1 Figure

详情

AI中文摘要

深度学习模型在医疗图像分类中容易出现因年龄、性别和种族等人口属性导致的子群体性能差异。我们识别出这些差异背后的潜在表征机制：在迁移学习模型中，正预测下的主导倒数第二层激活通道同时被疾病阳性样本和特权人口群体（男性、年长患者）激活，导致过度诊断；相反，负预测下的主导通道由不利群体（女性、年轻患者）激活，导致系统性误诊。为了解决这一问题，我们提出了Neuron Incidence Redistribution (NIR)，一种轻量级正则化方法，该方法惩罚倒数第二层神经元预测概率加权平均激活值的方差，无需在训练时使用人口属性标签。在HAM10000数据集上，NIR使年龄组的TPR不平等从10.81%降至0.93%，性别组的TPR不平等从12.04%降至0.74%，同时AUC略有提高0.51个点。在Harvard OCT-RNFL数据集上，NIR减少了种族（从15.68%降至10.66%）和年龄（从12.69%降至1.80%）的FPR不平等，证明了在全倒数第二层分布潜在疾病证据是一种提升医疗AI人口公平性的原则性且有效的方法。

英文摘要

Deep learning models for medical image classification are susceptible to subgroup performance disparities across demographic attributes such as age, gender, and race. We identify a latent representational mechanism underlying these disparities: in transfer-learned models, the dominant penultimate-layer activation channel under positive predictions is co-activated by both disease-positive samples and privileged demographic groups (male, older patients), producing over-diagnosis; conversely, the dominant channel under negative predictions is co-activated by disadvantaged groups (female, younger patients), producing systematic under-diagnosis. To address this, we propose Neuron Incidence Redistribution (NIR), a lightweight regularization method that penalizes the variance of predicted-probability-weighted mean activations across penultimate-layer neurons, requiring no demographic labels at training time. On HAM10000, TPR disparity drops from 10.81% to 0.93% across age groups and from 12.04% to 0.74% across gender, with a marginal AUC improvement of 0.51 points. On Harvard OCT-RNFL, NIR reduces FPR disparity for race (from 15.68% to 10.66%) and age (from 12.69% to 1.80%), demonstrating that distributing latent disease evidence across the full penultimate layer is a principled and effective strategy for improving demographic fairness in medical AI.

URL PDF HTML ☆

赞 0 踩 0

2605.19392 2026-05-20 cs.LG

Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach

理解Adam在零和游戏中的动态：一种微分方程方法

Yi Feng, Weiming Ou, Xiao Wang

发表机构 * Aarhus University, Aarhus, Denmark.（奥胡斯大学）； MoE Key Laboratory of Interdisciplinary Research of Computation and Economics, Shanghai University of Finance and Economics, Shanghai, China（教育部交叉信息与经济学联合实验室，上海财经大学）； Shanghai University of Finance and Economics, Shanghai, China（上海财经大学）

AI总结本文通过微分方程方法研究Adam-DA在零和游戏中的动态，揭示了动量参数在零和游戏中的作用与最小化问题相反，通过GAN实验验证了这一发现。

详情

AI中文摘要

Adam在训练神经网络中的显著成功自然导致其下降-上升对应物Adam-DA被广泛用于解决零和游戏。尽管在实践中很受欢迎，但对Adam-DA的严格理论理解仍滞后。在本文中，我们推导了普通微分方程（ODEs），这些方程是Adam-DA的连续时间极限。这些ODEs紧密近似Adam-DA的离散时间动态，提供了一个可分析的框架来理解其在零和游戏中的行为。利用这种ODE方法，我们研究了Adam-DA的两个基本方面：局部收敛性和隐式梯度正则化。我们的分析揭示了在零和游戏中一阶和二阶动量参数的作用恰好与在最小化问题中已记录的效果相反。我们通过多个架构和数据集的GAN实验验证了这些预测，展示了这种反转的动量效应的实用意义。

英文摘要

The remarkable success of the Adam in training neural networks has naturally led to the widespread use of its descent-ascent counterpart, Adam-DA, for solving zero-sum games. Despite its popularity in practice, a rigorous theoretical understanding of Adam-DA still lags behind. In this paper, we derive ordinary differential equations (ODEs) that serve as continuous-time limits of the Adam-DA. These ODEs closely approximate the discrete-time dynamics of Adam-DA, providing a tractable analytical framework for understanding its behavior in zero-sum games. Using this ODE approach, we investigate two fundamental aspects of Adam-DA: local convergence and implicit gradient regularization. Our analysis reveals that the roles of the first- and second-order momentum parameters in zero-sum games are exactly the opposite of their well-documented effects in minimization problems. We validate these predictions through GAN experiments across multiple architectures and datasets, demonstrating the practical implications of this reversed momentum effect.

URL PDF HTML ☆

赞 0 踩 0

2605.19390 2026-05-20 cs.CV

LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue

LMM-Track4D: 通过轨迹引导的对话激发LMM中的4D动态推理

Chaoyue Li, Yongxue Xu, Jie Feng, Jiayu Ding

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Sun Yat-sen University（中山大学）； Beihang University（北航）； Peking University（北京大学）

AI总结本文提出LMM-Track4D任务，通过轨迹引导的多轮时空对话，结合RTGE、TRK和OSK-RA解码器，提升LMM在4D动态推理中的性能，实验表明显式动态状态建模是有效设计原则。

详情

AI中文摘要

近期大型多模态模型（LMMs）在图像和视频理解方面的能力不断增强，但仍难以持续进行4D连续时空动态推理。为研究这一能力差距，我们提出了轨迹引导的多轮时空对话任务，该任务要求模型在回答时空查询的同时，返回整个短片段或指定较长片段中的结构化3D目标轨迹，并引入Track4D-Bench基准，包含526个片段级对话样本，涵盖23.5k帧和7.5k对象注释，用于训练和评估。基于此任务，我们提出了LMM-Track4D，结合RTGE（射线-时间几何编码）、专门用于长时间跨度动态传播的流式状态令牌TRK，以及在遮挡和视角变化下稳定进行4步3D状态估计的Object-Slot Kinematic, Residual-Anchor（OSK-RA）解码器。在Track4D-Bench上的实验表明，与强基线相比，LMM-Track4D有持续的性能提升，表明显式动态状态建模是激发LMM中4D动态推理的有效设计原则。我们的代码和数据集将在https://github.com/mikubaka88/LMM-Track4D上公开。

英文摘要

Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray--Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation under occlusion and viewpoint variation. Experiments on Track4D-Bench show consistent improvements over strong baselines, suggesting that explicit dynamic state modeling is a useful design principle for eliciting 4D dynamic reasoning in LMMs. Our code and dataset will be publicly available at https://github.com/mikubaka88/LMM-Track4D.

URL PDF HTML ☆

赞 0 踩 0

2605.19386 2026-05-20 cs.CV

MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos

MatPhys: 从视频中学习材料感知的物理参数以模拟可变形物体

Yang Yang, Yiyan Wang, Zheming Liu, Naoya Iwamoto

发表机构 * The University of Osaka（大阪大学）； The University of Tokyo（东京大学）； Huawei Technologies Japan K.K（华为技术日本株式会社）

AI总结本文提出MatPhys方法，通过单视角视频预测弹簧-质量参数，解决了现有方法在材料假设和跨场景一致性方面的不足，从而提升可变形物体模拟的准确性和泛化能力。

Comments Submitted to Siggrah Asia 2026

详情

AI中文摘要

从视频中重建可变形物体的模拟准备版本对于视觉、图形学和机器人学至关重要。现有的物理驱动方法可以从视频中恢复物理数字双胞胎，但它们有两个根本性的局限性：它们通常假设物体整体具有均匀的材料属性，且其场景特定的逆向优化与单目观测的固有模糊性相结合，导致相同材料在不同场景或交互中参数不一致。我们提出了MatPhys，一种材料感知的前馈框架，通过单视角视频预测弹簧-质量参数，通过两个耦合的设计解决这两个问题。为了放松均匀材料假设，我们使用DINO特征将物体分解为具有语义意义的部分，并查询部分级材料先验，为每个部分分配其自身的物理行为。为了强制跨场景一致性，我们引入了一个学习的材料代码本，其中包含共享的材料嵌入，作为外观和物理之间的桥梁，并进一步使用部分级先验作为参考分布，约束解码器，使得相同材料在不同场景和交互中产生一致的参数。这些设计将一个欠约束的单目问题转化为基于共享、可重用材料概念的前馈推断。实验表明，我们的方法在重建和未来预测方面与每场景优化基线相匹配，同时在未见过的交互和物体上实现了更强的泛化能力，具有更一致的物理参数。

英文摘要

Reconstructing simulation-ready deformable objects is important for vision, graphics, and robotics. Existing physics-driven methods can recover physical digital twins from videos, but they suffer from two fundamental limitations: they typically assume a homogeneous material across the whole object, and their scene-specific inverse optimization, combined with the inherent ambiguity of monocular observation, yields inconsistent parameters for the same material across different scenes or interactions. We propose MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters from a single-view video, addressing these two issues with two coupled designs. To relax the homogeneous material assumption, we use DINO features to decompose the object into semantically meaningful parts and to query a part-level material prior, assigning each part its own physical behavior. To enforce cross-scene consistency, we introduce a learned material codebook of shared material embeddings as the bridge between appearance and physics, and further use the part-level prior as a reference distribution that constrains the decoder so that the same material yields consistent parameters across scenes and interactions. Together, these designs turn an under-constrained monocular problem into feed-forward inference grounded on shared, reusable material concepts. Experiments show that our method matches per-scene optimization baselines in reconstruction and future prediction, while achieving stronger generalization to unseen interactions and objects with more consistent physical parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.19382 2026-05-20 cs.AI

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

PRISM：一个程序化空间-时间推理的基准测试

Qiran Zhang, Yuheng Wang, Runde Yang, Lin Wu, Jingru Fan, Shu Yao, Jie Zhang, Tianle Zhou, Huatao Li, Ruijie Shi, Yihan Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）

AI总结本文提出PRISM基准测试，通过大规模人类校准的指令-代码对（共10,372个，比之前基准大20倍），评估语言模型生成空间正确动画输出的能力，并揭示执行成功率与空间正确率之间的显著差距。

详情

AI中文摘要

通过代码进行视频生成提供了超越像素级扩散模型在几何精度和时间一致性方面的优势，但严格评估语言模型是否能生成空间正确的动画输出仍是一个开放性问题。我们引入PRISM，一个基于英语和中文真实世界知识可视化场景，涵盖437个主题类别的大规模基准测试，包含10,372个由人类校准的指令-代码对（比之前的程序化视频生成基准大20倍）。我们进一步提出一种 funnel 风格的评估框架，包含四个互补的指标：代码级别可靠性用于可执行性，空间推理用于完整动画序列中的布局正确性，以及 Prompt-Aware Dynamic Visual Complexity (PADVC) 和 Temporal Density (TD) 用于诊断动态表达和时间活动。对七个主流LLM的系统评估揭示了显著的执行-空间差距：执行成功率平均下降约41%，表明可执行代码并不一定产生空间一致的视觉输出。这些发现表明，程序化视频生成的评估应超越可执行性。PRISM为推进空间一致的代码生成提供了原则性的基准测试。

英文摘要

Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem. We introduce PRISM, a large-scale benchmark of 10,372 human-calibrated instruction-code pairs (20 times larger than prior programmatic video generation benchmarks), grounded in real-world knowledge visualization scenarios across English and Chinese and spanning 437 subject categories. We further propose a funnel-style evaluation framework with four complementary metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness over full animation sequences, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for diagnosing dynamic expression and temporal activity. Systematic evaluation of seven mainstream LLMs reveals a striking Execution-Spatial Gap: the average drop from execution success rate to spatial pass rate is approximately 41%, showing that runnable code does not necessarily yield spatially coherent visual output. These findings show that programmatic video generation evaluation should go beyond executability. PRISM provides a principled benchmark for advancing spatially coherent code generation.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Adynamical systems view of training generativemodels and the memorization phenomenon

Drifting Objectives for Refining Discrete Diffusion Language Models

Sampling-Based Safe Reinforcement Learning

Quantifying the Pre-training Dividend: Generative versus Latent Self-Supervised Learning for Time Series Foundation Models

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning

Generative Auto-Bidding with Unified Modeling and Exploration

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Targeted Downstream-Agnostic Attack

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

Self-assembling Modular Aerial Robot for Versatile Aerial Tasks

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation

Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

Vision Harnessing Agent for Open Ad-hoc Segmentation

A Bitter Lesson for Data Filtering

TIDE: Asymmetric Neural Circuits for Stabilized Temporal Inhibitory-Excitatory Dynamics

EmbGen: Teaching with Reassembled Corpora

Neuron Incidence Redistribution for Fairness in Medical Image Classification

Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach

LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue

MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning