arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2056
专题追踪
2605.30912 2026-06-01 cs.CV cs.CL

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

关注证据:面向多模态RLVR的证据锚定空间注意力监督

Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, Yue Wang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Nankai University(南开大学) Shanghai Jiaotong University(上海交通大学) Zhejiang University(浙江大学)

AI总结 提出EASE方法,通过将标注证据区域转化为平滑视觉标记目标,在多模态强化学习训练中引导响应到图像的注意力,从而提升视觉语言模型在感知、幻觉、视觉数学和多模态推理基准上的性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过优化从最终答案中导出的结果奖励来改进视觉语言模型(VLM)。然而,这种仅基于结果的奖励并不能告诉模型哪些图像区域证明了答案的正确性。对于需要视觉定位的问题,这些奖励无法区分由相关视觉证据支持的响应与由语言先验捷径或幸运猜测产生的响应。我们引入了EASE(证据锚定空间注意力),它通过视觉证据过程监督增强了多模态RLVR。EASE将标注的证据区域转换为平滑的视觉标记目标,并在RL训练期间使用它来引导响应到图像的注意力,但仅限于高奖励轨迹。标注仅用作特权训练标签,而推理仅需要原始图像和问题。在Qwen2.5-VL-7B、Qwen3-VL-4B和Qwen3-VL-8B上,EASE在感知、幻觉、视觉数学和多模态推理基准上的平均得分比DAPO高出2.5到3.1分。诊断和消融实验表明,EASE更好地将视觉注意力与标注的证据区域对齐。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.

2605.30911 2026-06-01 cs.CV cs.AI

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

什么使LVLMs更少产生幻觉?揭示影响幻觉鲁棒性的架构因素

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin, Jun Luo, Jiancheng Lv

发表机构 * School of Computer Science, Engineering Research Center of Machine Learning and Industry Intelligence, Sichuan University(计算机科学学院,机器学习与产业智能工程研究中心,四川大学) School of Computer and Information Engineering, Xiamen University of Technology(计算机与信息工程学院,厦门理工大学) Department of Electrical and Computer Engineering, University of Hong Kong(电气与计算机工程系,香港大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 本文通过将架构设计分解为语言基础、视觉表示和语义对齐三个维度,并引入CoSimUE基准,系统探索了架构因素对LVLMs幻觉鲁棒性的影响,发现模型参数扩展效果有限,而增强视觉编码器、语言基础和语义对齐能分别减少不同类型的幻觉。

详情
AI中文摘要

幻觉仍然是削弱大型视觉-语言模型(LVLMs)可靠性的关键挑战之一。但什么使LVLM更少产生幻觉?许多现有工作专注于改进模型的内部组件。我们认为幻觉从根本上源于模型架构的设计方式。为了研究这一点,我们将架构设计分解为三个维度:语言基础(LF)、视觉表示(VR)和语义对齐(SA),并将幻觉分为共现型、相似型和先前被忽视的不确定型。基于这一框架,我们提出了CoSimUE基准,通过受控文本扰动和随机扰动创建细粒度的幻觉场景,从而建立设计选择与幻觉行为之间的映射。在7个设计方面的实验表明:1)广泛强调的参数规模扩展对减少所有三类幻觉的影响有限;2)更大且训练更好的语言基础可以减少共现型幻觉;3)更强的视觉编码器和更高的分辨率减轻相似型错误;4)有效的对齐策略缓解不确定型幻觉。5)此外,跨维度分析显示,联合增强视觉保真度和对齐质量能带来最全面的改进。本研究首次系统性地将架构级设计与幻觉鲁棒性联系起来,为开发可靠且高效的LVLMs提供了实用指导。

英文摘要

Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.

2605.30910 2026-06-01 cs.LG

PINNs Failure Modes are Overfitting

PINNs 的失败模式是过拟合

Nigel T. Andersen, Takashi Matsubara

发表机构 * Graduate School of Information Science and Technology(信息科学与技术研究生学校) RIKEN Center for Advanced Intelligence Project (AIP)(RIKEN高级智能项目中心(AIP))

AI总结 本文通过可视化残差证明物理信息神经网络的失败模式源于过拟合,并提出基于正则化和双反向传播的方法来消除失败模式,在标准方程上以更少的配置点实现最先进性能。

详情
AI中文摘要

物理信息神经网络(PINNs)是一类常见的基于机器学习的偏微分方程(PDE)求解器,它们通过最小化编码 PDE 的残差损失来训练网络以表示解。尽管取得了成功,但已知它们在某些简单方程上会失败,收敛到不正确的解,尽管损失很低。这些失败模式在过去几年中引起了文献中的广泛关注,激发了基于架构和优化的解决方案。通过直接可视化残差,我们表明失败模式是过拟合的结果:损失在配置点上被最小化,但在其他地方则不然。应用正则化会使失败模式消失。最后,我们将双反向传播扩展到整个残差集,并使用它在四个标准失败模式方程上实现了最先进的性能,配置点数量减少多达 $23\times$,且使用普通架构。

英文摘要

Physics-Informed Neural Networks (PINNs) are a common class of machine learning-based partial differential equation (PDE) solvers which train a network to represent a solution by minimizing a residual loss that encodes the PDE. Despite their successes, they are known to fail on certain simple equations, converging to an incorrect solution despite low loss. These failure modes have garnered significant attention in the literature over the past several years, motivating both architectural and optimization based solutions. By directly visualizing the residual, we show that failure modes are the result of overfitting: the loss is minimized on the collocation points, but not elsewhere. Applying regularization causes the failure modes to vanish. Finally, we extend double backpropagation over the full set of residuals, and use it to achieve state-of-the-art performance on four standard failure mode equations with up to $23\times$ fewer collocation points and a vanilla architecture.

2605.30906 2026-06-01 cs.RO cs.SY eess.SY

Trajectory Planning for Non-Communicating Mobile Robots using Inverse Optimal Control

非通信移动机器人的逆最优控制轨迹规划

Nina Majer, Yannick Epple, Xin Ye, Stefan Schwab, Sören Hohmann

发表机构 * FZI Research Center for Information Technology(弗劳恩霍夫信息技术研究所) Institute of Control Systems, Karlsruhe Institute of Technology(卡尔斯鲁厄大学控制系统研究所)

AI总结 针对非通信移动机器人在避碰场景中的高效交互,提出一种结合逆最优控制的轨迹规划与预测算法,通过估计未知目标状态并联合预测,实现更快的规划求解。

详情
AI中文摘要

为了实现非通信移动机器人在避碰场景中的高效交互,我们提出了一种新颖的轨迹规划与预测组合算法。逆最优控制用于基于观测到的过去轨迹估计所有机器人的未知目标状态。每个机器人还从其他机器人的角度考虑自我预测,并使用估计的目标状态解决联合预测问题。然后将得到的预测用于规划。在2-8个机器人场景中的仿真结果表明,与基于恒定加速度估计目标状态的规划相比,所有车辆到达目标的中位时间加快了9.8%。此外,所提出的方法从未导致求解器无法找到规划或预测问题的解。

英文摘要

To enable an efficient interaction of non-communicating mobile robots in collision avoidance scenarios, we present a novel combined trajectory planning and prediction algorithm. Inverse optimal control is used to estimate unknown goal states of all robots based on observed past trajectories. Each robot also takes the perspective of other robots in considering self-prediction and solves a joint prediction problem using the estimated goal states. The resulting predictions are then considered for planning. Simulation results of scenarios with 2-8 robots show that the median of the durations until all vehicles reach their goals is 9.8 % faster compared to planning with constant acceleration based estimated goal states. Moreover, the proposed approach never leads to the solver being unable to find a solution to the planning or prediction problem.

2605.30904 2026-06-01 cs.CV

MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

MergeTok: 通过令牌合并实现统一连续和离散视觉令牌化

Luyuan Zhang, Siyuan Li, Zedong Wang, Qingsong Xie, Cheng Tan, Anna Wang, Yanhao Zhang, Chen Chen, Haonan Lu, Haoqian Wang

发表机构 * Tsinghua University(清华大学) Westlake University(西湖大学) Zhejiang University(浙江大学) Hong Kong University of Science and Technology(香港科学与技术大学) OPPO Shanghai AI Lab(上海人工智能实验室)

AI总结 提出MergeTok统一令牌化器,通过令牌合并技术联合优化连续VAE和离散VQ令牌化器,实现高保真重建与语义可控离散表示的兼顾。

Comments 11 pages (main text), 7 figures. Preprint. Under review at NeurIPS 2026

详情
AI中文摘要

大多数用于图像生成的视觉令牌化器分为两类,各有互补的局限性:连续VAE提供高保真重建,但遭受密集、纠缠的潜在变量,不适合语义控制;而基于离散VQ的模型能够实现自回归生成,但面临梯度稀疏、训练不稳定和码本崩溃的问题。在这项工作中,我们引入了MergeTok,一个统一的令牌化器,在编码器-解码器架构中联合优化连续(VAE)和离散(VQ)令牌化器,利用令牌合并技术作为语义桥梁。通过在编码过程中聚类相似令牌,MergeTok建立了一个结构先验,提供双重监督信号:(i)在VAE分支中施加合并令牌的语义对齐,将其潜在空间正则化为解缠、语义感知的表示;(ii)推导出组级约束,促进组内多样性和组间排他性,从而稳定VQ训练。MergeTok在ImageNet-256上展示了具有竞争力的重建和生成性能,在匹配令牌预算下,其rFID远低于强VAE和VQ模型,同时产生语义组织的令牌表示,兼容自回归和扩散生成器。这表明单一架构可以赋予视觉令牌化器鲁棒的语义组织和生成器友好的离散性。

英文摘要

Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.

2605.30903 2026-06-01 cs.LG cs.AI

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

无最优演示者的逆强化学习:一种可行奖励集方法

Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang

发表机构 * MIT LIDS(麻省理工学院媒体实验室) University of Massachusetts, Amherst(马萨诸塞大学阿姆赫斯特分校) Adobe Research(Adobe研究院) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 针对多个非最优演示者数据,提出可行奖励集框架,通过线性约束联合可行集单调收缩,并给出恢复保证与高维环境离线算法。

详情
AI中文摘要

逆强化学习(IRL)通常假设来自单个最优演示者的演示,但在许多应用中,数据来自多个具有异质次优性水平的非完美演示者。我们通过可行奖励集框架研究这一设置下的奖励学习:对于每个演示者,我们将其声明的次优性水平编码为线性约束,并在演示者之间对所得可行集取交集。我们的理论分析表明,随着数据的增加,联合可行集单调收缩,并且我们精确刻画了新演示者何时严格收紧该集合。我们进一步为真实最优演示者的可行奖励集建立了两个恢复保证:一个界限依赖于与最优占用度的接近程度,而另一个仅需要足够的覆盖且没有接近最优的演示者。在实际方面,我们引入了解决所得奖励集中固有奖励模糊性的策略,并提供了适用于高维环境的函数逼近离线算法。在表格型网格世界和大语言模型(LLM)微调设置中的实验与理论预测一致,并证明了所提框架相对于基线的有效性。

英文摘要

Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study reward learning in this setting through a feasible-reward-set framework: for each demonstrator, we encode its declared suboptimality level as a linear constraint and intersect the resulting feasible sets across demonstrators. Our theoretical analysis shows that the joint feasible set shrinks monotonically as data are added, and we give an exact characterization of when a new demonstrator strictly tightens it. We further establish two recovery guarantees for the feasible reward set of the ground-truth optimal demonstrator: one bound depends on closeness to the optimal occupancy, while the other requires only sufficient coverage and no near-optimal demonstrator. On the practical side, we introduce strategies to address the inherent reward ambiguity in the obtained reward set and provide an offline algorithm with function approximation for high-dimensional environments. Experiments in tabular grid-world and large language model (LLM) fine-tuning settings are consistent with the theoretical predictions and demonstrate the effectiveness of the proposed framework over baselines.

2605.30901 2026-06-01 cs.LG

Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity

模型多重性下表格数据的密度引导鲁棒反事实解释

Jun Tan, Qing Guo, Zicheng Xu, Jinglin Li, Qi Fang, Ning Gui

发表机构 * School of Computer Science and Engineering, Central South University, Changsha, China(计算机科学与工程学院,中南大学,长沙,中国)

AI总结 提出DensityFlow生成框架,利用神经ODE和密度评分构建鲁棒反事实解释,避免低密度区域,并在模型多重性下保持有效性。

Comments 26 pages, 11 figures, accepted by ICML 2026

详情
AI中文摘要

反事实解释(CEs)对于可操作的补救措施至关重要,但其可靠性在低密度区域常常受到损害,因为分类器在这些区域表现出高方差。与依赖昂贵的集成交集来定义稳定性的现有方法不同,我们提出了 extit{DensityFlow},一种生成框架,通过遵循高置信度数据流形来构建鲁棒的反事实解释。具体来说,我们将反事实生成建模为由神经ODE参数化的连续时间动力学,并由可微密度评分引导,以主动避免不确定的低密度区域。该密度评分通过噪声对比估计学习,有效利用$(K{+}1)$路判别器来估计密度比。对于黑盒设置,我们引入了一种局部代理蒸馏机制,该机制在CE生成的轨迹内严格地将轻量级代理与目标模型对齐,从而实现高效的基于梯度的优化,且查询次数最少。实验表明,与基于集成的基线相比, extit{DensityFlow}在模型多重性下实现了优越的有效性,同时显著降低了查询成本。我们的实现可在https://github.com/G-AILab/DensityFlow获取。

英文摘要

Counterfactual explanations (CEs) are essential for actionable recourse, yet their reliability is often compromised in low-density regions, where classifiers exhibit high variance. Unlike existing methods that rely on expensive ensemble intersections to define stability, we propose \textit{DensityFlow}, a generative framework that constructs robust CEs by adhering to the high-confidence data manifold. Specifically, we model the counterfactual generation as continuous-time dynamics parameterized by Neural ODE, guided by a differentiable density score to actively avoid uncertain, low-density areas. This density score is learned via Noise Contrastive Estimation, effectively leveraging a $(K{+}1)$-way discriminator to estimate density ratios. For black-box settings, we introduce a local proxy distillation mechanism that aligns a lightweight surrogate with the target model strictly within the trajectory of CE generation, enabling efficient gradient-based optimization with minimal queries. Experiments demonstrate that \textit{DensityFlow} achieves superior validity under model multiplicity while significantly reducing query costs compared to ensemble-based baselines. Our implementation is available at https://github.com/G-AILab/DensityFlow.

2605.30900 2026-06-01 cs.AI physics.app-ph

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

BilliardPhys-Bench: 多模态大语言模型的物理推理与视觉动力学基准测试

Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao, Chengliang Xu, Zeyu Wang, Zichao Chen, Bing Zhao, Hu Wei

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出BilliardPhys-Bench基准,通过合成台球环境评估多模态大语言模型在物理推理(碰撞、反弹、最终位置预测)上的能力,发现模型存在“静态偏差”且性能随模拟时间与场景复杂度下降。

详情
AI中文摘要

当前多模态模型在静态图像识别方面表现良好,但直观的物理推理仍是弱点。从单张图像预测物体如何运动及相互作用对这些系统而言仍然困难。我们提出了BilliardPhys-Bench,一个用于合成台球环境中物理推理的基准测试。其程序化引擎生成带有摩擦和弹性碰撞的随机场景。该基准测试三种能力:(1) 预测球与球之间的碰撞,(2) 推理墙壁反弹,(3) 估计运动停止后球的最终位置。我们评估了来自GPT、Claude、Gemini和Qwen系列的最新MLLMs。随着模拟时间增加和场景几何复杂度提高,性能下降。我们还观察到一个一致的失败模式,称为“静态偏差”:当正确的物理结果更难推断时,模型倾向于预测无交互。这些发现揭示了当前MLLMs在视觉动力学上的不足之处,并指出了在多模态架构中需要更好的物理归纳偏置。

英文摘要

Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments. Its procedural engine generates randomized scenarios with friction and elastic collisions. The benchmark tests three abilities: (1) predicting ball-to-ball collisions, (2) reasoning about wall bounces, and (3) estimating final ball positions after motion stops. We evaluate recent MLLMs from the GPT, Claude, Gemini, and Qwen families. Performance drops as simulation time increases and scene geometry grows more complex. We also observe a consistent failure mode we call "stasis bias": when the correct physical outcome is harder to infer, models tend to predict no interaction. These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.

2605.30898 2026-06-01 cs.AI cs.CL

UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

UniScale: 通过模型路由和测试时扩展的在线联合优化实现自适应统一推理扩展

Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi, Yuqian Hou, Hong Xu, Zhongxiang Dai, Minchen Yu, Qingjiang Shi

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)大数据研究院) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) Department of Computer Science and Engineering, Chinese University of Hong Kong(香港中文大学计算机科学与工程系)

AI总结 提出UniScale框架,将模型路由和测试时扩展统一为上下文多臂老虎机问题,通过LinUCB在线学习推理策略,实现细粒度且更优的质量-成本权衡。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在大语言模型(LLM)的实际部署中,平衡推理质量和计算成本已成为核心挑战。现有方法沿着两个大致独立的维度处理这一权衡:模型路由(在不同规模的模型之间切换以匹配请求复杂度)和测试时扩展(TTS,在固定模型内调整推理时计算以实现细粒度控制)。然而,这种解耦设计引入了固有限制。由于模型规模稀疏,模型路由产生粗粒度的离散性能变化,而单模型TTS通常遇到能力上限,并随着计算增加出现收益递减。此外,将两种机制分开处理限制了动态推理环境中的适应性。为克服这些限制,我们引入统一推理扩展(UIS),将模型路由和TTS统一到单个优化空间中。基于此公式,我们提出UniScale,一个在线框架,将自适应UIS建模为上下文多臂老虎机问题,并通过LinUCB学习推理策略。该框架包含效率感知学习和成本建模,以确保在高维动作空间上的稳定和可扩展优化。评估表明,UniScale有效利用UIS空间中的协同作用,在多样化的动态推理场景中提供细粒度且持续更优的质量-成本权衡。

英文摘要

In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.

2605.30894 2026-06-01 cs.CV

SteerFace: Debiasing Synthetic Face Generation via Adaptive Residue Perturbation

SteerFace: 通过自适应残差扰动消除合成人脸生成中的偏差

Yuxi Mi, Qiuyang Yuan, Jianqing Xu, Yichun Zhou, Xuan Zhao, Jun Wang, Rizen Guo, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Youtu Lab, Tencent(腾讯优图实验室) WeChat Pay Lab33, Tencent(腾讯微信支付实验室33)

AI总结 针对合成人脸数据与真实数据分布存在视觉倾向差异的问题,提出SteerFace框架,通过将身份嵌入向随机正交方向扰动作为正则化项,抑制生成器对非身份视觉线索的依赖,从而缩小合成-真实差距。

详情
AI中文摘要

人脸识别训练中合法合规数据的短缺引发了人们对使用合成数据作为替代方案的日益关注。虽然最近的扩散方法能够生成具有强身份一致性和数据多样性的逼真人脸图像,但其下游识别性能仍然存在显著的合成-真实差距。本文识别出视觉倾向(visual tendency)作为一个此前未被充分探索的限制因素,即合成数据表现出不切实际的视觉属性普遍性,从而偏离真实数据分布。视觉倾向可归因于生成器对身份嵌入的条件化,通过这种条件化,共现的残留视觉线索被无意中吸收到学习到的身份语义中。为了阻止生成器利用此类视觉线索,本文提出SteerFace,一个简单高效的训练框架,通过将身份嵌入向嵌入超球面上的随机正交方向引导来扰动身份嵌入。该扰动作为一种身份保持正则化项,惩罚生成器对非身份成分的依赖,理论分析支持了这一点。本文进一步引入一种自适应策略,学习具有样本级偏好和有利总体统计的扰动强度。大量实验表明,SteerFace有效缓解了视觉倾向,在下游人脸识别中优于先前方法,并且在不同训练数据集和生成流程中具有良好的泛化能力。

英文摘要

The shortage of legally compliant data for face recognition training has sparked growing interest in using synthetic data as an alternative. While recent diffusion-based methods enable the generation of photorealistic face images with strong identity adherence and data diversity, their downstream recognition performance still exhibits a significant synthetic-real gap. This paper identifies visual tendency as a previously underexplored limitation, whereby synthetic data exhibit an unrealistic prevalence of visual attributes and thus deviate from the real-data distribution. Visual tendency can be attributed to the generator's conditioning on identity embeddings, through which co-occurring residual visual cues are unintentionally absorbed into learned identity semantics. To discourage the generator from exploiting such visual cues, this paper proposes SteerFace, a simple and efficient training framework that perturbs identity embeddings by steering them toward random orthogonal directions on the embedding hypersphere. The perturbation serves as an identity-preserving regularizer that penalizes the generator's reliance on non-identity components, as supported by theoretical analysis. This paper further introduces an adaptive strategy that learns perturbation strengths with both sample-wise preference and favorable overall statistics. Extensive experiments show that SteerFace effectively mitigates visual tendency, outperforms prior methods in downstream face recognition, and generalizes well across different training datasets and generation pipelines.

2605.30893 2026-06-01 cs.CV

Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation

用于3D CT重建、增强和生成的基础VAE

Qi Chen, Shuhan Ding, Yu Gu, Nan Liu, Jiang Bian, Alan Yuille, Zongwei Zhou, Jingjing Fu

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Duke-NUS Medical School(duke-nus 医学院) Microsoft Research(微软研究院)

AI总结 本文发现,在自然图像上预训练的基础VAE可直接用于CT重建、增强和生成,无需训练或微调,通过冻结编解码器实现解剖结构保留和噪声抑制,并在分割和生成任务上取得显著提升。

Comments ICML 2026 Accepted

详情
AI中文摘要

变分自编码器(VAE)将高分辨率CT体积压缩为紧凑的潜在表示,同时保留临床相关结构。然而,从头训练或大量微调CT专用VAE会带来巨大的计算和工程成本,并且在异构扫描仪、协议和疾病下性能常会下降。本文通过一个关键观察向免训练的医学VAE迈出了渐进的一步:一个在自然图像和视频上大规模预训练的基础VAE可以作为CT重建、增强和生成的统一接口。在编码器和解码器均冻结的情况下,基础VAE重建CT体积时保留了解剖结构,同时抑制了采集噪声;在这些重建上训练分割模型,对于胰腺肿瘤和肺肿瘤,表面准确度平均提高了3.9% NSD。在相同的基础VAE潜在空间中,条件潜在扩散模型实现了平均FVD降低3.9%,CT CLIP分数提高36.2%,并在18种疾病的多疾病生成忠实度上提高了2.76% AUC。这些结果表明基础VAE可作为可扩展的CT表示重用和忠实CT生成的实用接口。我们的代码和演示可在 https://github.com/qic999/Foundation-VAE 获取。

英文摘要

Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation. Our code and demo are available at https://github.com/qic999/Foundation-VAE.

2605.30892 2026-06-01 cs.LG

Bandwidth Allocation with Device Partitioning for Federated Learning over Industrial IoT networks

面向工业物联网联邦学习的设备分区带宽分配

Kangmin Kim, Jaeyoung Song

发表机构 * School of Electrical and Electronics Engineering, Pusan National University(釜山国立大学电气与电子工程学院)

AI总结 针对联邦学习在工业物联网中的通信瓶颈,提出一种基于设备计算能力分区的带宽分配策略,通过顺序分配全带宽给子集来最小化训练时间,并理论证明其优于无分区方案,同时降低上行能耗。

详情
AI中文摘要

我们考虑一个联邦学习(FL)系统,其中工业物联网(IIoT)设备通过无线信道协作训练全局模型,而不共享本地数据。在此类系统中,通信时间是制约整体训练效率的主要瓶颈。与优先考虑个体服务质量需求的传统网络不同,FL系统旨在尽可能高效地收敛到最优全局模型,这需要一种根本不同的带宽分配方法。本文提出一种新颖的带宽分配策略,利用设备计算能力的异构性来最小化总训练时间。该策略并非同时将所有选定设备的带宽分配出去,而是将参与设备划分为有序子集,并依次授予每个子集全带宽的独占访问权。我们正式证明,无论底层调度算法如何,这种基于分区的策略都能实现比任何无分区带宽分配方案更低的训练时间。此外,通过减少每台设备的传输持续时间,该策略还最小化了上行能耗,这对电池受限的IIoT设备尤其有利。在真实数据集(包括工业表面缺陷基准GC10-Det和标准图像分类基准CIFAR-10)上的大量实验表明,与现有带宽分配方案相比,所提策略持续降低了训练时间和能耗,接近轮次时间的理论下界。

英文摘要

We consider a federated learning (FL) system in which Industrial Internet-of-Things (IIoT) devices collaboratively train a global model over wireless channels without sharing local data. In such systems, communication time is a primary bottleneck that constrains overall training efficiency. Unlike conventional networks that prioritize individual quality-of-service requirements, FL systems collectively aim to converge to an optimal global model as efficiently as possible, which calls for a fundamentally different approach to bandwidth allocation. In this paper, we propose a novel bandwidth allocation policy that exploits the heterogeneity of device computing capabilities to minimize total training time. Rather than distributing bandwidth among all selected devices simultaneously, the proposed policy partitions the participating devices into ordered subsets and sequentially grants each subset exclusive access to the full bandwidth. We formally prove that this partitioning-based policy achieves a strictly lower training time than any bandwidth allocation scheme without partitioning, irrespective of the underlying scheduling algorithm. Furthermore, by reducing per-device transmission duration, the proposed policy also minimizes uplink energy consumption, which is particularly beneficial for battery-constrained IIoT devices. Extensive experiments on real-world datasets - including GC10-Det, an industrial surface defect benchmark, and CIFAR-10, a standard image classification benchmark - demonstrate that the proposed policy consistently reduces training time and energy consumption compared to existing bandwidth allocation schemes, approaching the theoretical lower bound on round time.

2605.30888 2026-06-01 cs.CL

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

RLHF的另一面:用于奖励模型自监督改进的在线策略反馈

Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li, Qi Liu, Zilong Zheng

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(认知智能国家重点实验室,中国科学技术大学) University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 提出SAVE框架,利用价值函数生成在线策略反馈,通过对比学习更新奖励模型,在六个基准上超越现有方法。

详情
AI中文摘要

构建用于语言模型对齐的强大奖励模型(RM)受到从人工标注或评判模型获取多样且可靠偏好数据的成本和难度的瓶颈限制。随着策略超越静态RM训练,这一问题变得更加严重。因此,我们提出SAVE(基于价值锚定的在线策略反馈自监督奖励模型改进),一个通过使用价值函数进行在线策略RM训练的框架,对在线策略响应进行评分作为反馈。SAVE自然地利用提示特定的价值头作为自适应锚点,将奖励评分的在线策略响应转化为监督信号。它计算RM优势并过滤模糊样本,通过对比目标更新RM。通过六个不同基准的严格实证评估,SAVE在增强RM训练方面的有效性得到了强烈验证。它在所有数据集上取得了优于现有方法的结果,同时在三种RL算法(GRPO、RLOO、GSPO)和不同策略骨干上保持一致的改进。

英文摘要

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.

2605.30884 2026-06-01 cs.CV

GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

GUI-C$^2$:基于难度感知强化学习的由粗到细GUI定位

Junlong Li, Chao Hao, Lap-Pui Chau, Yi Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出GUI-C$^2$框架,通过难度感知数据筛选和由粗到细的强化学习机制,解决GUI定位中训练样本难度不均和视觉区域裁剪权衡问题,实现最先进性能。

详情
AI中文摘要

现有的用于GUI定位的智能体强化学习方法在数据层面和策略层面存在局限性。在数据层面,当前方法通常平等对待所有训练样本,尽管它们对基线模型的训练价值随难度而变化。忽视这一点会大大降低训练效率甚至导致崩溃。在策略层面,现有框架难以平衡裁剪较大区域以获取足够上下文和较小区域以减少冗余之间的权衡,这是工具增强定位代理固有的张力。此外,过于复杂的决策对于小参数模型来说难以处理,并显著增加推理时间。为了解决这些问题,在数据层面,我们提出了GUI-D,一个数据挖掘和难度评分流程,通过适当的测试识别值得训练的样本,并分配难度分数以指导后续训练权重。在策略层面,我们提出了GUI-C$^2$,它采用区域门控的由粗到细细化机制,通过模型内部不确定性信号逐步缩小视野,自适应地为大目标保留上下文,同时增强对小目标的精度,并通过改进感知的阶段奖励进行强化,确保每次细化真正提升定位。同时,我们简化了决策过程,大大减少了额外的推理时间。最后,大量实验表明,我们的方法达到了最先进的性能。代码和数据将公开。

英文摘要

Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C$^2$, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. The code and data will be publicly available.

2605.30876 2026-06-01 cs.CL

dMoE: dLLMs with Learnable Block Experts

dMoE: 具有可学习块专家的扩散大语言模型

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 针对扩散大语言模型与混合专家架构集成时块并行解码与令牌级专家选择不匹配导致的推理内存瓶颈,提出dMoE框架,通过聚合块内令牌级专家分布为统一的块级专家分布来减少激活专家数量,在保持性能的同时显著降低内存使用和延迟。

Comments Working in progress. Code is available at: \url{https://github.com/fscdc/dMoE}

详情
AI中文摘要

扩散大语言模型(dLLMs)最近作为自回归模型的有前途的替代方案出现,在自然支持并行解码的同时提供了有竞争力的性能。然而,随着dLLMs越来越多地与混合专家(MoE)架构集成以扩展模型容量,块并行解码与令牌级专家选择之间出现了根本性的不匹配。具体来说,每次dLLM前向传递处理多个具有双向依赖关系的令牌,而传统的MoE层独立路由每个令牌。这种不匹配显著增加了唯一激活专家的数量,使推理越来越受内存限制。为了解决这个问题,我们提出了dMoE,一个简单而有效的块级MoE框架。dMoE的核心思想是将每个块内的令牌级专家分布聚合成统一的块级专家分布,然后以更连贯的方式指导专家路由。通过这种方式,dMoE在不牺牲性能的情况下显著减少了推理期间唯一激活专家的数量,从而缓解了内存瓶颈。在各种基准上的大量实验证明了dMoE的有效性。平均而言,dMoE将唯一激活专家的数量从69.5减少到14.6,同时保留了原始性能的99.11%。同时,它将内存使用减少了76.64%到79.84%,并实现了1.14倍到1.66倍的端到端延迟加速。代码可在https://github.com/fscdc/dMoE获取。

英文摘要

Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14$\times$ to 1.66$\times$ end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE

2605.30873 2026-06-01 cs.LG cs.AI cs.DC

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

联邦变分偏好对齐与Gumbel-Softmax先验用于个性化用户偏好

Jabin Koo, Hoyoung Kim, Minwoo Jang, Jungseul Ok

发表机构 * Graduate School of AI, POSTECH, Pohang, Republic of Korea(POSTECH人工智能研究生院) Department of CSE, POSTECH, Pohang, Republic of Korea(POSTECH计算机科学与工程系) National AI Research Lab, Seoul, Republic of Korea(首尔国家人工智能研究实验室)

AI总结 提出FedVPA-GP框架,通过联邦混合先验和正交损失解决联邦学习中用户偏好冲突和个性化问题,在HH-RLHF数据集上优于单一模型。

Comments 21 pages, 4 figures. Accepted to ICML 2026

详情
AI中文摘要

联邦学习(FL)为对齐大型语言模型(LLMs)提供了一条保护隐私的途径;然而,现有框架通常强制使用单一奖励模型,不可避免地平均了本质上相互冲突的用户偏好(例如,有用性与无害性)。虽然变分偏好学习(VPL)提供了一条个性化的途径,但将其适应于去中心化设置面临一个基本挑战:由严重的局部数据稀缺性和异质性驱动的后验坍塌。在本文中,我们提出了具有Gumbel-Softmax先验的联邦变分偏好对齐(FedVPA-GP),这是一个旨在在不牺牲隐私的情况下解耦多样偏好的框架。为了稳定变分推断,我们引入了一个联邦混合先验,使客户端能够利用聚合的总体分布作为动态先验。此外,我们加入了一个正交损失,明确强制在潜在空间中分离偏好原型。在HH-RLHF数据集上的实验表明,FedVPA-GP显著优于单一基线,成功解耦了冲突的用户意图,并实现了动态偏好切换。

英文摘要

Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user preferences (e.g., helpfulness vs. harmlessness). While Variational Preference Learning (VPL) offers a pathway to personalization, adapting it to decentralized settings presents a fundamental challenge: posterior collapse driven by severe local data scarcity and heterogeneity. In this paper, we propose Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a framework designed to disentangle diverse preferences without compromising privacy. To stabilize variational inference, we introduce a Federated Mixture Prior that enables clients to leverage the aggregate population distribution as a dynamic prior. Furthermore, we incorporate an Orthogonal Loss that explicitly enforces the separation of preference prototypes in the latent space. Experiments on the HH-RLHF dataset demonstrate that FedVPA-GP significantly outperforms monolithic baselines, successfully disentangling conflicting user intents and enabling dynamic preference switching.

2605.30865 2026-06-01 cs.LG

GlucoFM: A Dual-Stream Foundation Model for Continuous Glucose Monitoring

GlucoFM: 一种用于连续血糖监测的双流基础模型

Zechen Li, Keerthana Natarajan, Weizhi Zhang, Menglian Zhou, Simon A. Lee, Yuwei Zhang, Maxwell A. Xu, Zeinab Esmaeilpour, Flora D. Salim, Mark Malhotra, Lindsey Sunden, Shwetak Patel, Yuzhe Yang, Ahmed A. Metwally

发表机构 * Google Research(谷歌研究) University of New South Wales(新南威尔士大学)

AI总结 提出GlucoFM,一种轻量级CGM基础模型,通过将血糖动态分解为慢生理状态和瞬态事件流,在7个临床预测任务上平均PR-AUC比最佳CGM专用模型提高4.1点。

详情
AI中文摘要

连续血糖监测(CGM)提供了日常代谢生理的密集视图,然而现有的通用时间序列和CGM专用基础模型通常将血糖轨迹编码为纠缠的单流序列,使得血糖动态的独特时间结构仅被隐式建模。我们提出GlucoFM,一种轻量级CGM基础模型,它将不规则记录对齐到24小时时间网格,保留观测掩码,并将血糖动态分解为慢生理状态和瞬态事件流,捕捉低频血糖基线和可能反映急性生理反应或传感器伪影的短期偏差。GlucoFM在来自477名受试者的109,066小时未标记CGM记录上进行了预训练,具有两个互补目标:融合每日表示上的掩码上下文潜在预测以及状态和事件流上的时间动态预测。在四个不同队列和七个临床预测任务中,GlucoFM在评估基线中实现了最强的受试者分离线性探测性能,比最佳CGM专用基础模型平均PR-AUC提高4.1点。其收益在核心代谢结果上最为显著,在所有糖尿病风险和β细胞功能障碍任务以及4个胰岛素抵抗任务中的3个上领先PR-AUC。GlucoFM还在评估方法中实现了最佳的整体跨数据集迁移性能和强大的少样本适应能力,并且在聚合多天进行受试者级别预测时获得一致收益,突出了生理感知分解作为可迁移CGM表示学习的有效归纳偏置。

英文摘要

Continuous glucose monitoring (CGM) provides a dense view of daily metabolic physiology, yet existing generic time-series and CGM-specific foundation models often encode glucose traces as entangled single-stream sequences, leaving the distinct temporal structure of glycemic dynamics only implicitly modeled. We present GlucoFM, a lightweight CGM foundation model that aligns irregular recordings to a 24-hour chronological grid, preserves observation masks, and decomposes glucose dynamics into slow physiological state and transient event streams, capturing low-frequency glycemic baselines and short-term deviations that may reflect acute physiological responses or sensor artifacts. GlucoFM is pretrained on 109,066 hours of unlabeled CGM recordings from 477 subjects with two complementary objectives: masked contextual latent prediction over fused daily representations and temporal dynamics prediction over state and event streams. Across four diverse cohorts and seven clinical prediction tasks, GlucoFM achieves the strongest subject-disjoint linear-probing performance among evaluated baselines, improving average PR-AUC by 4.1 points over the best CGM-specific foundation model. Its gains are most pronounced on core metabolic outcomes, leading PR-AUC on all diabetes-risk and $β$-cell dysfunction tasks and on 3 of 4 insulin-resistance tasks. GlucoFM also achieves the best overall cross-dataset transfer performance and strong few-shot adaptation among evaluated methods, and consistent gains when aggregating multiple days for subject-level prediction, highlighting physiology-aware decomposition as an effective inductive bias for transferable CGM representation learning.

2605.30863 2026-06-01 cs.CV cs.GR

DSD-GS: Dynamic-Static Decomposition of Gaussian Splatting for Efficient and High-Fidelity Dynamic Scene Reconstruction

DSD-GS: 面向高效高保真动态场景重建的高斯泼溅动态-静态分解

Youngtae Han, Sung-hwan Han, Youngmin Yi

发表机构 * Department of Artificial Intelligence Engineering, Sogang University(人工智能工程系,首尔大学)

AI总结 提出基于前馈高斯泼溅编码器和光流模型的动态-静态分解框架,通过消除静态区域冗余计算,在渲染质量、训练/渲染速度和存储效率上达到最优。

Comments 23 pages, 9 figures, 7 tables

详情
AI中文摘要

动态场景重建和新视角合成是虚拟现实、机器人、数字孪生等下一代视觉智能应用的基础。然而,从任意视角对复杂时变场景进行高保真重建仍是一个重大挑战。现有的动态3DGS方法由于将所有高斯体建模为动态组件,存在计算效率低下的问题。虽然近期基于分解的方法试图解决这一问题,但仍面临重建质量下降和训练时间延长的问题。为缓解这些局限,我们提出一种新颖的动态重建框架,基于高效的静态-动态分解策略,使用前馈高斯泼溅编码器和光流模型。通过消除静态区域的冗余计算,我们的方法实现了最先进的性能,在渲染质量、训练和渲染速度以及存储效率上均优于现有基线。值得注意的是,在Neural 3D数据集上,我们的框架仅需10分钟训练,并在单张NVIDIA RTX 5090 GPU上以1352x1014分辨率实现了超过700 FPS的渲染速度。此外,我们的分解策略消除了COLMAP预处理的需求,并实现了确定性初始化,从而提高了效率和可重复性。

英文摘要

Dynamic scene reconstruction and novel view synthesis are fundamental to next-generation visual intelligence applications such as virtual reality, robotics, and digital twins. However, high-fidelity reconstruction of complex, time-varying scenes from arbitrary viewpoints remains a significant challenge. Existing dynamic 3DGS methods suffer from computational inefficiency, since they model all Gaussians as dynamic components. While recent decomposition-based approaches address this issue, they still struggle with degraded reconstruction quality and prolonged training time. To mitigate these limitations, we propose a novel dynamic reconstruction framework built upon an efficient static-dynamic decomposition strategy using a Feed-Forward Gaussian Splatting encoder and an optical flow model. By eliminating redundant computations on static regions, our method achieves state-of-the-art performance, outperforming existing baselines across rendering quality, training and rendering speed, and storage efficiency. Notably, on the Neural 3D dataset, our framework requires only 10 minutes for training and achieves a rendering speed of over 700 FPS on a single NVIDIA RTX 5090 GPU at resolution of 1352x1014. Furthermore, our decomposition strategy eliminates the need for COLMAP preprocessing and enables deterministic initialization, thereby enhancing both efficiency and reproducibility.

2605.30861 2026-06-01 cs.AI

Distilling LLM Feedback for Lean Theorem Proving

蒸馏LLM反馈用于Lean定理证明

Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal, Pierre Marion

发表机构 * FAIR at Meta(Meta 的 FAIR 部门) Inria(法国国家科学与技术研究院) Sorbonne Université(索邦大学) Institut universitaire de France(法国国家科学研究院) CERMICS École des Ponts ParisTech(巴黎理工学院 CERMICS 实验室) ENS, PSL Research University(巴黎高等师范学院与巴黎科学实验室)

AI总结 提出反馈蒸馏方法,通过让模型在token级别匹配自身分布(基于语言模型提供的特权反馈)来训练,以解决GRPO在推理后训练中的稀疏奖励和模式崩溃问题,并在Lean4定理证明中取得更好效果。

详情
AI中文摘要

推理模型的后训练通常结合监督微调和基于可验证奖励的强化学习(最常见的是GRPO)。然而,该算法存在奖励稀疏、探索受限和模式崩溃的问题。基于最近关于自蒸馏的工作,我们提出了反馈蒸馏,这是一种训练方法,其中模型在token级别被训练以匹配自身分布,该分布以语言模型产生的特权反馈为条件。反馈蒸馏提供token级别的监督,并能注入外部知识。在Lean4定理证明中评估我们的方法,我们发现反馈蒸馏比GRPO在生成轨迹上保持更大的多样性,从而产生更高的策略熵和更好的pass@k缩放。这两种方法是互补的:从反馈蒸馏检查点初始化GRPO优于单独使用任何一种方法。总之,我们的结果为提高复杂推理的后训练提供了一条有前景的途径。

英文摘要

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.

2605.30859 2026-06-01 cs.LG cs.AI

DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

DARTS: 分布感知的主动展开轨迹塑造以加速LLM强化学习

Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu, Xupeng Miao, Fangcheng Fu, Bin Cui

发表机构 * School of Computer Science \& Beijing Key Laboratory of Software Hardware Cooperative Artificial Intelligence Systems, Peking University, Beijing, China School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China Institute of Computational Social Science, Peking University (Qingdao), Qingdao, China

AI总结 针对强化学习中长尾响应分布导致的效率瓶颈,提出分布感知的主动轨迹塑造方法,通过细粒度识别提示内长尾并削减无效冗余,实现高达1.77倍的加速而不损失模型性能。

Comments 16 pages, 14 figures, 5 tables. Accepted to ICML 2026

详情
AI中文摘要

强化学习已成为提升模型能力的关键技术,但由于响应长度的长尾分布,其展开效率受到瓶颈制约。现有工作通过提示级尾部调度缓解长尾影响,但我们关注低效率的根本来源:分布本身。具体而言,我们以更细粒度刻画长尾分布,识别提示内长尾,并揭示它们通常包含无效冗余。为解决此问题,我们提出一种主动分布塑造的新范式,将展开分布向简洁性和确定性方向塑造,从而从根本上解决尾部带来的开销。我们通过一种分布感知的轨迹采样机制实现这一点,该机制为每个提示从冗余探索空间中选择轨迹,并采用自适应冗余分配方案以最大化塑造效果和系统效率。实验表明,与最先进系统相比,在不影响模型性能的情况下,实现了高达1.77倍的显著加速。

英文摘要

Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the long-tail response length distribution. While existing works mitigate the impact of long tails via prompt-level tail scheduling, we focus on the root source of inefficiency: the distribution itself. Specifically, we characterize the long-tail distribution at a finer granularity, identifying intra-prompt long tails, and revealing that they frequently consist of ineffective verbosity. To address this, we propose a novel paradigm of active distribution shaping to shape the rollout distribution towards conciseness and certainty, thereby fundamentally resolving tail-induced overheads. We achieve this through a distribution-aware trajectory sampling mechanism, which selects trajectories from a redundant exploration space for each prompt, and an adaptive redundancy allocation scheme to maximize both shaping effectiveness and system efficiency. Experiments demonstrate significant acceleration over state-of-the-art systems by up to 1.77x without compromising model performance.

2605.30858 2026-06-01 cs.LG

ForecastCompass: Guiding Agentic Forecasting with Adaptive Factor Memory

ForecastCompass: 自适应因子记忆引导的智能预测

Yurui Chang, Yongkang Du, Yuanpu Cao, Jinghui Chen, Lu Lin

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出ForecastCompass框架,通过分层预测任务分类和双组件记忆(因子记忆与推理记忆),结合回顾分析迭代修正,提升智能体在动态环境中的概率预测准确性和校准性。

详情
AI中文摘要

智能预测对于动态环境中的决策至关重要,但由于智能体必须从不完整、时间有限的证据中进行推理,并在结果确定之前产生校准的概率,因此仍然具有挑战性。记忆提供了一种自然机制,将经验从已解决的预测转移到未来的预测任务。然而,现有的智能体记忆方法并非为预测量身定制,因为它们通常存储过去的交互、反思或事实关联,而没有明确表示可重用的预测因子或校准知识。我们提出了ForecastCompass (FoCo),一种用于智能预测的自适应因子记忆框架。FoCo通过分层预测任务分类来组织预测经验,从而能够检索与任务相关的预测知识。它维护两个互补的记忆组件:因子记忆(捕获可重用的预测维度)和推理记忆(编码概率更新、不确定性处理和校准原则)。利用回顾分析作为学习信号,FoCo通过口头记忆修正程序迭代修正记忆,使智能体能够随时间积累可迁移的预测知识。在Prophet Arena和FutureX上使用GPT-5-mini和Gemini-2.5-Flash进行的实验表明,FoCo提高了概率准确性和校准性。

英文摘要

Agentic forecasting is important for decision-making in dynamic environments, but it remains challenging because agents must reason from incomplete, time-limited evidence and produce calibrated probabilities before outcomes are resolved. Memory provides a natural mechanism for transferring experience from resolved forecasts to future prediction tasks. However, existing agent-memory methods are not tailored to forecasting, as they typically store past interactions, reflections, or factual associations without explicitly representing reusable predictive factors or calibration knowledge. We propose ForecastCompass (FoCo), an adaptive factor-based memory framework for agentic forecasting. FoCo organizes forecasting experience with a hierarchical forecasting-task taxonomy, enabling retrieval task-relevant forecasting knowledge. It maintains two complementary memory components: factor memory, which captures reusable predictive dimensions, and reasoning memory, which encodes probability updating, uncertainty handling, and calibration principles. Using retrospective analyses as learning signals, FoCo iteratively revises memory through a verbalized memory-revision procedure, enabling the agent to accumulate transferable forecasting knowledge over time. Experiments on Prophet Arena and FutureX with GPT-5-mini and Gemini-2.5-Flash show that FoCo improves both probabilistic accuracy and calibration.

2605.30857 2026-06-01 cs.CL

MADS: Model-Aware Diverse Core Set Selection for Instruction Tuning

MADS: 面向指令微调的模型感知多样化核心集选择

Yi Bai, Wenhao Zhang, Yao Chen, Jiao Xue, Zhumin Chen, Pengjie Ren

发表机构 * Shandong University(山东大学) Inspurcloud

AI总结 提出一种基于模型推理时神经激活状态区分数据特征的多样化核心集选择方法,在减少数据量的同时提升大语言模型在多个下游任务上的性能。

详情
AI中文摘要

指令微调用于增强大语言模型(LLMs)的指令遵循能力。随着指令微调数据量的增加,选择最优核心集变得尤为重要。然而,确保核心集的多样性仍然是一个重大挑战。现有方法主要基于文本特征本身来区分不同的训练数据,与LLMs自身对数据的理解和表示相分离。为解决这一问题,我们提出了一种模型感知的多样化核心集选择方法,该方法基于LLM推理过程中的神经激活状态来区分数据特征。该方法利用模型内在的激活特征,实现了基于覆盖的选择的高效实例化,以确保核心集的多样性。我们在涵盖五个不同任务的六个基准上广泛评估了我们的方法。在我们的方法中,由3B参数LLM选择的核心集在用于微调7B、8B和13B参数的更大模型时表现有效。在包含52K指令-响应对的Alpaca-GPT4数据集上的实验结果表明,由Llama-3.2-3B-Instruct选择的、大小为原始数据集15%的核心集,在微调四个更大的基础模型时,与使用完整数据集训练相比,平均提升了2.5%。实验结果表明,我们的方法在减少数据需求的同时,提升了模型在多个下游任务上的性能。

英文摘要

Instruction fine-tuning is employed to enhance the instruction-following ability of large language models (LLMs). As the amount of instruction fine-tuning data increases, selecting the optimal core set becomes particularly important. However, ensuring the diversity of the core set remains a significant challenge. Existing methods predominantly distinguish different training data based on the text features themselves, decoupled from LLMs' own understanding and representation of the data. To address this issue, we propose a Model-Aware Diverse Core Set Selection method, which distinguishes data features based on the neural activation states during LLM inference. This approach serves as an efficient instantiation of coverage-based selection using model-intrinsic activation features to ensure the diversity in the core set. We extensively evaluate our method on six benchmarks that cover five distinct tasks. In our method, the core set selected by the 3B-parameter LLM performs effectively when utilized to fine-tune larger models with 7B, 8B, and 13B parameters. Experimental results on the Alpaca-GPT4 dataset, which comprises 52K instruction-response pairs, show that the core set, sized at 15\% of the original dataset and selected by Llama-3.2-3B-Instruct, achieves an average improvement of 2.5\% when fine-tuning four larger base models compared with training on the full dataset. The experimental results demonstrate that our method enhances model performance on multiple downstream tasks while reducing data requirements.

2605.30852 2026-06-01 cs.CL

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

推测性流水线解码:通过流水线并行实现更高准确度和零气泡推测

Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren, Ji Pei

发表机构 * Oregon State University(俄勒冈州立大学) DeepSolution(深思解决方案)

AI总结 提出推测性流水线解码(SPD)框架,利用流水线并行将目标LLM划分为n个流水线阶段并行处理n个token,通过推测模块聚合中间特征预测下一token,实现有限难度、高接受率和零延迟气泡,显著提升理论加速比。

详情
AI中文摘要

推测性解码(SD)通过草稿-验证范式加速低并发LLM推理。然而,主流方法通常依赖多token预测,这引入了逐渐增加的预测难度和串行草稿延迟。为了解决这些问题,我们提出了推测性流水线解码(SPD),这是一个突破性的框架,释放了流水线并行的真正潜力。通过将目标LLM划分为$n$个流水线阶段,SPD允许LLM并行处理$n$个token以加速解码。为了在单序列解码中持续填充流水线,推测模块聚合不同流水线深度的中间特征来预测下一个token,与目标模型的流水线步骤严格并行执行,从而实现有限的难度、更高的接受率和零延迟气泡。我们的实验表明,与主流基线相比,SPD实现了显著更高的理论加速比,为LLM解码加速提供了高度可扩展的解决方案。我们的代码可在https://github.com/yuyijiong/speculative_pipeline_decoding获取。

英文摘要

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into $n$ pipeline stages, SPD allows LLM to process $n$ tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding

2605.30846 2026-06-01 cs.CV

Count Anything

Count Anything

Mengqi Lei, Shuokun Cheng, Wei Bao, Shaoyi Du, Jun-Hai Yong, Siqi Li, Yue Gao

发表机构 * Tsinghua University(清华大学) China University of Geosciences, Wuhan(武汉地质大学) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) National Engineering Research Center for Visual Information and Applications(视觉信息与应用国家工程研究中心) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究院)

AI总结 提出跨域文本引导的目标计数模型Count Anything,通过双粒度实例枚举和互补计数融合,在统一基准CLOC上实现多域泛化。

详情
AI中文摘要

尽管通用视觉模型取得了快速进展,目标计数仍然分散在特定领域的数据集和任务公式中。现有的计数模型通常针对人群、车辆、细胞、农作物或遥感目标等场景定制,因此难以跨类别、视觉域、目标尺度和密度分布进行泛化。在本文中,我们研究了跨域的文本引导目标计数,其中模型以图像和自然语言查询为输入,并返回一组基于实例的目标点,其基数给出计数。这种公式将类别条件计数与可解释的空间定位统一起来。为了支持这一设置,我们构建了CLOC,一个跨域大规模目标计数数据集,将多样化的公共数据源重组为统一的基准。CLOC涵盖六个视觉域:通用场景、遥感、组织病理学、细胞显微镜、农业和微生物学,包含约22万张图像、619个类别和1500万个目标实例。基于CLOC,我们提出了Count Anything,一个用于文本引导目标计数的通用模型。与主导计数模型的密度图方法不同,Count Anything采用离散实例点并执行双粒度实例枚举。区域级稀疏计数器为大而稀疏的目标提供目标级锚点,而像素级密集计数器通过密集点预测处理小、拥挤和弱边界目标。点中心监督策略能够从异构标注中学习,互补计数融合以无参数方式结合两个计数器。大量实验表明,Count Anything实现了强准确性和多域泛化,优于现有的开放世界计数方法。代码可在:https://github.com/Mengqi-Lei/count-anything 获取。

英文摘要

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.

2605.30844 2026-06-01 cs.CL cs.AI stat.ML

Fine-Tuning Improves Information Conveyance in Language Models

微调提升语言模型中的信息传递

Yuwei Cheng, Weiyi Tian, Haifeng Xu

发表机构 * Department of Statistics(统计学系) University of Chicago(芝加哥大学) Department of Data Science(数据科学系) Department of Computer Science(计算机科学系)

AI总结 提出冠层熵(Canopy Entropy)度量,从树结构视角量化生成空间的有效大小,发现微调模型在总熵降低时仍能增强长度-熵率正相关,从而更高效地将不确定性转化为语义多样性。

详情
AI中文摘要

微调通常被认为会降低大型语言模型的不确定性和多样性,但现有分析忽略了输出长度这一关键混杂因素,因此未能捕捉不确定性在整个生成展开中的分布。为解决这一问题,我们提出冠层熵($\mathrm{CE}^\star$),一种从树视角看待语言生成的度量,其中“冠层”代表所有可能展开的空间,使得$\mathrm{CE}^\star$自然地量化生成空间的有效大小。$\mathrm{CE}^\star$共同捕捉输出长度$N$和生成序列$Y_{1:N}$中的不确定性——实际上,我们证明它等于总香农熵$H(N, Y_{1:N}\mid X)$,其中$X$表示提示。该公式产生了可解释的度量,包括长度-熵率相关项$ ho(N, r_N)$,其中$r_N$是熵率,通过指示较长输出是否每个标记信息量更多或更少来量化信息传递效率。实验上,跨任务和模型家族,我们发现微调模型一致地表现出更强的正相关$ ho(N, r_N)$,即使总熵降低。此外,在控制模型家族、任务、提示和输出长度效应后,我们发现微调几乎使熵率与语义多样性之间的相关强度增加了两倍,表明对齐模型更有效地将标记不确定性转化为语义多样性。总体而言,这些结果表明微调并非简单地降低不确定性,而是从根本上将其重组为更具信息性和语义意义的生成。我们的代码可在https://github.com/WeiyiTian/canopy-entropy获取。

英文摘要

Fine-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout. To address this, we propose Canopy Entropy ($\mathrm{CE}^\star$), a measure that views language generation from a tree perspective, where ``canopy'' represents the space of all possible rollouts, making $\mathrm{CE}^\star$ naturally quantify the effective size of the generation space. $\mathrm{CE}^\star$ jointly captures uncertainty in both the output length $N$ and the generated sequence $Y_{1:N}$ -- indeed, we show that it equals to total Shannon entropy $H(N, Y_{1:N}\mid X)$, where $X$ denotes the prompt. This formulation yields interpretable metrics, including a length-entropy correlation term $ρ(N, r_N)$, where $r_N$ is the entropy rate, quantifying information conveyance efficiency by indicating whether longer outputs are more or less informative per token. Empirically, across tasks and model families, we find that fine-tuned models consistently exhibit stronger positive correlation $ρ(N, r_N)$, even when total entropy decreases. Furthermore, after controlling for model family, task, prompt, and output-length effects, we find that fine-tuning nearly triples the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently. Overall, these results demonstrate that fine-tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations. Our code is available at https://github.com/WeiyiTian/canopy-entropy.

2605.30843 2026-06-01 cs.LG econ.EM

A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models

离线强化学习与逆强化学习讲义,第二部分:逆强化学习与动态离散选择模型的基础

Enoch Hyunwook Kang

发表机构 * University of Washington, Foster School of Business(华盛顿大学,福斯特商学院)

AI总结 本文证明了逆强化学习(IRL)与动态离散选择(DDC)模型的等价性,回顾了经典识别结果和计算范式,并介绍了现代机器学习方法及其识别特性。

详情
AI中文摘要

在前向强化学习问题中,奖励是固定且已知的;学习者被要求找到一个好的策略或价值函数。这里我们反过来提问:给定由专家生成的离线数据,我们能否恢复专家所优化的奖励?这就是逆强化学习问题,值得注意的是,两个社区——研究动态离散选择(DDC)的结构计量经济学家和研究熵正则化IRL的机器学习者——一直在以不同的名称研究完全相同的概率模型。我们首先证明它们的等价性。然后,我们发展Magnac和Thesmar的经典识别结果以及由此产生的经典计算范式:Rust的嵌套不动点算法、Hotz和Miller的条件选择概率方法,以及Adusumilli和Eckardt的两种时间差分方法:线性半梯度TD和近似价值迭代。每种方法都有其局限性:维度、转移核估计、致命三元组或投影不动点偏差。接着,我们回顾现代ML/IRL分支:对抗性IRL、占用匹配、IQ-Learn和离线ML-IRL,推导每种方法的实际目标,并精确说明它识别了什么和没有识别什么。最后,我们介绍Kang等人的经验风险最小化框架,该框架为离线IRL/DDC提供了基于梯度的估计器。

英文摘要

In the forward reinforcement-learning problem, the reward is fixed and known; the learner is asked to find a good policy or value function. Here we turn the question around. Given offline data generated by an expert, can we recover the reward the expert was optimizing? This is the inverse reinforcement learning problem, and remarkably, two communities, structural econometricians studying dynamic discrete choice (DDC) and machine learners studying entropy-regularized IRL, have been working on exactly the same probabilistic model under different names. We begin by proving their equivalence. We then develop the classical identification result of Magnac and Thesmar and the classical computational paradigms that grew out of it: Rust's nested fixed-point algorithm, the conditional-choice-probability approach of Hotz and Miller, and the two temporal-difference approaches of Adusumilli and Eckardt: linear semi-gradient TD and approximate value iteration. Each route has its limits: dimensionality, transition-kernel estimation, the deadly triad, or projected fixed-point bias. We then walk through the modern ML/IRL strand: adversarial IRL, occupancy matching, IQ-Learn, and offline ML-IRL, deriving each method's actual objective and stating precisely what it does and does not identify. We close with the empirical-risk-minimization framework of Kang et al., which yields a gradient-based estimator for offline IRL/DDC.

2605.30842 2026-06-01 cs.LG

CoMem: Context Management with A Decoupled Long-Context Model

CoMem: 基于解耦长上下文模型的上下文管理

Yuwei Zhang, Chengyu Dong, Shuowei Jin, Changlong Yu, Hejie Cui, Hongye Jin, Xinyang Zhang, Hamed Bonab, Colin Lockard, Jianshu Chen, Zhenyu Shi, Jingbo Shang, Xian Li, Bing Yin

发表机构 * Halıcıoğlu Data Science Institute, University of California, San Diego(哈里卡卢斯数据科学研究所,加州大学圣地亚哥分校) Amazon(亚马逊)

AI总结 提出CoMem框架,通过将记忆管理与智能体工作流解耦并采用k步偏移异步流水线,利用奖励驱动训练策略,在SWE-Bench-Verified上实现1.4倍延迟改进且保持大部分性能。

Comments Work in progress

详情
AI中文摘要

上下文管理使智能体模型能够通过对先前交互历史的迭代总结来解决长时任务。然而,这一过程通常会因额外的总结标记而产生大量解码开销,显著影响部署时的端到端响应延迟。在本文中,我们介绍CoMem,一种新颖的框架,它将记忆管理与主要智能体工作流解耦,使这些过程能够并行执行。我们提出了一种k步偏移异步流水线,将记忆模型的总结与智能体的推理重叠,有效掩盖了上下文处理的延迟。为了确保在这种异步设置下的鲁棒性,我们引入了一种奖励驱动的训练策略,使记忆模型对齐以捕获足够统计信息供智能体决策。理论分析证实,与耦合架构相比,CoMem提供了更优的效率-效果权衡。我们在SWE-Bench-Verified上的广泛实验结果表明,CoMem在保留大部分性能的同时,相比普通长上下文解决方案提供了1.4倍的延迟改进。此外,我们证明这些延迟增益随系统吞吐量增加而有利地扩展,为智能体推理和记忆压缩的独立优化提供了一条模块化路径。

英文摘要

Context management enables agentic models to solve long-horizon tasks through iterative summarization of previous interaction histories. However, this process typically incurs substantial decoding overhead for the extra summarization tokens, which significantly affect the end-to-end response latency at deployment. In this paper, we introduce CoMem, a novel framework that decouples memory management from the primary agent workflow, enabling these processes to execute in parallel. We propose a $k$-step-off asynchronous pipeline that overlaps the memory model's summarization with the agent's inference, effectively masking the latency of context processing. To ensure robustness under this asynchronous setting, we introduce a reward-driven training strategy that aligns the memory model to capture sufficient statistics for the agent's decision-making. Theoretical analysis confirms that CoMem offers a superior efficiency-effectiveness trade-off compared to coupled architectures. Our extensive experimental results on SWE-Bench-Verified show that CoMem provides 1.4x latency improvements upon vanilla long-context solutions while preserving most of the performance. Furthermore, we demonstrate that these latency gains scale favorably with increased system throughput, offering a modular path forward for the independent optimization of agent reasoning and memory compression.

2605.30838 2026-06-01 cs.AI

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

COMPASS: 认知MCTS引导的过程对齐用于安全搜索代理

Wenkai Shen, Pengyang Zhou, Jiahe Xu, Jiaming Qian, Haozhe He, Zhihao Huang, Chaochao Chen, Xiaolin Zheng

发表机构 * Zhejiang University(浙江大学)

AI总结 提出COMPASS框架,通过认知树探索和自省步骤对齐,在保持通用效用的同时实现搜索代理工作流中的鲁棒安全对齐。

详情
AI中文摘要

基于LLM的搜索代理能够进行多步推理和使用工具。然而,这些能力引入了检索诱导的安全退化,因为有害意图可能分解为看似无害的子查询,导致不安全的结果。现有的对齐方法难以捕捉稀疏的安全信号,并且无法监督多步交互中的各种违规行为。我们提出COMPASS,一种认知MCTS引导的过程对齐框架,旨在在保持通用效用的同时,实现代理工作流中的鲁棒安全对齐。COMPASS集成了认知树探索(CTE)以高效合成隐蔽攻击轨迹,以及自省步骤对齐(ISA)以隔离有风险的中间动作进行细粒度过程监督。实验结果表明,COMPASS在实现良好的安全-效用权衡的同时,所需训练数据大幅减少。

英文摘要

LLM-powered search agents enable multi-step reasoning and tool use. However, these capabilities introduce retrieval-induced safety degradation, as harmful intents may decompose into seemingly innocuous sub-queries that lead to unsafe outcomes. Existing alignment methods struggle to capture sparse safety signals and fail to supervise diverse violations across multi-step interactions. We propose COMPASS, a Cognitive MCTS-Guided Process Alignment framework designed to achieve robust safety alignment throughout the agent workflow while preserving general utility. COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories, and introspective step-wise alignment (ISA) to isolate risky intermediate actions for fine-grained process supervision. Empirical results show that COMPASS achieves a favorable safety-utility trade-off while requiring substantially less training data.

2605.30834 2026-06-01 cs.RO cs.AI

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

轨迹中的捉迷藏:发现VLA运行时监控的失败信号

Seongheon Park, Wendi Li, Changdae Oh, Samuel Yeh, Zsolt Kira, Michael Hagenow, Sharon Li

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Hide-and-Seek框架,通过轨迹间和轨迹内对比学习,从轨迹级监督中定位失败指示动作,实现无需步骤标注的VLA模型运行时失败检测。

详情
AI中文摘要

视觉-语言-动作(VLA)模型使机器人能够遵循自然语言指令并在不同任务中泛化,但在实际部署中仍易受执行失败影响,损害可靠性。因此,在执行过程中检测此类失败对于具身系统的稳健部署至关重要。现有的失败检测方法要么依赖昂贵的动作重采样或外部模型,要么将轨迹级标签均匀传播到每个时间步,掩盖了局部失败信号。在本文中,我们提出 extbf{Hide-and-Seek}框架,将VLA失败检测形式化为粗监督学习问题。通过结合轨迹间和轨迹内对比目标,Hide-and-Seek能够定位指示失败的动作,并仅从轨迹级监督中诱导出具有时间结构的失败信号,无需任何步骤级标注。我们在LIBERO、VLABench和真实机器人平台上,针对三种代表性VLA策略(OpenVLA、$π_0$和$π_{0.5}$)评估了Hide-and-Seek。我们的方法在共形预测下实现了最先进的多任务失败检测性能,具有实用的准确度-及时性权衡,并且对已见和未见任务均具有良好的泛化能力。

英文摘要

Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose \textbf{Hide-and-Seek}, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, $π_0$, and $π_{0.5}$.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.

2605.30833 2026-06-01 cs.CL cs.AI

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

你的老师在这里帮不了你:对抗在线策略蒸馏中的监督保真度衰减

Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Chinese Information Processing Laboratory(中文信息处理实验室) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国)

AI总结 针对在线策略蒸馏中监督保真度衰减问题,提出前瞻组奖励方法,通过评估学生候选词在后续步骤中诱导的教师置信度并分配组归一化奖励,结合熵触发树注意力机制,显著提升长链推理性能。

详情
AI中文摘要

在线策略蒸馏通过使用来自教师的 token 级反馈,在学生模型自身生成的轨迹上训练学生模型来传递推理能力。然而,我们识别出一个关键瓶颈,即 extbf{监督保真度衰减(SFD)}:随着学生生成的前缀变长,教师的下一个 token 分布变得不那么自信和更具区分性。因此,反向 KL 蒸馏中依赖教师的纠正信号减弱,导致学生漂移在长推理链中累积。为了缓解 SFD,我们引入了 extbf{前瞻组奖励(\ours{})}。基于下一步教师置信度反映了未来反向 KL 监督的区分强度这一见解,\ours{} 通过学生在后续步骤中诱导的教师置信度来评估学生的 top-K 候选 token,并分配组归一化奖励。为了保持计算效率,我们进一步设计了一种熵触发的树注意力机制。在六个数学和代码基准测试中,\ours{} 在 7B 学生模型上比 OPD 提高了 mean@8 达 extbf{2.57} 个点,在长生成任务中增益更大,在 AIME-26 上达到 + extbf{4.92} 个点(39k token)。

英文摘要

On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce \textbf{Lookahead Group Reward (\ours{})}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, \ours{} evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, \ours{} improves mean@8 by \textbf{2.57} points over OPD for a 7B student, with gains increasing in longer-generation and reaching +\textbf{4.92} points on AIME-26 at 39k tokens.