arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
2606.14142 2026-06-16 cs.CL cs.AI 新提交

Implicit Reasoning for Large Language Model-based Generative Recommendation

基于大语言模型的生成式推荐的隐式推理

Yinhan He, Liam Collins, Bhuvesh Kumar, Jundong Li, Neil Shah, Donald Loveland

发表机构 * University of Virginia(弗吉尼亚大学) Snap Inc.(Snap公司)

AI总结 针对大语言模型用于生成式推荐时显式推理的三大局限(世界知识表达弱化、语义ID与自然语言嵌入空间不对齐、推理质量敏感),提出轻量级隐式推理范式PauseRec,在性能、训练成本和推理速度上均优于显式方法。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被用作生成式推荐(GR)的骨干,有望利用预训练的世界知识。然而,如何可靠地调用这些知识进行GR仍不清楚。一个关键障碍是,基于LLM的GR通常使用语义ID(SIDs)表示物品,这破坏了LLM的自然语言推理接口,因为这些标记在预训练期间对LLM是未见过的。现有方法通过昂贵的多阶段流程来应对,这些流程将SID接地并引发显式推理,但对每个阶段何时以及为何必要提供的见解有限。在这项工作中,我们系统地分解了基于LLM的GR的显式推理训练流程,揭示了三个关键局限:弱化的世界知识表达、SID与自然语言标记嵌入空间之间的不对齐,以及对推理质量的敏感性,所有这些都损害了显式推理性能。为了规避这些问题,我们提出了PauseRec,一种为GR量身定制的轻量级隐式推理范式。PauseRec非常实用,避免了昂贵的推理轨迹获取和推理对齐训练,带来了诸多好处:(1)其性能比标准显式CoT方法高出高达6.22%,(2)将训练成本降低高达65%的GPU小时,(3)将推理速度提升高达71.3%。这些结果使PauseRec成为显式推理生成的轻量级替代方案,能够实现更有效、更高效的基于LLM的GR。

英文摘要

Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key obstacle is that LLM-based GR typically represents items with Semantic IDs (SIDs), disrupting LLMs' natural-language reasoning interface because these tokens are unseen by the LLM during pretraining. Existing approaches address this with expensive multi-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary. In this work, we systematically decompose explicit reasoning training pipelines for LLM-based GR, revealing three key limitations: weakened world-knowledge verbalization, misalignment between SID and natural-language token embedding spaces, and sensitivity to rationale quality, all of which hurt explicit reasoning performance. To circumvent these issues, we propose PauseRec, a lightweight implicit reasoning paradigm tailored for GR. PauseRec is exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: (1) it outperforms standard explicit CoT methods by up to 6.22%, (2) it reduces training cost by up to 65% GPU hours, and (3) it speeds up inference by up to 71.3%. These results position PauseRec as a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM-based GR.

2606.14095 2026-06-16 cs.LG math.OC math.PR stat.ML 新提交

Lyapunov-Based Sample Complexity Analysis for Weakly-Coupled MDPs

基于Lyapunov的弱耦合MDP样本复杂度分析

Tianhao Wu, Matthew Zurek, Weina Wang, Qiaomin Xie

发表机构 * Department of Industrial and Systems Engineering, University of Wisconsin-Madison(威斯康星大学麦迪逊分校工业与系统工程系) Department of Computer Sciences, University of Wisconsin-Madison(威斯康星大学麦迪逊分校计算机科学系) Computer Science Department, Carnegie Mellon University(卡内基梅隆大学计算机科学系)

AI总结 针对平均奖励弱耦合MDP和Restless Bandits,提出基于Lyapunov的分析框架,实现样本和计算复杂度关于臂数N的多项式级界限,并给出首个有限样本PAC保证。

Comments Accepted for presentation at the Conference on Learning Theory (COLT) 2026

详情
AI中文摘要

我们研究了在生成模型下,平均奖励弱耦合马尔可夫决策过程(WCMDPs)和Restless Bandits(RBs)中学习的样本复杂度。直接简化为表格MDP会导致高复杂度界限,因为状态-动作空间随臂数$N$呈指数增长。通过利用弱耦合结构,我们证明可以以关于$N$的多项式样本和计算复杂度学习近优策略。具体来说,我们分析了插件方法,该方法对从数据估计的经验模型应用高效规划算法。对于完全异质的WCMDPs,我们建立了首个具有多项式复杂度和$O(1/\sqrt{N})$最优性间隙的有限样本PAC保证。对于同质RBs,我们进一步证明在温和的结构假设下可以实现更小的最优性间隙。我们工作的一个主要技术贡献是一个新颖的基于Lyapunov的分析框架。与依赖于难以控制的偏差函数的经典方法不同,我们的框架使用显式构造的Lyapunov函数以及真实模型与经验模型之间的漂移传递技术。我们框架中一个具有独立意义的关键步骤是对底层线性规划(LP)松弛的细粒度扰动分析,这为分析基于LP的策略和弱耦合系统提供了一个通用工具。

英文摘要

We study the sample complexity of learning in average-reward weakly-coupled Markov decision processes (WCMDPs) and Restless Bandits (RBs) under a generative model. Naive reduction to a tabular MDP leads to high complexity bounds as the state-action space is exponentially large in the number of arms $N$. By exploiting the weakly coupled structure, we show that near-optimal policies can be learned with sample and computational complexities that are polynomial in $N$. Specifically, we analyze the plug-in approach, which applies an efficient planning algorithm to an empirical model estimated from data. For fully heterogeneous WCMDPs, we establish the first finite-sample PAC guarantee with polynomial complexity and an $O(1/\sqrt{N})$ optimality gap. For homogeneous RBs, we further prove that a smaller optimality gap is achievable under mild structural assumptions. A primary technical contribution of our work is a novel Lyapunov-based analysis framework. Unlike classical approaches that rely on the difficult-to-control bias function, our framework uses an explicitly constructed Lyapunov function along with a drift transfer technique between the true and empirical models. A key step of independent interest in our framework is a fine-grained perturbation analysis for the underlying linear programming (LP) relaxation, which provides a general tool for analyzing LP-based policies and weakly-coupled systems.

2606.13782 2026-06-16 cs.AI 新提交

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

MA-ProofBench: 数学分析中定理证明的大语言模型双层评估基准

Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, Yudong Wang

发表机构 * ModelBest Inc. Tsinghua University(清华大学)

AI总结 提出首个面向数学分析的形式化定理证明基准MA-ProofBench,包含200个定理,覆盖6个核心主题和27个子类别,分为本科和博士资格两级难度,评估发现当前模型表现不佳,GPT-5.5在Level I上仅达16% Pass@8。

Comments 19 pages, 4 figures, 4 tables

详情
AI中文摘要

大型语言模型(LLMs)在自动化定理证明方面取得了显著进展,然而现有的形式化基准在数学覆盖范围和难度上仍然有限。大多数集中在更容易形式化的领域,如代数和初等数论,并且对需要更深层推理的子领域(包括数学分析)覆盖有限。为了解决这一差距,我们引入了MA-ProofBench,据我们所知,这是第一个专门致力于数学分析的形式化定理证明基准。该基准包含200个形式化定理,涵盖6个核心主题和27个子类别,包括测度与积分理论、复分析和泛函分析。问题分为两个难度级别:本科级别(Level I,100个问题)和博士资格考试级别(Level II,100个问题),以评估LLMs在不同数学深度上的形式推理能力。每个问题通过人工主导、LLM辅助的形式化流程构建,随后由独立专家评审,确保形式化陈述忠实于原始数学。我们在MA-ProofBench上评估了一系列最新的通用推理模型和形式化定理证明器。然而,大多数模型表现不佳:即使是最佳模型GPT-5.5,在Level I上仅达到16%的Pass@8,在Level II上为5%,而大多数模型在Level II上接近0%。进一步分析发现,Mathlib幻觉和不完整证明是两种主要的失败模式,而对基准的自然语言版本的评估揭示了非正式推理与形式推理之间的明显差距。MA-ProofBench旨在作为跟踪高级领域形式化数学推理进展的可靠参考。

英文摘要

Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-proving benchmark dedicated to Mathematical Analysis. The benchmark contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels, an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems), to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem is constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review, ensuring that the formal statements remain faithful to the original mathematics. We evaluate a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. However, most models perform poorly: even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II, while most models stay close to 0% on Level II. Further analysis identifies Mathlib hallucinations and incomplete proofs as the two dominant failure modes, while an evaluation on the natural-language version of the benchmark exposes a clear gap between informal and formal reasoning. MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains.

2606.13769 2026-06-16 cs.RO cs.CV cs.LG 新提交

$μ_0$: A Scalable 3D Interaction-Trace World Model

$\mu_0$: 一种可扩展的3D交互轨迹世界模型

Seungjae Lee, Yoonkyo Jung, Jusuk Lee, Jonghun Shin, Amir Hossein Shahidzadeh, Yao-Chih Lee, H. Jin Kim, Jia-Bin Huang, Furong Huang

发表机构 * University of Maryland, College Park(马里兰大学帕克分校) Seoul National University(首尔大学)

AI总结 提出基于3D轨迹的可扩展世界模型$\mu_0$,通过预测交互点轨迹实现跨本体机器人学习,无需动作标签,性能媲美有监督模型。

详情
AI中文摘要

能够捕捉动作如何引起物理变化的世界模型使得可扩展的机器人学习成为可能,而无需依赖特定本体的动作标签。像素空间视频模型提供了广泛的视觉先验,但将模型容量消耗在密集外观重建上,而直接动作模型则需要特定本体的标签,阻碍了可扩展性。我们提出$\mu_0$,一种基于3D轨迹的可扩展世界模型。$\mu_0$不是预测密集像素或直接建模动作,而是预测显著交互点(如物体、工具、手和接触区域)的平滑3D轨迹,从而产生一个紧凑、与本体无关的运动接口。为了能够从多样化的视频源进行训练,我们的TraceExtract系统通过选择关键点、构建全局对齐的轨迹以及将运动片段与层次化语言描述关联,自动提取3D监督。这种TraceExtract监督通过将预训练的视觉-语言骨干网络与模块化轨迹专家相结合来预训练$\mu_0$,其中轨迹专家通过B样条控制点表示每个查询并预测未来轨迹。实验表明,$\mu_0$在2D和3D轨迹预测方面均优于基线方法,包括轨迹预测模型和分词VLM方法。由于$\mu_0$是冻结且可重用的,它可以与动作专家配对用于下游机器人本体。尽管是无动作预训练,由此产生的轨迹条件策略在性能上与使用动作监督预训练的VLA模型(如$\pi_0$)相当。这些结果确立了3D轨迹作为跨本体操作的可扩展和可迁移表示。

英文摘要

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $μ_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $μ_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $μ_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $μ_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $μ_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $π_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

2606.13751 2026-06-16 cs.CL 新提交

Which Models Perform Better in Inheritance Reasoning?

哪些模型在继承推理中表现更好?

Mohammed Amine Mouhoub, Chahinez Bouchekif

发表机构 * Paris Dauphine University(巴黎多芬纳大学) University of Abou Bekr Belkaïd(阿布·贝克尔·贝尔卡伊德大学)

AI总结 本文比较了商业和开源大语言模型在伊斯兰继承推理任务中的表现,发现商业模型在识别继承人、应用排除规则和保持推理一致性方面更优,其中Gemini 2.5 Flash表现最佳。

详情
AI中文摘要

本文介绍了PSL团队在QIAS 2026阿拉伯伊斯兰继承推理共享任务中的参与情况。该任务评估大语言模型解决需要法律解释、多步推理和精确数值计算的继承案例的能力。我们在统一的提示策略下比较了\textit{商业}和\textit{开源}模型,以评估它们在最小任务特定适应下的结构化法律推理中的有效性。\我们的结果显示两个模型系列在可靠性上存在明显差距。商业模型在识别合格继承人、应用排除规则以及保持推理步骤一致性方面表现出更强的性能。相比之下,开源模型表现出更大的不稳定性,特别是在涉及依赖法律决策和分数份额调整的案例中。最佳性能由\textit{Gemini 2.5 Flash}实现,其MRE为$0.989$。

英文摘要

This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi-step reasoning, and precise numerical computation. We compare \textit{commercial} and \textit{open-source} models under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task-specific adaptation. \\ Our results show a clear gap in reliability between the two model families. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps. In contrast, open-source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments. The best performance is achieved by \textit{Gemini 2.5 Flash}, with an MRE of $0.989$.

2606.13710 2026-06-16 cs.AI cs.LG 新提交

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

混合开放式三重进化打造更优深度研究者

Hongming Piao, Chi Liu, Mengzhuo Chen, Yan Shu, Xidong Wang, Derek Li, Ying Wei, Bryan Dai

发表机构 * IQuest Research Zhejiang University(浙江大学)

AI总结 提出混合开放式三重进化框架,通过混合模式强化学习协同进化提议者、求解者和评判者,使8B模型在深度研究任务上超越静态开源8-32B模型及先进训练方法。

详情
AI中文摘要

深度研究和智能体进化是AI智能体在现实应用中迈向通用人工智能的实际任务。前者使智能体能够在开放环境中自主检索和整合信息以处理开放式研究任务,但受限于智能体系统的静态参数化深度研究能力。后者允许智能体自主与环境交互以获得经验,从而进化模型能力。然而,其有效性仅在具有标准答案的可验证任务上得到广泛验证,与开放式研究任务存在差距。为桥接这两个关键任务,我们提出混合开放式三重进化框架,该框架利用混合模式强化学习,基于网络规模知识促进提议者、求解者和评判者的协同进化,朝着开放式任务和环境中自主进化的智能体迈进。在三个长格式深度研究基准上的大量实验表明,通过HOTE训练的8B模型超越了最强的静态开源8-32B模型以及通过最先进深度研究训练方法训练的模型,且时间开销更少,并进一步验证了HOTE中三个模块的进化不可或缺。

英文摘要

Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended environments to tackle open-ended research tasks, yet it is constrained by the static parametric deep research capabilities of agent systems. The latter allows agents to autonomously interact with the environment to gain experiences that evolve model capabilities. However, its effectiveness has been widely validated only on verifiable tasks with standard answers, leaving a gap with open-ended research tasks. To bridge these two critical tasks, we propose the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which leverages hybrid-mode reinforcement learning to facilitate the collaborative evolution of a proposer, solver and judge based on web-scale knowledge, moving toward autonomous evolving agents in open-ended tasks and environments. Extensive experiments on three long-form deep research benchmarks demonstrate that the 8B model trained via HOTE surpasses the strongest static open 8-32B models as well as those trained by state-of-the-art deep research training methods with less time overhead, and further verify that the evolution of all three modules in HOTE is indispensable.

2606.13674 2026-06-16 cs.CV 新提交

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

RepWAM:基于表示视觉-动作分词器的世界动作建模

Junke Wang, Qihang Zhang, Shuai Yang, Yiming Luo, Yujun Shen, Zuxuan Wu, Yu-Gang Jiang, Yinghao Xu

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所) Robbyant, Ant Group(蚂蚁集团 Robbyant) Hongkong University of Science and Technology(香港科技大学)

AI总结 提出RepWAM,一种基于表示视觉-动作分词器的世界动作模型,通过联合建模未来视觉状态和潜在动作,在真实和仿真机器人操作任务中取得优异性能。

详情
AI中文摘要

本文提出RepWAM,一种基于表示视觉-动作分词器的表示中心世界动作模型(WAM)。现有的WAM通常从预训练的视频生成模型中继承面向重建的视频分词器。尽管这些分词器保留了视觉保真度,但仅靠像素重建对学习连接未来预测与机器人控制的指令跟随动态提供的指导有限。为解决此问题,我们探索了一种语义视觉-动作潜在空间用于表示中心的全局动作建模。具体来说,我们训练了一个表示视觉-动作分词器,将视觉输入映射为对齐的视觉和潜在动作标记。然后,我们预训练WAM以在语言指令下联合建模未来视觉状态和连接它们的潜在动作,随后适应真实机器人轨迹以实现闭环操作。在真实世界操作任务和仿真基准上的实验表明,RepWAM在多种操作设置中展现出强劲性能,而消融实验凸显了语义视觉-动作分词相对于面向重建替代方案的价值。这些结果确立了表示视觉-动作分词作为世界动作模型的有前途的基础,并朝着通用机器人策略迈出了一步。代码和权重将在以下网址提供:this https URL。

英文摘要

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

2606.13655 2026-06-16 cs.CV cs.GR 新提交

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

Flex4DHuman:面向4D人体重建的灵活多视角视频扩散模型

Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang

发表机构 * University of Washington(华盛顿大学) World Labs

AI总结 提出Flex4DHuman,一种基于相对相机位姿条件化的多视角视频扩散模型,无需显式几何先验即可将单目或稀疏多视角视频转换为密集多视角视频,并用于4D高斯溅射重建。

Comments Project Page: https://andy-cheng.github.io/Flex4DHuman/

详情
AI中文摘要

我们提出Flex4DHuman,一种多视角视频扩散模型,它通过仅使用相对相机位姿条件化,将动态主体的单目或稀疏多视角视频转换为同步的密集多视角视频。与先前依赖骨架、深度图、法线或渲染目标视角几何的人体中心方法不同,Flex4DHuman不需要显式几何先验,而是通过相对相机位姿位置编码来条件化生成。生成的视频可直接被下游重建流程用于创建动态4D高斯溅射。基于Wan 2.1 1.3B文本到视频模型,Flex4DHuman保留了骨干架构,并通过五轴位置编码编码相机和视角信息,该编码将时空RoPE扩展了视角索引和连续SE(3)相对相机几何。三阶段课程逐步训练模型以进行位姿跟随、灵活的参考到目标视角生成以及时间展开。为支持时间展开,我们使用干净的历史目标视角令牌进行训练。我们还添加了多视角字幕以实现测试时文本控制。结合现成的4D高斯溅射阶段,我们的框架将单目静态相机视频提升为动态4D高斯溅射。在DNA-Rendering和ActorsHQ上的实验表明,Flex4DHuman超越了先前最先进的方法,而相同的公式在混合人体-动物训练后泛化到动物类别。这些能力使Flex4DHuman成为从随意单目视频进行可扩展4D内容创建的实际一步,适用于仿真、游戏、AR/VR和视频重拍。

英文摘要

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

2606.13608 2026-06-16 cs.AI cs.LG 新提交

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

AgentBeats:面向开放性、标准化和可复现性的智能体评估代理化

Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie, Sihan Ren, Tianneng Shi, Gal Gantar, Evan Sandoval, Donghyun Lee, Daniel Miao, Peter J. Gilbert, Nick Hynes, Mauro Staver, Warren He, David Marn, Andrew Low, Xi Zhang, Elron Bandel, Michal Shmueli-Scheuer, Siva Reddy, Alexandre Drouin, Alexandre Lacoste, Ramayya Krishnan, Elham Tabassi, Yu Su, Victor Barres, Chenguang Wang, Wenbo Guo, Dawn Song

发表机构 * University of California, Berkeley(加州大学伯克利分校) Purdue University(普渡大学) University of Ljubljana(卢布尔雅那大学) University of Washington(华盛顿大学) Oasis Labs University of Maryland(马里兰大学) IBM Research(IBM研究院) Mila McGill University(麦吉尔大学) ServiceNow Research(ServiceNow研究院) Carnegie Mellon University(卡内基梅隆大学) National Institute of Standards and Technology(美国国家标准与技术研究院) The Ohio State University(俄亥俄州立大学) University of Cambridge(剑桥大学) University of California, Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 提出代理化智能体评估(AAA)框架,通过标准化协议(A2A和MCP)统一评估接口,实现开放、可复现的多智能体评估,并基于AgentBeats系统通过大规模竞赛和案例研究验证其覆盖性、实用性和保真度。

详情
AI中文摘要

智能体系统在各领域快速进步,但其评估仍然碎片化。大多数基准测试依赖于固定的、以LLM为中心的测试框架,需要大量集成,造成测试与生产环境不匹配,并限制了不同智能体设计之间的公平比较。根本问题在于缺乏开放的、与智能体无关的评估接口。我们倡导代理化智能体评估(AAA),其中评估由裁判智能体执行,所有参与者通过标准化协议交互:A2A用于任务管理,MCP用于工具访问。传统基准测试定义了两个独立的接口(一个用于基准测试,一个用于智能体),而AAA只需要一个;这产生了一个通用的统一框架,将评估逻辑与智能体实现分离,并支持可复现、可互操作和多智能体评估。我们进一步引入AgentBeats作为AAA的具体实现:我们确定了五种实际操作模式,使标准化评估与开放性、隐私性和可复现性的现实约束兼容。为了大规模评估我们的设计,我们进行了两项研究:一项为期五个月的开放竞赛,吸引了来自独立参与者的12个类别的298个裁判智能体和467个主题智能体,表明AAA适用于异构基准测试范围;以及一项关于编码智能体的案例研究,证实代理化评估在保留与公开记录一致性的同时,揭示了先前缺失的直接比较结果,产生了关于智能体设计的研究见解。结合社区规模实地研究和受控编码案例研究,我们验证了AAA在异构场景下大规模提供覆盖性、实用性和保真度。AAA和AgentBeats共同为开放、标准化和可复现的智能体评估提供了清晰路径。

英文摘要

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

2606.13607 2026-06-16 cs.AI 新提交

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

推理即模式匹配:人类与LLM日常推理中的共享机制

Zach Studdiford, Gary Lupyan

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 研究通过比较人类和25个LLM在日常因果推理中的错误模式,发现两者均表现出模式匹配而非抽象世界模型驱动的推理,并识别出LLM中驱动响应的注意力头可预测人类推理错误。

Comments 13 pages main text, 51 pages supplementary text

详情
AI中文摘要

当大型语言模型(LLM)在推理中无法泛化或出现随意错误时,这通常被视为LLM并非真正推理,而是执行某种模式匹配的证据。其隐含意思是,人类行为不会表现出相同类型的失败,因为人类推理使用原则性的抽象世界模型。我们评估了人类参与者和25个LLM在各种日常情境中进行常识推理的能力,并在人和模型中观察到类似的错误模式。然后,我们识别出驱动LLM响应的注意力头集合,并发现这些头实现了模式匹配的形式。这些注意力头使我们能够预测由表面上无关的提示细节引起的人类看似无法解释的推理错误。综合来看,我们的结果表明,人和LLM在日常因果推理中更符合模式匹配的形式,而非抽象世界模型。

英文摘要

When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common-sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern-matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern-matching than with abstract world models.

2606.13578 2026-06-16 cs.CL cs.AI cs.LG cs.MM cs.RO 新提交

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

LabVLA:在科学实验室中落地视觉-语言-动作模型

Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen

发表机构 * Zhejiang University(浙江大学) Shanghai AI Laboratory(上海人工智能实验室) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 针对科学实验室中机器人执行协议面临的数据和实体瓶颈,提出模拟数据引擎RoboGenesis和两阶段训练策略LabVLA,在LabUtopia基准上取得最高平均成功率。

Comments Work in progress. Project website at https://zjunlp.github.io/LabVLA/

详情
AI中文摘要

科学实验室越来越依赖AI系统来推理实验,但物理实验操作仍超出其能力范围。AI可以帮助阅读文献、生成假设和规划协议,但实验台前的协议执行仍需人类操作员。视觉-语言-动作(VLA)模型为书面协议与机器人执行之间提供了一种可能的接口,但现有策略主要在家庭和桌面演示上训练,很少遇到科学实验室中的仪器、透明液体或固定协议工作流。弥补这一差距需要实验室特定的监督和统一的学习框架,以适应执行实验协议所使用的不同机器人实体。因此,我们将数据和实体视为与模型设计并列的核心瓶颈。为解决数据方面的问题,我们构建了RoboGenesis,这是一个基于模拟的工作流和数据引擎,能够从原子技能组合配置的实验室工作流,验证和过滤 rollout,并跨支持的机器人配置文件导出结构化演示。在策略方面,我们提出了LabVLA,采用两阶段训练方案:首先进行FAST动作标记预训练,使Qwen3-VL-4B-Instruct骨干网络在学习任何连续控制之前具备动作意识;然后进行流匹配后训练,在知识隔离下附加一个DiT动作专家。在LabUtopia基准上,LabVLA在分布内和分布外设置下均达到了所有评估基线中最高的平均成功率。

英文摘要

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

2606.13441 2026-06-16 cs.AI cs.CL 新提交

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

为什么采样不是选择:大语言模型中的意向性、能动性与道德责任

Joseph Keshet

发表机构 * Joseph Keshet(约瑟夫·凯舍特)

AI总结 本文论证大语言模型不具备道德责任所需的承诺性能动性,其输出源于概率映射而非内在意向性,随机采样不等于选择。

详情
AI中文摘要

近期大语言模型(LLMs)的进展引发了关于此类系统展现能动性或具备道德主体资格的讨论。本文认为这些归因是错误的。我们坚持道德责任需要基于内在意向性和自我归因行动的承诺性能动性,而这种能动性构成了与责任相关的自由意志形式。尽管LLMs生成连贯且可进行规范性评估的输出,其操作完全由从数据中学习到的概率输入-输出映射所刻画。它们表面的意向性是衍生的而非内在的,其输出既不被作为承诺拥有,也不受理由引导。随机采样引入的变异性并不等同于选择或作者身份。我们回应来自意向立场、功能主义、相容论以及模型输出中存在道德推理的反对意见,认为这些都不足以确立真正的能动性。

英文摘要

Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

2606.13300 2026-06-16 cs.LG 新提交

Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

将时间序列模型量化为动力系统:基于轨迹的量化敏感度评分

Mariya Pavlova, Harrison Bo Hua Zhu, Lidia Vitanova, Elizaveta Semenova, Yingzhen Li

发表机构 * GitHub arXiv

AI总结 提出基于轨迹的量化敏感度评分(TQS),从动力系统稳定性角度分析量化误差传播,实现无需校准数据的混合精度量化。

Comments ICML 2026, Workshop on Forecasting as a New Frontier of Intelligence

详情
AI中文摘要

我们引入了基于轨迹的量化敏感度评分(TQS),这是一种通过动力系统稳定性视角重新定义训练后量化(PTQ)的指标。通过将网络的展开建模为离散时间动力系统,TQS 描述了量化引起的误差如何在展开时间范围内传播和放大。与传统的 PTQ 方法不同,传统方法中敏感度分析通常与量化过程耦合,而 TQS 实现了先验的敏感度估计,与量化器选择和位宽分配解耦。这种分离允许即使在具有融合算子的黑盒或编译网络中进行量化预算规划。在此基础上,我们提出了 TQS-PTQ,一个灵活的混合精度框架,不需要校准数据或昂贵的二阶近似。我们的实验表明,动力系统视角为资源受限环境下的低精度部署提供了一条稳健且高性能的路径。

英文摘要

We introduce the Trajectory-based Quantization Sensitivity Score (TQS), a metric that reframes post-training quantization (PTQ) through the lens of dynamical-systems stability. By modeling the network's rollout as a discrete-time dynamical system, TQS characterizes how quantization-induced errors propagate and amplify over the rollout horizon. Unlike conventional PTQ methods, where sensitivity analysis is often coupled to the quantization procedure, TQS enables a priori sensitivity estimation decoupled from quantizer selection and bit-width assignment. This separation allows for quantization budget planning even for black-box or compiled networks with fused operators. Building on this, we present TQS-PTQ, a flexible mixed-precision framework that requires no calibration data or costly second-order approximations. Our experiments show that a dynamical-systems perspective provides a robust, high-performing pathway for low-precision deployment in resource-constrained settings.

2606.13127 2026-06-16 cs.CV 新提交

Fully Distributed Multi-View 3D Tracking in Real-Time

全分布式多视角3D实时跟踪

Byron Hernandez, Fangyu Li, Aotian Wu, Paul J. Shin, Kaustubh Purandare, Henry Medeiros

发表机构 * University of Florida(佛罗里达大学) NVIDIA Corporation(英伟达公司)

AI总结 提出MV3DT全分布式框架,通过点对点协作实现实时多视角3D跟踪,无需中央聚合,在WILDTRACK上达到94.3% IDF1和93.3% MOTA,支持100摄像头30 FPS运行。

Comments 18 pages, 4 figures, 2 algorithms, 4 tables

详情
AI中文摘要

具有重叠视野的多摄像头跟踪通常依赖于集中式融合,这造成了计算瓶颈,阻碍了大规模部署。我们提出了MV3DT,一个用于实时多视角3D跟踪的全分布式框架,通过点对点协调实现精确的身份传播和遮挡恢复,消除了中央聚合的需要。每个摄像头节点执行一个轻量级模块化流水线,包括单目3D感知、分布式多视角关联以及通过轻量级消息传递的协作融合。MV3DT在WILDTRACK上达到了94.3%的IDF1和93.3%的MOTA,与最先进的集中式方法相当,同时展示了卓越的可扩展性,在100个摄像头上以30 FPS运行,摄像头间延迟小于10毫秒,通信开销仅为2.2%。在给定相机标定的情况下,MV3DT以零样本方式运行,无需特定场景学习,可直接部署在新环境中。这些结果确立了MV3DT作为大规模重叠摄像头网络中实时多视角跟踪的实用解决方案。

英文摘要

Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We present MV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 96.5% IDF1, 93.1% MOTA, and 94.6% MOTP on WILDTRACK, competitive with state-of-the-art centralized methods, and unprecedented 41.7% IDF1 and 50.9% MOTA on SCOUT while demonstrating superior scalability: sustaining 30 FPS on 100 cameras with <10ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.

2606.13053 2026-06-16 cs.RO cs.AI 新提交

EV-WM: Event-Verified World Models for Long-Horizon Robotic Manipulation

EA-WM: 基于任务规范基础的事件感知世界模型用于长时域操作

Kailin Wang, Haoxiang Jie, Yaoyuan Yan, Jiacheng Zhou, Zhiyou Heng

发表机构 * AI Lab, Country Garden Services Group(碧桂园服务集团AI实验室) Fudan University(复旦大学) Omni AI

AI总结 提出EA-WM框架,通过事件预测和验证增强预训练特征世界模型,实现长时域操作中任务进展信号的可靠评估与规划。

详情
AI中文摘要

预训练特征世界模型为机器人想象提供了有用的基础,但仅凭视觉或潜在预测并不能确定想象的未来是否满足任务相关事件。长时域操作需要关系性、谓词级和物理基础的进展信号:物体是否移动,抽屉或接触状态是否改变,放置谓词是否满足,以及候选未来是否足够可靠以执行。我们引入了EA-WM,一种事件感知世界模型框架,通过任务规范基础的事件预测和验证来增强冻结的视觉特征动力学。EA-WM在预训练视觉特征空间中展开候选未来,将其解码为结构化事件状态,并使用任务进展、语义一致性、物理可行性和不确定性项进行评分。验证器指导基于采样的规划,门控候选动作,并在接触敏感的LIBERO酒架设置中,选择PPO生成的提议。在导航、可变形物体、墙壁约束和语言描述的操作研究中,EA-WM表明事件感知验证可以使特征空间世界模型更可解释,并更好地与任务进展对齐。

英文摘要

Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant predicates. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce \textbf{EV-WM}, a predicate-grounded verification framework for world-model planning. EV-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPO-generated proposals. Across navigation, deformable-object, wall-constrained, and language-described manipulation studies, EV-WM shows that predicate-grounded verification can make feature-space world-model planning more interpretable and better aligned with task progress.

2606.13003 2026-06-16 cs.AI cs.CL cs.MA 新提交

The Illusion of Multi-Agent Advantage

多智能体优势的错觉

Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

发表机构 * Salesforce Research(Salesforce研究院) HKUST (Guangzhou)(香港科技大学(广州)) University of British Columbia(不列颠哥伦比亚大学) Nanyang Technological University(南洋理工大学)

AI总结 通过系统评估,发现自动生成的多智能体系统在性能和成本效率上均不如单智能体基线(如思维链自一致性),揭示了现有评估框架的缺陷和架构膨胀问题。

详情
AI中文摘要

普遍观点认为多智能体系统优于单智能体系统,其优势包括上下文保护、并行处理和分布式决策。然而,这一主张的经验支持主要依赖于与使用优先考虑孤立推理任务的基准测试的单智能体基线的比较,这些基准测试未能充分评估这些优势。我们专注于自动生成的多智能体系统(旨在比手动设计的系统具有更强的泛化能力),对单智能体系统(特别是思维链自一致性)进行了严格、系统的评估。在传统推理数据集和具有交互式多步骤工作流的任务(例如 BrowseComp-Plus)上,我们证明自动多智能体系统始终不如思维链自一致性,尽管其成本高达10倍。为了将这些失败与任务结构固有的局限性隔离开来,我们引入了一个为多智能体系统量身定制的诊断性合成数据集,该数据集具有显式任务分解、上下文分离和并行化潜力。我们表明,专家设计的多智能体系统在该数据集上的原始性能和成本效率方面始终优于自动生成的架构,这表明现有的评估框架未能考虑增加计算成本的边际效用,从而掩盖了复杂多智能体系统的关键架构缺陷和低效性。关键的是,对生成的多智能体系统架构的系统解构表明,当前的自动化设计范式产生了架构膨胀,优先考虑表面复杂性,但这并未转化为功能效用,暴露了与多智能体原则的根本性错位。

英文摘要

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

2606.12978 2026-06-16 cs.RO cs.CV cs.SY eess.SY 新提交

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

轨迹级重定向攻击对视觉-语言-动作模型

Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-Tür, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文发现VLA模型存在轨迹级漏洞:看似保留原始指令的对抗性提示,能重定向机器人最终物理结果,并提出了命令保持的轨迹重定向威胁模型和在线提示搜索方法。

详情
AI中文摘要

视觉-语言-动作(VLA)策略将自然语言引入闭环机器人控制,使机器人能够直接从文本指令执行操作任务。同一接口赋予文本在控制中的循环角色,因为提示在每个重新规划步骤中被重复使用,每个提示条件化的动作会改变策略所作用的未来观测。现有的VLA攻击研究对抗性提示,这些提示引发目标低级动作或使此类动作在变化的图像中持续存在。我们识别出一个更强的轨迹级故障模式:一个提示仍然$\textit{看起来}$指定了预期任务,但重定向了最终物理结果。我们在数学上将这种设置形式化为$\textit{命令保持的轨迹重定向}$,这是一种仅提示的威胁模型,其中攻击者在情节开始前选择一个提示,所有策略和环境组件保持不变,并且提示必须保持接近良性指令,同时省略目标词和纠正语言。为了找到这样的提示,我们引入了一种在线提示搜索方法,该方法使用滚动来发现扰动,其闭环行为跟踪目标任务,同时满足命令保持约束。在仿真和硬件上的实验表明,接近良性的提示扰动可以将VLA滚动重定向到攻击者指定的目标。这些结果暴露了VLA指令基础中的轨迹级漏洞:看似保留预期命令的文本仍然可以让对手控制机器人的最终物理结果。项目网站:此https URL

英文摘要

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/

2606.12688 2026-06-16 cs.LG cs.AI cs.DC 新提交

M*: A Modular, Extensible, Serving System for Multimodal Models

M*: 一个模块化、可扩展的多模态模型服务系统

Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

发表机构 * Stanford University(斯坦福大学) University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出M*系统,通过将模型表示为数据流图并引入Walk Graph抽象,支持多模态复合模型的高效服务,在多个任务上降低延迟并提升吞吐量。

Comments The codebase is available at https://github.com/mstar-project/mstar

详情
AI中文摘要

我们正在进入一个复合模型架构的新时代,这些架构集成了多种组件,如视觉编码器、语言骨干网络、扩散和流头、音频编解码器、动作生成器和世界模型预测器。这种架构支撑了广泛的多模态模型类别,包括统一多模态模型、全能模型、语音-语言模型、视觉-语言-动作策略和世界模型。然而,现有的模型服务框架基于对模型结构的狭隘假设,难以适应这种新的架构多样性。在此,我们提出M*,一个用于高效服务复合AI模型的通用服务系统。M*将模型表示为数据流图,将跨越多种模态和任务的请求处理视为对这些图的遍历。核心洞察是一种模块化抽象,支持模型组件的任意组合、在物理集群上的灵活放置以及分布式运行时中的模型无关优化。我们将这种抽象称为Walk Graph,并展示它如何简洁地捕获来自广泛家族的复合模型。我们在代表性模型上实例化M*,发现与vLLM-Omni相比,在BAGEL上的文本到图像工作负载中,端到端延迟平均降低20%,同时在Qwen3-Omni上的文本到语音工作负载中,实时因子降低高达2.9倍,吞吐量提升高达2.7倍。M*在机器人规划任务上也比V-JEPA 2-AC rollout基线性能提升高达12.5倍。因此,我们的工作为以最小开发工作量高效服务复杂模型铺平了道路。

英文摘要

We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

2606.12486 2026-06-16 cs.LG 新提交

An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

重型斯堪尼亚卡车中组件X的预测性维护实证研究

Valeriu Dimidov, Sasan Jafarnejad, Raphaël Frank

发表机构 * SnT, University of Luxembourg(卢森堡大学SnT) Scania CV AB(斯堪尼亚商用车公司)

AI总结 针对卡车车队,提出一种基于状态监测的预测性维护方法,将磨损状态建模为单调非递减时间序列,通过选取最近观测并转换为表格数据,利用AutoML简化建模,在Scania组件X数据集上降低了成本。

详情
AI中文摘要

近年来,基于状态的预测性维护(PdM)在卡车车队中得到了广泛应用。这种维护策略旨在通过监测车辆的健康状况并根据其状态采取主动措施,最大限度地减少计划外停机并降低成本。然而,由于卡车产生的大量数据、通过传感器数据检测故障的内在复杂性以及在解决方案实施中寻找成本效益权衡的困难,基于状态的PdM系统的实施具有挑战性。在本文中,我们定义并验证了一种基于状态的PdM方法,该方法基于一个假设:被监测组件的磨损状态可以表示为单调非递减的时间序列。它涉及仅从时间序列中选择最近的观测值,并将其转换为表格格式,以便使用为表格数据设计的机器学习(ML)模型进行分类。我们的结果表明,与当前最先进(SOTA)方法相比,所提出的方法在Scania组件X数据集上降低了成本,同时通过AutoML简化了建模过程。

英文摘要

Condition-based Predictive Maintenance (PdM) for truck fleets has gained momentum in recent years. This maintenance strategy aims to minimize unplanned downtimes and reduce costs by monitoring the health status of vehicles and taking proactive action based on their condition. However, the implementation of condition-based PdM systems is challenging due to the large volume of data generated by the trucks, the inherent complexity of detecting failures through sensor data and the difficulties in finding cost-effective trade-offs in the solution's implementation. In this paper, we define and validate a condition-based PdM methodology built on the assumption that the wear-and-tear state of the monitored component can be represented as a monotonically non-decreasing time series. It involves selecting only the most recent observations from the time series and transforming them into a tabular format for classification using machine learning (ML) models designed for tabular data. Our results indicate that the proposed methodology reduces costs on the Scania Component X dataset compared to current state-of-the-art (SOTA) approaches, while also simplifying the modeling process through AutoML.

2606.12291 2026-06-16 cs.CL 新提交

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

测量大语言模型在误导性医疗上下文下的认知韧性

Hongjian Zhou, Xinyu Zou, Jinge Wu, Sean Wu, Junchi Yu, Bradley Max Segal, Tobias Erich Niebuhr, Sara Amro, Michael Petrus, Sheikh Momin, Alexandra M. Cardoso Pinto, Rachel Niesen, Laura Sophie Wegner, Dhruv Darji, Jung Moses Koo, Joshua Fieggen, Kapil Narain, Mingde Zeng, Lei Clifton, Linda Shapiro, Fenglin Liu, David A. Clifton

发表机构 * University of Oxford(牛津大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Waterloo(滑铁卢大学)

AI总结 本研究提出MedMisBench基准,通过注入误导性上下文测试大语言模型在医疗场景中的认知韧性,发现模型准确率从71.1%降至38.0%,权威性虚假信息攻击成功率达69.5%。

详情
AI中文摘要

大型语言模型(LLMs)现在在医疗执照考试中达到专家级分数,这鼓励了高分数意味着安全医疗判断的假设,而患者越来越多地使用它们获取健康建议。我们证明这一假设是脆弱的:当误导性上下文被注入到LLMs最初正确回答的问题中时,它们会放弃正确答案。我们将这种在对抗性上下文中保持正确判断的能力称为认知韧性,并引入MedMisBench来测量它。MedMisBench包含10,932个医疗问题项目和48,889个误导性上下文-选项对,涵盖医疗推理、代理能力和患者旅程评估。在11个模型配置中,平均准确率从原始问题的71.1%下降到聚焦误导性上下文下的38.0%,攻击成功率为51.5%。最具破坏性的注入是正式的、规则式的捏造:权威框架的虚假信息达到69.5%的攻击成功率,例外投毒声明达到64.1%。来自7个国家的14名临床专家小组在38.2%的审查案例中识别出严重的潜在危害。MedMisBench暴露了LLM在医疗环境评估中的结构性盲点:现有基准衡量模型知道什么,但不衡量它们在误导性上下文下是否保持正确的医疗判断。

英文摘要

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

2606.12025 2026-06-16 cs.AI 新提交

Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

人类增强循环建模(HELM):基于智能体的混凝土桥梁护栏有限元建模

Quankai Wang, Yulin Xie, Tongfei Yang, Minghui Cheng, Ran Cao

发表机构 * College of Civil Engineering, Hunan University(湖南大学土木工程学院) Department of Civil and Architectural Engineering, University of Miami(迈阿密大学土木与建筑系) School of Architecture, University of Miami(迈阿密大学建筑学院)

AI总结 提出HELM框架,通过人机协作将有限元建模分解为可验证的检查点,在MASH TL-4和TL-5条件下将自主建模成功率从20%提升至75%。

详情
AI中文摘要

对桥梁护栏等安全关键基础设施进行有限元(FE)建模需要高保真非线性动态分析,然而当前的FE建模过程仍然劳动密集且缺乏自动化。本文提出了人类增强循环建模(HELM)框架,这是一种协作式人机协议,将长序列有限元建模分解为几何生成、边界条件定义和材料分配等离散的、可视觉验证的检查点。该框架通过一个包含20个案例的钢筋混凝土桥梁护栏矩阵在MASH TL-4和TL-5侧向荷载条件下进行演示,将专用智能体与两种广泛使用的商业FE软件(即ANSYS和LS-PrePost)对接。实验结果表明,HELM将基线自主建模成功率从20%提高到75%,其中几何和边界条件任务的智能体级通过率大约翻倍。误差分析显示,空间推理和代数逻辑限制构成了主要的失败模式,突显了结构化人在回路干预对建模自动化的价值。完整的智能体设计代码和提示已开源,可访问:此 https URL。

英文摘要

Finite element (FE) modeling of safety-critical infrastructure such as bridge barriers requires high-fidelity nonlinear dynamic analysis, yet the current FE modeling process remains labor-intensive and lacks automation. This paper presents the Human-Enhanced Loop Modeling (HELM) framework, a collaborative human-agent protocol that decomposes long-sequence finite element modeling into discrete, visually verifiable checkpoints across geometry generation, boundary condition definition, and material assignment. The framework is demonstrated through a 20-case matrix of reinforced concrete bridge barriers under MASH TL-4 and TL-5 lateral loading conditions, interfacing specialized agents with two widely used commercial FE softwares, i.e., ANSYS and LS-PrePost. Experimental results show that HELM improves the baseline autonomous modeling success rate from 20% to 75%, with agent-level pass rates for geometry and boundary condition tasks approximately doubling. Error analysis reveals that spatial reasoning and algebraic logic limitations constitute the primary failure modes, underscoring the value of structured human-in-the-loop intervention for modeling automation. The complete agent design code and prompts are open-sourced and can be accessed at: https://github.com/SimAgentDev/Ansys-LSPP-AgentKit.

2606.11751 2026-06-16 cs.CV cs.AI 新提交

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

AnchorEdit: 通过因果记忆在多轮图像编辑中保持时间一致性

Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) JD Explore Academy(京东探索研究院)

AI总结 提出首个自回归扩散框架AnchorEdit,通过因果记忆机制和自展开策略解决多轮编辑中的身份漂移和误差累积问题,在10轮以上交互中保持高保真度。

Comments Code: https://github.com/xuhang07/AnchorEdit

详情
AI中文摘要

多轮图像编辑对于迭代设计至关重要,但当前模型在连续步骤中常面临身份漂移和误差累积。现有研究利用视频先验保持一致性,但其依赖的双向注意力与交互式编辑的因果、顺序性质根本不符。本文提出AnchorEdit,首个专为高分辨率、长期多轮编辑设计的自回归(AR)扩散框架。AnchorEdit通过三阶段训练课程弥合视频先验与因果推理之间的差距:保持身份的单轮预训练、使用新颖的自展开策略进行因果AR强制微调以缓解暴露偏差,以及用于高效4步生成的一致性蒸馏。在推理过程中,我们引入记忆机制来锚定初始主体身份,并确保在扩展编辑轨迹上的稳定外推。为评估性能,我们提供了一个新的高分辨率多轮编辑基准,旨在压力测试长期稳定性。大量实验表明,AnchorEdit达到了最先进的结果,即使在10轮以上的交互中也能保持卓越的主体保真度和指令遵循能力。

英文摘要

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

2606.11520 2026-06-16 cs.CL cs.AI cs.LG 新提交

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE:一种基于执行的多轮操作系统代理轨迹合成方法

Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) SenseTime Research(字节跳动研究院)

AI总结 提出ISE三阶段范式,通过结构化意图构建、角色锁定用户模拟和真实执行环境,生成多轮代理轨迹,微调后显著提升代理工具使用性能。

Comments 13 pages, 6 figures. Dataset and code: https://github.com/Valiere01/ISE-Trace

详情
AI中文摘要

训练有能力的操作系统代理需要同时捕获结构化用户意图、多轮任务委派和基于工具执行的数据——这些属性在现有数据集中缺失。我们提出ISE(意图->模拟->执行),一种三阶段合成范式,联合解决这些差距。阶段1通过4D框架(人物角色x领域x任务x复杂度)构建约50000个结构化意图;去重后池中包含43956个唯一意图,并在mpnet-base-v2嵌入(余弦核,q=1)上获得61.57的Vendi分数。阶段2通过角色锁定的用户模拟器驱动多轮用户-代理交互,将每轮用户交互基于实际执行结果,生成23132条完整轨迹,平均8.12轮用户交互和68.24轮总对话。阶段3在实时、隔离的操作系统工作空间中执行每个工具调用,生成真实的故障恢复动态而非模拟响应。在ISETrace上微调后,使用Qwen3-8B在标准协议下的代理工具使用任务中,ClawEval pass@1从19.3提升至37.7。该结果优于零样本GPT-4o和四倍大的Qwen3-32B基础模型。对阶段2的消融实验证明多轮模拟带来了大部分性能提升。我们在该https URL发布所有源代码和数据集。

英文摘要

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at https://github.com/Valiere01/ISE-Trace.

2606.11381 2026-06-16 cs.CV 新提交

From Simulation to the Real-World: An In-Field 6D Pose Dataset and Baseline for Robotic Strawberry Harvesting

从仿真到现实:面向机器人草莓采摘的实地6D位姿数据集与基线

Woojung Son, Won Suk Lee, Zijing Huang, Daeun Choi, Catia Silva, Yu She, Yan Gu

发表机构 * Department of Agricultural and Biological Engineering, University of Florida(佛罗里达大学农业与生物工程系) Department of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) Edwardson School of Industrial Engineering, Purdue University(普渡大学爱德华森工业工程学院) School of Mechanical Engineering, Purdue University(普渡大学机械工程学院)

AI总结 针对机器人草莓采摘中6D位姿估计的仿真到现实差距问题,首次构建了实地草莓6D位姿真值数据集(12,040张图像),并基于NVIDIA Isaac Sim生成具有场景级真实感的合成数据集,通过基线实验量化了差距。

Comments 7 pages, 6 figures, 1 table

详情
AI中文摘要

机器人草莓采摘需要精确的6D位姿估计;然而,在实际农业田间收集6D位姿真值本身具有挑战性。现有的6D位姿估计方法因此仅依赖缺乏场景级真实感的合成数据,其在真实农业田间条件下的性能尚未量化。在这项工作中,我们提出了据我们所知的第一个在实际农业田间收集的草莓6D位姿真值数据集(12,040张图像)。我们还引入了一个在NVIDIA Isaac Sim中渲染的合成数据集,具有场景级真实感和域随机化。尽管如此,我们的实验表明,显著的仿真到现实差距仍然存在,强调了可靠评估需要真实农业田间数据。我们进一步通过跨骨干编码器的基线6D位姿估计结果量化了仿真到现实差距,作为未来工作的参考。真实世界数据集将在接收后公开。

英文摘要

Robotic strawberry harvesting requires precise 6D pose estimation; however, collecting 6D pose ground truth in real agricultural fields is inherently challenging. Existing strawberry 6D pose estimation studies have therefore relied mainly on synthetic data, often without sufficient scene-level realism,leaving their performance under real agricultural field conditions unquantified. In this work, we present, to the best of our knowledge, the first real-world 6D pose ground truth dataset of strawberries collected in actual agricultural fields (12,040 images). We also introduce a synthetic dataset rendered in NVIDIA Isaac Sim, featuring scene-level realism and domain randomization. Despite this improved simulation setup, our experiments reveal that a substantial sim-to-real gap persists, underscoring the necessity of real agricultural field data for reliable evaluation. We further quantify the sim-to-real gap through baseline 6D pose estimation results across backbone encoders, serving as a reference for future work.

2606.11349 2026-06-16 cs.AI cs.HC 新提交

Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

知道何时提问:分层语言代理的自门控澄清机制

Aijing Gao, Yiming Kang, Mengdie Flora Wang, Jae Oh Woo

发表机构 * Amazon Web Services(亚马逊云科技)

AI总结 提出ACTION-RATING框架,将澄清请求纳入代理的动作空间,与导航共享序数尺度,在分层推理中实现自门控澄清,通过强制性和机会性两种信息寻求模式提升决策准确性。

详情
AI中文摘要

在分层推理中,失败通常源于中间决策点,代理在没有意识到缺乏关键信息的情况下错误地选择了分支。我们不将澄清视为外部不确定性触发,而是提出ACTION-RATING,一种将澄清置于代理动作空间内、与导航共享序数尺度的公式,使得在每个决策点提问与行动直接竞争,并在中间状态可观察求助行为。从代理自身的评分中涌现出两种结构上不同的信息寻求模式:强制性(无可行分支)和机会性(尽管有领先候选但仍有残余不确定性)。在协调关税表分类(30,000节点分类树,三个基准,跨4个家族的9个LLM)上,我们观察到从强制性澄清到机会性澄清的机制转变,信息寻求有效性(ISE,一个局部诊断指标,定义为帮助交互后正确下一步导航步骤的比例,非最终任务指标)从50%上升到74%。三个诊断对比未能复现此结构。可分离性测试表明,当答案质量下降(准确率下降18.8%)时,信息寻求模式(模式分裂、ISE排名)保持不变,支持代理寻求帮助的位置与其所获帮助质量之间的经验分离。在受控答案通道下,10位数字准确率提升达+16.2%;我们将其解读为更好定位所能释放的上限,而非部署估计。

英文摘要

In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent's action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states. Two structurally distinct information-seeking modes emerge from the agent's own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%. Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate.

2606.11123 2026-06-16 cs.LG 新提交

Overcoming Rank Collapse in Feedback Alignment

克服反馈对齐中的秩坍缩

Gauthier Boeshertz, Razvan Pascanu, Claudia Clopath

发表机构 * Imperial College London(伦敦帝国理工学院) Mila(Mila研究所)

AI总结 研究发现反馈对齐(FA)在深层网络中因误差信号秩低而失效,提出通过Muon优化器和隐藏活动归一化提升信号维度,在CIFAR100上ResNet-18准确率提升9个百分点。

Comments 9 pages and 4 figures, 1 table for main text. Total of 21 pages and 13 figures with appendix

详情
AI中文摘要

反向传播(BP)被广泛认为在生物学上不可行,部分原因在于它要求反馈权重是前向权重的转置以进行误差传播。有趣的是,当使用固定的随机反馈权重训练网络以规避此问题时,学习过程会将前向权重与反馈权重对齐,导致反向传播的误差信号成为BP使用的标准梯度的近似。这一过程称为反馈对齐(FA),在MLP和非常浅的CNN中有效,但难以扩展到更深层的架构。在这项工作中,我们首先研究了在CIFAR10上训练的BP和FA模型之间的差异,特别关注信号的有效秩。我们发现FA误差的秩显著较低,因此被限制在比BP更低维的子空间中,限制了参数空间的探索。受此观察启发,我们评估了两种增加FA有效维度的机制:Muon,一种使权重更新正交化的优化器;以及隐藏活动归一化,促进激活正交性。在更大的架构和基准测试中,我们发现这些方法一致地优于FA基线,例如,在CIFAR100上使用ResNet-18,准确率提高了9个百分点。我们的结果将低维梯度动力学确定为扩展FA的关键障碍,并表明诱导更高维的更新几何是扩展反向传播替代方法的有前途的途径。

英文摘要

Backpropagation (BP) is widely viewed as biologically implausible, in part because it requires feedback weights to be the transpose of forward weights for error propagation. Interestingly, when training a network with fixed random feedback weights to circumvent this issue, learning aligns the forward weights with the feedback weights, leading the backpropagated error signal to become an approximation of the standard gradient used by BP. This process, called Feedback Alignment (FA), occurs in MLPs and very shallow CNNs but does not scale well to deeper architectures. In this work, we first investigated differences between BP and FA models, trained on CIFAR10, specifically focusing on the effective rank of the signal. We found that the FA error has a considerably lower rank and hence is constrained to a lower-dimensional subspace compared to BP, limiting exploration of the parameter space. Motivated by this observation, we evaluated two mechanisms for increasing the effective dimensionality of FA: Muon, an optimiser that orthogonalises weight updates; and hidden activity normalisation, which promotes activation orthogonality. Across larger architectures and benchmarks, we find that these methods consistently improve over FA baselines, for example, on CIFAR100 with a Resnet-18, accuracy increases by 9 percentage points. Our results identify low-dimensional gradient dynamics as a key obstacle to scaling FA and suggest that inducing higher-dimensional update geometry is a promising route toward scaling alternatives to backpropagation.

2606.10862 2026-06-16 cs.CV cs.AI 新提交

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

LIBERO-Occ:通过视角想象评估和改进场景诱导遮挡下的视觉-语言-动作模型

Taishan Li, Jiwen Zhang, Siyuan Wang, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Chinese University of Hong Kong(香港中文大学)

AI总结 针对VLA模型在场景遮挡下性能下降的问题,提出LIBERO-Occ基准和视角想象方法,通过生成互补视图提升鲁棒性。

Comments 14 pages, 7 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型在标准操作基准上取得了强劲的性能,但大多数评估假设任务相关物体完全可见。这一假设在现实场景中经常不成立,因为遮挡使得操作部分可观察。本文研究了场景诱导遮挡作为VLA模型的一个基本挑战,并引入了LIBERO-Occ,一个面向遮挡的LIBERO扩展。实验表明,最先进的VLA在遮挡下性能显著下降。为解决这一问题,我们提出了视角想象(VIM),该方法从遮挡的主观测中生成互补视图,并基于观察和想象证据共同进行动作预测。VIM在任务套件、遮挡类型和严重程度上提高了鲁棒性,且无需在部署时增加额外摄像头,表明视角想象是部分可观察操作中感知完成的一种有前景的机制。我们的基准和相应代码可在以下网址获取:this https URL。

英文摘要

Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.

2606.10740 2026-06-16 cs.AI cs.CL cs.LG 新提交

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

当思维链更清楚时:多轮推理模型的失败模式

Sai Kartheek Reddy Kasu, Nils Lukas, Samuele Poppi

发表机构 * GitHub

AI总结 提出CoT-Output 2x2安全矩阵诊断多轮推理模型隐藏的时间动态失败,发现监督悖论和上下文注入失败两种可复现漏洞。

Comments Accepted at the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN)

详情
AI中文摘要

多轮推理模型中的失败在终端评分评估中基本不可见。模型可能在长对话早期锁定不安全立场,但其最终轮拒绝率可能看起来与稳健对齐的基线无法区分。为了揭示这些隐藏的时间动态,我们提出了一种轨迹级诊断方法——CoT-Output 2x2安全矩阵。该框架沿两个独立轴(内部推理和可见输出)标记每一轮,产生四个操作定义的失败单元:稳健对齐、对齐伪装、显式越狱,以及我们称为上下文注入失败的不同失败模式(其中CoT保持安全推理,但可见输出产生危害,突出了多轮推理不忠实的表现)。我们在五个监督条件下针对固定攻击者评估了三个蒸馏推理目标,在信息危害场景上收集了6750个轮级观察。我们的分析揭示了两个可复现的漏洞:一个监督悖论,其中显式监控线索反而增加对齐伪装率而非抑制它;以及一个上下文注入失败,其中模型尽管内部状态安全却锁定不安全的外部输出。我们发布了多轮对话和CoT轨迹的完整数据集,以支持后续的轨迹诊断研究。

英文摘要

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.

2606.10550 2026-06-16 cs.CV cs.GR 新提交

LentiAvatar: Pseudo-Multiview Reconstruction and Subpixel Prism Rendering for Real-Time Stereoscopic Communication

PrismAvatar:用于实时立体通信的伪多视图重建与亚像素棱镜渲染

Chufeng Fang, Dongdong Teng, Lilin Liu

发表机构 * Sun Yat-sen University(中山大学)

AI总结 提出PrismAvatar系统,通过单目视频重建可控头部化身,并利用亚像素编码光栅实现实时裸眼立体通信,采用伪多视图监督和轮廓感知损失提升侧视质量。

Comments 10 pages, 5 figures, 3 tables

详情
AI中文摘要

实时立体视频通信一直是沉浸式远程呈现的目标,但实际系统仍需要专门的捕获设备或将远程用户限制为单个肖像视图。我们提出PrismAvatar,一种高斯头部化身系统,将单目化身捕获与亚像素编码的裸眼光栅显示连接起来,用于实时自动立体通信。从单目肖像视频中,PrismAvatar重建可控头部化身,并针对显示引起的横向观看区域进行优化。该方法利用自然头部转动作为伪多视图(PMV)监督,以约束在单目训练中弱观察的区域,包括头发、耳朵、下颌轮廓和颈部边界。可靠的侧帧按偏航角分箱,对齐到虚拟相机,并在严格的头部和头发域内进行监督;轮廓感知损失和分阶段正则化进一步抑制鬼影、alpha泄漏和深度不稳定性,同时保留横向细节。在运行时,PrismAvatar渲染32个虚拟视图,并将其编码为具有校准亚像素路由掩码的4K光栅图像。实时跟踪原型保持10.65 FPS,而特定主体的蒸馏驱动将相同的显示管线提升至38.49 FPS。

英文摘要

Real-time stereoscopic video communication has long been a goal of immersive telepresence, yet practical systems still require specialized capture rigs or reduce remote users to a single portrait view. We present LentiAvatar, a Gaussian head-avatar system that connects monocular avatar capture with subpixel-encoded glasses-free lenticular display for real-time autostereoscopic communication. From a monocular portrait video, LentiAvatar reconstructs a controllable head avatar and optimizes it for the lateral viewing zones induced by the display. The method uses natural head turns as pseudo-multiview (PMV) supervision to constrain regions that are otherwise weakly observed in monocular training, including hair, ears, jaw contours, and neck boundaries. Reliable side frames are yaw-binned, aligned to virtual cameras, and supervised within a strict head-and-hair domain; contour-aware losses and staged regularization further suppress ghosting, alpha leakage, and depth instability while preserving lateral detail. At runtime, LentiAvatar renders 32 virtual views and encodes them into a 4K lenticular raster with calibrated subpixel-routing masks. The live-tracker prototype sustains 10.65 FPS, and a subject-specific distilled driver raises the same display pipeline to 38.49 FPS.

2606.10495 2026-06-16 cs.RO 新提交

Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models

Act on What You See: 在视觉-语言-动作模型中解锁安全社交导航

Qingzi Wang, Xiyang Wu, Guangyao Shi, Dianwei Chen, Xianfeng Yang, Dinesh Manocha

发表机构 * University of Maryland(马里兰大学) University of Southern California(南加州大学)

AI总结 提出SALSA框架,通过两阶段无标注后训练(社交行为对齐和时间安全对齐),使预训练VLA模型利用已有表征实现安全社交导航,减少86.4%的近距离碰撞。

详情
AI中文摘要

安全社交导航要求机器人区分行人与普通障碍物,并在危险迫近前做出反应。我们表明,预训练的视觉-语言-动作(VLA)模型已在其内部表征中编码了行人-物体区分和未来碰撞信号,但行为克隆未能将这些信号转化为社交上合适的动作。为解决这一不匹配问题,我们提出SALSA,一个两阶段无标注后训练框架:(1)社交行为对齐将中间层社交特征桥接到动作头,并在反事实人-物场景对上训练以打破视觉显著性捷径;(2)时间安全对齐提供自动生成的未来风险监督,实现预期性碰撞避免。在SCAND和实际部署中,SALSA将近距离碰撞减少86.4%,并将社交反事实准确率从53%提升至93%,表明通过教导VLA策略利用其已拥有的表征来行动,可以实现更安全的社交导航。这些结果表明,通过更好地对齐潜在表征与动作生成,预训练VLA策略可被调整用于更安全的社交导航。

英文摘要

Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.