arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.03620 2026-06-03 cs.LG cs.AI

Physics-Guided Policy Optimization with Self-Distillation

基于物理引导的自蒸馏策略优化

Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei

发表机构 * Amazon（亚马逊）

AI总结针对自蒸馏策略优化中固定步长导致训练不稳定的问题，提出受粘性流体动力学启发的物理引导策略优化（PGPO），通过互信息估计动态调整步长，在Science-QA数据集上提升性能并保持训练稳定性。

详情

AI中文摘要

自蒸馏策略优化（SDPO）已成为大语言模型后训练的一种流行范式，其中模型根据特权信息从自身预测中学习。然而，SDPO对每次更新步长的信任程度敏感：来自自我教师的修正可能在某些批次上信息丰富，而在其他批次上具有误导性，若以固定步长统一应用，会破坏训练稳定性。受粘性流体动力学启发，并在随机微分方程层面形式化类比，我们提出物理引导策略优化（PGPO），该方法引入一个基于学生预测与反馈条件教师之间互信息估计的信息调制步长乘子。我们证明这种调制保留了普通SGD的一阶弱近似保证，且每次迭代的额外开销可忽略。我们在Science-QA数据集上评估PGPO，它在4个领域中的3个上优于SDPO，提升高达+4.5个点，同时在SDPO训练后期崩溃的设置中保持稳定。

英文摘要

Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each update step should be trusted: corrections from a self-teacher can be highly informative on some batches and misleading on others, and applying them uniformly with a fixed step size can destabilize training. Drawing inspiration from viscous-fluid dynamics and formalizing the analogy at the SDE level, we propose Physics-Guided Policy Optimization (PGPO), which introduces an information-modulated step-size multiplier derived from a mutual-information estimate between the student's predictions and the feedback-conditioned teacher. We show that this modulation preserves the order-1 weak-approximation guarantees of vanilla SGD, and incurs negligible overhead per iteration. We evaluate PGPO on the Science-QA dataset, where it outperforms SDPO on 3 of the 4 domains with gains of up to +4.5 points, while remaining stable in a setting where SDPO collapses late in training.

URL PDF HTML ☆

赞 0 踩 0

2606.03618 2026-06-03 cs.AI

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

跨语言令牌套利：通过本地LLM预处理优化代码智能体上下文窗口

Mehmet Utku Colak

发表机构 * GitHub

AI总结提出一种预处理的边缘端提示重写中间件，利用本地Llama 3.2模型进行跨语言翻译和结构重写，在保持或提升任务准确率的同时减少34-47%的提示令牌和最高18.8%的总令牌消耗。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

AI辅助编码智能体受到输入令牌成本的瓶颈限制。原始人类输入的两个病理现象导致了大部分开销：非英语文本的令牌化低效和对话提示中的结构熵。现有方法通过压缩已经臃肿的上下文或在失败发生后进行干预来被动应对。我们引入了一种预处理的边缘端提示重写中间件，在开发者和云智能体之间运行。本地Llama 3.2（3B）模型执行跨语言翻译成英语、结构重写为紧凑的任务导向格式，以及正则表达式验证的重写-回退保护，确保优化后的提示永远不会大于原始提示。我们在OMH-Polyglot（一个涵盖土耳其语、阿拉伯语、中文和代码混合规范的多语言编码基准）上进行评估。在三个商业LLM后端上，该中间件将提示令牌减少了34-47%，总令牌减少了最多18.8%，同时保持或提高了任务准确率。消融研究表明，收益主要来自重写阶段，而非简单的函数名提取。与LLMLingua-2在匹配压缩率下相比，我们的方法在所有评估后端上始终获得更优的OckScore性能。这些结果表明，主动提示优化可以在不牺牲编码质量的情况下大幅降低推理成本。

英文摘要

AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original. We evaluate on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications. Across three commercial LLM backends, the middleware reduces prompt tokens by 34-47 percent and total tokens by up to 18.8 percent while preserving or improving task accuracy. Ablation studies show that gains arise primarily from the rewriting stage rather than simple function-name extraction. Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends. These results demonstrate that proactive prompt optimization can substantially reduce inference costs without sacrificing coding quality.

URL PDF HTML ☆

赞 0 踩 0

2606.03610 2026-06-03 cs.CV

SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition

SkelHCC：一种基于双曲CLIP驱动的缓存自适应框架用于骨架基础的一次动作识别

Yanan Liu, Anqi Zhu, Jingmin Zhu, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, Dan Xu, Qiuhong Ke

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出SkelHCC框架，利用双曲几何编码骨架层次结构，结合CLIP和免训练缓存实现一次动作识别，在三个数据集上达到最优。

Comments Accepted by ICML 2026

详情

AI中文摘要

基于骨架的动作识别旨在从人体关节序列理解人类行为，在一次设置中尤其具有挑战性，因为每个新动作仅有一个标记样本。关键挑战是学习捕捉人体运动的层次和组合结构的表示，同时在极端数据稀缺下与高层动作语义有效对齐。现有方法主要基于欧几里得嵌入和低级运动线索，难以建模骨架数据的树状组织，限制了跨模态对齐和对未见动作类别的泛化。我们提出SkelHCC，一个统一的骨架双曲CLIP驱动的缓存自适应框架，用于一次骨架动作识别。SkelHCC引入显式层次双曲CLIP（EH-HCLIP）模块，将骨架序列和动作语言嵌入共享双曲空间。通过利用双曲几何的负曲率和指数体积增长，EH-HCLIP自然编码人体解剖学的关节-部位-身体层次，并产生结构一致的跨模态表示。为支持高效的一次自适应，SkelHCC进一步集成了一个无需训练的LLM引导的多粒度投票缓存（LMV-Cache），用于上下文感知推理。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD上的实验表明，SkelHCC持续优于最先进方法。

英文摘要

Skeleton-based action recognition aims to understand human behaviors from body joint sequences and is especially challenging in the one-shot setting, where only a single labeled exemplar is available for each novel action. A key challenge is learning representations that capture the hierarchical and compositional structure of human motion while aligning effectively with high-level action semantics under extreme data scarcity. Existing approaches, largely based on Euclidean embeddings and low-level motion cues, struggle to model the tree-like organization of skeleton data, limiting cross-modal alignment and generalization to unseen action categories. We propose SkelHCC, a unified skeleton hyperbolic CLIP-driven cache adaptation framework for one-shot skeleton-based action recognition. SkelHCC introduces an Explicitly Hierarchical Hyperbolic CLIP (EH-HCLIP) module that embeds skeleton sequences and action language into a shared hyperbolic space. By leveraging the negative curvature and exponential volume growth of hyperbolic geometry, EH-HCLIP naturally encodes the joint-part-body hierarchy of human anatomy and yields structurally consistent cross-modal representations. To support efficient one-shot adaptation, SkelHCC further integrates a training-free LLM-guided Multi-granularity Voting Cache (LMV-Cache) for context-aware inference. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD demonstrate that SkelHCC consistently outperforms state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.03608 2026-06-03 cs.LG cs.AI

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

利用验证-生成差距：基于置信度条件的测试时强化学习

Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng

发表机构 * Sun Yat-Sen University（中山大学）； University of Science and Technology of China（中国科学技术大学）； National University of Singapore（新加坡国立大学）

AI总结提出TTRL-CoCoV框架，通过置信度自适应机制解决无标签设置下Pass@k优化中的伪标签错误和多样性崩溃问题，显著提升Pass@1和Pass@k性能。

详情

AI中文摘要

测试时强化学习已成为一种有前景的范式，用于在完全无标签的方式下增强大型语言模型的复杂推理能力。尽管现有研究关注Pass@1性能，但在无标签设置下优化Pass@k（衡量生成覆盖率以支持持续探索）仍未被充分探索且至关重要。在无标签设置下优化Pass@k极具挑战性，因为直接应用对RLVR有效的Pass@k优势设计会导致性能不佳。通过深入的实证分析，我们发现阻碍性能的根本原因：低置信度样本的伪标签估计很可能不正确，而高置信度样本的候选答案则遭受严重的多样性崩溃。为克服这些障碍，我们提出TTRL-CoCoV（基于置信度条件的测试时强化学习），一种新颖的置信度自适应框架，可扩展Pass@k覆盖率并提升Pass@1性能。基于我们的关键洞察——验证能力通常领先于生成能力，TTRL-CoCoV采用置信度条件机制：对于高置信度样本，它引导验证器并应用探索增强奖励以防止多样性崩溃；对于低置信度样本，它将伪标签选择委托给验证器以过滤错误伪标签；对于中等置信度样本，则完全绕过验证。大量实验表明，TTRL-CoCoV在6个广泛认可的基准上优于最佳竞争方法，在Pass@1上平均绝对提升+9.8%，在Pass@16上平均绝对提升+18.7%，甚至在与全监督强化学习方法相比时，在多个推理基准上实现了高达+5.0%的Pass@1绝对提升。我们的代码仓库：此 https URL。

英文摘要

Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.

URL PDF HTML ☆

赞 0 踩 0

2606.03604 2026-06-03 cs.CL

Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

超越字面：多模态模因理解中的语用意图分解

Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Luyao Ye, Huimin Wang, Hanqi Yan, Binyang Li, Kam-Fai Wong, Yulan He

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei（华为）； Central China Normal University（中央师范大学）； Shenzhen University（深圳大学）； King’s College London（伦敦国王学院）； University of International Relations（国际关系大学）

AI总结针对大型视觉语言模型（LVLMs）在理解模因时倾向于描述字面内容而非语用意图的问题，提出Intent Projection框架，通过表示、输出和目标三层面的字面-语用分解，在六个基准上超越开源模型并缩小与专有模型的差距。

详情

AI中文摘要

当被问及一个模因或讽刺帖子的含义时，大型视觉语言模型（LVLMs）倾向于描述图像显示的内容，而不是作者试图传达的信息。标准指令调优将帖子的字面内容与其语用意义纠缠在一起，让表面细节污染最终响应。我们将模因理解重新定义为字面-语用分解问题，并提出 extbf{Intent Projection}，这是一个在单个LVLM骨干网络中的表示、输出和目标三个层面分离这两个信号的框架。在表示层面，一个正交投影模块从融合的图像-文本表示中移除主要的单模态方向，仅保留语用残差，同时一个表面真实情感分类器用一个离散标签锚定解码器，该标签命名了极性差距。在输出层面，模型外化一个结构化的推理链，在目标层面，一个对比奖励明确惩罚重复字面描述的答案。在六个多模态基准测试中，Intent Projection始终优于开源基线，并缩小了与专有模型的差距，在字面崩溃最具破坏性的高分歧帖子上取得了最大收益。

英文摘要

When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal content with its pragmatic meaning, letting surface-level details contaminate the final response. We reframe meme understanding as a problem of literal-pragmatic decomposition and propose \textbf{Intent Projection}, a framework that separates the two signals at the representation, output, and objective levels within a single LVLM backbone. At the representation level, an orthogonal projection module removes dominant unimodal directions from the fused image-text representation, retaining only the pragmatic residual, while a surface-real affect classifier anchors the decoder with a discrete tag that names the polarity gap. At the output level, the model externalizes a structured reasoning chain, and at the objective level a contrastive reward explicitly penalizes answers that restate the literal description. Across six multimodal benchmarks, Intent Projection consistently outperforms open-source baselines and narrows the gap to proprietary models, with the largest gains on high-divergence posts where literal collapse is most damaging.

URL PDF HTML ☆

赞 0 踩 0

2606.03603 2026-06-03 cs.CV cs.CL

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

世界模型遇见语言模型：论具体推理与抽象推理的互补性

Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结本文提出受控具体推理框架及PF-OPSD方法，通过结合世界模型的视觉模拟与多模态大语言模型的抽象推理，在空间前瞻和开放域物理预测任务上提升性能与鲁棒性。

详情

AI中文摘要

世界模型和多模态大语言模型（MLLMs）为从静态视觉观察预测未来结果提供了互补能力。世界模型可以生成可能未来的具体视觉推演，而MLLMs可以对问题、目标和规则进行抽象推理。然而，生成的推演是随机的，可能在视觉上合理但任务不正确，因此需要确定视觉模拟何时有用、推演是否可信以及它应如何影响最终答案。我们将此问题形式化为受控具体推理，其中模型学习在抽象推理之外调用、验证和整合视觉未来模拟。为了研究这一设置，我们构建了两个人工验证的基准：用于可控空间前瞻的VRQABench和用于开放域物理预测的OpenWorldQA，并提出了特权未来在策略自蒸馏（PF-OPSD）。在训练期间，PF-OPSD仅使用真实未来视频和答案作为教师侧特权上下文来评估在策略具体推理轨迹，而可部署的学生在测试时从未观察到真实未来。实验结果表明，PF-OPSD在VRQABench和OpenWorldQA上分别比基线高出10.6%和10.9%，同时增强了对噪声或冲突推演的鲁棒性。我们的代码和数据集可在以下网址获取：https://this https URL。

英文摘要

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.

URL PDF HTML ☆

赞 0 踩 0

2606.03602 2026-06-03 cs.LG cs.AI cs.CL

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

CauTion：知道何时信任LLM进行集成因果发现

Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Jiao Tong University（上海交通大学）； Nanjing University（南京大学）； Tongji University（同济大学）

AI总结提出CauTion框架，通过共识过滤和LLM可靠性估计，将LLM领域知识可靠地集成到多个统计因果发现算法中，解决纯统计方法的局限和LLM错误问题。

详情

AI中文摘要

从观测数据进行因果发现仍然具有挑战性，因为纯统计方法存在根本性限制，例如等价类内的统计可区分性和对有限样本量的敏感性。虽然大型语言模型（LLM）提供了有希望的领域知识来源来补充统计推断，但现有的LLM增强方法容易受到LLM错误的影响，并且产生高昂的令牌成本。此外，依赖单一数据驱动算法可能使结果对算法特定偏差敏感。为了解决这些限制，我们提出了CauTion，一个通过共识过滤和LLM可靠性估计将LLM领域知识可靠地集成到统计因果发现算法集成中的框架。CauTion分三个阶段进行。首先，算法集成利用共识投票解决算法一致的最多96%的边，在过滤后的共识边上实现接近完美的准确性。其次，一个信任校准仲裁机制通过无注释的信任校准过程估计LLM和算法的相对可靠性，然后用于控制信任加权投票过程，将LLM仲裁限制在算法证据不可靠的边上。第三，应用循环修复步骤确保最终因果图是有效的无环图。在六个数据集上的实验表明，CauTion在性能上始终优于数据驱动和LLM增强的基线，在更大的图上获得更大的收益，并且对LLM错误具有强大的鲁棒性。代码可在以下网址获取：https://this URL。

英文摘要

Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.

URL PDF HTML ☆

赞 0 踩 0

2606.03590 2026-06-03 cs.RO

CANMOT: Class-Aware Noise Modeling for Multi-Object Tracking in Autonomous Driving

CANMOT: 自动驾驶中多目标跟踪的类别感知噪声建模

Timo Osterburg, Stefan Schütte, Torsten Bertram

发表机构 * Institute of Control Theory and Systems Engineering, TU Dortmund University（控制理论与系统工程研究所，多特蒙德大学）

AI总结针对自动驾驶中多目标跟踪任务，提出一种类别感知且目标对齐的噪声建模框架CANMOT，通过引入类别特定的过程与测量噪声协方差矩阵，并在目标坐标系中表达以保持纵向-横向各向异性，从而提升跟踪性能并显著减少身份切换。

Comments submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情

AI中文摘要

基于卡尔曼滤波的多目标跟踪（MOT）因其强大的性能、计算效率和可解释性，仍然是自动驾驶的强基线。在大多数实际系统中，过程噪声和测量噪声协方差是全局定义并在对象类别间共享的，假设异质交通参与者具有相同的不确定性特征。本文重新审视了这一假设，并提出了CANMOT，一种用于基于KF的3D MOT的类别感知和目标对齐的噪声建模框架。引入了类别特定的对角过程与测量协方差矩阵，并可选地在对象坐标系中表达以保持纵向-横向各向异性。在nuScenes基准上的系统实验表明，与最先进方法相比，类别感知和目标对齐的噪声建模提高了跟踪性能，并显著减少了身份切换。此外，使用平均归一化估计误差平方（ANEES）和基于$\chi^2$的违例测试分析了估计不确定性的一致性。结果揭示了标准基于KF的MOT基线存在严重的过度自信。虽然所提出的公式在不修改底层滤波框架的情况下改善了校准，但仍然表现出显著的不一致性，凸显了在该领域进一步研究的必要性。代码可在该https URL获取。

英文摘要

Kalman filter (KF)-based multi-object tracking (MOT) remains a strong baseline for autonomous driving due to its strong performance, computational efficiency and interpretability. In most practical systems, the process noise and measurement noise covariances are defined globally and shared across object classes, presuming identical uncertainty characteristics across heterogeneous traffic participants. This work revisits this assumption and proposes CANMOT, a class-aware and object-aligned noise modeling framework for KF-based 3D MOT. Class-specific diagonal process and measurement covariance matrices are introduced and optionally expressed in the object coordinate frame to preserve longitudinal-lateral anisotropy. Systematic experiments on the nuScenes benchmark show that class-aware and object-aligned noise modeling improves tracking performance and substantially reduces identity switches compared to state-of-the-art (SotA). In addition, the consistency of the estimated uncertainty is analyzed using the Average Normalized Estimation Error Squared (ANEES) and $χ^2$-based violation tests. The results reveal severe overconfidence in standard KF-based MOT baselines. While the proposed formulation improves calibration without modifying the underlying filtering framework, it still exhibits substantial inconsistency, highlighting the need for further research in this area. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms.

URL PDF HTML ☆

赞 0 踩 0

2606.03584 2026-06-03 cs.LG cond-mat.dis-nn cs.NE

Training a Predictive Coding Network on ImageNet using Equilibrium Propagation

使用均衡传播在ImageNet上训练预测编码网络

Tugdual Kerjan, Rasmus Høier, Benjamin Scellier

发表机构 * Rain AI

AI总结提出一种结合中心化均衡传播与新型均衡方案的预测编码网络训练方法，在ImageNet上训练10层卷积PCN，达到13.23% top-5错误率，接近反向传播基线。

详情

AI中文摘要

均衡传播（EP）是一种基于物理的训练框架，主要应用于能量模型，包括连续Hopfield网络、非线性电阻网络和耦合相位振荡器。然而，EP的实际应用至今仍局限于相对小规模的问题。预测编码网络（PCN）是另一类根植于计算神经科学的能量模型，通常使用专门的算法训练，同样尚未在大规模上得到验证。在这项工作中，我们开发了一种基于EP的PCN训练方法，该方法将中心化EP与一种新的PCN均衡方案相结合。使用这种方法，我们在全尺寸ImageNet上训练了一个10层卷积PCN（VGG10），在top-5分类任务上实现了13.23%的测试错误率，接近12.2%的反向传播基线。据我们所知，这是PCN和基于EP的训练首次在ImageNet规模上得到验证。这些结果显著扩展了两种方法的可扩展性，并表明在其他物理系统中扩展EP的主要挑战可能更多地来自这些系统的计算特性，而非EP框架本身的固有限制。

英文摘要

Equilibrium Propagation (EP) is a physics-based training framework that has primarily been employed in energy-based models, including continuous Hopfield networks, nonlinear resistive networks and coupled phase oscillators. However, EP's practical applications have so far remained limited to relatively small-scale problems. Predictive coding networks (PCNs), another class of energy-based models rooted in computational neuroscience, are typically trained with a specialized algorithm and have likewise not yet been demonstrated at large scale. In this work, we develop an EP-based training method for PCNs which combines the centered variant of EP with a novel equilibration scheme for PCNs. Using this approach, we train a 10-layer convolutional PCN (VGG10) on full-size ImageNet, achieving 13.23\% test error rate on the top-5 classification task, close to the 12.2\% backpropagation baseline. To our knowledge, this is the first demonstration of both PCNs and EP-based training at ImageNet scale. These results significantly extend the scalability of both approaches and suggest that the primary challenges in scaling EP in other physical systems may come more from the computational properties of these systems than from inherent limitations of the EP framework.

URL PDF HTML ☆

赞 0 踩 0

2606.03581 2026-06-03 cs.CV cs.RO

UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene via Rendering Fusion

UnsOcc：非结构化场景下基于渲染融合的3D语义占用预测

Ye Wu, Ruiqi Song, Baiyong Ding, Nanxin Zeng, Junjie Cheng, Yunfeng Ai

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Waytous Inc.（Waytous公司）

AI总结提出UnsOcc多模态框架，通过渲染融合模块和基于高斯溅射的细节感知辅助监督，解决非结构化场景中跨模态融合困难与长尾分布问题，在露天矿和nuScenes数据集上超越现有方法。

Comments 8 pages

详情

AI中文摘要

非结构化场景给自动驾驶带来了独特挑战，因为不规则障碍物和稀疏的场景布局削弱了3D目标检测等传统感知方法的有效性。3D语义占用预测因其能够通过为3D空间中的单个体素分配语义标签来提供密集的空间表示而成为研究热点。然而，将3D语义占用预测直接应用于非结构化场景仍然具有挑战性，因为场景稀疏性阻碍了有效的跨模态融合，并且这些场景中更严重的长期尾部分布进一步降低了预测性能。为了验证我们方法的有效性，我们构建了一个从露天矿收集的非结构化场景专用数据集。在此基础上，我们提出了UnsOcc，一种多模态3D语义占用预测框架，提高了在非结构化环境中的鲁棒性。其核心是，我们引入了一个基于渲染的融合模块RenderFusion，通过双向渲染监督增强跨模态特征对齐。此外，我们提出了GSRefinement，一种基于高斯溅射的细节感知辅助监督方法，将稀疏的3D占用预测投影到密集的2D语义分割图中，从而实现对长尾类别的有效监督。在露天矿数据集和nuScenes数据集上的大量实验表明，我们的方法显著优于现有的最先进方法。

英文摘要

Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic labels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes remains challenging because scene sparsity hinders effective cross-modal fusion and the more severe long-tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open-pit mines. Based on this, we propose UnsOcc, a multi-modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering-based fusion module, RenderFusion, which enhances cross-modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail-aware auxiliary supervision method based on Gaussian Splatting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long-tail categories. Extensive experiments on both the open-pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.03578 2026-06-03 cs.CV

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

在正确空间中扩散：潜在可扩散性的系统研究

Tianxiong Zhong, Xingye Tian, Xuebo Wang, Xin Tao, Pengfei Wan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文系统研究潜在扩散模型中潜在表示的可扩散性，提出速度不可约方差（VIV）作为生成质量的稳定预测指标。

详情

AI中文摘要

潜在扩散模型利用视觉分词器将图像压缩到潜在空间以实现高效生成建模。然而，分词器更好的重建质量并不一定转化为更好的生成质量，这表明潜在表示不仅应通过保真度评估，还应通过其可扩散性评估。最近的研究提出了多种对扩散友好的潜在空间的解释，包括语义可分离性、仿射等变性、分布均匀性、空间结构、谱平滑性和流形连续性。然而，这些性质通常在一组有限的分词器上验证，导致不清楚哪些因素最能预测下游生成质量，以及这些结论是否适用于其引入的特定设置之外。在这项工作中，我们通过训练大量具有不同正则化策略、架构和潜在配置的分词器，并使用多个下游扩散骨干网络对其进行评估，对潜在可扩散性进行了系统研究。我们的分析确定了几个与生成质量持续相关且在实验设置中表现出强泛化能力的潜在性质。除了现有指标，我们引入了速度不可约方差（VIV），这是一种由轨迹交叉引起的速度模糊性的度量。大量实验表明，VIV是生成质量最稳定的预测因子之一。

英文摘要

Latent diffusion models leverage visual tokenizers to compress images into latent spaces for efficient generative modeling. However, better reconstruction quality of a tokenizer does not necessarily translate into better generation quality, suggesting that latent representations should be evaluated not only by fidelity but also by their diffusability. Recent studies have proposed diverse explanations for diffusion-friendly latent spaces, including semantic separability, affine equivariance, distribution uniformity, spatial structure, spectral smoothness, and manifold continuity. Yet these properties are often validated on a limited set of tokenizers, leaving it unclear which factors are most predictive of downstream generation quality and whether such conclusions hold beyond the specific settings in which they are introduced. In this work, we conduct a systematic study of latent diffusability by training a large collection of tokenizers with diverse regularization strategies, architectures, and latent configurations, and evaluating them with multiple downstream diffusion backbones. Our analysis identifies several latent properties that consistently correlate with generation quality and exhibit strong generalization across experimental settings. Beyond existing metrics, we introduce Velocity Irreducible Variance (VIV), a measure of velocity ambiguity induced by trajectory crossings. Extensive experiments show that VIV is one of the most stable predictors of generation quality.

URL PDF HTML ☆

赞 0 踩 0

2606.03577 2026-06-03 cs.CV

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

通过宽基线匹配激发多模态大语言模型中的复杂空间推理

Hao Zhong, Muzhi Zhu, Shenyan Zeng, Anzhou Li, Cong Chen, Hua Geng, Duochao Shi, Wentao Ye, Tao Lin, Hao Chen, Chunhua Shen

发表机构 * State Key Laboratory of CAD & CG, Zhejiang University（浙江大学计算机辅助设计与图形学国家重点实验室）； Ant Group（蚂蚁集团）； Westlake University（西湖大学）

AI总结本文提出ReasonMatch-Bench基准和动态对应强化学习（DCRL）方法，以系统评估和提升多模态大语言模型在宽基线匹配任务中的空间推理能力。

Comments CVPR 2026. Project page: https://aim-uofa.github.io/reasonmatch/ Code: https://github.com/aim-uofa/ReasonMatch

详情

AI中文摘要

宽基线匹配（WBM）需要整合几何理解、视角变化、细粒度感知和遮挡推理，使其成为部署在物理环境中的多模态大语言模型（MLLMs）空间推理的一个具有挑战性的测试平台。然而，当前的MLLMs缺乏对这些能力的系统评估和训练框架。我们引入了ReasonMatch-Bench，这是一个根据视角位移和匹配粒度在室内、室外和以物体为中心的场景中分层的基准，并表明当前的MLLMs在细粒度宽基线对应上仍然存在困难：在一个困难的90样本子集上，人类标注者达到84.0 F1，而最佳现有基线达到37.2。为了弥补这一差距，我们构建了一个可扩展的数据生成管道，该管道从大规模视频-3D语料库（包括RGB-D视频和SfM重建）中自动提取宽基线视图对，产生多样且可验证的监督。我们进一步提出了动态对应强化学习（DCRL），它结合了图像级视角进展和点级对应课程，通过可验证的奖励改进WBM训练，无需显式的CoT监督。大量实验表明，DCRL显著提高了ReasonMatch-Bench的性能，并迁移到相关的空间基准，同时在几个基准上保持了通用视觉理解性能并取得了适度提升。

英文摘要

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.03569 2026-06-03 cs.CV cs.AI

When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

当注意力崩溃时：从结构到语义的阶段性视觉令牌剪枝

Jiahui Wang, Kai Zhang, Mai Han, Huanghe Zhang

发表机构 * Shandong University（山东大学）； National University of Singapore (Suzhou) Research Institute（新加坡国立大学（苏州）研究院）

AI总结针对视觉语言模型推理中视觉令牌剪枝因依赖单一注意力分数导致特征多样性下降的问题，提出两阶段剪枝框架STS，先通过排斥采样最大化结构多样性，再通过指令感知交叉注意力过滤语义无关令牌，从而提升保留令牌的结构多样性与细粒度任务对齐。

详情

AI中文摘要

视觉语言模型（VLMs）展现了卓越的能力，但在推理过程中承受着巨大的计算开销。虽然视觉令牌剪枝提供了一种有前景的解决方案，但现有方法主要依赖于初始注意力分数。这种单一度量范式存在一个关键缺陷：高注意力分数会固有地坍缩到语义相似区域，从而严重降低特征多样性并丢弃重要的上下文细节。为解决这一问题，我们引入了结构到语义（STS），一种新颖的两阶段视觉令牌剪枝框架，明确解耦了剪枝过程。第一阶段采用基于排斥的采样机制，以最大化空间和结构多样性。第二阶段利用指令感知的交叉注意力，精确过滤掉与提示无关的令牌。这种两阶段协同构成了STS的核心，首先确保几何覆盖，然后根据语义相关性细化保留的令牌。大量评估表明，STS减轻了由基于注意力的选择引起的冗余，提高了保留视觉令牌的结构多样性和细粒度任务对齐。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a critical flaw: high attention scores inherently collapse onto semantically similar regions, thereby severely reducing feature diversity and discarding vital contextual details. To address this, we introduce Structure-to-Semantics (STS), a novel two-stage visual token pruning framework that explicitly decouples the pruning process. The first stage employs a repulsion-based sampling mechanism to maximize spatial and structural diversity. The second stage leverages instruction-aware cross-attention to precisely filter out prompt-irrelevant tokens. This two-stage synergy constitutes the core of STS, first ensuring geometric coverage and then refining the retained tokens according to semantic relevance. Extensive evaluations demonstrate that STS mitigates the redundancy caused by attention-based selection, improving both structural diversity and fine-grained task alignment of the preserved visual tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.03568 2026-06-03 cs.CV cs.AI cs.LG cs.RO

Learned Non-Maximum Suppression for 3D Object Detection

用于3D目标检测的学习型非极大值抑制

Timo Osterburg, Stefan Schütte, Torsten Bertram

发表机构 * Institute of Control Theory and Systems Engineering, TU Dortmund University（控制理论与系统工程研究所，多特蒙德技术大学）

AI总结提出两种基于学习的过滤模块（D2D-Rescore和GossipNet3D）替代启发式NMS，通过检测间关系提升3D检测性能，尤其改善小物体和稀有类别的检测精度。

Comments 6 pages, accepted at IEEE Intelligent Vehicles Symposium (IV) 2026

详情

AI中文摘要

后处理是基于激光雷达的3D目标检测中的关键阶段，必须过滤密集且重叠的提议以实现紧凑可靠的感知。本文引入了两个学习型过滤模块，通过利用检测之间的关系来替代启发式非极大值抑制（NMS）。D2D-Rescore采用基于Transformer的检测到检测（D2D）注意力，而GossipNet3D通过鸟瞰图中的局部消息传递将2D GossipNet概念适应到3D。一种与nuScenes评估协议对齐的度量感知匹配策略确保了训练和验证行为的一致性，从而提高了整体检测性能。与CircleNMS相比，两种方法都提高了平均精度（mAP）、nuScenes检测分数（NDS）和真阳性质量，特别是对于小物体和稀有类别，同时增加了最小的计算开销。这些结果表明，学习型的检测级过滤可以在不修改基础网络的情况下增强3D检测器的可靠性，为启发式抑制提供了一种原则性的替代方案。代码可在以下网址获取：https://this URL。

英文摘要

Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms .

URL PDF HTML ☆

赞 0 踩 0

2606.03557 2026-06-03 cs.AI cs.HC

From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds

从提示到服务：基于SLM的AI驱动虚拟世界代理编排网关

Louis Nisiotis, Aimilios Hadjiliasi

发表机构 * University of Cambridge（剑桥大学）

AI总结本文提出一种基于小语言模型的代理编排网关，通过意图驱动的服务路由解耦虚拟世界客户端与异构AI后端，并在虚拟博物馆测试床中验证了其可行性和效率。

详情

AI中文摘要

随着生成式AI能力的扩展，AI驱动的虚拟世界面临日益增长的架构挑战。用户通过世界内界面以多模态方式进行交互，但其请求需要根本不同的AI后端模型和计算资源。将这些能力直接嵌入虚拟世界系统会降低可扩展性、增加维护复杂性，并限制协调分布在边缘和云基础设施上的服务的能力。本文提出一种基于SLM的代理编排网关，这是一种轻量级运行时协调机制，通过意图驱动的服务路由将虚拟世界客户端与异构AI后端解耦。边缘部署的SLM对每个用户提示的语义意图进行分类，可配置的服务注册表验证并解析路由决策，然后透明地调用所选后端，从而无需修改客户端应用即可在虚拟世界中引入新的AI能力。该网关在InterwovenXR虚拟博物馆测试床中实现并评估。评估表明，紧凑型SLM可以在边缘硬件上作为可靠的意图路由器，并且任务特定的微调可以将参数低于十亿的模型转化为实用的低延迟路由器。一种分层配置将微调后的十亿以下参数模型作为路由器，与用于对话响应生成的较大SLM配对，证明可以在中端边缘硬件上部署，并且比将两个职责委托给单个模型更高效。研究结果表明，SLM可以支持虚拟世界中实用的AI服务编排，并且该工作贡献了一种可评估的架构，用于可扩展、可扩展且支持边缘的AI交互，使虚拟代理成为分布式生成式AI服务的访问点。

英文摘要

As generative AI capabilities expand, AI-driven virtual worlds face a growing architectural challenge. Users interact through in-world interfaces in multimodal ways, yet their requests demand fundamentally different AI backend models and computational resources. Embedding these capabilities directly into virtual world systems reduces extensibility, complicates maintenance, and limits the ability to coordinate services distributed across edge and cloud infrastructure. This paper presents an SLM-based Agent Orchestration Gateway, a lightweight runtime coordination mechanism that decouples a virtual world client from heterogeneous AI backends through intent-driven service routing. An edge-deployed SLM classifies the semantic intent of each user prompt, a configurable service registry validates and resolves the routing decision, and the selected backend is invoked transparently, enabling new AI capabilities to be introduced in the virtual world without modifying the client application. The gateway is implemented and evaluated within the InterwovenXR virtual museum testbed. The evaluation shows that compact SLMs can serve as reliable intent routers on edge hardware, and that task-specific fine-tuning can transform sub-billion-parameter models into practical, low-latency routers. A layered configuration pairing a fine-tuned sub billion-parameter model as router with a larger SLM for conversational response generation is shown to be deployable on mid-range edge hardware and more efficient than delegating both responsibilities to a single model. The findings show that SLMs can support practical AI service orchestration in virtual worlds and the work contributes an evaluated architecture for scalable, extensible, and edge-supported AI interaction, enabling virtual agents become access points to distributed generative AI services.

URL PDF HTML ☆

赞 0 踩 0

2606.03556 2026-06-03 cs.RO

Partially Observable Adversarial Patch Attacks on Vision-Language-Action Models in Robotics

部分可观测的对抗性补丁攻击在机器人视觉-语言-动作模型上的应用

Xiaofei Wang, Mingliang Han, Tianyu Hao, Yi Yang, Yun-Bo Zhao, Keke Tang

发表机构 * Department of Automation, University of Science and Technology of China（自动化系，中国科学技术大学）； SmartMore Corporation（SmartMore公司）； Cyberspace Institute of Advanced Technology, Guangzhou University（广西亚技术空间研究所，广州大学）； th Medical Center of Chinese PLA General Hospital（中国人民解放军总医院第八医学中心）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）

AI总结针对机器人VLA模型，提出部分可观测威胁模型下的两阶段攻击框架，利用注意力图定位关键区域并优化补丁以破坏语义接地和增加动作轨迹曲率，导致长期任务失败。

Comments Accepted by IEEE Robotics and Automation Letters, 2026

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人领域受到关注，但其对对抗性攻击的鲁棒性仍鲜有探索。现有工作表明对抗性补丁可以误导基于VLA的机器人，但假设完全访问整个执行轨迹，这在实践中是不现实的。我们通过制定部分可观测威胁模型来解决这一限制，其中攻击者只能利用轨迹的短前缀来生成固定补丁，应用于所有后续帧。在此设置下，我们提出了一个两阶段框架。首先，我们使用模型的注意力图定位补丁，以识别与完整指令对应的视觉关键区域。然后，我们优化补丁以破坏目标对象的语义接地并增加动作轨迹的曲率，从而在感知和控制中复合故障。在模拟和真实机器人环境中的大量实验表明，我们的方法在部分可观测性下维持对抗效果，诱导长期中断并显著降低任务成功率。

英文摘要

Vision-language-action (VLA) models are gaining attention in robotics, yet their robustness to adversarial attacks remains largely unexplored. Existing work shows that adversarial patches can mislead VLA-based robots but assumes full access to the entire execution trajectory, an unrealistic requirement in practice. We address this limitation by formulating a partially observable threat model, where the adversary can exploit only a short prefix of the trajectory to generate a fixed patch applied to all subsequent frames. Under this setting, we propose a two-phase framework. First, we localize the patch using the model's attention maps to identify visually critical regions that correspond to the full instruction. Then, we optimize the patch to disrupt the semantic grounding of target objects and increase the curvature of action trajectories, thereby compounding failures in both perception and control. Extensive experiments in simulation and real-world robotic environments show that our method sustains adversarial effects under partial observability, inducing long-horizon disruptions and significantly reducing task success rates.

URL PDF HTML ☆

赞 0 踩 0

2606.03551 2026-06-03 cs.RO

NVIDIA Isaac Sim: Enabling Scalable, GPU-Accelerated Simulation for Robotics

NVIDIA Isaac Sim：实现可扩展的GPU加速机器人仿真

Sicong Gao, Maurice Pagnucco, Tomasz Bednarz, Yang Song

发表机构 * School of Computer Science and Engineering, The University of New South Wales（新南威尔士大学计算机科学与工程学院）； NVIDIA USA（NVIDIA美国公司）

AI总结本文系统综述了NVIDIA Isaac Sim的架构、应用模式及局限性，重点分析其GPU加速在大规模并行训练、合成数据生成和物理精确建模方面的优势，并探讨了未来方向。

详情

AI中文摘要

仿真已成为机器人研究的核心基础设施。与以往的仿真器不同，NVIDIA Isaac Sim利用GPU加速实现大规模并行训练和物理精确建模。其合成数据生成流水线缓解了高质量训练数据的稀缺性，支持数据驱动的机器人学习和大规模以仿真为中心的实验。然而，现有综述通常将其视为众多仿真器之一，缺乏对其架构特性、使用模式和局限性的系统分析。本文从系统和应用角度综述Isaac Sim，概述其架构并与广泛使用的仿真器进行比较。我们分析了五个主要领域的代表性研究，总结了常见的使用模式，特别是在数据生成和高保真仿真方面。我们还概述了关键的未来方向和挑战，包括物理开放世界学习、以仿真为中心的培训以及实际可用性约束。

英文摘要

Simulation has become a core infrastructure for robotics research. Unlike previous simulators, NVIDIA Isaac Sim leverages GPU acceleration to enable large-scale parallel training and physics-accurate modeling. Its synthetic data generation pipeline alleviates the scarcity of high-quality training data, supporting data-driven robot learning and large-scale simulation-centric experimentation. However, existing surveys often treat it as one simulator among many, without a systematic analysis of its architectural characteristics, usage patterns, and limitations. This survey reviews Isaac Sim from system and application perspectives, outlining its architecture and comparing it with widely used simulators. We analyze representative studies across five major domains and summarize common usage patterns, particularly in data generation and high-fidelity simulation. We also outline key future directions and challenges, including physics open-world learning, simulation-centric training and practical usability constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.03549 2026-06-03 cs.LG math.PR

How Many Trees in a Random Forest? A Revisited Approach with Plateau Search and Optuna Integration

随机森林中需要多少棵树？一种结合平台搜索与Optuna集成的重新审视方法

Vadim Porvatov, Andrey Dukhovny, Andrey Lange

发表机构 * Sberbank ； Skolkovo Institute of Science and Technology (Skoltech)（Skoltech）； Federal Research Center "Computer Science and Control" of Russian Academy of Sciences (FRC CSC RAS)（俄罗斯科学院计算机科学与控制联邦研究中心）

AI总结提出一种基于三元组平台搜索的算法，通过监控袋外分数的相对变化自动确定随机森林的树数量，避免预设搜索范围，并提供了理论分析和实验验证。

详情

AI中文摘要

随机森林的超参数优化在调整树数量时面临一个特定困难：预测分数通常随集成规模单调提升，因此诸如树结构Parzen估计器（TPE）和Hyperband等标准方法需要预定义搜索范围，且往往将估计推向其右边界。早停策略避免了固定这样的范围，但对分数噪声敏感且容易过早停止。为解决此问题，我们提出一种集成的基于三元组的平台搜索算法，该算法将树数量从直接TPE搜索空间中移除，同时仍利用跨HPO试验积累的信息。该方法通过监控三个森林规模上的袋外（OOB）分数相对变化，自适应地跟踪接近最小的充分集成规模，并相应移动该三元组。这产生了一个基于容差参数的自动化且用户可解释的过程。我们还提供了理论分析：我们将所提出的相对OOB分数准则与当前分数和极限分数之间的差距联系起来，并推导了相应的基于OOB的绝对相对差异的渐近方差估计。实验表明，所选树数量可能与常见启发式方法有显著差异：对于大多数经典基准数据集，它更小；而对于一些高维生物信息学数据集（如Arcene和Dorothea），则更大。源代码和可重复实验可在以下网址获取：https://github.com/your-repo。

英文摘要

Hyperparameter optimization (HPO) for Random Forest faces a specific difficulty in tuning the number of trees: the predictive score typically improves monotonically with ensemble size, so standard methods such as Tree-structured Parzen Estimator (TPE) and Hyperband require a predefined search range and often drive the estimate toward its right boundary. Early-stopping strategies avoid fixing such a range, but can be sensitive to score noise and prone to premature stopping. To address this, we propose an integrated triplet-based plateau-search algorithm that removes the number of trees from the direct TPE search space and still exploits information accumulated across HPO trials. The method adaptively tracks a near-minimal sufficient ensemble size by monitoring relative changes in the out-of-bag (OOB) score across a triplet of forest sizes and shifting this triplet accordingly. This yields an automated and user-interpretable procedure based on a tolerance parameter. We also provide a theoretical analysis: we relate the proposed relative OOB-score criterion to the gap between the current and limiting scores, and derive an asymptotic variance estimate for the corresponding OOB-based absolute relative difference. Experiments show that the selected number of trees can differ substantially from the common heuristic: for most classical benchmark datasets it is smaller, whereas for some high-dimensional bioinformatics datasets, such as Arcene and Dorothea, it is larger. The source code and reproducible experiments are available at https://github.com/lange-am/rf_plateau_hpo.

URL PDF HTML ☆

赞 0 踩 0

2606.03545 2026-06-03 cs.RO

Static and Dynamic Representations for Tactile Contact-Angle Estimation with Event-Based Sensors

基于事件传感器的触觉接触角估计的静态与动态表示

Yanhui Lu, Efi Psomopoulou, Benjamin Ward-Cherrier

发表机构 * School of Engineering Mathematics and Technology, University of Bristol（布里斯托大学工程数学与科技学院）

AI总结本文利用事件触觉传感器（NeuroTac）的事件流，比较了三种事件衍生的空间轮廓表示（动态、静态及其组合）用于接触角估计，并验证了其在机器人操作中实现高频、低延迟触觉角度估计的潜力。

Comments 8 pages, 8 figures. Submitted to IEEE Robotics and Automation Letters (RAL), under review

详情

AI中文摘要

基于事件的触觉传感为接触密集的机器人交互提供了低延迟信号采集。本文研究了使用来自事件触觉传感器（NeuroTac）的事件流进行接触角估计，并比较了三种事件衍生的空间轮廓表示：捕获近期事件活动的动态表示、恢复更持久接触状态的静态表示以及它们的组合表示。在评估的运动场景中，所有表示管道在所有测试采样间隔下的P99处理延迟均低于10毫秒，展示了它们在机器人操作中用于高频基于事件的触觉角度估计的潜力。在特定场景训练下，静态表示始终比动态和组合表示表现略好，在连续传感器滚动期间产生平均总体MAE为0.160°，在随机插入的运动中断期间停止阶段平均MAE为0.251°。它还在速度和压痕深度变化方面表现出比其他两种表示更小的性能波动。

英文摘要

Event-based tactile sensing offers low-latency signal acquisition for contact-rich robotic interaction. This paper investigates contact-angle estimation using event streams from an event-based tactile sensor (NeuroTac) and compares three event-derived spatial contour representations: a dynamic representation capturing recent event activity, a static representation recovering a more persistent contact state, and their combined representation. Across the evaluated motion scenarios, all representation pipelines exhibited P99 processing latency below 10 ms at all tested sampling intervals, demonstrating their potential for high-frequency event-based tactile angle estimation in robotic manipulation. The static representation consistently achieved marginally better performance than the dynamic and combined representations under scenario-specific training, yielding a mean overall MAE of 0.160° during continuous sensor rolling and a stop-phase mean MAE of 0.251° during randomly inserted motion interruptions. It also exhibited smaller performance fluctuations across speed and indentation depth variations than the other two representations.

URL PDF HTML ☆

赞 0 踩 0

2606.03544 2026-06-03 cs.AI cs.CL

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

SAGE: 智能体生态中社会化演化的定量评估

Linyue Pan, Yaoming Zhu, Lin Qiu, Xuezhi Cao, Xunliang Cai

发表机构 * Tsinghua University, China（清华大学, 中国）； Meituan, China（美团, 中国）

AI总结提出SAGE框架，通过对比社会演化（SocialEvo）与自我演化（SelfEvo）两种计算条件，在三个领域评估共享经验对智能体性能的影响，发现群体历史并非普遍放大器，但能帮助陷入停滞的智能体取得突破，且社会收益依赖于抽象能力而非暴露量。

Comments 13 pages, 5 figures

详情

AI中文摘要

自我改进的语言智能体通常被孤立评估：一个智能体尝试任务、接收反馈并迭代优化自身行为。然而，智能体越来越多地与同伴一起运作，其策略和结果公开可见。这引发了一个研究不足的问题：共享经验何时能产生自我改进无法单独实现的改进？我们引入了SAGE（社会智能体群体演化），一个评估框架，比较两种计算匹配的条件：SocialEvo，其中来自五个不同模型家族的智能体共同演化，可访问所有同伴的历史；以及SelfEvo，其中每个智能体获得相同数量的任务尝试，但只能看到自己的过去，这是自我改进智能体研究中的常规做法。我们在三个领域实例化SAGE：开放式机器学习研究、长期经济规划和战略多人游戏，并在多个演化轮次中进行评估。我们发现群体历史并非普遍放大器：最强的智能体并未超过其自我演化上限。然而，在自我改进下停滞的智能体，当同伴经验可用时，可以取得重大突破。在竞争环境中，反事实控制显示智能体普遍改进，而非发展针对对手的策略。在不同形式的共享历史中，过滤后的同伴轨迹和反思性摘要通常优于原始日志，表明社会收益依赖于抽象而非暴露量。这些发现表明，同伴历史收益是智能体特定的、领域依赖的，并取决于从公共轨迹中抽象可转移知识的能力。

英文摘要

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

URL PDF HTML ☆

赞 0 踩 0

2606.03539 2026-06-03 cs.CV

Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding

零空间中知识保留的模型调优用于鲁棒的时空视频定位

Haoxuan Chen, Xianqin Liu, Jian-Fang Hu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, China（中山大学计算机科学与工程学院）； National Information Center of GACC (Guangdong), GuangZhou, China（广东省GACC国家信息中心）； Guangdong Province Key Laboratory of Information Security Technology, China（广东省信息安全技术重点实验室）； Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China（教育部机器智能与高级计算重点实验室）

AI总结针对低质量视频导致预训练知识被破坏的问题，提出零空间调优（NST）框架，通过将可学习残差限制在冻结权重的零空间内来保留预训练知识，同时利用质量自适应单元和双空间重参数化合成残差，在混合质量基准上达到最优性能。

Comments Accepted by ICME 2026

详情

AI中文摘要

时空视频定位旨在基于文本查询定位目标管。尽管近期方法取得了显著成功，但它们主要关注高质量输入，忽略了现实场景中广泛存在的低质量视频。虽然像LoRA这样的调优方法可以适应降质输入，但它们不可避免地破坏了预训练知识。为解决这一问题，我们提出了零空间调优（NST）。该框架利用了将冻结权重的零空间内的向量添加到层输入不会影响输出的几何性质。利用这一点，NST将可学习残差注入输入特征，这些残差可以选择性地对预训练骨干网络不可见。具体地，NST结合了质量自适应单元和双空间重参数化来合成这些残差，通过将高质量输入的组件限制在零空间内，同时将低质量输入的恢复组件引导至非零空间。由于冻结权重消除了零空间组件，我们有效地纠正了降质输入，同时保留了高质量输入的预训练知识。大量实验表明，NST在我们的混合质量基准上优于最先进的方法。

英文摘要

Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods like LoRA can adapt to degraded inputs, they inevitably disrupt pre-trained knowledge. To address this, we propose Null-Space Tuning (NST). This framework exploits the geometric property that adding vectors within the null-space of frozen weights to the layer input does not affect the output. Leveraging this, NST injects learnable residuals into input features that can be selectively invisible to the pre-trained backbone. Specifically, NST combines the Quality-Adaptive Unit and Dual-Space Reparameterization to synthesize these residuals by confining components for HQ inputs to the null-space, while directing restoration components for LQ inputs to the non-null space. As the frozen weights eliminate null-space components, we effectively rectify degraded inputs while preserving pre-trained knowledge for HQ inputs. Extensive experiments show that NST outperforms state-of-the-art methods on our Mixed-Quality benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.03536 2026-06-03 cs.RO

Bionic Human-Motion Style Transfer for Physically Executable Whole-Body Control of Humanoid Robots

仿人运动风格迁移用于人形机器人物理可执行全身控制

Tianchen Huang, Mingkuan Zhao, Yang Gao, Feiyang Yuan, Junchi Gu, Xiaohu Zhang, Dongdong Zhao, Shi Yan, Yu Wang, Wei Gao, Shiwu Zhang

发表机构 * Institute of Humanoid Robots, Department of Precision Machinery and Precision Instrumentation, University of Science and Technology of China（人形机器人研究院，精密机械与精密仪器系，中国科学技术大学）； School of Computer Science and Technology, Faculty of Electronic and Information Engineering, Xi’an Jiaotong University（计算机科学与技术学院，电子与信息工程学院，西安交通大学）； School of Information Science and Engineering, Lanzhou University（信息科学与工程学院，兰州大学）

AI总结提出一种仿生生成到控制框架，通过物理感知多条件潜扩散模型和预览式全身跟踪策略，将短时人体风格示例迁移到不同运动内容上，实现人形机器人可执行且表达性强的全身运动。

Comments Project page: https://huangtc233.github.io/bionic-style-transfer/

详情

AI中文摘要

表达性全身运动对于在人类环境中运行的人形机器人至关重要，机器人需要稳定移动的同时呈现可读且可调整的身体行为。然而，大多数表达性运动仍来自固定演示或手动设计的脚本，难以在不同运动内容间复用演示风格。受人体运动风格通过步态节奏、姿态、手臂摆动和身体摇摆传递情感和意图线索的启发，本文提出了一种仿生生成到控制框架，用于人形机器人上的示例驱动风格迁移。给定一个短时人体风格示例和目标内容运动，所提框架生成一个风格化全身参考，保留预期运动内容的同时迁移演示风格。开发了一个物理感知多条件潜扩散模型来融合风格、内容和轨迹条件，并使用无分类器引导在不重新训练的情况下调整风格强度。为提高硬件可执行性，在训练期间对解码后的运动施加接触一致性和时间平滑正则化。生成的参考随后转换为G1兼容的机器人参考，并由基于预览的全身跟踪策略执行，该策略采用聚类和蒸馏策略训练。仿真和Unitree G1实验表明，所提方法可以将短时人体风格示例迁移到多样化的机器人运动内容，与面向动画的风格迁移基线相比减少接触和抖动伪影，并在125次真实机器人试验中达到96.0%的成功率。结果证明了使用短时人体运动示例作为可复用的仿生源实现物理可执行表达性人形运动的可行性。

英文摘要

Expressive whole-body motion is important for humanoid robots operating in human environments, where robots are expected to move stably while presenting readable and adjustable body behaviors. However, most expressive motions are still obtained from fixed demonstrations or manually designed scripts, making it difficult to reuse a demonstrated style across different motion contents. Inspired by the way human motion styles convey affective and intentional cues through gait rhythm, posture, arm swing and body sway, this paper proposes a bionic generation-to-control framework for exemplar-driven style transfer on humanoid robots. Given a short human style exemplar and a target content motion, the proposed framework generates a stylized whole-body reference that preserves the intended motion content while transferring the demonstrated style. A physics-aware multi-condition latent diffusion model is developed to fuse style, content and trajectory conditions, and classifier-free guidance is used to adjust the style intensity without retraining. To improve hardware executability, contact-consistency and temporal-smoothness regularization are imposed on decoded motions during training. The generated references are then converted into G1-compatible robot references and executed by a preview-based whole-body tracking policy trained with a cluster-and-distill strategy. Simulation and Unitree G1 experiments show that the proposed method can transfer short human style exemplars to diverse robot motion contents, reduce contact and jitter artifacts compared with animation-oriented style-transfer baselines, and achieve a 96.0% success rate over 125 reported real-robot trials. The results demonstrate the feasibility of using short human motion exemplars as reusable bionic sources for physically executable expressive humanoid motion.

URL PDF HTML ☆

赞 0 踩 0

2606.03532 2026-06-03 cs.LG cs.AI

When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

教师何时应该移动？自在线策略蒸馏中的时间耦合与稳定性

Haowei Guo, Baolong Bi, Ruicheng Zhang, Bingqian Sun, Wentao Zhang

发表机构 * Peking University（北京大学）； University of Chinese Academy of Sciences（中国科学院大学）； Tsinghua University（清华大学）

AI总结研究自在线策略蒸馏中教师更新调度对稳定性的影响，提出基于隔离期和门控机制的CGTR方法，实现零崩溃和最佳性能。

详情

AI中文摘要

自在线策略蒸馏针对从自身参数历史派生的教师训练学生策略，但教师的更新调度——控制教师与学生之间的\emph{时间耦合}——尚未作为稳定性变量被系统研究。通过对Qwen3-8B进行受控调度扫描，我们确定\emph{隔离期}（定义为更新之间教师完全冻结）是实现稳定学习的关键结构属性，而非教师年龄。为了刻画这些底层训练动态，我们引入了一个诊断框架，包括时间KL结构、刷新冲击和长度尾部风险。该框架进一步揭示了\emph{状态遗忘崩溃}：最优的短视固定调度在长视训练下灾难性失败，因为时钟驱动的刷新可以在单个不可逆步骤中将短暂漂移的学生复制到教师中。这种失败模式在短视评估下不可见，并且在机制上不同于EMA的慢性污染。为了解决这个问题，我们提出了\emph{巩固门控教师刷新}（CGTR），它在保持隔离期的同时，基于奖励改进和长度尾部安全的联合证据对每次刷新进行门控，确保每次教师移动响应于真正的学生巩固而非时钟信号。使用单一共享参数集且无需每数据集重新调整，CGTR在所有四个任务（化学、生物学、物理学、工具使用）上实现了 extbf{零崩溃}和最佳最终分数，并自动调节其刷新频率以适应每个任务的学习动态。

英文摘要

Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable. Through a controlled schedule sweep on Qwen3-8B, we establish that \emph{isolation periods}, defined as complete teacher freezing between updates, are the key structural property enabling stable learning, not teacher age. To characterize these underlying training dynamics, we introduce a diagnostic framework of temporal KL structure, refresh shock, and length-tail risk. This framework further uncovers \emph{state-oblivious collapse}: optimal short-horizon fixed schedules catastrophically fail under long-horizon training because a clock-driven refresh can copy a transiently drifting student into the teacher in a single, irreversible step. This failure mode is invisible under short-horizon evaluation and mechanistically distinct from EMA's chronic contamination. To address this, we propose \emph{Consolidation-Gated Teacher Refresh} (CGTR), which preserves isolation periods while gating each refresh on joint evidence of reward improvement and length-tail safety, ensuring every teacher movement responds to genuine student consolidation rather than a clock signal. With a single shared parameter set and no per-dataset retuning, CGTR achieves \textbf{zero collapse} and the best final score on all four tasks (Chemistry, Biology, Physics, ToolUse), self-regulating its refresh frequency to each task's learning dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.03521 2026-06-03 cs.LG cs.AI

Post-Hoc Robustness for Model-Based Reinforcement Learning

基于模型的强化学习的后验鲁棒性

Siemen Herremans, Ali Anwar, Siegfried Mercelis

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一种在推理时利用学习模型和名义策略进行鲁棒策略改进的后验鲁棒化方法，通过对抗性展开的模型预测控制提升鲁棒性，无需额外训练神经网络。

详情

AI中文摘要

为了提高强化学习（RL）在现实世界中的适用性，对抗鲁棒RL领域研究如何在对抗环境扰动下训练智能体。在该设置中，主角智能体在对手的环境扰动下优化策略，形成零和马尔可夫博弈。当对抗鲁棒RL与基于模型的RL结合时，对手可以针对学习到的转移模型而非训练环境。扩展这一思想，本文引入了深度RL智能体在推理时的后验鲁棒化。通过将学习模型与训练的名义策略结合使用，我们的方法执行鲁棒策略改进步骤。目标是提高鲁棒性而无需对神经网络进行额外训练。具体来说，我们利用对抗性展开下的模型预测控制，这些展开通过有界不确定性集内的投影梯度下降进行近似。此外，这些离线展开在执行时考虑并缓解了分布外问题。通过在扰动的Gymnasium MuJoCo环境中评估算法，同时考虑后验推理设置的计算限制，验证了所提方法在鲁棒性上的显著提升。

英文摘要

To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an adversary, resulting in a zero-sum Markov game. When adversarially robust RL is combined with model-based RL, the adversary can target a learned transition model instead of the training environment. Extending this idea, this work introduces post-hoc robustification of deep RL agents at inference time. By using the learned model in combination with a trained nominal policy, our approach performs a robust policy improvement step. The goal is to improve robustness without any additional training of neural networks. Specifically, we utilize model-predictive control under adversarial rollouts, which are approximated via projected gradient descent within a bounded uncertainty set. Furthermore, these offline rollouts are performed while considering and mitigating out-of-distribution issues. The proposed methodology is validated by demonstrating significant improvements in robustness when the algorithm is evaluated in perturbed Gymnasium MuJoCo environments, while considering the computational limitations of the post-hoc inference setting.

URL PDF HTML ☆

赞 0 踩 0

2606.03518 2026-06-03 cs.AI cs.CR

Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

覆盖治理：面向代理型人工智能的委托与范围的组合授权框架

Amjad Ibrahim, Yong Li

发表机构 * Huawei Heisenberg Research Center（华为海森堡研究所以）

AI总结针对代理型AI中传统授权框架无法处理递归委托、动态范围等问题，提出一种组合治理框架，通过定义委托类型、权限责任和资源范围衰减，并引入组合算子在不重写现有策略的情况下叠加代理语义，实现可问责的授权。

Comments 12 pages

详情

AI中文摘要

随着AI系统从被动模型演变为能够发起行动、协作和委托任务的自主主动代理，软件系统的传统边界变得模糊。围绕固定主体、显式请求和静态范围构建的传统授权和委托框架不足以治理代理系统。代理型AI需要更丰富的授权语义：代理必须继承和委托权限，在时间限制的权限下行动，并通过共享协议进行协调。现有的身份和访问管理（IAM）系统未能完全捕捉这种代理概念，缺乏递归委托、上下文边界和动态范围作为可执行治理原语的机制。与OAuth 2.0等访问委托标准不同，我们将委托视为合同条款，而不仅仅是基于静态令牌的同意凭证。本文提出一个组合治理框架，引入了代理型AI不可或缺的原语。我们定义了委托类型及其权限和问责含义，并引入了资源范围衰减的概念以限制代理访问范围。这些概念被表达为通用的关系定义，可以组合到现有的授权域（例如金融系统）中。为了操作化这种组合，我们定义了一个组合算子，将新的代理语义（例如递归委托链）叠加到现有关系策略上，而无需重写它们。我们通过形式化证明和实证评估来证实该框架，表明它为代理型AI系统中的可问责授权提供了形式化且实用的基础。

英文摘要

As AI systems evolve from passive models into autonomous active agents capable of initiating actions, collaborating, and delegating tasks, the traditional boundaries of software systems blur. Traditional authorization and delegation frameworks, built around fixed principals, explicit requests, and static scopes, are insufficient to govern agentic systems. Agentic AI demands richer authorization semantics: agents must inherit and delegate permissions, act under time-limited authority, and coordinate through shared protocols. Existing Identity and Access Management (IAM) systems fail to fully capture this notion of agency, lacking mechanisms for recursive delegation, contextual boundaries, and dynamic scoping as executable governance primitives. Unlike access delegation standards such as OAuth 2.0, we treat delegation as a contractual term rather than merely a static token-based consent credential. This paper proposes a compositional governance framework that introduces primitives indispensable for agentic AI. We define types of delegation and their permissions and accountability implications, and we introduce a notion of resource scope attenuation to bound agentic access envelopes. These concepts are expressed as general relational definitions that can be composed into existing authorization domains (e.g., financial systems). To operationalize this composition, we define a compositional operator that overlays new agentic semantics, such as recursive delegation chains, onto existing relational policies without rewriting them. We substantiate this framework through formal proofs and empirical evaluation, showing that it provides a formal yet practical foundation for accountable authorization in agentic AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.03512 2026-06-03 cs.RO cs.AI

SPADE: Sketch-guided Path Planning Augmented with Diffusion Experts

SPADE: 草图引导的路径规划增强扩散专家

Charbel Abi Hana, Tatiana Ghantous, Mikael Khalil, Anthony Rizk

发表机构 * IDEALworks GmbH ； IMT Atlantique ； IDEALworks GmbH & Saint Joseph University of Beirut（IDEALworks GmbH及贝鲁特圣约瑟夫大学）

AI总结提出一种结合扩散增强的框架，通过改进的标注工具和训练策略，在保持实时性的同时提升路径规划的泛化能力和鲁棒性，显著降低姿态误差和FID。

详情

DOI: 10.65109/RIHP6974

AI中文摘要

路径规划对于自主移动机器人（AMR）至关重要。将人类偏好纳入规划的常规方法通常依赖于复杂的奖励工程或硬件密集型解决方案。最近的最先进框架利用模仿学习从专家演示中训练特定行为的路径规划模型。然而，这些方法面临两个关键限制：对未见环境的泛化能力有限，以及演示收集中的鲁棒性较低。为了解决这些挑战，本文介绍了一个增强框架，专注于两个主要贡献：一个基于ROS 2重构的标注工具，以及一种新颖的训练策略，将基于扩散的数据增强集成到基线行为克隆模型中。提供了专家演示数据集，并通过消融研究评估所提出解决方案的鲁棒性。增强方法优于最先进的方法，绝对姿态误差（APE）降低39.1%，Fréchet初始距离（FID）降低33.5%，同时可训练参数减少93.8%。此外，它达到了扩散级别的泛化能力，同时保留了最先进模型的实时、边缘特性。

英文摘要

Path planning is essential for Autonomous Mobile Robots (AMRs). Conventional methods for incorporating human preferences into planning typically rely on either complex reward engineering or hardware-intensive solutions. Recent state-of-the-art frameworks leverage imitation learning to train behavior-specific path planning models from expert demonstrations. However, these approaches face two key limitations: limited generalization to unseen environments and low robustness in demonstration collection. To address these challenges, this work introduces an enhanced framework that focuses on two main contributions: an overhauled annotation tool built on ROS 2, and a novel training strategy that integrates diffusion-based augmentation into baseline behavioral cloning models. A dataset of expert demonstrations is provided and evaluated through ablation studies to assess the robustness of the proposed solution. The enhanced approach outperforms state-of-the-art methods with 39.1% lower Absolute Pose Error (APE) and 33.5% lower Fr'echet Inception Distance (FID) while having 93.8% less trainable parameters. Moreover it attains diffusion-level generalization while preserving the real-time, on-edge properties of state-of-the-art models.

URL PDF HTML ☆

赞 0 踩 0

2606.03508 2026-06-03 cs.CV

Structure-Guided Mixed Masked Pretraining and Spatial Continuity Regularization for Printed Circuit Board Defect Detection

结构引导混合掩码预训练与空间连续性正则化用于印刷电路板缺陷检测

Peitong Wang, Nuo Wang, Enxin Qin, Chengjin Yu, Hanyu Xuan, Yuanting Yan

发表机构 * Ahu.edu.cn（安徽大学）

AI总结提出两阶段PCB缺陷检测框架，通过结构引导混合掩码预训练学习PCB结构先验，并在微调阶段引入空间连续性正则化提升细长缺陷定位紧凑性，在DsPCBSD+数据集上达到85.5% mAP0.5。

Comments Preprint. 38 pages, 12 figures, 6 tables

详情

AI中文摘要

印刷电路板（PCB）缺陷检测是自动光学检测（AOI）的关键环节，但在实际应用中仍具挑战性，因为许多缺陷微小、低对比度且嵌入密集电路背景中。为解决这些问题，本文提出一种两阶段PCB缺陷检测框架，结合结构引导混合掩码预训练与空间连续性正则化。在预训练阶段，我们设计了一种稀疏卷积掩码预训练方案，利用无标签PCB图像，其中结构引导混合掩码用于构建信息丰富的掩码输入。稀疏卷积重建管道抑制掩码区域的无效响应，使检测器主干能够从可见导电模式推断缺失的PCB结构，从而学习PCB结构先验。在微调阶段，预训练主干被迁移到下游缺陷检测任务。针对该任务，在微调过程中引入空间连续性正则化项，该项约束分配给同一缺陷实例的分散正预测，并促进细长缺陷区域上更紧凑的定位。在DsPCBSD+数据集上的实验表明，所提方法达到85.5% mAP0.5和52.3% mAP0.5:0.95，优于多个强基线检测器。消融研究和定性结果进一步证实了所提框架在工业AOI场景中稳健PCB缺陷检测的有效性。

英文摘要

Printed circuit board (PCB) defect detection is an essential part of automated optical inspection (AOI); yet it remains challenging in practice because many defects are tiny, low-contrast, and embedded in dense circuit backgrounds. To address these issues, this paper presents a two-phase PCB defect detection framework that combines structure-guided mixed masked pretraining with spatial continuity regularization. In the pretraining stage, we design a sparse convolutional masked pretraining scheme to exploit unlabeled PCB images, where structure-guided mixed masking is used to construct informative masked inputs. The sparse convolutional reconstruction pipeline suppresses invalid responses from masked regions and enables the detector backbone to infer missing PCB structures from visible conductive patterns, thereby learning PCB structural priors. In the fine-tuning stage, the pretrained backbone is transferred to the downstream defect detection task. For the task, a spatial continuity regularization term is introduced during fine-tuning. This term constrains dispersed positive predictions assigned to the same defect instance and promotes more compact localization on elongated defect regions. Experiments on the DsPCBSD+ dataset show that the proposed method achieves 85.5% mAP0.5 and 52.3% mAP0.5:0.95, outperforming several strong baseline detectors. Ablation studies and qualitative results further confirm the effectiveness of the proposed framework for robust PCB defect detection in industrial AOI scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.03506 2026-06-03 cs.CV cs.GR

AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization

AvatarMix: 保持身份特征的跨化身组合用于服装个性化

Zhaorong Wang, Yoshihiro Kanamori, Yuki Endo

发表机构 * University of Tsukuba（茨口大学）

AI总结提出AvatarMix方法，通过直接组合两个高保真高斯化身实现服装迁移，并采用SeamFix和FullbodyFix两级细化策略解决接缝伪影和身体重塑后的外观保真问题。

Comments CVPR 2026 Findings. 16 pages, including supplementary material

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 425-435

AI中文摘要

现有的3D化身服装迁移方法面临不同挑战：将2D编辑提升到3D的方法通常会导致服装或身份质量下降，而分别建模身体和服装层的方法则容易出现交叉伪影。我们提出AvatarMix，一种组合范式，通过直接组合两个高保真高斯化身的头部和身体来绕过这些问题。虽然这种范式固有地保留了服装质量并避免了交叉，但在创建无缝连接和保持身体重塑后的外观保真度方面带来了挑战。为此，我们提出两级细化策略：SeamFix，一个局部扩散模块，用于细化头发和颈部以确保无伪影连接；以及一个可选的全身细化模块FullbodyFix，当重定向导致穿衣身体退化时恢复服装外观。两者都在已经3D一致的高斯化身渲染上操作，与2D到3D提升相比，这限制了多视图伪影。为了保留用户的身体身份，我们的基于网格的高斯表示能够适应鲁棒的网格重定向技术，精确地将穿衣身体重塑为用户体型，并鲁棒地处理多样化的身体形状。大量实验表明，我们的方法在服装保真度和身份保持方面达到了最先进的结果，为逼真的3D服装个性化提供了新视角。项目页面：此https URL

英文摘要

Existing 3D avatar outfit transfer methods face distinct challenges: approaches that lift 2D edits to 3D often suffer from outfit or identity quality degradation, while those that separately model body and clothing layers are prone to intersection artifacts. We introduce AvatarMix, a compositional paradigm that bypasses these issues by directly composing the head and body from two high-fidelity Gaussian avatars. While this paradigm inherently preserves outfit quality and avoids intersections, it introduces challenges in creating a seamless join and maintaining appearance fidelity after body reshaping. To this end, we propose a two-tier refinement strategy: SeamFix, a localized diffusion module that refines hair and neck to ensure an artifact-free join, and an optional full-body refinement, FullbodyFix, that restores garment appearance when retargeting degrades the clothed body. Both operate on renders from an already 3D-consistent Gaussian avatar, which limits multi-view artifacts compared to 2D-to-3D lifting. To preserve the user's body identity, our mesh-based Gaussian representation enables the adaptation of a robust mesh retargeting technique, precisely reshaping the clothed body to the user's physique and robustly handling diverse body shapes. Extensive experiments demonstrate that our method achieves state-of-the-art results in outfit fidelity and identity preservation, providing a new perspective for realistic 3D outfit personalization. Project page: https://larsph.github.io/avatarmix/

URL PDF HTML ☆

赞 0 踩 0

2606.03499 2026-06-03 cs.CV

Characterizing Detectability in 3DGS Poisoning: A Stage-wise Benchmark

表征3DGS投毒中的可检测性：分阶段基准测试

Quoc-Anh Bui-Huynh, Thanh Duc Ngo, Xue Geng, Kaixin Xu, Wang Zhe, Xulei Yang, Ngai-Man Cheung

发表机构 * Temasek Laboratories, Singapore University of Technology and Design（新加坡科技与设计大学Temasek实验室）； Vietnam National University, Ho Chi Minh City（越南国家大学胡志明市分校）； University of Information Technology, VNU-HCM（越南国家大学胡志明市信息技术大学）； Agency for Science, Technology, and Research (A*STAR)（科技研究局（A*STAR））

AI总结针对3DGS易受多种投毒攻击的问题，提出分阶段基准Poison-3DGS，系统研究各阶段可检测性差异，发现不同攻击在不同阶段产生独特取证信号，后期阶段（如训练动态和高斯参数统计）提供早期不可观测的强线索。

详情

AI中文摘要

3D高斯泼溅（3DGS）已迅速成为实时新视角合成的主要表示方法，但近期研究表明它易受多种投毒攻击，包括虚幻物体注入、计算成本放大和事后模型水印。尽管威胁面不断扩大，现有研究主要关注攻击成功，而防御和检测仍探索不足。从检测角度看，3DGS重建流程的多阶段特性产生了异构的中间表示，这既是关键挑战也是机遇。检测投毒的取证信号本质上是阶段依赖的：在一个阶段引入的攻击可能仅在后续阶段产生信号。这促使我们采用超越单阶段评估的分阶段可检测性视角。我们引入Poison-3DGS，一个用于分阶段表征3DGS投毒检测的基准。它暴露了跨多种场景和攻击的阶段特定伪影，包括多视图图像、几何、训练动态和高斯参数。利用该基准，我们对流水线各阶段的可检测性进行了系统研究。分析揭示了若干见解。首先，可检测性在不同阶段间差异显著，且没有任何单一阶段在所有攻击类型中持续占优。其次，不同攻击表现出不同的阶段特定取证信号，因此检测有效性关键取决于信号在何处被观测到。第三，后期阶段的信号（如训练动态和高斯参数统计）提供了早期阶段不可观测的强线索。总体而言，我们的工作提供了一个原则性基准，并首次系统表征了3DGS中阶段依赖的可检测性，为未来研究鲁棒可靠的3DGS系统奠定了基础。

英文摘要

3D Gaussian Splatting (3DGS) has rapidly emerged as a leading representation for real-time novel view synthesis, but recent work shows it is vulnerable to diverse poisoning attacks, including illusory object injection, computation cost amplification, and post hoc model watermarking. Despite this expanding threat surface, existing studies focus mainly on attack success, while defense and detection remain underexplored. From a detection perspective, a key challenge and opportunity arise from the multi-stage nature of the 3DGS reconstruction pipeline, which produces heterogeneous intermediate representations. Forensic signals for detecting poisoning are inherently stage dependent: an attack introduced at one stage may produce signals that emerge only at later stages. This motivates a stage-wise view of detectability that goes beyond single-stage evaluation. We introduce Poison-3DGS, a benchmark for stage-wise characterization of poisoning detection in 3DGS. It exposes stage-specific artifacts, including multi-view images, geometry, training dynamics, and Gaussian parameters, across a diverse set of scenes and attacks. Using it, we conduct a systematic study of detectability across pipeline stages. Our analysis reveals several insights. First, detectability varies significantly across stages, and no single stage consistently dominates across attack types. Second, different attacks exhibit distinct stage-specific forensic signals, so detection effectiveness depends critically on where signals are observed. Third, later-stage signals such as training dynamics and Gaussian parameter statistics provide strong cues not observable at earlier stages. Overall, our work provides a principled benchmark and the first systematic characterization of stage-dependent detectability in 3DGS, offering a foundation for future research on robust and reliable 3DGS systems.

URL PDF HTML ☆

赞 0 踩 0

2606.03498 2026-06-03 cs.LG cs.DC

Demystifying Pipeline Parallelism: First Theory for PipeDream

揭秘流水线并行：PipeDream 的首个理论

Ivan Ilin, Peter Richtárik

发表机构 * KAUST（卡斯土尼亚大学）

AI总结本文通过引入随机化 PipeDream (RPD) 抽象，首次为 PipeDream 风格方法提供了非凸收敛保证，并分析了其稳态延迟与阶段数的缩放关系，同时与 LocalSGD 进行了比较。

Comments 40 pages, 4 figures

详情

AI中文摘要

训练现代机器学习模型越来越需要跨多个加速器进行分布式计算。数据并行仍然是默认选择，并且通常与张量并行分片相结合，但一旦参数、激活或优化器状态不再适合单个设备，模型并行就变得不可避免。本文通过 PipeDream (PD) (Harlap et al., 2018) 的视角研究流水线模型并行。我们的第一个贡献是理论性的：我们引入了随机化 PipeDream (RPD)，一种陈旧块-SGD 抽象，据我们所知，这为 PD 风格方法提供了第一个干净的非凸收敛保证。我们的第二个贡献是扩展诊断：我们证明了稳态 PD 引起的延迟随阶段数 S 增长为 $S^2 - S/2 + O(1)$，因此收敛定理中的陈旧读取贡献缩放为 $\Theta(\gamma^2 S^4)$，在调谐速率形式中等价于 $\Theta(S^4/K)$。我们的第三个贡献是与 LocalSGD 的比较，后者通过周期性模型平均来权衡权重陈旧性与同步气泡。在我们报告的模拟时间实验中，表现更好的方法取决于目标：PD 在二次目标和小型语言建模训练损失任务上表现更好，而对于逻辑回归，随着阶段数增加，LocalSGD 变得优越。

英文摘要

Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018). Our first contribution is theoretical: we introduce Randomized PipeDream (RPD), a stale block-SGD abstraction that yields, to our knowledge, the first clean nonconvex convergence guarantee for a PD-style method. Our second contribution is a scaling diagnosis: we prove that the delay induced by steady-state PD grows as $S^2 - S/2 + O(1)$ for $S$ stages, so the stale-read contribution in the convergence theorem scales as $Θ(γ^2 S^4)$, equivalently as $Θ(S^4/K)$ in the tuned-rate form. Our third contribution is a comparison with LocalSGD, whose periodic model averaging trades weight staleness for synchronization bubbles. In our reported simulated-time experiments, the better-performing method depends on the objective: PD performs better on the quadratic objective and on a small language-modeling training-loss task, while for logistic regression LocalSGD becomes superior as the number of stages increases.

URL PDF HTML ☆

赞 0 踩 0