arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 机器人学习与模仿强化学习 11 篇

2606.17256 2026-06-17 cs.RO cs.CV 新提交

Contrastive Action-Image Pre-training for Visuomotor Control

对比动作-图像预训练用于视觉运动控制

Yuvan Sharma, Dantong Niu, Anirudh Pai, Zekai Wang, Zhuoyang Liu, Baifeng Shi, Stefano Saravalle, Boning Shao, Ruijie Zheng, Jing Wang, Konstantinos Kallidromitis, Yusuke Kato, Fabio Galasso, Yuke Zhu, Danfei Xu, Linxi "Jim" Fan, Jitendra Malik, Trevor Darrell, Roei Herzig

发表机构 * UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达) Sapienza University of Rome(罗马大学) Panasonic(松下) ItalAI

AI总结 提出CAIP方法,利用大规模第一人称视频中3D手部关键点作为代理动作信号,通过对比学习统一动作-图像表示,在少量机器人数据下显著提升灵巧操作性能。

详情
AI中文摘要

现有的机器人视觉编码器面临一个根本瓶颈:机器人数据集缺乏大规模预训练所需的规模。先前的工作通过转向互联网规模的图像和语言数据或自我中心的人类视频来规避数据稀缺问题。虽然这些模型显示出潜力,但两种范式都没有从配对的视觉和动作数据中学习,而下游视觉运动控制策略需要这些数据。然而,机器人轨迹作为这种配对信号最直接的来源,在预训练规模上不可用,这促使我们从丰富的人类视频中提取动作信号。为此,我们引入了CAIP(对比动作-图像预训练),一种视觉编码器,将大规模自我中心视频中的人类手部姿态视为末端执行器动作的代理。通过提取3D手部关键点(一种与下游机器人动作空间自然对齐的表示),CAIP通过对比目标学习统一的动作-图像表示。利用32,041小时的自我中心人类视频和仅88小时的机器人操作数据,CAIP优于最先进的视觉编码器,包括DINOv2、SigLIP、MVP和R3M。在使用Dexmate Vega和Sharpa Wave手的具有挑战性的真实世界灵巧操作设置上评估,CAIP在涉及折叠、倾倒和精细操作的任务上取得了超过30%的性能提升。我们的结果表明,我们的对比动作中心预训练方法为获得更适合物理交互的鲁棒视觉表示提供了一条可扩展的路径。

英文摘要

Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.

2606.17408 2026-06-17 cs.RO cs.CV cs.LG 新提交

Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies

动作生成应从何处开始?面向生成式机器人策略的可学习源先验

Meipo Dai, Qiyuan Zhuang, He-Yang Xu, Ying-Jie Shuai, Yijun Wang, Qi Dou, Xiu-Shen Wei

发表机构 * Southeast University(东南大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出LeaP,用轻量MLP预测基于本体感知的对角高斯分布作为动作生成源先验,替代标准高斯分布,在15个RoboTwin任务中平均成功率81.6%,优于基线方法6.5-25.5个百分点。

详情
AI中文摘要

生成式机器人策略通常从与观测无关的标准高斯分布开始动作生成,源分布的选择尚未被充分探索。本文提出一个简单问题:动作生成应从何处开始?我们提出LeaP,一种可学习源先验,用基于本体感知的对角高斯分布(作用于动作块)替代标准高斯分布。通过轻量MLP参数化,LeaP联合预测源分布的均值和状态自适应方差,同时保持下游生成器架构和推理求解器不变。这种设计提供了观测信息驱动的随机初始化,使生成器能够专注于精确的动作细化,而非从无信息的噪声源传输样本。在15个RoboTwin操作任务中,LeaP实现了81.6%的平均成功率,优于四个代表性基线——包括确定性源方法、无先验对应方法和扩散桥策略——6.5至25.5个百分点。相同的先验一致地改进了流匹配和扩散桥生成器,同时使用更少的参数且收敛更快。该优势延续到实际部署中,LeaP取得了最佳性能。这些结果表明,源分布是生成式机器人策略的一个独立且可重用的设计轴,与生成动力学的选择互补。

英文摘要

Generative robot policies typically begin action generation from an observation-independent standard Gaussian distribution, leaving the choice of source distribution underexplored. This work asks a simple question: where should action generation begin? We propose LeaP, a Learnable source Prior that replaces the standard Gaussian with a proprioception-conditioned diagonal Gaussian over action chunks. Parameterized by a lightweight MLP, LeaP jointly predicts the mean and state-adaptive variance of the source distribution, while keeping the downstream generator architecture and inference solver unchanged. This design provides an observation-informed yet stochastic initialization, allowing the generator to focus on precise action refinement rather than transporting samples from an uninformed noise source. On 15 RoboTwin manipulation tasks, LeaP achieves an average success rate of 81.6%, outperforming four representative baselines -- including deterministic-source methods, a no-prior counterpart, and a diffusion-bridge policy -- by 6.5 to 25.5 percentage points. The same prior consistently improves both flow-matching and diffusion-bridge generators, while using fewer parameters and converging faster. The advantage carries over to real-world deployment, where LeaP attains the best performance. These results suggest that the source distribution is an independent and reusable design axis for generative robot policies, complementary to the choice of generative dynamics.

2606.17493 2026-06-17 cs.RO 新提交

When Robots Sleep: Offline Skill Consolidation for Shared-Policy Robot Learning

当机器人睡眠时:面向共享策略机器人学习的离线技能巩固

Nethmi Jayasinghe, Diana Gontero, Amit Ranjan Trivedi

发表机构 * University of Illinois at Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出睡眠-觉醒框架,通过冻结技能记忆和纳什谈判梯度组合,解决多技能学习中的技能耦合崩溃问题,在Meta-World和SurgicAI上显著提升成功率和可靠性。

详情
AI中文摘要

在长期部署中学习的机器人必须添加新技能,同时不丢失使早期技能可重用的共享策略结构。我们研究顺序机器人技能学习,其中先前的轨迹和任务损失可能不可用,并且部署的策略必须保持单个共享控制器,没有特定任务的头部、路由或适配器。我们识别出技能耦合崩溃,这是一种故障模式,其中单个技能的成功仍然非平凡,而相关技能之间的可靠性下降。我们提出睡眠机器人,一种睡眠-觉醒框架,在觉醒期间学习每个新技能,并在睡眠期间使用紧凑的冻结技能记忆离线巩固共享策略:用于强化学习的冻结评论家与无序状态缓冲区,以及用于模仿学习的冻结演员快照与无序观察缓冲区。在睡眠期间,这些记忆定义了可微分的替代目标,其梯度通过纳什谈判组合,并具有自适应锚定和局部兴奋性以实现稳定巩固。在Meta-World MT5上,睡眠机器人相比最强的非神谕基线将平均成功率提高了64%,将成对可靠性提高了2.0倍;在SurgicAI上,相比持续模仿基线,它提高了平均成功率和反向迁移,同时在成对可靠性上保持竞争力。

英文摘要

Robots that learn over long deployments must add new skills without losing the shared policy structure that makes earlier skills reusable. We study sequential robot skill learning, where previous trajectories and task losses may be unavailable, and the deployed policy must remain a single shared controller without task-specific heads, routing, or adapters. We identify skill-coupling collapse, a failure mode in which individual skill success remains non-trivial while reliability among related skills deteriorates. We propose Sleeping Robots, a wake-sleep framework that learns each new skill during wake and consolidates the shared policy offline during sleep using compact frozen skill memories: frozen critics with unordered state buffers for reinforcement learning and frozen actor snapshots with unordered observation buffers for imitation learning. During sleep, these memories define differentiable surrogate objectives whose gradients are combined through Nash bargaining, with adaptive anchoring and local excitability for stable consolidation. On Meta-World MT5, Sleeping Robots improves average success by 64 % and pairwise reliability by x 2.0 over the strongest non-oracle baseline, and on SurgicAI it improves average success and backward transfer relative to continual imitation baselines while remaining competitive on pairwise reliability.

2606.17906 2026-06-17 cs.RO 新提交

WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT

WAM-RL:基于重建奖励和在线视频SFT的世界-动作模型强化学习

Zezhong Qian, Xiaowei Chi, Yu Qi, Haozhan Li, Zhi Yang Chen, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室) Northeastern University(东北大学) Tsinghua University(清华大学)

AI总结 提出WAM-RL框架,通过强化学习联合优化世界模型和动作模型,解决长时域任务中仅优化动作模型的不足,首次将强化学习引入世界-动作范式。

详情
AI中文摘要

最近的世界-动作(WA)模型展现出强大的泛化能力和数据效率,但它们通常依赖专家轨迹进行训练。这种依赖限制了它们获取超出演示分布的细粒度操作技能,并阻止它们通过真实世界交互持续改进。为了解决这些限制,我们提出了WAM-RL,一种强化学习框架,通过与环境的在线交互实现世界模型和动作模型的联合优化。通过允许两个组件共同进化,我们的方法增强了细粒度控制和适应性。具体来说,WA模型由世界模型和动作器组成。我们设计了一种具有分层优化的定制强化学习方法,以协调它们的改进。在方法论方面,我们系统地研究了将强化学习应用于动作模型以及在线训练世界模型在RL设置中的效果。我们的实验揭示了一个关键见解:仅优化动作器可以在短时域任务上带来改进,但在长时域任务上无法提供显著收益。相反,联合优化世界模型和动作器对于在长时域设置中实现强性能至关重要。我们的工作是首次将强化学习引入世界-动作范式,并提供了关于在线优化动作头和世界模型如何影响整体性能的见解。

英文摘要

Recent World-Action (WA) models demonstrate strong generalization ability and data efficiency, but they typically rely on expert trajectories for training. This reliance limits their ability to acquire fine-grained manipulation skills beyond the demonstration distribution and prevents them from continuously improving through real-world interaction. To address these limitations, we propose WAM-RL, a reinforcement learning framework that enables joint optimization of the world model and the action model through online interaction with the environment. By allowing the two components to co-evolve, our approach enhances fine-grained control and adaptability. Specifically, a WA model consists of a world model and an actor. We design a tailored reinforcement learning method with hierarchical optimization to coordinate their improvement. On the methodological side, we systematically investigate the effects of applying reinforcement learning to the action model, as well as online training of the world model within an RL setting. Our experiments reveal a key insight: optimizing only the actor yields improvements on short-horizon tasks, but fails to provide significant gains on long-horizon tasks. In contrast, jointly optimizing both the world model and the actor is critical for achieving strong performance in long-horizon settings. Our work is the first to introduce reinforcement learning into the World-Action paradigm, and provides insights into how online optimization of both the action head and the world model impacts overall performance.

2606.18247 2026-06-17 cs.RO cs.AI 新提交

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

视觉验证实现推理时引导与自主策略改进

Mingtong Zhang, Dhruv Shah

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出VERITAS框架,利用预训练通用机器人策略作为生成器,结合无梯度视觉验证器在推理时评估动作,实现无需额外训练的推理时策略引导和离线策略改进。

Comments Website: https://veritas-improvement.github.io

详情
AI中文摘要

部署在现实世界中的机器人应从经验中学习并随时间改进。这需要一个实践并从反馈中学习的机制。在本文中,我们提出VERITAS,一个用于通用机器人策略的生成器-验证器框架,用于推理时策略引导和自我改进。我们使用预训练的通用机器人策略作为“生成器”,并将其与一个无梯度的“视觉验证器”配对,该验证器在推理时评估动作。该框架实现了推理时引导,无需额外训练即可提高策略性能。我们证明,推理时验证在无需额外演示数据训练的情况下,始终优于普通通用策略。此外,我们证明验证后的 rollout 为离线策略改进提供了有效的监督:在验证后的自生成轨迹上微调的策略实现了持续的性能提升。值得注意的是,我们发现使用验证后的 rollout 进行后训练达到了与专家演示相当的效率,同时无需人工干预。我们的结果突出了推理时验证作为一种实用且可扩展的机制,用于在部署期间改进机器人策略。

英文摘要

Robots deployed in the real world should learn from their experience and improve over time. This requires a mechanism of practicing and learning from feedback. In this paper, we propose VERITAS, a generator-verifier framework for generalist robot policies for inference-time policy steering and self-improvement. We use a pre-trained generalist robot policy as a ``generator'' and pair it with a gradient-free ``visual verifier'' that evaluates actions at inference time. This framework enables inference-time steering that improves policy performance without additional training. We demonstrate that inference-time verification consistently outperforms vanilla generalists without training on additional demonstration data. Additionally, we demonstrate that the verified rollouts provide effective supervision for offline policy improvement: policies fine-tuned on verified self-generated trajectories achieve consistent performance gains. Notably, we find that post-training with verified rollouts achieves comparable efficiency to expert demonstrations, while requiring no human interventions. Our results highlight inference-time verification as a practical and scalable mechanism for improving robotic policies during deployment.

2506.17639 2026-06-17 cs.RO cs.AI 版本更新

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

RLRC:基于强化学习的压缩视觉-语言-动作模型恢复

Yuxuan Chen, Yixin Han, Yize Huang, Xiao Li

AI总结 提出RLRC三阶段压缩恢复流程,通过结构化剪枝、SFT和强化学习恢复以及量化,实现8倍内存减少和2.3倍推理加速,同时保持任务成功率。

Comments 8 pages, 10 figures; accepted by RA-L 2026

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 7, pp. 8864-8871, July 2026
AI中文摘要

视觉-语言-动作模型(VLA)在复杂机器人操作中展示了卓越的能力和巨大潜力。然而,其庞大的参数规模和高推理延迟阻碍了实际部署,尤其是在资源受限的平台上。为此,我们对VLA的模型压缩进行了系统的实证研究。基于这些见解,我们提出了\textit{RLRC},一个三阶段压缩和恢复流程,包括结构化剪枝、通过SFT和RL进行性能恢复,以及后续量化。RL阶段引入了评论家预热策略和BC损失正则化,以稳定训练并保持策略行为。RLRC实现了高达8倍的内存减少和2.3倍的推理加速,同时保持原始任务成功率。在多个VLA骨干网络上的大量实验表明,RLRC始终优于现有的压缩基线,突显了其在设备端部署的有效性。项目网站:此https URL

英文摘要

Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and strong potential in complex robotic manipulation. However, their large parameter sizes and high inference latency hinder real-world deployment, especially on resource-constrained platforms. To address this, we conduct a systematic empirical study of model compression for VLAs. Building on these insights, we present \textit{RLRC}, a three-stage compression and recovery pipeline consisting of structured pruning, performance recovery via SFT and RL, and subsequent quantization. The RL stage incorporates a critic warm-up strategy and BC loss regularization to stabilize training and preserve policy behavior. RLRC achieves up to an 8 times memory reduction and 2.3 times inference speedup while maintaining the original task success rate. Extensive experiments across multiple VLA backbones show that RLRC consistently outperforms existing compression baselines, highlighting its effectiveness for on-device deployment. Project website: https://rlrc-vla.github.io

2604.00611 2026-06-17 cs.RO 版本更新

Physical Imitation Learning: Distilling Control Policies into Passive Elasticity

物理模仿学习:将控制策略蒸馏到被动弹性中

Huyue Ma, Yurui Jin, Helmut Hauser, Rui Wu

AI总结 提出物理模仿学习(PIL)方法,将强化学习控制策略分解为主动与被动部分,被动部分卸载到并联弹性关节,显著降低能耗,在模拟四足机器人上实现高达95%的机械功率卸载。

详情
AI中文摘要

由于脑-体协同进化,动物的内在身体动力学在其节能运动中起着关键作用。具体来说,控制努力在主动肌肉和被动身体动力学之间共享——这一原则通常被称为物理智能。因此,身体动力学是解决方案的一部分。相比之下,机器人身体通常被设计得尽可能简单,但主动控制常常与内在身体动力学对抗,导致低能效。我们引入了物理模仿学习(PIL),这是一种新颖的方法,使当前的机器人控制更接近动物。PIL 获取通过强化学习(RL)获得的学习控制策略,并将其系统地分解为主动和被动控制贡献。然后,被动部分可以直接卸载到被动并联弹性关节(PEJ)上。结果,主动控制贡献显著减少,降低了整体能耗。此外,策略可以通过 RL 训练,通过生成更容易被 PEJ 模仿的步态来利用 PEJ 的辅助。这使得主动和被动控制组件的协同设计成为可能,将更大份额的驱动努力转移到 PEJ。在这里,我们在模拟四足动物中展示了这种方法的潜力。我们的结果表明,所提出的方法可以在平坦地形上卸载高达 95% 的机械功率到被动身体动力学,在崎岖地形上卸载 13%。因此,PIL 提供了一条可推广的途径,用于实现特定任务的物理智能,适用于各种基于关节的机器人形态。

英文摘要

Due to brain-body co-evolution, animals' intrinsic body dynamics play a crucial role in their energy-efficient locomotion. Specifically, the control effort is shared between active muscles and passive body dynamics--a principle often referred to as Physical Intelligence. As a result, the body dynamics are part of the solution. In contrast, robot bodies are typically designed to be as simple as possible, but the active control often fights the intrinsic body dynamics, resulting in low energy-efficiency. We introduce Physical Imitation Learning (PIL), a novel approach that brings current robotics control closer to animals. PIL takes learned control policies obtained with Reinforcement Learning (RL) and systematically splits them up into an active and passive control contribution. The passive part can be then directly offloaded to passive Parallel Elastic Joints (PEJs). As a result, the active control contribution is significantly reduced, lowering the overall energy consumption. Furthermore, the policy can be trained via RL to leverage the PEJ assistance by generating gaits that are more readily emulated by the PEJs. This enables co-design of the active and passive control components, shifting a greater share of actuation effort to the PEJs. Here we demonstrate the potential of this approach in simulated quadrupeds. Our results show that the proposed approach can offload up to 95% of mechanical power to passive body dynamics on flat terrain and 13% on rough terrain. PIL thereby provides a generalisable route to task-specific Physical Intelligence applicable to a wide range of joint-based robot morphologies.

2605.05172 2026-06-17 cs.RO cs.AI 版本更新

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

当生活给你行为克隆,就做Q函数:从行为克隆中提取Q值用于机器人强化学习

Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng

发表机构 * Rai-Inst

AI总结 提出Q2RL算法,通过从行为克隆策略中提取Q函数并利用Q门控切换策略,实现高效的离线到在线强化学习,在机器人操作任务中达到100%成功率和3.75倍提升。

Comments Robotics: Science and Systems, 2026

详情
AI中文摘要

行为克隆(BC)已成为机器人学习的一种高效范式。然而,BC在收集演示后缺乏自我引导的在线改进机制。现有的离线到在线学习方法常常由于离线数据与在线学习之间的分布不匹配,导致策略替换先前学习的好动作。在这项工作中,我们提出了Q2RL(从BC进行Q估计和Q门控用于强化学习),一种高效的离线到在线学习算法。我们的方法包括两部分:(1)Q估计通过与环境的少量交互步骤从BC策略中提取Q函数,然后进行在线RL;(2)Q门控根据各自的Q值在BC和RL策略动作之间切换,以收集用于RL策略训练的样本。在D4RL和robomimic基准测试的操作任务中,Q2RL在成功率和收敛时间上优于最先进的离线到在线学习基线。Q2RL足够高效,可应用于机器人上的RL设置,在1-2小时的在线交互中学习接触密集和高精度操作任务(如管道组装和套件装配)的鲁棒策略,成功率达到100%,相比原始BC策略提升高达3.75倍。代码和视频见https://this URL。

英文摘要

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/

2606.14551 2026-06-17 cs.RO cs.AI 版本更新

TRACE: Trajectory-Routed Causal Memory for Delayed-Evidence Visuomotor Imitation

TRACE: 用于延迟证据视觉运动模仿的轨迹路由因果记忆

Zihao Li, Ranpeng Qiu, Yincong Chen, Guoqiang Ren, Weiming Zhi

发表机构 * Zeno AI Zhejiang University(浙江大学) Zhejiang University of Technology(浙江工业大学) The University of Sydney(悉尼大学)

AI总结 针对视觉运动模仿中早期线索消失导致观察歧义的问题,提出TRACE记忆框架,利用路径签名存储和检索任务相关证据,在长周期任务中提升分支选择准确率。

详情
AI中文摘要

自主运行的机器人可能需要基于不再可见的证据做出决策。我们研究\emph{延迟证据}任务,其中早期线索在后续决策点之前消失,因此视觉上相似的观察可能需要不同的动作。在这些设置中,当前观察不足以作为控制的状态。我们引入了轨迹路由因果证据(TRACE),一种用于视觉运动模仿策略的记忆框架。TRACE将任务相关的视觉和机器人状态证据(如物体身份、目标选择或路线依赖状态)存储在固定大小的潜在记忆中,该记忆在长片段中保持有界。TRACE不是通过原始时间或手动提供的任务标签来索引记忆,而是使用\emph{路径签名}:已执行机器人状态轨迹的紧凑、顺序敏感特征。这些签名不存储视觉线索本身;相反,它们提供了轨迹条件化的键,用于写入和检索线索可见时存储的证据。当机器人后来遇到歧义观察时,策略以TRACE记忆为条件,恢复缺失的上下文并选择正确的分支。TRACE通过轻量级适配器附加到策略上,而不改变策略主干、动作头或模仿目标。在具有视觉歧义分支点的真实世界长时域操作任务中,TRACE在分支选择和任务成功率上优于替代基线,包括短历史记忆和循环记忆。项目页面:此 https URL

英文摘要

Robots under autonomous operation may require decisions based on evidence that is no longer visible. We study delayed-evidence tasks, where an early cue disappears before a later decision point, so visually similar observations can require different actions. In these settings, the current observation is not a sufficient state for control. We introduce TRAjectory-routed Causal Evidence (TRACE), a memory framework for visuomotor imitation policies. TRACE stores task-relevant visual and robot-state evidence, such as object identity, target choice, or route-dependent state, in a fixed-size latent memory that remains bounded over long episodes. Instead of indexing memory by raw time or manually provided task labels, TRACE uses path signatures: compact, order-sensitive features of the executed robot-state trajectory. These signatures do not store the visual cue itself; rather, they provide trajectory-conditioned keys for writing and retrieving the evidence stored when the cue was visible. When the robot later reaches an ambiguous observation, the policy conditions on TRACE memory to recover the missing context and choose the correct branch. TRACE attaches through lightweight adapters to policies, without changing the policy backbone, action head, or imitation objective. Across real-world long-horizon manipulation tasks with visually ambiguous branch points, TRACE improves branch selection and task success over alternative baselines, including short-history and recurrent memory. Project page: https://jeong-zju.github.io/trace

2606.15148 2026-06-17 cs.RO cs.AI 版本更新

MimicIK: Real-Time Generative Inverse Kinematics from Teleoperation with FK Consistency

MimicIK: 基于遥操作且保持正运动学一致性的实时生成式逆运动学

Jiahao Yang, Shenhao Yan, Fan Feng, Chengsi Yao, Ge Wang, Zhixin Mai, Yiming Zhao, Yatong Han

发表机构 * Ising AI CUHK-Shenzhen(香港中文大学(深圳))

AI总结 提出MimicIK框架,利用条件流匹配从遥操作数据学习平滑鲁棒的关节空间运动先验,通过两阶段迭代优化和正运动学一致性损失实现实时逆运动学求解,在6-DOF机器人数据集上达到4.65mm位置误差和92.01%成功率。

详情
AI中文摘要

逆运动学(IK)仍然是实时机器人操作的关键瓶颈。经典的数值求解器具有高几何精度,但在闭环部署中常出现不连续的分支切换和运动学奇异点附近的不稳定行为。同时,学习型IK方法在平衡空间精度、运动平滑性和实时效率方面经常遇到困难,尤其是在使用嘈杂的人类遥操作数据训练时。我们提出\textbf{MimicIK},一个实时生成式逆运动学框架,通过条件流匹配从遥操作演示中学习平滑且鲁棒的关节空间运动先验。给定当前关节构型和目标末端执行器位姿,MimicIK基于最小迭代策略(MIP)主干,通过高效的两步迭代精化过程预测连续的增量关节指令。为了强制物理一致性,我们进一步引入正运动学一致性损失,这是一种可微的正运动学正则化项,在训练过程中惩罚任务空间与目标位姿的偏差。我们在包含8,848个遥操作演示的真实6-DOF机器人数据集上评估MimicIK。MimicIK实现了4.65 mm的平均位置误差,92.01%的10 mm成功率,以及仅7.99%的轨迹尖峰率。与UNet扩散基线相比,我们的方法在提高空间精度和运动平滑性的同时,将推理延迟从21.66 ms降低到6.74 ms。此外,与在分布外部署时灾难性发散的确定性MLP基线不同,MimicIK在奇异构型附近保持稳定,并在部署硬件上实现鲁棒的20 Hz实时控制。

英文摘要

Inverse kinematics (IK) remains a critical bottleneck for real-time robot manipulation. Classical numerical solvers achieve high geometric precision but often suffer from discontinuous branch switching and unstable behavior near kinematic singularities during closed-loop deployment. Meanwhile, learned IK approaches frequently struggle to balance spatial accuracy, motion smoothness, and real-time efficiency, particularly when trained on noisy human teleoperation data. We present \textbf{MimicIK}, a real-time generative inverse kinematics framework that learns smooth and robust joint-space motion priors from teleoperation demonstrations through conditional flow matching. Given the current joint configuration and a target end-effector pose, MimicIK predicts continuous delta-joint commands using an efficient two-step iterative refinement process based on a Minimal Iterative Policy (MIP) backbone. To enforce physical consistency, we further introduce an FK consistency loss, a differentiable forward-kinematics regularization that penalizes task-space deviations from the target pose during training. We evaluate MimicIK on a real-world 6-DOF robot dataset containing 8,848 teleoperation demonstrations. MimicIK achieves a mean position error of 4.65 mm, a 10 mm success rate of 92.01\%, and a trajectory spike rate of only 7.99\%. Compared with a UNet diffusion baseline, our method improves both spatial accuracy and motion smoothness while reducing inference latency from 21.66 ms to 6.74 ms. Furthermore, unlike deterministic MLP baselines that catastrophically diverge under out-of-distribution deployment, MimicIK remains stable near singular configurations and enables robust 20 Hz real-time control on deployment hardware.

2606.16917 2026-06-17 cs.RO 版本更新

Unified Motion-Action Modeling for Heterogeneous Robot Learning

统一运动-动作建模用于异构机器人学习

Yunhao Cao, Shitong Liu, Chao Feng, Meryl Zhang, Xuanchen Lu, Andrew Owens, Kuan Fang

发表机构 * Cornell University(康奈尔大学)

AI总结 提出UMA模型,利用3D物体运动轨迹作为共享接口,通过掩码生成目标统一视觉运动控制和动力学建模,实现跨异构数据源的多任务预训练,并在部署时支持多种推理模式。

Comments https://uma-manipulation.github.io/

详情
AI中文摘要

我们提出了统一运动-动作(UMA)模型,该方法使用3D物体运动轨迹作为共享接口,以桥接视觉运动控制和动力学建模。UMA将物体运动和机器人动作视为在掩码生成目标下共同演化的变量,其中掩码模式决定了预训练期间的监督机制和部署时的推理模式。通过使用事后重标记的运动上下文和对比目标(将任务意图与场景几何解耦),UMA能够在无需手动标注任务指令的情况下,跨异构数据源进行多任务预训练。在部署时,相同的预训练参数支持运动条件视觉运动控制、基于运动的动力学建模以及从少量示范中进行的任务适应。在机器人演示、人类视频和模拟数据的混合数据集上预训练后,UMA在每种推理模式下均持续优于专门针对该模式的最先进基线。

英文摘要

We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.

2. 运动规划、控制与动力学 3 篇

2606.17317 2026-06-17 cs.RO cs.AI math.OC 新提交

Transformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators

基于Transformer的可行且最优末端接近翻滚目标的空间机械臂热启动方法

Yuji Takubo, Maximilian Adang, Mac Schwager, Simone D'Amico

发表机构 * Stanford University(斯坦福大学)

AI总结 针对空间机械臂末端接近翻滚目标的实时轨迹生成问题,提出基于因果Transformer的热启动方法,通过分解规划并热启动姿态-力矩分配阶段,在300个测试场景中减少28%迭代次数和23%运行时间,同时保持控制成本分布。

Comments 8 pages, 4 figures

详情
AI中文摘要

由于航天器总线运动、机械臂动力学、可见性锥和轨迹级安全约束之间的非线性耦合,在轨机器人服务的实时轨迹生成具有挑战性。本文研究了基于学习的热启动方法,用于空间机械臂末端接近翻滚目标的序列凸规划(SCP)。所提出的框架将问题分解为系统质心平移规划阶段和耦合姿态-机械臂力矩分配阶段,并对后者应用因果变压器热启动,后者构成了主要的计算瓶颈。比较了线性动作解码器和流匹配动作解码器在不同动作分块和训练数据集大小下的表现,并使用SCP在成本最优和可行性投影下评估了生成的热启动。在300个保留场景中,学习的热启动将第二阶段SCP迭代次数减少多达28%,运行时间减少23%,同时保持最终控制成本分布。当学习的热启动用于非凸可行性投影时,其运行时间相比成本最优SCP几乎减半,同时避免了启发式初始化时观察到的灾难性高成本尾部行为。这些结果表明,序列模型热启动可以提高基于优化的空间机械臂末端制导的计算效率和轨迹鲁棒性。

英文摘要

Real-time trajectory generation for on-orbit robotic servicing is challenging due to the nonlinear coupling between spacecraft bus motion, manipulator dynamics, visibility cone, and trajectory-level safety constraints. This paper studies learning-based warm-starting for sequential convex programming (SCP) in the terminal approach of a space manipulator toward a tumbling target. The proposed framework decomposes the problem into a system center-of-mass translational planning stage and a coupled attitude--manipulator torque-allocation stage, and applies a causal transformer warm-start to the latter, which constitutes the dominant computational bottleneck. Linear and flow matching action decoders are compared under different action-chunking and training dataset sizes, and the resulting warm-starts are evaluated under both cost-optimal and feasibility projection using SCP. Across 300 held-out scenarios, the learned warm-start reduces the second-stage SCP iteration count by up to 28% and the runtime by 23% while preserving the final control-cost distribution. When the learned warm-starts are used for nonconvex feasibility projection, they nearly halve the runtime relative to cost-optimal SCP, while avoiding the catastrophic high-cost tail behavior observed when initialized heuristically. These results indicate that sequence-model warm-starts can improve both the computational efficiency and trajectory robustness of optimization-based terminal guidance for space manipulation.

2606.17630 2026-06-17 cs.RO 新提交

FLAP: FOV-Constrained Active Perception Planning for Prior-Map-Free 3D Navigation

FLAP: 面向无先验地图3D导航的视场约束主动感知规划

Mengke Zhang, Sitong Li, Tiancheng Lai, Ruitian Pang, Mingxuan Zhang, Qingcheng Chen, Fei Gao, Chao Xu, Yanjun Cao

发表机构 * The State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院工业控制技术国家重点实验室) Huzhou Institute, Zhejiang University(浙江大学湖州研究院) Huzhou Key Laboratory of Autonomous System(湖州市自动驾驶系统重点实验室) Shanghai Institute of Special Equipment Inspection and Technical Research Co., Ltd(上海市特种设备监督检验技术研究院有限公司)

AI总结 提出一种将主动感知直接融入轨迹优化的规划框架,通过传感器坐标系下的视场几何约束和速度触发机制,在保证安全的同时提升效率,并支持任意3D机动。

Comments 18 pages, 19 figures

详情
AI中文摘要

在未知、杂乱的三维环境中进行安全高效的轨迹规划是无人机在现实应用中部署的关键瓶颈。机载传感器有限的视场和感知范围进一步加剧了这一挑战。许多现有方法要么对未探索空间做出简单假设,要么依赖保守启发式(如速度限制或固定感知模式),降低了效率且在不同传感器类型间泛化能力差。本文提出一种新颖的规划框架,将主动感知直接融入轨迹优化,从而在保持效率的同时提高安全性。感知约束源自无人机的动力学模型,并在传感器坐标系中公式化,从而能够精确处理视场几何。速度触发的激活机制使规划器能够平衡感知和运动效率。我们引入带有参数化起始时间优化的主动感知子轨迹段,减轻了因障碍物检测延迟带来的碰撞风险。我们的公式化方法能够在任意三维机动中实现主动感知,超越了主要针对水平运动的现有方法。所有约束和惩罚项均融入可微优化问题,因此规划器仅需一个简单的前端全局路径作为引导,而非计算昂贵的感知感知路径生成器。大量仿真和真实世界实验证明了该方法在不同传感器配置的多样未知环境中的鲁棒性能。

英文摘要

Safe and efficient trajectory planning in unknown, cluttered 3D environments constitutes a critical bottleneck for deploying Unmanned Aerial Vehicles (UAVs) in real-world applications. This challenge is further exacerbated by the limited field-of-view (FOV) and sensing range of onboard sensors. Many existing methods either make simplistic assumptions about unexplored space or rely on conservative heuristics such as speed limits or fixed perception patterns, reducing efficiency and generalizing poorly across different sensor types. In this work, we propose a novel planning framework that directly integrates active perception into trajectory optimization, thereby improving safety while preserving efficiency. The perception constraints are derived from the UAV's dynamic model and formulated in the sensor coordinate frame, which enables precise handling of FOV geometry. The velocity-triggered activation mechanism enables the planner to balance perception and motion efficiency. We introduce an active perception sub-trajectory segment with parametric start-time optimization, mitigating collision risks from late obstacle detection. Our formulation enables active perception during arbitrary 3D maneuvers, extending beyond prior methods designed mainly for horizontal motion. All constraints and penalties are incorporated into a differentiable optimization problem, so the planner requires only a simple front-end global path for guidance, rather than a computationally expensive perception-aware path generator. Extensive simulations and real-world experiments demonstrate robust performance across diverse unknown environments with varying sensor configurations.

2512.13009 2026-06-17 cs.RO 版本更新

K-VARK: Kernelized Variance-Aware Residual Kalman Filter for Sensorless Force Estimation in Collaborative Robots

K-VARK: 用于协作机器人无传感器力估计的核化方差感知残差卡尔曼滤波器

Oğuzhan Akbıyık, Naseem Alhousani, Fares J. Abu-Dakka

AI总结 提出K-VARK方法,通过核化运动基元学习残差力矩的预测均值和异方差方差,并自适应调整卡尔曼滤波噪声协方差,在6自由度协作机械臂上实现无传感器力估计,RMSE降低20%以上。

详情
AI中文摘要

可靠接触力估计对于确保机器人与非结构化环境的安全和精确交互至关重要。然而,由于固有的建模误差以及复杂的残差动力学和摩擦,准确的无传感器力估计仍然具有挑战性。为应对这一挑战,本文提出K-VARK(核化方差感知残差卡尔曼滤波器),一种将关节残差力矩的核化概率模型集成到自适应卡尔曼滤波框架中的新颖方法。通过在优化激励轨迹上训练的核化运动基元,K-VARK捕获残差力矩的预测均值和输入相关的异方差方差,反映数据变异性和距训练样本距离的影响。这些统计信息通过增广测量噪声协方差来通知方差感知的虚拟测量更新,而过程噪声协方差通过变分贝叶斯优化在线自适应以处理动态干扰。在6自由度协作机械臂上的实验验证表明,与最先进的无传感器力估计方法相比,K-VARK的RMSE降低了20%以上,为抛光、装配等高级任务提供了鲁棒且准确的外部力/力矩估计。

英文摘要

Reliable estimation of contact forces is crucial for ensuring safe and precise interaction of robots with unstructured environments. However, accurate sensorless force estimation remains challenging due to inherent modeling errors and complex residual dynamics and friction. To address this challenge, in this paper, we propose K-VARK (Kernelized Variance-Aware Residual Kalman filter), a novel approach that integrates a kernelized, probabilistic model of joint residual torques into an adaptive Kalman filter framework. Through Kernelized Movement Primitives trained on optimized excitation trajectories, K-VARK captures both the predictive mean and input-dependent heteroscedastic variance of residual torques, reflecting data variability and distance-to-training effects. These statistics inform a variance-aware virtual measurement update by augmenting the measurement noise covariance, while the process noise covariance adapts online via variational Bayesian optimization to handle dynamic disturbances. Experimental validation on a 6-DoF collaborative manipulator demonstrates that K-VARK achieves over 20% reduction in RMSE compared to state-of-the-art sensorless force estimation methods, yielding robust and accurate external force/torque estimation suitable for advanced tasks such as polishing and assembly.

3. 操作、抓取与灵巧手 10 篇

2606.17309 2026-06-17 cs.RO 新提交

Abstention-Aware Personalized Object Rearrangement via Uncertainty-Guided LLM Assistance

基于不确定性引导的LLM辅助的弃权感知个性化物体重排

Sam Collin, Ali Ayub

发表机构 * Concordia University(康考迪亚大学)

AI总结 提出APOLLO框架,结合轻量级个性化嵌入模型与选择性大语言模型辅助,通过不确定性估计在模糊决策时调用LLM,实现高效、隐私保护的弃权感知物体重排。

Comments Accepted at the 2026 IEEE 35th International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情
AI中文摘要

家庭环境中的机器人辅助不仅需要预测物体应放置的位置,还需要推理何时不应放置物体。现有的个性化物体重排方法主要假设观测清晰且完全可操作,限制了其在现实、杂乱且部分错误环境中的适用性。本文提出APOLLO,一个用于弃权感知个性化物体重排的混合框架,结合了轻量级个性化嵌入模型(PEM)与选择性大语言模型(LLM)辅助。PEM针对每个用户-环境对使用少量演示进行训练,完全在CPU上运行,并产生不确定性估计,用于仅对模糊决策选择性调用基于LLM的推理,平衡效率、隐私和推理能力。为了在现有基准之外评估该公式,我们引入了APOR,一个合成的、由LLM生成的数据集,捕捉房间级、多家具环境、多样化的组织配置文件、明确的弃权行为和嘈杂的部分场景上下文。在PARSEC和APOR上的大量实验初步表明,APOLLO在受控基准设置中优于先前基于LLM的基线,同时大幅减少LLM的使用。代码可在该网址获取。

英文摘要

Robotic assistance in household environments requires not only predicting where objects should be placed, but also reasoning about when objects should not be placed at all. Existing approaches to personalized object rearrangement primarily focus on placement decisions under the assumption of clean observations and complete actionability, limiting their applicability in realistic, cluttered, and partially erroneous settings. In this paper, we introduce APOLLO, a hybrid framework for abstention-aware personalized object rearrangement that combines a lightweight, personalized embedding model (PEM) with selective large language model (LLM) assistance. PEM is trained for each user-environment pair using a small number of demonstrations, operates entirely on CPU, and produces uncertainty estimates, which are used to selectively invoke LLM-based reasoning only for ambiguous decisions, balancing efficiency, privacy, and reasoning capability. To evaluate this formulation beyond existing benchmarks, we introduce APOR, a synthetic, LLM-generated dataset that captures room-level, multi-furniture environments, diverse organizational profiles, explicit abstention behavior, and noisy partial scene context. Extensive experiments on both PARSEC and APOR provide initial evidence that APOLLO improves over prior LLM-based baselines in controlled benchmark settings while substantially reducing LLM usage. Code is available at https://github.com/PaInt-Lab/APOLLO.

2606.17418 2026-06-17 cs.RO 新提交

DexLink Hand: A Compact, Affordable, 16-DOF Linkage-Driven Hand with Human-Like Dexterity

DexLink Hand:一款紧凑、经济、16自由度连杆驱动且具有类人灵巧性的手部

Hao Wu, Yanzhe Wang, Yu Feng, Jian Liu, Jihao Li, Jianshu Zhou, Huixu Dong

发表机构 * Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 提出一种紧凑、低成本的连杆驱动仿人手,通过混合平面与空间连杆机构实现16个独立驱动、20个关节的高灵巧性,重320g、成本低于400美元,达到最大Kapandji评分并复现全部33种Feix抓取类型。

详情
AI中文摘要

灵巧机器人手在灵巧性、紧凑性和经济性之间长期面临权衡。特别是,高自由度设计通常需要复杂的驱动和传动,阻碍了其集成到人形尺寸中。为解决这些挑战,本文提出一种紧凑、低成本的连杆驱动仿人手,实现了高灵巧性、结构集成和类人手功能。该手集成了由16个独立驱动器驱动的20个关节,所有驱动、传感和传动组件紧凑地嵌入人手大小的结构中。最终原型仅重320g,总成本低于400美元。为实现这些目标,提出了一种结合平面和空间连杆机构的混合机械架构,实现了解耦的多向运动、仿生关节协同和高被动承载能力。拇指进一步采用了支持类人重构和对掌运动的仿生特征。通过这些机构和结构布局的协调集成,原型实现了具有仿生灵巧性的高度集成设计。实验评估表明,该手达到了最大Kapandji评分,复现了所有33种Feix抓取类型,并在多种日常物品和工具上实现了稳定抓取和灵巧操作。这些结果验证了所提出的手作为面向以人为中心环境中灵巧操作、遥操作和机器人学习的低成本、紧凑且机械高效的平台。

英文摘要

Dexterous robotic hands face a longstanding trade-off among dexterity, compactness, and affordability. Particularly, high-degree-of-freedom designs typically demand complex actuation and transmission, hindering integration into human-scale forms. To address these challenges, this work presents a compact, low-cost linkage-driven anthropomorphic hand that achieves high dexterity, structural integration, and human-hand-like functionality. The hand integrates 20 joints driven by 16 independent actuators, with all actuation, sensing, and transmission components compactly embedded within a human-hand-sized structure. The resulting prototype weighs only 320g at a total cost below USD 400. To meet these objectives, a hybrid mechanical architecture combining planar and spatial linkage mechanisms is proposed, enabling decoupled multidirectional motion, biomimetic joint synergies, and high passive load-bearing capability. The thumb further incorporates biomimetic features supporting human-like reconfiguration and opposition movements. Through the coordinated integration of these mechanisms and structural layout, the prototype achieves a highly integrated design with anthropomorphic dexterity. Experimental evaluations demonstrate that the hand achieves the maximum Kapandji score, reproduces all 33 Feix grasp types, and performs stable grasping and dexterous manipulation across a wide variety of daily objects and tools. These results validate the proposed hand as an affordable, compact, and mechanically efficient platform for dexterous manipulation, teleoperation, and robot learning in human-centered environments.

2606.17982 2026-06-17 cs.RO 新提交

LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation

LAGO策略:面向平滑操作的延迟感知异步扩散策略与目标导向无碰撞规划

Guowei Shi, Xupeng Xie, Yiming Luo, Jian Guo, Jun Ma, Boyu Zhou

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) International Digital Economy Academy(国际数字经济学院) The University of Hong Kong(香港大学) Southern University of Science and Technology(南方科技大学)

AI总结 提出LAGO策略,通过延迟感知条件引导和时空轨迹优化,解决异步扩散策略的间断和碰撞问题,实现平滑安全的操作。

Comments 8 pages, 8 figures

详情
AI中文摘要

基于扩散的视觉运动策略在异步推理部署时,常出现片段间不连续,且缺乏显式的障碍物感知机制,导致运动抖动和碰撞,阻碍了在真实场景中的可靠操作。为解决这些问题,我们提出LAGO策略,一个统一的异步动作生成框架,将轨迹优化与扩散策略相结合,实现平滑安全的执行。LAGO策略通过基于未来动作的延迟感知无分类器引导条件,提高了片段间一致性。它进一步通过从演示中预测任务相关的交互目标,实现目标导向的无碰撞轨迹规划。最后,时空轨迹优化细化待执行的动作,以实现低抖动和可行的运动。大量真实世界实验表明,LAGO策略在具有挑战性的操作任务中,实现了平滑无碰撞的执行和高任务成功率。项目网站:此 https URL

英文摘要

Diffusion-based visuomotor policies deployed with asynchronous inference often exhibit inter-chunk discontinuities and lack explicit mechanisms for obstacle-aware execution, leading to jerky motions and collisions that hinder reliable manipulation in real-world scenes. To address these issues, we propose LAGO Policy, a unified asynchronous action-generation framework that integrates trajectory optimization with diffusion policy for smooth and safe execution. LAGO Policy improves inter-chunk consistency via latency-aware classifier-free guidance conditioning on future actions. It further enables goal-directed collision-free trajectory planning by predicting a task-relevant interaction goal from demonstrations. Finally, spatial-temporal trajectory optimization refines the actions to be executed for low-jerk and feasible motion. Extensive real-world experiments demonstrate that LAGO Policy achieves smooth collision-free execution with high task success across challenging manipulation tasks. Project Website: https://lago-policy.github.io/

2606.18053 2026-06-17 cs.RO 新提交

A Hybrid Optimization Framework for Grasp Synthesis under Partial Observations

一种用于部分观测下抓取合成的混合优化框架

Wenzheng Zhang, Fahira Afzal Maken, Tin Lai, Fabio Ramos

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) Data61, CSIRO(澳大利亚联邦科学与工业研究组织Data61) NVIDIA(英伟达)

AI总结 提出结合基于学习的能量模型与解析迭代最近点方法的混合框架,从部分观测点云生成鲁棒抓取,在67个物体5360次抓取尝试中平均成功率达60.9%,优于现有方法。

详情
Journal ref
ICRA2026
AI中文摘要

我们提出一种混合抓取合成框架,该框架将基于学习的能量模型(EBM)与解析迭代最近点(ICP)方法相结合,以从部分观测的点云生成鲁棒抓取。学习到的能量函数在Stein变分梯度下降(SVGD)框架中充当先验,指导抓取配置的迭代优化。在67个物体上的5360次抓取尝试评估中,我们的方法实现了60.9%的平均成功率,优于AnyGrasp(31.1%)、抓取姿态检测(48.4%)和AS-ICP(56.6%)。这些结果突显了我们方法的强泛化能力,并展示了将数据驱动学习与几何优化相结合如何解决单独使用任一策略的局限性。

英文摘要

We propose a hybrid grasp synthesis framework that combines a learning-based Energy-Based Model (EBM) with an analytical Iterative Closest Point (ICP) method to generate robust grasps from partially observed point clouds. The learned energy function acts as a prior within a Stein Variational Gradient Descent (SVGD) framework, guiding iterative refinement of grasp configurations. Evaluated on 67 objects with 5,360 grasp attempts, our method achieves an average success rate of 60.9\%, outperforming AnyGrasp (31.1\%) and Grasp Pose Detection (48.4\%) and AS-ICP (56.6\%). These results highlight the strong generalization ability of our approach and demonstrate how combining data-driven learning with geometric optimization addresses the limitations of either strategy in isolation.

2606.18092 2026-06-17 cs.RO cs.AI 新提交

EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning

EAGG: 通过几何感知图条件实现具身对齐的抓取生成

Wanhao Niu, Qiyan Ke, Yuan Sun, Hao Sun, Jie Xu, Muyuan Ma, Ruiqi Hu, Fuchun Sun

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Beijing Moce Future Technology Co., Ltd.(北京墨策未来科技有限公司)

AI总结 提出EAGG,一种通过拓扑感知末端执行器图和几何感知令牌实现跨末端执行器抓取生成的统一模型,在MultiGripperGrasp基准上达到56.17%平均成功率,并显著降低接触距离。

Comments 16 pages, 8 figures. Code is available at https://github.com/wanhaoniu/EAGG

详情
AI中文摘要

跨末端执行器抓取生成旨在寻求一个统一的模型,能够泛化到不同物体以及从平行夹爪到灵巧末端执行器的不同具身形态。现有的抓取生成器通常针对固定具身设计,或使用静态描述符编码具身身份,当拓扑结构、驱动耦合和接触几何差异较大时,这会削弱迁移能力。我们提出EAGG,一种具身对齐的抓取生成器,通过拓扑感知的末端执行器图和具身特定的低维末端执行器控制空间来表示每个具身。一个冻结的末端执行器认知骨干将当前关节状态转换为几何感知令牌,作为可复用的形态先验,并通过迭代几何注入在采样过程中刷新这些令牌,使条件与不断演变的末端执行器几何保持同步。在MultiGripperGrasp基准上,EAGG在六个训练末端执行器上达到56.17%的平均成功率,与专门训练的差距在1.10个百分点以内,同时保持对微调和零样本末端执行器的迁移能力。迭代几何注入进一步将合并中位接触距离从0.239厘米降低到0.189厘米。这些结果表明,通过在共享生成器内对齐具身结构而非抑制具身差异,可以增强跨末端执行器抓取生成。代码可在该网址获取:https://this URL。

英文摘要

Cross-end-effector grasp generation seeks a unified model that generalizes across objects and across embodiments ranging from parallel grippers to dexterous end effectors. Existing grasp generators are typically designed for a fixed embodiment or encode embodiment identity with a static descriptor, which weakens transfer when topology, actuation coupling, and contact geometry differ substantially. We present EAGG, an embodiment-aligned grasp generator that represents each embodiment with a topology-aware end-effector graph and an embodiment-specific low-dimensional end-effector control space. A frozen end-effector-cognition backbone converts the current articulated state into geometry-aware tokens that act as a reusable morphology prior, and iterative geometry injection refreshes these tokens throughout sampling so that conditioning remains synchronized with the evolving end-effector geometry. On the MultiGripperGrasp benchmark, EAGG reaches 56.17% average success across six training end effectors, remaining within 1.10 percentage points of specialized training while preserving transfer to finetuning and zero-shot end effectors. Iterative geometry injection further reduces the pooled median contact distance from 0.239 cm to 0.189 cm. These results show that cross-end-effector grasp generation is strengthened by aligning embodiment structure inside a shared generator rather than suppressing embodiment differences. Code is available at https://github.com/wanhaoniu/EAGG.

2606.17463 2026-06-17 cs.CV cs.RO 交叉投稿

WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation

WeaveLA: 面向重复机器人操作的基于事件驱动的跨子任务潜在记忆编织

Shoujing Zhu, Zhenyang Liu, Fungmiu Wang, Jiafeng Wang, Bo Yue, Guiliang Liu, Simo Wu, Xiangyang Xue, Taiping Zeng

发表机构 * Fudan University(复旦大学) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Shanghai Innovation Institute(上海创新研究院) Shenzhen Loop Area Institute(深圳环域研究院)

AI总结 针对短窗口VLA策略缺乏跨子任务信息传递的问题,提出WeaveLA,通过事件触发将完成子任务压缩为潜在令牌并注入下一子任务的动作生成路径,在保持基础策略短窗口接口的同时实现轻量级跨子任务通道,在困难重复任务上成功率从0%提升至47.8%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略已实现显著的单步操作,但在每个阶段依赖于刚刚完成的任务时仍然脆弱。核心问题是结构性的:短窗口VLA缺乏明确的跨子任务信息路由通道,而现有的记忆增强变体要么在每一帧写入,要么从演示阶段检索,要么在子目标事件触发时未执行显式的子任务到子任务交接给动作专家。我们将子目标完成事件识别为跨子任务记忆交接的自然时间单元,并提出WeaveLA(为视觉-语言-动作策略编织潜在记忆),这是一种跨子任务记忆接口,在冻结的VLA骨干之上,通过查询驱动的注意力池化将每个完成的段压缩为潜在令牌,并直接路由到下一子任务的动作生成路径。这种事件触发、动作侧的设计保留了基础策略的短窗口接口,同时添加了轻量级跨子任务通道。通过在RoboMME上使用$\pi_{0.5}$骨干进行分层评估,WeaveLA的增益恰好出现在需要该通道的地方:在最难的重复切片(SwingXtimes,$N{=}3$)上,成功率从$0\\%$提升至$47.8\\%$,而单次执行片段保持不变。每集配对分析证实增益仅限于因果结构需要跨子任务信息的任务。

英文摘要

Vision-Language-Action (VLA) policies have achieved remarkable single-step manipulation, yet they remain brittle precisely where each stage depends on what was just completed. The core issue is structural: short-window VLAs lack an explicit channel for rouxting information across sub-task boundaries, and existing memory-augmented variants either write at every frame, retrieve from demonstration-time stages, or fire at sub-goal events without performing an explicit sub-task-to-sub-task hand-off into the action expert. We identify the sub-goal completion event as the natural temporal unit for cross-subtask memory hand-off, and present WeaveLA (Weave Latent memory for Vision-Language-Action policies), a cross-subtask memory interface that, on top of a frozen VLA backbone, compresses each completed segment into latent tokens via query-driven attention pooling and routes them directly into the action-generation path of the next sub-task. This event-triggered, action-side design preserves the base policy's short-window interface while adding a lightweight cross-subtask channel. Through stratified evaluation on RoboMME with a $π_{0.5}$ backbone, WeaveLA's gains land exactly where the channel is needed: on the hardest repetition slice (SwingXtimes, $N{=}3$), success rises from $0\%$ to $47.8\%$, while single-execution episodes remain unchanged. Per-episode paired analysis confirms the gains are confined to tasks whose causal structure requires cross-subtask information.

2606.18243 2026-06-17 cs.CV cs.GR cs.RO 交叉投稿

MOCHI: Motion Enhancement of Collaborative Human-object Interactions

MOCHI: 协作人-物交互的运动增强

Jiye Lee, Yonghun Choi, Jungdam Won

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Seoul National University(首尔国立大学)

AI总结 针对多人-物交互数据中手物接触错位、运动抖动和手指细节缺失等问题,提出两阶段框架MOCHI,先通过优化生成物理合理的手部抓取,再基于扩散模型优化全身运动,有效增强噪声数据。

Comments SIGGRAPH 2026 Journal (ACM TOG); Project page: https://jiyewise.github.io/projects/MOCHI/

详情
AI中文摘要

协作人-物交互展示了动态且复杂的运动,需要参与者与共享对象之间的相互预期和持续调整。对此类协作多人-物交互(MHOI)场景进行建模需要高质量的数据采集作为基础步骤;然而,由于MHOI中人与人、人与物交互同时发生的内在复杂性,这一步骤具有挑战性。这种复杂性导致MHOI捕获数据存在噪声,表现为多种伪影:手与物体之间的接触错位、捕获序列中的运动抖动和时间不一致性,以及缺失或不完整的手指级关节细节。为了解决这些挑战,我们提出了MOCHI(协作人-物交互的运动增强),一个用于增强噪声MHOI数据的两阶段框架。我们的方法首先通过从噪声身体输入进行优化生成物理合理的手部抓取,产生既物理合理又与身体姿态语义一致的抓取,然后将这些优化后的抓取扩展为完整的手-物交互序列。随后,所有参与者的全身运动通过一个基于扩散的噪声优化框架进行细化,该框架使用单人运动先验。在优化过程中,我们引入优化目标以在这些单人先验中编码人-物和人与人交互信息。实验结果表明,我们的流程在多种MHOI数据(无论是通过现有捕获方法获取还是由生成模型合成)上均有效。我们进一步展示了系统在不同参与者数量和交互类型下的鲁棒性,并演示了包括基于关键帧的MHOI创建和通过改变物体几何形状进行数据增强在内的多种应用。

英文摘要

Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.

2509.00064 2026-06-17 cs.RO cs.CV 版本更新

OpenTie: Open-vocabulary Sequential Rebar Tying System

OpenTie: 开放词汇的连续钢筋绑扎系统

Sai Fan, Mingze Liu, Haozhen Li, Haobo Liang, Yixing Yuan, Yanke Wang

AI总结 提出OpenTie,一种无需训练的3D钢筋绑扎框架,通过RGB到点云生成和开放词汇检测实现高精度连续绑扎,优于基于YOLO的方法。

Comments This article is accepted by The 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

建筑工地的机器人实践因其应对复杂挑战的能力而备受关注,尤其是在涉及钢筋的场景中。现有产品和研究主要集中于需要模型训练的大量数据收集。为填补这一空白,我们提出OpenTie,一种利用RGB到点云生成和开放词汇钢筋检测的3D无训练钢筋绑扎框架,并在真实世界测试中实现。我们通过带有双目摄像头的机械臂实现OpenTie,并通过将基于提示的目标检测方法应用于经我们提出的后处理流程过滤的图像(用于图像到点云生成框架),保证了高精度。我们的流程无需训练,且在真实连续钢筋绑扎测试中优于基于训练的目标检测(即基于YOLO的方法)。该系统灵活适用于水平和垂直钢筋绑扎任务,并具有在真实建筑工地应用和商业化的潜力。

英文摘要

Robotic practices on the construction site emerge as an attention-attracting manner owing to their capability of tackling complex challenges, especially in the rebar-involved scenarios. Most of existing products and research are mainly focused on the collection of large amounts of data with model training demands. To fulfill this gap, we propose OpenTie, a 3D training-free rebar tying framework utilizing a RGB-to-point-cloud generation and an open-vocabulary rebar detection on the real-world test. We implement the OpenTie via a robotic arm with a binocular camera and guarantee a high accuracy by applying the prompt-based object detection method on the image filtered by our proposed post-processing procedure for the image-to-point-cloud generation framework. Our pipeline requires no training efforts and outperforms the training-based object detection, i.e., YOLO-based method, with the verification on the real-world sequential rebar tying test. The system is flexible for horizontal and vertical rebar tying tasks and holds the potential application to the real construction site with possibility of commercialization.

2606.03177 2026-06-17 cs.RO 版本更新

ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control

ConTrack: 具有自适应权衡控制的约束手部运动跟踪

Yutong Liang, Quanquan Peng, Ri-Zhao Qiu, Xiaolong Wang

发表机构 * University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出一种基于强化学习的框架ConTrack,通过将物体跟踪视为约束并利用双变量更新自适应调整任务-风格权衡,同时结合自适应中轨迹重置库,实现长时域、接触密集的手部运动跟踪,在仿真和真实机器人上显著提升成功率和物体位姿精度。

详情
AI中文摘要

人类演示为机器人操作提供了强大的先验,但由于运动学差距,将其转移到真实机器人上执行并非易事。在灵巧操作中,即使在仿真器中跟踪长时域、接触密集的序列仍然具有挑战性:参考跟踪策略必须保持物体在其目标轨迹上,同时保留演示的关节运动和接触时序。现有方法通常依赖于需要针对每个序列进行调整的手工奖励调节,并且在有限的交互预算下会失效。我们提出了ConTrack,一种随跟踪数据扩展的强化学习(RL)框架。ConTrack将物体跟踪视为约束,并将剩余控制权限分配给运动保真度,从而通过双变量更新在线适应任务-风格权衡。此外,ConTrack还通过一个自适应中轨迹重置库来稳定长时域学习,该库重用策略可达的仿真器状态。我们在仿真跟踪和真实机器人上的定性和定量结果表明,ConTrack在保持关节和接触保真度的同时,显著提高了成功率和物体位姿精度,优于现有技术。网站:此 https URL。

英文摘要

Human demonstrations provide strong priors for robot manipulation, yet it is non-trivial to transfer them to execute on real robots due to the kinematic gap. In dexterous manipulation, it remains challenging to track long-horizon, contact-rich sequences even in simulators: a reference-tracking policy must keep objects on their target trajectories while preserving demonstrated joint motion and contact timing. Existing approaches often rely on hand-crafted reward tuning that require per-sequence tuning and break under limited interaction budgets. We introduce ConTrack, a reinforcement learning (RL) framework that scales with tracking data. ConTrack treats object tracking as a constraint and allocates remaining control authority to motion fidelity, which allows it to adapt task--style trade-offs online using a dual-variable update. In addition, ConTrack also stabilizes long-horizon learning with an adaptive mid-trajectory reset library that reuses policy-reachable simulator states. Our qualitative and quantitative results in simulation tracking and real robot demonstrate that ConTrack improves success and object pose accuracy significantly over prior arts while preserving joint and contact fidelity. Website: https://www.lyt0112.com/projects/ConTrack.

2606.09337 2026-06-17 cs.RO 版本更新

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

TORL-VLA:触觉引导的在线强化学习用于接触丰富操作

Huaihang Zheng, Yi Yang, Kai Ma, Shenglin Xu, Tian Xie, Guozheng Li, Xiangyu Wang, Yiren Ma, Si Liu, Yinian Mao, Baoxu Liu

发表机构 * Meituan(美团) Beijing Institute of Technology(北京理工大学) Beihang University(北京航空航天大学) State Key Lab of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS(中国科学院自动化研究所多模态人工智能系统国家重点实验室) China University of Mining and Technology (Beijing)(中国矿业大学(北京))

AI总结 提出TORL-VLA框架,结合触觉反馈与在线强化学习,通过触觉导出的力矩感知VLA预测参考动作,并利用轻量在线RL模块优化动作,解决接触条件变化时的策略适应问题,在长时接触任务中提升成功率和执行效率。

Comments Project page: https://torl-vla.github.io/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为机器人操作的有力框架,最近的研究将触觉或力反馈引入VLA以处理接触丰富的任务。然而,这些模型通常作为离线策略部署。当接触条件偏离训练分布时,策略无法进行在线适应,导致接触力不当和重试效率低下等问题。因此,我们提出TORL-VLA,一种触觉引导的在线强化学习框架,将触觉反馈与策略优化相结合用于接触丰富操作。我们的方法引入了一个触觉导出的力矩感知VLA来预测参考动作和未来的力矩序列,同时使用轻量级在线RL模块来优化参考动作。为了稳定地从混合的探索性策略生成和人工干预数据中学习,我们引入了一个干预审查评论家,防止干预后的成功被错误地归因于干预前的策略生成动作。在包括门闩操作、咖啡杯放置和鸡蛋处理等长时接触丰富任务上的真实机器人实验表明,TORL-VLA在子任务和完整任务级别上提高了成功率,并在时间约束的执行效率上优于强基线。

英文摘要

Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines. Project page: https://torl-vla.github.io/

4. 导航、定位与SLAM 6 篇

2606.17183 2026-06-17 cs.RO 新提交

VL-MemKnG: Hybrid Memory with a Spatio-Temporal Knowledge Graph for Question Answering over Long Egocentric Navigation Trajectories

VL-MemKnG:结合时空知识图谱的混合记忆用于长程自我中心导航轨迹问答

Svetlana Lukina, Mohamad Al Mdfaa, Gloria Haro, Sergey Zagoruyko, Gonzalo Ferrer

发表机构 * Mobile Robotics Laboratory, Artificial Intelligence Center(移动机器人实验室,人工智能中心) Skoltech(斯科尔科沃科学技术学院) Intelligent Multimodal Vision Analysis Group, Department of Engineering, Universitat Pompeu Fabra(智能多模态视觉分析组,工程系,庞培法布拉大学) Independent Researcher(独立研究员)

AI总结 提出VL-MemKnG混合记忆框架,结合时空知识图谱与片段级上下文记忆,通过混合检索推理模块提升长程自我中心视频导航问答的准确性和效率。

详情
AI中文摘要

回答长程自我中心视频中的导航相关问题需要检索和组织分布在遥远时间瞬间的证据,同时保持空间和上下文一致性。尽管长上下文视觉-语言模型能够实现强大的答案质量,但对于长轨迹而言计算成本高昂,且对于重复查询效率低下。最近基于图的方法(如VL-KnG)通过持久化时空知识图谱解决了这一挑战,但仅依赖图检索可能不足以表达更广泛的时间连续性和上下文线索。我们提出了VL-MemKnG,一种混合记忆框架,它扩展了VL-KnG,将时空知识图谱与持久化片段级上下文记忆相结合。知识图谱捕获结构化关系信息和长程对象关联,而片段级记忆则保留更广泛的时间上下文以进行长程证据检索。混合检索与推理模块联合操作于两种记忆表示之上,生成基于证据的答案和时间上组织的支持证据。我们还引入了WalkieKnowledgeT+,这是WalkieKnowledge的扩展,用于长程导航导向的视频问答。该基准包括需要跨多个非共现时刻进行证据聚合的时间分布式推理任务。在WalkieKnowledgeT+上,VL-MemKnG将Top-1检索准确率从58%提升至67%,Recall@1从34.50%提升至40.55%,优于所有对比方法,包括Gemini 2.5 Pro和Qwen 3.5+。在时间全局和时间分散聚合问题上提升尤为显著,证明了将结构化关系记忆与片段级上下文记忆相结合的优势,同时保持高效的查询时推理。

英文摘要

Answering navigation-relevant questions over long egocentric videos requires retrieving and organizing evidence distributed across distant temporal moments while maintaining spatial and contextual consistency. Although long-context vision--language models can achieve strong answer quality, they are computationally expensive for long trajectories and inefficient for repeated querying. Recent graph-based approaches such as VL-KnG address this challenge through persistent spatio-temporal knowledge graphs, but graph-centric retrieval alone may underrepresent broader temporal continuity and contextual cues. We present VL-MemKnG, a hybrid memory framework that extends VL-KnG by combining a spatio-temporal knowledge graph with persistent segment-level contextual memory. The knowledge graph captures structured relational information and long-range object associations, while segment-level memory preserves broader temporal context for long-horizon evidence retrieval. A hybrid retrieval-and-reasoning module jointly operates over both memory representations to produce evidence-grounded answers and temporally organized supporting evidence. We also introduce WalkieKnowledgeT+, an extension of WalkieKnowledge for long-horizon navigation-oriented video question answering. The benchmark includes temporally distributed reasoning tasks requiring evidence aggregation across multiple non-cooccurring moments. On WalkieKnowledgeT+, VL-MemKnG improves Top-1 retrieval accuracy from 58% to 67% and Recall@1 from 34.50% to 40.55%, outperforming all compared methods, including Gemini 2.5 Pro and Qwen 3.5+. The gains are particularly pronounced on temporal-global and temporally scattered aggregation questions, demonstrating the benefits of combining structured relational memory with segment-level contextual memory while maintaining efficient query-time inference.

2606.17294 2026-06-17 cs.RO cs.LG 新提交

VISTA: Scale-Aware Visual Navigation via Action History Conditioning

VISTA:通过动作历史条件实现尺度感知的视觉导航

Maeva Guerrier, Koki Kobayashi, Simon Roy, Jana Pavlasek, Giovanni Beltrame

发表机构 * Polytechnique Montreal(蒙特利尔理工学院) MILA(MILA研究所) Institute of Science Tokyo(东京科学大学) CoRA Lab(CoRA实验室) Mist Lab(Mist实验室)

AI总结 针对视觉导航基础模型因动作归一化导致的尺度脆弱性,提出通过动作历史条件化提供物理位移上下文,并集成DINOv3编码器增强重复环境中的特征表示,实现零样本跨环境部署。

详情
AI中文摘要

视觉导航基础模型(VNMs)承诺能够实现端到端的学习导航策略,并能在不同实体和环境之间进行零样本部署。为了保持通用性,许多基于视觉的导航模型预测归一化动作。然而,这种归一化引入了一个关键的部署漏洞:对相同的归一化轨迹应用不同的缩放因子会改变其物理几何形状,从而降低导航性能并增加碰撞风险。我们通过将模型条件化于归一化动作历史以及图像观测来解决这一漏洞,为模型预测与机器人实际物理位移之间的关系提供显式上下文。此外,当前的VNMs在缺乏显著特征的视觉重复环境中常常表现不佳。为解决此问题,我们集成了DINOv3编码器,其更丰富的表示使我们的模型能够捕获观测之间的空间和几何维度。VISTA能够鲁棒地泛化到分布外环境,在户外、森林和办公室环境的零样本真实世界部署中实现了100%的目标预测准确率,平均95%的检查点被穿越,展示了在未见环境中的一致路径跟随能力。

英文摘要

Vision Navigation Foundation Models (VNMs) promise end-to-end learned navigation policies capable of zero-shot deployment across diverse embodiments and environments. To maintain generality, many vision-based navigation models predict normalized actions. However, this normalization introduces a critical deployment vulnerability: applying different scaling factors to the same normalized trajectory alters its physical geometry, which degrades navigation performance and increases collision risks. We address this vulnerability by conditioning the model on normalized action histories alongside image observations, providing explicit context on the relationship between the model's predictions and the robot's actual physical displacement. Furthermore, current VNMs often struggle in visually repetitive environments that lack distinct features. To resolve this issue, we integrate a DINOv3 encoder, whose richer representations enable our model to capture both spatial and geometric dimensions between observations. VISTA generalizes robustly to out-of-distribution environments, achieving 100% goal prediction accuracy in zero-shot, real-world deployment in Outdoor, Forest and Office settings, and an average of 95% checkpoints crossed, demonstrating consistent path following in unseen environments.

2606.17534 2026-06-17 cs.RO 新提交

RICH-SLAM: Radar SLAM with Incremental and Continuous Hilbert Mapping

RICH-SLAM:基于增量连续希尔伯特映射的雷达SLAM

Bingbing Zhang, Huan Yin, Yang Xu, Shuo Liu, Shaojie Shen, Fumin Zhang, Wen Xu

发表机构 * State Key Laboratory of Ocean Sensing, Zhejiang University(浙江大学海洋传感国家重点实验室) Interdisciplinary Student Training Platform for Marine areas, Zhejiang University(浙江大学海洋交叉学科学生培养平台) Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology(香港科技大学电子与计算机工程系) School of AI and Robotics, Hunan University(湖南大学人工智能与机器人学院) Ocean College, Zhejiang University(浙江大学海洋学院) Institute of Deep-Sea Science and Engineering, Chinese Academy of Sciences(中国科学院深海科学与工程研究所)

AI总结 提出RICH-SLAM框架,采用Rao-Blackwellized粒子滤波后端和增量希尔伯特空间降秩高斯过程映射,从稀疏雷达测量中构建连续占用地图,并支持不确定性感知规划。

Comments 12 figures

详情
AI中文摘要

由于雷达对恶劣天气和光照条件具有固有的鲁棒性,使用雷达传感器进行同步定位与地图构建(SLAM)越来越受到关注。然而,与激光雷达和视觉数据相比,雷达测量具有稀疏和噪声大的特点,这给实现密集、连续且一致的地图表示带来了重大挑战。在本文中,我们提出了RICH-SLAM,一个旨在解决这些挑战的雷达SLAM框架。我们的方法采用基于Rao-Blackwellized粒子滤波的后端,使用粒子滤波进行位姿估计,卡尔曼滤波进行地图更新。我们提出了一种增量希尔伯特空间降秩高斯过程映射策略,能够在给定稀疏雷达输入的情况下实现连续且具有不确定性感知的地图表示。我们进一步引入了一种后验感知的粒子加权方案,利用地图参数的完整后验分布进行更鲁棒的似然评估。在自采集和公共ColoRadar数据集上的实验表明,RICH-SLAM能够从稀疏雷达测量中构建连续占用地图,并支持移动机器人的不确定性感知规划。

英文摘要

Simultaneous localization and mapping using radar sensors has gained increasing attention due to radar's inherent robustness to adverse weather and lighting conditions. However, radar measurements are characteristically sparse and noisy compared to LiDAR and visual data, posing significant challenges in achieving dense, continuous, and consistent map representations. In this paper, we present RICH-SLAM, a radar SLAM framework designed to address these challenges. Our approach features a Rao-Blackwellized particle filter-based back end that employs particle filtering for pose estimation and Kalman filtering for map updates. We propose an incremental Hilbert-space reduced-rank Gaussian process mapping strategy that enables continuous and uncertainty-aware map representations given sparse radar inputs. We further introduce a posterior-aware particle weighting scheme that leverages the full posterior distribution of map parameters for more robust likelihood evaluation. Experiments on self-collected and public ColoRadar datasets show that RICH-SLAM constructs continuous occupancy maps from sparse radar measurements and supports uncertainty-aware planning for mobile robots.

2606.18112 2026-06-17 cs.RO cs.CV 新提交

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告:为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(通义实验室)

AI总结 提出 Qwen-RobotNav 可扩展导航模型,通过参数化接口支持多种任务模式和可调观测参数,在15.6M样本上训练,联合视觉语言数据防止行为坍缩,在多个导航基准上取得新最优结果,并展示零样本泛化能力。

详情
AI中文摘要

智能体导航系统需要一个基础导航模型,其观测策略可以在推理时从外部重新配置,因为指令跟随、目标搜索、目标跟踪和自动驾驶共享相同的感知规划主干,但对视觉流的消费方式有根本不同的要求。我们提出 Qwen-RobotNav,一个建立在 Qwen-RobotNav 上的可扩展导航模型,通过一个具有两个互补维度的参数化接口来解决这个问题:多个任务模式选择导航行为,以及可控的观测参数(例如,token 预算、每个摄像头的权重)控制视觉历史的编码方式。通过训练时对所有参数进行随机化,Qwen-RobotNav 对任何推理时配置都具有鲁棒性,无需对 Qwen-RobotNav 主干进行任何架构修改。我们在15.6M样本上训练 Qwen-RobotNav;与视觉语言数据联合训练防止了在仅轨迹训练中观察到的反应性动作序列映射器的坍缩。参数化接口也使 Qwen-RobotNav 成为智能体系统的自然构建块:对于长时域场景,上层规划器将目标分解为子任务,并在情节中动态切换 Qwen-RobotNav 的任务模式和上下文策略,通过重复调用同一模型组合出复杂行为。大量实验表明,Qwen-RobotNav 在主要导航基准上取得了新的最优结果。该模型从2B到8B参数展现出良好的扩展性,联合多任务训练发展出一个跨任务族迁移的共享空间规划基板,并在多样环境中对真实世界机器人展现出强大的零样本泛化能力。

英文摘要

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

2606.17897 2026-06-17 cs.AI cs.RO 交叉投稿

Learn to Quantify Social Interaction with Constraints for Pedestrian Walking

学习量化行人行走中的社交互动约束

Xiaodan Shi

发表机构 * Department of Computer and Systems Sciences, Stockholm University(斯德哥尔摩大学计算机与系统科学系)

AI总结 提出Learn to Cluster方法,通过概率潜变量生成模型从轨迹观测中无监督学习社交互动模式,并有效集成到行人轨迹预测中,提升预测鲁棒性。

详情
AI中文摘要

人群中的长期行人路径预测对于自主移动平台(如自动驾驶汽车和社交机器人)避免碰撞并做出高质量规划至关重要。尽管当前研究考虑了社交互动进行预测,但它们并未揭示人与人之间发生的具体社交互动类型以及社交互动如何影响行人的决策过程,这进一步限制了其鲁棒性。行人行走中的社交互动直观上大量存在且难以标注和量化。在本文中,我们通过提出Learn to Cluster创造性地探索量化和解释行人如何与他人互动。我们的聚类社交互动是概率潜变量生成模型,直接从序列轨迹观测中学习,可扩展到任意数量的行人。Learn to Cluster无需标签,可以自然地集成到预测模型的训练过程中。潜变量随后将作为“标签”对社交互动进行分类。在多个轨迹预测基准上的大量实验表明,我们的方法能够学习社交互动的模式,并将这些模式有效集成到行人轨迹预测中。

英文摘要

Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don't reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as 'labels' to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.

2606.03609 2026-06-17 cs.RO cs.LG 版本更新

A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature

3D 等视域世界模型——揭示城市不可见几何及其涌现的跨城市特征

Xuhui Lin, Stephen Law, Nanjiang Chen, Kunyao Li, Tao Yang

发表机构 * The Bartlett School of Sustainable Construction University College London, UK(可持续建设学院伦敦大学学院,英国) Department of Geography University College London, UK(地理系伦敦大学学院,英国) School of Project Management, Faculty of Engineering The University of Sydney, AU(工程学院项目管理学院悉尼大学,澳大利亚) School of Engineering Cardiff University, UK(工程学院卡迪夫大学,英国) School of Architecture Tsinghua University, Beijing, CN(建筑学院清华大学,北京,中国)

AI总结 提出一种预测3D等视域(球形可见性深度图)的具身世界模型,通过深度残差和自滚动调度采样训练,发现跨城市空间特征可从时间潜变量中线性解码。

详情
AI中文摘要

在城市中导航的具身智能体依赖于世界模型来预测其移动时周围环境的变化。但对于导航而言,重要的不是建筑物的外观,而是智能体可以到达的位置。尽管如此,大多数世界模型仍然预测外观,学习场景的外观而非智能体可穿行的空间。那些确实针对几何的模型,如鸟瞰占用网格,将三维环境压缩到地面平面,忽略了塑造真实导航的地上和多层结构。目前缺少的是一个能够捕捉智能体实际穿行的可导航几何的预测目标,既不受光度信息干扰,也不丢失第三维度。我们的核心思想是对建筑物之间的开放体积(负空间)进行建模,编码为3D等视域:一个球形可见性深度图,记录每个方向上到最近表面的距离。我们引入了一个具身世界模型,根据过去短时间内的等视域历史和运动动作预测下一个等视域。预测被公式化为深度残差,使解码器继承锐利的建筑边缘,通过自滚动调度采样进行训练以保持几何流形上的上下文,并配备持久潜鸟瞰空间图以实现跨路径一致性。我们的核心发现是涌现且出乎意料的:一个在曼哈顿和巴黎上训练的单一城市盲模型发展出了跨城市空间特征,其城市身份可从时间潜变量中线性解码,远高于单帧基线,因此该特征存在于学习到的动力学中而非外观中。该表示轻量、可解释且可复现,为具身AI、机器人和城市分析中的空间推理提供了几何基础,并随附开放数据集和流程发布。

英文摘要

Embodied agents that navigate cities rely on world models that predict how their surroundings will change as they move. But for navigation, what matters is not what the buildings look like; it is where the agent can go. Most world models nonetheless predict appearance, learning how a scene looks rather than the space an agent can move through. Those that do target geometry, such as bird's-eye-view occupancy grids, flatten the three-dimensional environment onto a ground plane, discarding the above-ground and multi-level structure that shapes real navigation. What is missing is a predictive target that captures the navigable geometry an agent actually traverses, without photometric entanglement and without collapsing the third dimension. Our key idea is to model the open volume between buildings, the negative space, encoded as a 3D isovist: a spherical visibility-depth map recording the distance to the nearest surface in every direction. We introduce an embodied world model that predicts the next isovist from a short history of past isovists and a movement action. The prediction is formulated as a depth residual so the decoder inherits sharp building edges, trained with self-rollout scheduled sampling to keep corrupted context on the geometry manifold, and equipped with a persistent latent bird's-eye-view spatial map for cross-path consistency. Our central finding is emergent and unexpected: a single city-blind model trained on Manhattan and Paris develops a cross-city spatial signature, with city identity linearly decodable from its temporal latents far above single-frame baselines, so the signature lives in the learned dynamics rather than in appearance. The representation is lightweight, interpretable, and reproducible, offering a geometric substrate for spatial reasoning in embodied AI, robotics, and urban analysis, released with an open dataset and pipeline.

5. 人机交互与协作机器人 4 篇

2606.17455 2026-06-17 cs.RO 新提交

Continual Online Personalization of Exoskeleton Control via Manifold-Aware Experience Replay

基于流形感知经验回放的外骨骼控制持续在线个性化

Changseob Song, Inseung Kang

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出流形感知经验回放框架,通过回放缓冲区保留用户特定表征,避免在线适应中的灾难性遗忘,在模拟偏瘫步态中扭矩和步态相位跟踪精度分别提升40%和60%。

详情
AI中文摘要

个性化外骨骼控制对于步态障碍的临床用户仍然是一个关键挑战。在线适应(OA)通过实时适应受试者变异性、设备适配性和不同运动任务提供了一种有效解决方案。然而,OA涉及连续的用户状态数据流,可能导致先前学习的运动情境的灾难性遗忘。在此,我们开发了一种基于流形感知经验回放的在线个性化框架,旨在在外骨骼控制的OA过程中跨不同任务维护用户特定表征。通过从回放缓冲区重放先前经历的任务,我们保留了跨所有学习任务的个性化外骨骼辅助。此外,我们捕获了一个区分不同运动任务的步态流形,消除了在选择目标回放区间时对显式任务标签的需求。我们在模拟偏瘫步态(与健全模式有显著偏差)上评估了我们的框架,涉及速度和坡度转换的多个遗忘场景。与没有回放的基线框架(在任务转换期间表现出灾难性遗忘)相比,我们的流形感知回放框架在扭矩和步态相位跟踪精度上分别实现了40%和60%的提升。这表明我们提出的框架在临床人群的日常行走中跨不同运动情境实时个性化外骨骼控制。

英文摘要

Personalizing exoskeleton control remains a critical challenge for clinical users with gait disabilities. Online adaptation (OA) offers an effective solution by adapting in real time to subject variability, device fit, and diverse locomotor tasks. However, OA involves a continual stream of user state data, which can lead to catastrophic forgetting of previously learned locomotor contexts. Here, we develop a manifold-aware experience replay-based online personalization framework designed to maintain user-specific representations across diverse tasks during OA of exoskeleton control. By replaying previously experienced tasks from a replay buffer, we preserve the personalized exoskeleton assistance across all learned tasks. Furthermore, we capture a gait manifold that distinguishes between different locomotor tasks, removing the need for explicit task labeling when selecting target replay bins. We evaluated our framework on emulated hemiplegic gait, which largely deviates from able-bodied patterns, across multiple forgetting scenarios with speed and incline transitions. Our manifold-aware replay framework achieved 40% and 60% improvements in torque and gait phase tracking accuracy, respectively, compared to a baseline framework without replay, which exhibited catastrophic forgetting during task transitions. This demonstrates that our proposed framework personalizes exoskeleton control in real time across diverse locomotor contexts in daily ambulation of clinical populations.

2606.17831 2026-06-17 cs.RO cs.HC 新提交

Accountability in Autonomous Drone-Based Firefighting: Insights From a Field Trial

自主无人机消防中的问责制:来自实地试验的见解

Dzmitry Katsiuba, Anna Katharina Boos, Robin Hany, Mateusz Dolata, Gerhard Schwabe

发表机构 * University of Zurich(苏黎世大学) Zeppelin University(塞普豪森大学)

AI总结 通过实地试验,研究自主无人机在消防中对问责制的影响,发现角色不确定性和人机交互新问题,并提出建议以负责任地整合无人机。

Comments Accepted for Publication at International Conference on Information Systems (ICIS) 2025: https://aisel.aisnet.org/icis2025/ethical_is/ethical_is/10/

详情
Journal ref
International Conference on Information Systems, 2025, ICIS2025-2162
AI中文摘要

有一个不断增长的研究领域探索自主无人机如何提高应急响应效率。将这些(人工)智能体整合到现有的应急团队和工作流程中,可能会显著影响既定的问责关系。本文研究了自主无人机如何在复杂的社会技术系统中影响问责归属。通过两次真实的消防实地试验,该研究揭示了当无人机在组织层面部署时,围绕问责制存在显著的不确定性。利用Bovens的问责框架,识别出两个挑战:(1)无人机在层级结构中的角色不确定性,导致问责归属混乱;(2)新形式的人机交互引入了额外的问责相关问题。基于这些见解,本文提出了可操作的建议,以支持在不损害问责制的前提下将自主无人机负责任地整合到消防行动中。这些发现为政策制定者提供了实用指导,并有助于进一步研究自主系统中的问责制。

英文摘要

There is a growing research field exploring how autonomous drones can enhance emergency response effectiveness. Integrating these (artificial) agents into existing emergency teams and workflows may significantly impact established accountability relationships. This paper examines how autonomous drones affect accountability attribution within complex socio-technical systems. Drawing on two real-life field trials in firefighting, the study reveals substantial uncertainty around accountability when drones are organizationally deployed. Using Bovens' accountability framework, two challenges are identified: (1) uncertainty about the role of drones within hierarchical structures, leading to confused accountability ascriptions; and (2) new forms of human-drone interactions introducing additional accountability-relevant issues. Based on these insights, the paper proposes actionable recommendations to support the responsible integration of autonomous drones into firefighting operations without undermining accountability. These findings offer practical guidance for policymakers and contribute to further research on accountability in autonomous systems.

2606.17839 2026-06-17 cs.RO cs.HC 新提交

From Ad Hoc Pilots to Repeatable Patterns: Structuring Drone Collaboration in Emergency Services with DroneLets

从临时飞行员到可重复模式:用DroneLets构建紧急服务中的无人机协作

Dzmitry Katsiuba, Samuel Brander, Mateusz Dolata, Gerhard Schwabe

发表机构 * University of Zurich(苏黎世大学) Zeppelin University(空天大学)

AI总结 本文通过实地试验和访谈,提炼出44种交互模式并引入DroneLets设计构件,以结构化的方式实现紧急服务中无人机协作的可重复和可扩展。

Comments Presented at International Conference on Information Systems (ICIS) 2025: https://aisel.aisnet.org/icis2025/is_transformwork/is_transformwork/19/

详情
Journal ref
International Conference on Information Systems 2025: ICIS2025-2217
AI中文摘要

无人机有望支持紧急服务,但其融入工作流程仍具有临时性和协调密集型。本文探讨两个研究问题:紧急团队希望如何与无人机协作,以及如何将这些协作形式化为可重复的过程。基于四次实地试验和95次访谈,我们推导出44种交互模式,分为10个元模式,反映了侦察、通信和后勤支持等操作需求。为了构建这些实践,我们引入了DroneLets——一种新的设计构件类别,将协作工程扩展到具身代理。DroneLets捕获设置要求、无人机能力、环境约束以及人类和无人机代理之间的协调行动。它们提供了一个模块化框架,用于设计紧急服务中可重复、可扩展的协作过程,并通过向旁观者广播和火灾后监测等模式加以说明。这项工作扩展了协作工程的范围,并为将自主无人机集成到高风险现场操作中提供了结构化基础。

英文摘要

Drones hold promise for supporting emergency services, but their integration into workflows remains ad hoc and coordination-intensive. This paper addresses two research questions: how emergency teams want to collaborate with drones, and how to formalize these collaborations into repeatable processes. Based on four field trials and 95 interviews, we derive 44 interaction patterns grouped into 10 meta-patterns reflecting operational needs such as reconnaissance, communication, and logistical support. To structure these practices, we introduce DroneLets - a new class of design artifacts that extend Collaboration Engineering to embodied agents. DroneLets capture setup requirements, drone capabilities, environmental constraints, and coordinated actions across human and drone actors. They offer a modular framework for designing repeatable, scalable collaboration processes in emergency services, illustrated through patterns such as broadcasting to bystanders and post-fire monitoring. This work expands the scope of CE and provides a structured foundation for integrating autonomous drones into high-stakes field operations.

2606.18189 2026-06-17 cs.RO 新提交

Beyond Failure Recovery: An Engagement-Aware Human-in-the-loop Framework for Robotic Systems

超越故障恢复:一种面向机器人系统的参与感知人在回路框架

Jiaying Fang, Joyce Yang, Zhanxin Wu, Bohan Yang, Tapomayukh Bhattacharjee

发表机构 * Cornell University(康奈尔大学)

AI总结 提出一种参与感知模型预测控制(E-MPC)方法,通过规划交互频率和类型来维持用户参与度并控制工作负荷,在机器人辅助进食系统中验证了其提升用户体验且不降低任务成功率的效果。

Comments Project website at https://emprise.cs.cornell.edu/empc

详情
Journal ref
Robotics: Science and Systems 2026
AI中文摘要

传统的人机协同方法通常仅在机器人遇到故障或不确定性时才让用户介入,将人类主要视为提升机器人性能的工具。然而,在许多以人为中心的机器人环境中,交互应通过让用户参与决策来支持参与度,而非将其限制于故障驱动的干预。这在物理护理场景中尤为突出,因为行动受限会降低用户实时干预或调节机器人行为的能力。因此,故障驱动的交互策略可能使用户在任务的大部分时间里沦为被动观察者。例如,行动受限的用户在持续被动接受机器人喂食时可能感到参与度不足。同时,过于频繁的交互可能令人疲惫并增加用户工作负荷。为解决这一权衡,我们提出了一种用户参与感知方法——参与感知模型预测控制(E-MPC),该方法规划交互以在维持参与度的同时满足工作负荷约束。E-MPC利用一个用户交互动力学模型,该模型捕捉用户参与度如何随交互频率和类型变化。机器人并非仅在任务执行出现困难时才请求输入,而是主动考虑用户在整个任务中偏好的参与水平,平衡自主性与交互,同时确保任务成功。我们通过多项消融实验和基线对比在仿真中评估了E-MPC。结果表明,该方法在多种用户画像下均有效。此外,我们在一个机器人辅助咬取系统中,与模拟行动受限的真实参与者进行了用户研究,显示E-MPC在维持任务成功的同时改善了用户体验。

英文摘要

Conventional human-in-the-loop approaches typically involve users only when a robot encounters failure or uncertainty, treating humans primarily as tools for improving robot performance. However, in many human-centered robotics settings, interaction should support engagement by keeping users involved in decision-making rather than limiting them to failure-driven interventions. This is particularly compelling in physical caregiving, where mobility limitations can reduce users' ability to intervene or modulate the robot's behavior in the moment. As a result, failure-driven interaction policies may relegate users to passive observers for long stretches of the task. For example, a user with mobility limitations may feel less engaged when being continuously and passively fed by a robot. At the same time, overly frequent interaction can be tiring and increase the user's workload. To address this trade-off, we propose Engagement-aware MPC (E-MPC), a user-engagement-aware method that plans interaction to maintain engagement while respecting a workload constraint. E-MPC leverages a user interaction dynamics model that captures how user engagement evolves as a function of both the frequency and type of interaction. Rather than requesting input only when difficulties arise during task execution, the robot proactively considers the user's preferred level of engagement throughout the task, balancing autonomy and interaction while ensuring task success. We evaluate E-MPC in simulation with several ablations and baseline comparisons. Results demonstrate the effectiveness of our approach across diverse user personas. In addition, we conduct a real-world user study with participants with emulated mobility limitations on a robot-assisted bite acquisition system, showing that E-MPC improves user experience while maintaining task success.

6. 具身智能与视觉语言动作模型 7 篇

2606.17598 2026-06-17 cs.RO cs.CV 新提交

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

MuseVLA: 一种用于机器人操作的自适应多模态感知视觉-语言-动作模型

Xingyuming Liu, Ruichun Ma, Heyu Guo, Qixiu Li, Qingwen Yang, Lin Luo, Shiqi Jiang, Chenren Xu, Jiaolong Yang, Baining Guo

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) Microsoft Research Asia(微软亚洲研究院) Princeton University(普林斯顿大学) Tsinghua University(清华大学)

AI总结 提出MuseVLA模型,通过将传感器作为按需工具集成,实现自适应多模态感知;设计传感器图像统一表示,并引入数据合成流水线,在灵巧手操作任务中平均成功率80.6%,显著优于RGB-only和多模态基线。

详情
AI中文摘要

人类自然地利用多种感知模态与物理世界交互,而大多数用于机器人的视觉-语言-动作(VLA)模型仅依赖RGB观测。这限制了它们感知难以或无法从RGB相机推断的物理属性(如温度、声音或雷达响应)的能力。我们提出MuseVLA,一种自适应多模态感知VLA模型,将新型传感器作为按需工具集成到机器人操作中。给定任务指令和视觉上下文,MuseVLA首先生成一个传感器令牌和目标描述,选择要调用的感知模态和关注对象,类似于带参数的工具调用。然后,它将选定的传感器测量值转换为接地传感器图像,这是一种统一的中间表示,编码异构读数以进行多模态融合和动作生成。这种设计将传感器特定处理与VLA主干解耦,实现了多种模态的高效集成。为了减少对昂贵的多传感器机器人数据集的需求,我们进一步引入了一种数据合成流水线,用接地传感器图像增强现有的RGB视频数据集,从而实现对未见过的传感器引导任务的泛化。我们在真实机器人上评估了MuseVLA,涉及需要多模态感知输入的挑战性灵巧手操作任务,包括温度引导的拾取与放置、音频驱动的物体搜索和雷达辅助的隐藏物体检索。MuseVLA平均成功率达到80.6%,显著优于仅RGB和多模态VLA基线,并在未见任务上表现出强大的零样本能力。

英文摘要

Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.

2606.17924 2026-06-17 cs.RO cs.AI 新提交

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

PearlVLA:潜在空间中的渐进式具身动作计划精炼

Bochen Yang, Lianlei Shan

发表机构 * Imperial College London(帝国理工学院) Tsinghua University(清华大学)

AI总结 提出PearlVLA框架,通过在VLM潜在空间中进行迭代计划精炼,平衡动作生成效率与显式推理,在LIBERO基准上达到最先进性能。

Comments 21 pages, 2 figures. Preprint

详情
AI中文摘要

当前的视觉-语言-动作(VLA)模型在高效动作生成与显式推理之间存在权衡。直接从视觉-语言骨干表示解码动作可实现低延迟控制,而通过文本链、像素级子目标或动作搜索进行显式推理可以改善规划,但会带来大量延迟和计算成本。我们提出PearlVLA,一个将推理转移到视觉-语言模型(VLM)潜在空间中的VLA框架。PearlVLA将VLM元查询表示分离为固定的视觉接地分支和迭代的潜在计划分支。在每个精炼轮次中,一个计划条件的世界查询探测一个轻量级冻结的潜在世界模型,以获取无动作的未来观察潜在表示,该表示被反馈以指导计划精炼。然后,一个未来引导的RefineNet应用计划的残差更新,逐步将粗糙的语义草稿精炼为细粒度的潜在动作计划。经过K轮精炼后的计划被并行解码为动作块,用于低延迟执行。我们进一步引入因果精炼分组过程奖励强化学习,以优化潜在精炼过程,奖励来自由潜在计划编辑引起的更长视野想象未来。在LIBERO基准上的实证评估表明,PearlVLA在现有方法中达到了最先进的性能。

英文摘要

Current Vision-Language-Action (VLA) models face a trade-off between efficient action generation and explicit deliberation. Directly decoding actions from vision-language backbone representations enables low-latency control, whereas explicit reasoning through textual chains, pixel-level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision-language model (VLM). PearlVLA separates VLM meta-query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan-conditioned world query probes a lightweight frozen latent world model for an action-free future observation latent, which is fed back to guide plan refinement. A future-guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine-grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low-latency execution. We further introduce Causal Refinement-Grouped Process-Reward RL to optimize the latent refinement process with rewards from longer-horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state-of-the-art performance among existing methods.

2606.17937 2026-06-17 cs.RO 新提交

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

ThinkingVLA:用于机器人操作的交叉视觉与语言推理

Tianyi Lu, Hui Zhang, Zijie Diao, Junke Wang, Shengqi Xu, Xingyao Lin, Guojin Zhong, Ziyi Ye, Peng Wang, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学)

AI总结 提出ThinkingVLA,通过统一的多Transformer架构实现前向与逆向推理交织,显著提升长时域操作任务性能。

详情
AI中文摘要

大多数视觉-语言-动作(VLA)模型直接将观测映射到动作,缺乏显式推理,限制了其在推理密集型长时域任务中的能力。为解决此问题,现有方法采用思维链(CoT)推理以实现子目标分解和空间预测。然而,这些方法缺乏用于有效跨模态推理的统一架构,并且未能显式包含基于目标状态的逆向推理能力。我们认为,操作规划自然分解为预测(预测下一个视觉状态)和逆向动力学(推断达到该状态的动作)。连接两者需要一个统一的、在单一生成过程中交织文本和视觉推理的自回归架构。我们提出\textbf{ThinkingVLA},一种在统一的混合Transformer架构中实现此分解的生成模型。ThinkingVLA包含一个前向CoT,用于识别即时子目标并指导视觉预测;预测的图像随后作为目标状态,为逆向CoT提供基础,该逆向CoT基于预测图像推理空间关系和动作意图;最终动作基于完整的推理上下文生成。在仿真和真实世界基准上的大量实验表明,ThinkingVLA持续优于最先进的基线,在长时域操作任务上尤其有大幅提升。

英文摘要

Most Vision-Language-Action (VLA) models map observations directly to actions without explicit reasoning, limiting their capacity for reasoning-intensive long-horizon tasks. To address this, existing approaches adopt Chain-of-Thought (CoT) reasoning to enable subgoal decomposition and spatial anticipation. However, those methods lack a unified architecture for effective cross-modal reasoning and fail to explicitly include inverse reasoning ability based on the target state. We argue that manipulation planning naturally decomposes into prediction, anticipating the next visual state, and inverse dynamics, inferring the actions to reach it. Bridging both requires a unified autoregressive architecture that interleaves textual and visual reasoning in a single generation process. We propose \textbf{ThinkingVLA}, a generative model that realizes this decomposition within a unified Mixture-of-Transformers architecture. ThinkingVLA consists of a forward CoT that identifies the immediate subgoal and guides the visual forecasting; the predicted image then serves as the target state, grounding an inverse CoT that reasons about spatial relationships and action intent based on the predicted image; and the final action is generated conditioned on this full reasoning context. Extensive experiments on simulation and real-world benchmarks demonstrate that ThinkingVLA consistently outperforms state-of-the-art baselines, with particularly large gains on long-horizon manipulation tasks.

2606.17480 2026-06-17 cs.CV cs.RO 交叉投稿

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

GeneralVLA-2: 几何感知重建与受控记忆用于机器人规划

Haoyu Wang, Guoqing Ma, Zeyu Zhang, Yandong Guo, Boxin Shi, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) CASIA(中国科学院自动化研究所) AI 2 Robotics

AI总结 针对机器人规划中3D物体重建幻觉和记忆质量不可控的问题,提出GeoFuse-MV3D几何先验引导重建分支和受控长期记忆系统,在GSO-30和Terminal-Bench等基准上显著提升性能。

详情
AI中文摘要

通用视觉-语言-动作系统需要以物体为中心的3D证据和可复用的操作经验来规划可靠的机器人轨迹。GeneralVLA提供了一个层次化接口,用于将语言和RGB-D观测转换为3D末端执行器路径,但仍存在两个瓶颈。首先,单目SAM3D风格的物体重建可能产生姿态和未见几何的幻觉,而操作受益于在标定多视图观测可用时的稳定物体形状。其次,原始的KnowledgeBank主要检索语义相似的片段并附加新知识,这使得难以控制记忆质量、冲突、置信度和几何相关性。为了解决第一个挑战,我们引入了GeoFuse-MV3D,一个几何先验引导的MV-SAM3D重建分支,它用输入视图掩码验证外部几何线索,应用软视觉外壳支持,执行轴方向细化,并仅融合几何同时保留外观。为了解决第二个挑战,我们将KnowledgeBank升级为一个受控的长期记忆系统,具有明确的质量、置信度、生命周期、验证器和冲突元数据,以及面向精度的检索。最后,我们在GSO-30上评估重建分支,在Terminal-Bench 2.0和SWE-Bench Verified上评估记忆模块;GeoFuse-MV3D相比MV-SAM3D基线,CD和LPIPS分别降低2.20%和2.02%,PSNR和SSIM分别提高2.36%和1.03%;KnowledgeBank相比ReasoningBank,在Terminal-Bench SR上提高4.53%,在SWE-Bench解决率上提高3.73%,同时AS分别降低4.95%和5.65%。代码:此 https URL。网站:此 https URL。

英文摘要

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.

2509.26633 2026-06-17 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

OmniRetarget:面向人形全身运动操控与场景交互的交互保持数据生成

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, Guanya Shi

AI总结 提出OmniRetarget引擎,通过交互网格显式建模并保持智能体、地形和物体间的空间与接触关系,将人类运动重定向为机器人运动,生成高质量轨迹以训练强化学习策略,实现长时间跑酷和操控技能。

Comments Project website: https://omniretarget.github.io

详情
AI中文摘要

教授人形机器人复杂技能的主流范式是将人类运动重定向为运动学参考,以训练强化学习(RL)策略。然而,现有的重定向流程常常难以应对人与机器人之间的显著具身差异,产生物理上不可信的伪影,如脚滑和穿透。更重要的是,常见的重定向方法忽略了对于表达性运动及运动操控至关重要的丰富的人-物和人-环境交互。为解决这一问题,我们引入了OmniRetarget,一种基于交互网格的交互保持数据生成引擎,该网格显式建模并保持智能体、地形和操作对象之间的关键空间与接触关系。通过最小化人体与机器人网格之间的拉普拉斯变形同时施加运动学约束,OmniRetarget生成运动学上可行的轨迹。此外,保持任务相关的交互使得从单一示范到不同机器人本体、地形和物体配置的高效数据增强成为可能。我们通过将来自OMOMO、LAFAN1和我们内部MoCap数据集的运动进行重定向,全面评估了OmniRetarget,生成了超过8小时的轨迹,这些轨迹在运动学约束满足和接触保持方面优于广泛使用的基线。这种高质量数据使得本体感觉RL策略能够在Unitree G1人形机器人上成功执行长达30秒的长时间跑酷和运动操控技能,且仅使用5个奖励项和所有任务共享的简单域随机化进行训练,无需任何学习课程。

英文摘要

A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.

2605.23733 2026-06-17 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics(LimX动力学)

AI总结 提出Any2Any范式,通过运动学对齐和动力学微调,实现预训练全身跟踪模型高效迁移至新的人形机器人本体,仅需少量数据和计算即可达到竞争性跟踪性能。

详情
AI中文摘要

全身跟踪(WBT)模型已成为人形机器人的关键基础,使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算,使得在新人形平台上快速部署成本高昂。这自然引发一个问题:预训练的WBT模型能否通过最小化适应跨本体迁移?为回答这个问题,我们提出Any2Any,一种范式,能够高效地将现有WBT专家迁移到新人形本体,仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐,对齐其输入和输出空间,使得预训练的源策略可以在目标本体上有意义地重用。然后,Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调(PEFT)组件进行动力学适应,保留有用的行为先验,同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明,与从头训练相比,Any2Any显著加速收敛并降低训练成本,同时实现具有竞争力或更优的跟踪性能。值得注意的是,仅使用完整训练所需计算和数据的1%,Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明,预训练的WBT专家可以跨本体高效重用,为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots.

2605.31286 2026-06-17 cs.RO cs.AI 版本更新

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

DeMaVLA:面向可泛化可变形物体操作的视觉-语言-动作基础模型

Taiyi Su, Jian Zhu, Tianjian Wang, Youzhang He, Zitai Huang, Jianjun Zhang, Chong Ma, Hanyang Wang, Tianjiao Zhang, Munan Yin, Weihao Ding, Yi Xu

发表机构 * Tongji University(同济大学)

AI总结 提出DeMaVLA模型,采用VLM骨干与动作专家结合流匹配生成连续动作,通过剪枝Transformer层提升效率,并利用大规模真实世界数据和人类反馈数据聚合训练,实现可变形物体折叠操作的多类别泛化。

Comments 14 pages, 2 figures

详情
AI中文摘要

现实家庭机器人需要视觉-语言-动作(VLA)基础模型,能够在不同物体、任务条件和家庭环境中获取可重复使用的操作技能。可变形物体折叠是一个代表性挑战,要求机器人处理来自随机初始状态的衣物,涉及不同类别、几何形状、材料和场景。然而,现有的VLA系统通常为不同物体类别训练独立的策略,而简单混合的多任务训练常常遭受任务干扰和性能下降。为了超越类别特定的折叠策略,我们引入了DeMaVLA,一个面向可泛化可变形物体操作的VLA基础模型。DeMaVLA采用VLM骨干网络和动作专家,并使用流匹配来公式化连续动作生成。为了提高效率,动作专家通过剪枝每隔一个Transformer层构建,同时保持与VLM骨干网络的逐层对齐,从而降低训练和推理成本。DeMaVLA首先在大约5000小时精选的真实世界双臂演示数据上进行预训练,以获得通用的操作先验。然后,它在混合折叠数据上进行后训练,这些数据通过人类参与的数据聚合(DAgger)流程,聚合了自我收集的演示和来自多个折叠任务中真实机器人失败的纠正轨迹。实验表明,DeMaVLA在RoboTwin上取得了有竞争力的性能,并在我们的家庭折叠基准测试中取得了强大的真实世界结果。这些结果突显了可扩展的真实世界数据、高效的动作生成和纠正学习对于可变形物体操作中的通用VLA策略的价值。

英文摘要

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.

7. 多机器人与群体系统 2 篇

2606.17739 2026-06-17 cs.RO cs.AI cs.CV cs.MA 新提交

ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

ED3R: 能量感知的分布式灾难检测——基于协作机器人智能体

Lina Magoula, Nikolaos Koursioumpas, Nancy Alonistioti, Ramin Khalili

发表机构 * Dept. of Informatics and Telecommunications, National and Kapodistrian University of Athens(雅典大学信息学与电信系) Huawei Heisenberg Research Center (Munich)(华为海森堡研究中心(慕尼黑))

AI总结 提出ED3R框架,通过机器人-远程控制器分层协作与分布式神经回归预测,在不确定性下以最低能耗实现野火检测,成功率达97.18%,能耗降低36.4%,检测速度提升41%。

Comments 14 pages, 9 figures

详情
AI中文摘要

机器人技术有望支持环境监测和自然灾害管理,在这些场景中,决策必须在不确定性、资源限制和严格操作约束下做出。在关键任务(如野火)中,机器人智能体不仅需要以足够置信度识别危险事件,还需管理能量成本和检测时间。本文介绍ED3R,一种用于不确定性下野火检测的能量感知分布式框架。ED3R实现了机器人与远程控制器之间的分层协作决策:远程控制器决定机器人的运动,而机器人感知环境并决定在何处(机载或远程)以及如何执行野火检测。共同目标是以所需置信度检测野火,同时最小化任何机器人操作消耗的能量。ED3R进一步集成了避免附近障碍物、防止冗余探索、实现自适应早期任务完成以及通过自定义惩罚函数确保可行性的机制。ED3R还引入了前瞻能力,通过分布式神经回归模型使智能体能够在执行前评估候选策略以预测未来。该框架通过逼真的机器人仿真、消融研究和基线比较进行评估。总体而言,ED3R的任务成功率高达97.18%。尤其是在最具挑战性的任务中,它比基线减少高达36.4%的能量消耗,并提前高达41%检测到野火。

英文摘要

Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.

2606.17216 2026-06-17 cs.MA cs.GT cs.RO 交叉投稿

Intermittent Strategic Cooperation of Two Selfish Agents on Graphs

两个自私智能体在图上的间歇性战略合作

Itay Shedlezki, Noa Agmon

发表机构 * Bar-Ilan University(巴伊兰大学)

AI总结 研究两个自私智能体在时间与空间约束下的战略合作问题,通过IC2PP模型刻画纯纳什均衡结构,证明均衡存在性并提出多项式时间枚举算法。

详情
AI中文摘要

我们通过间歇性战略合作双智能体路径规划(IC2PP)问题研究两个自私智能体在空间和时间约束下的战略合作,这是一个图上的最短路径博弈,其中智能体向各自目标导航,同时可选地在特定节点合作以减少自身旅行时间。尽管这种合作对双方都有严格利益,但战略上脆弱:智能体可能在其路径的任何点偏离。建模为双人博弈,我们刻画了IC2PP中纯纳什均衡(PNE)联合策略的结构,并表明稳定合作必须遵循高度受限的形式。我们进一步证明每个IC2PP实例中至少存在一个PNE,并提出一个多项式时间算法来枚举所有相关PNE。当出现多个均衡时,我们研究基于议价理论选择概念的协调机制,并根据个体旅行时间和社会福利经验性地比较均衡结果。

英文摘要

We study strategic space- and time-constrained cooperation between two self-interested agents through the Intermittent Strategic Cooperation-Based Two-Agent Path Planning (IC2PP) problem, a shortest-path game on graphs in which agents navigate toward individual targets while optionally cooperating at specific nodes to reduce their own travel times. Although such cooperation can strictly benefit both agents, it is strategically fragile: agents may deviate at any point along their paths. Modeled as a 2-player game, we characterize the structure of Pure Nash Equilibrium (PNE) joint strategies in IC2PP, and show that stable cooperation must follow a highly constrained form. We further prove that at least one PNE exists in every instance of IC2PP, and present a polynomial-time algorithm for enumerating all relevant PNEs. When multiple equilibria arise, we study coordination mechanisms based on bargaining-theoretic selection concepts and empirically compare equilibrium outcomes in terms of individual travel times and social welfare.

8. 无人车、无人机与移动机器人 9 篇

2606.17082 2026-06-17 cs.RO cs.AI 新提交

ParkingTransformer: LLM-Enhanced End-to-End Trajectory Planning for Autonomous Parking

ParkingTransformer: 基于大语言模型增强的端到端自主泊车轨迹规划

Hauteng Wu, Xu Li, Dong Kong, Zihang Wang, Xieyuanli Chen, Benwu Wang, Wenkai Zhu

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院) School of Electronic and Information Engineering, Tongji University(同济大学电子与信息工程学院) College of Transportation, Shandong University of Science and Technology(山东科技大学交通学院) National University of Defense Technology(国防科技大学)

AI总结 提出ParkingTransformer框架,利用多视角感知和大语言模型场景理解能力,结合轨迹查询与隐状态特征,直接输出规划轨迹,无需密集BEV表示,通过3D位置编码、固定窗口流机制和粗到细解码策略提升性能,在CARLA和实车实验中验证有效性。

详情
AI中文摘要

端到端自主泊车已成为自动驾驶领域的关键任务。然而,现有方法存在黑箱特性,缺乏高层语义理解和可解释性,阻碍了从道路到目标点的无缝长距离自主泊车的实现。为解决这些限制,我们提出ParkingTransformer,一种利用多视角感知和大语言模型(LLMs)场景理解能力的新型框架。通过将轨迹查询与LLMs隐状态特征相结合,我们的方法直接与历史信息和原始传感器数据交互以输出规划轨迹,无需密集的鸟瞰图(BEV)表示。为补偿LLMs空间推理能力的不足,我们引入3D位置编码以显式注入空间几何感知。此外,设计了固定窗口流机制用于历史信息处理,显著提高了长期时间处理效率和推理速度。同时,采用粗到细解码策略逐步提升轨迹精度。在CARLA模拟器和真实车辆平台上进行了广泛的闭环实验。结果表明,我们的方法在CARLA模拟器中达到61.32的驾驶分数,在真实实验中平均成功率为88.70%,验证了所提算法的可行性和有效性。

英文摘要

End-to-end autonomous parking has emerged as a critical task within the realm of autonomous driving. However, existing methods suffer from black-box characteristics, lacking high-level semantic understanding and interpretability, which impedes the realization of seamless long-distance autonomous parking from the road to the target spot. To address these limitations, we propose ParkingTransformer, a novel framework that leverages multi-view perception and the scene understanding capability of Large Language Models (LLMs). By combining trajectory queries with LLMs implicit state features, our method interacts directly with historical information and raw sensor data to output planning trajectories, eliminating the need for dense Bird's-View (BEV) representations. To compensate for the inadequate spatial reasoning ability of LLMs, we introduce 3D positional encoding to explicitly inject spatial geometric awareness. Furthermore, a fixed-window streaming mechanism is designed for historical information processing, significantly improving long-term temporal processing efficiency and inference speed. Additionally, a coarse-to-fine decoding strategy is employed to progressively enhance trajectory precision. Extensive closed-loop experiments are conducted on the CARLA simulator and real-world vehicle platforms. The results demonstrate that our method achieves a driving score of 61.32 in CARLA simulator and an average success rate of 88.70% in real-world experiments, validating the feasibility and effectiveness of the proposed algorithms.

2606.17376 2026-06-17 cs.RO cs.CV 新提交

Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework

异构移动机器人上的非接触式呼吸监测:一种多模态边缘计算框架

Milind Rampure, Shadman Sakib, Haley Patel, Zahid Hasan, Nirmalya Roy

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 提出一种适用于异构移动机器人的多模态非接触式呼吸率监测框架,通过自适应传感器选择、关键点引导的ROI提取和信号质量过滤,在多种平台和光照条件下实现鲁棒监测,无需平台特定调参。

Comments 8 pages, 6 figures. To appear in Proceedings of the 8th International Workshop on IoT Applications and Industry 5.0 (IoTI5 2026), co-located with IEEE DCOSS-IoT 2026, Reykjavik, Iceland, June 2026

详情
AI中文摘要

呼吸率监测是紧急响应、灾难恢复和传染病场景中远程分诊和受害者评估的关键组成部分,在这些场景中,最小化物理接触可以降低救援人员风险并提高操作安全性。然而,由于光照变化、姿势变化、平台异构性以及危险环境中可穿戴传感器的不实用性,非接触式呼吸率监测的现场部署仍然具有挑战性。在本文中,我们提出了一种适用于具有机载边缘计算的异构移动机器人的模态自适应非接触式呼吸率监测框架。所提出的系统结合了跨RGB、热成像、近红外和低光相机的亮度自适应传感器选择、用于姿势鲁棒监测的关键点引导胸部ROI提取,以及基于信号质量指数的滤波机制以实现可靠的呼吸估计。我们在三个机器人平台上实现并评估了该框架,涵盖四足和轮式运动以及多种边缘计算架构。在不同光照条件、受试者姿势和机器人到受试者距离下进行的实验表明,该框架无需针对每个平台进行算法重新调整即可跨平台泛化,同时揭示了模态特定的操作边界。RGB提供最广的覆盖范围,可达8米;近红外在6米内有效;热成像仅在短距离内可靠;低光传感支持在完全黑暗环境中监测,距离可达8米。总体而言,结果证明了在移动机器人上进行多模态非接触式呼吸率监测的可行性,并支持其作为危险搜救场景中自主分诊和受害者评估的基础。

英文摘要

Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.

2606.17936 2026-06-17 cs.RO 新提交

SPARK: Low Latency Single-Camera 3D Pose Estimation for Autonomous Racing using Keypoints

SPARK: 基于关键点的自动驾驶赛车低延迟单摄像头3D姿态估计

Dominic Ebner, Markus Lienkamp

发表机构 * Technical University of Munich(慕尼黑工业大学) School of Engineering & Design, Department of Mobility Systems Engineering, Institute of Automotive Technology(工程与设计学院,移动系统工程系,汽车技术研究所) Munich Institute of Robotics and Machine Intelligence (MIRMI)(慕尼黑机器人与机器智能研究所)

AI总结 提出SPARK算法,利用单摄像头和关键点检测实现自动驾驶赛车中低延迟、高精度的3D姿态估计,性能优于现有方法。

Comments 9 pages, 6 figures, ITSC 2026, Invited Session

详情
AI中文摘要

在自动驾驶赛车中,快速检测其他参与者的运动对于规划与非合作对手的安全无碰撞轨迹至关重要。LiDAR检测本质上比视觉方法更慢且更难部署在边缘设备上,导致检测延迟,限制了高动态机动中的目标跟踪性能。利用单目3D检测可以实现对赛道上其他参与者的易于部署、低延迟检测。我们提出了SPARK,一种基于关键点检测的自动驾驶赛车单摄像头姿态估计算法。它实现了高精度的远距离检测,超越了最先进的单目摄像头检测算法的性能,同时保持较低的延迟。通过使用经过良好优化的YOLO模型并利用自动驾驶赛车领域的固定几何结构,该算法还表现出低延迟和低资源使用率。我们在真实世界的自动驾驶赛车数据上评估了我们的方法性能,并将其与最先进的LiDAR和摄像头检测算法进行了比较。源代码可在以下网址获取:this https URL

英文摘要

In autonomous racing, fast detection of other participants' movements is required to plan safe, collision-free trajectories with non-cooperative opponents. LiDAR detection is inherently slower and harder to deploy on edge devices than vision methods, causing delayed detections that limit object tracking performance during high-dynamic maneuvering. Utilizing monocular 3D detection enables an easy-to-deploy, low-latency detection of other participants on the racetrack. We present SPARK, a single-camera pose-estimation algorithm for autonomous racing using keypoint detection. It achieves long-range detection with high accuracy, exceeding the performance of state-of-the-art monocular camera detection algorithms while maintaining lower latency. By employing well-optimized YOLO models and leveraging the fixed geometry in the autonomous racing domain, the algorithm also exhibits low latency and resource usage. We evaluate the performance of our approach on real-world autonomous racing data and compare it to state-of-the-art LiDAR and camera detection algorithms. The source code is available at: https://github.com/TUMFTM/SPARK-camera-det

2606.17241 2026-06-17 cs.CV cs.RO cs.SY eess.SY 交叉投稿

Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception

超越基准:面向细粒度路边感知的连续边缘推理

Aditya Mishra, Haroon Lone

发表机构 * Indian Institute of Science Education and Research Bhopal(印度科学教育与研究学院博帕尔分校)

AI总结 针对边缘推理在持续运行中的性能退化问题,提出Edge-TSR系统,集成检测、跟踪与轻量级时域稳定机制,在NVIDIA Jetson Orin Nano上实现实时路边感知,恢复高达10.16%的分类准确率。

详情
AI中文摘要

在资源受限的边缘硬件上进行连续AI推理会引入传统基准评估难以察觉的部署效应,包括流视频的时间不稳定性、持续负载下的热节流以及工作负载相关的性能变化。我们提出Edge-TSR,一个面向部署的连续边缘推理系统,用于在NVIDIA Jetson Orin Nano上进行持续的路边感知。Edge-TSR集成了检测、跟踪、细粒度分类以及轻量级的轨迹感知时域稳定机制,以最小的计算开销提高了流推理的一致性。我们的核心发现是,以基准为中心的评估系统性地高估了部署边缘推理的性能。在三个最先进的基线上,我们观察到从静态图像评估过渡到真实流部署时,性能一致下降20-30%。Edge-TSR通过时域推理稳定解决了这一差距,在持续运行下,相比逐帧推理基线,恢复了高达10.16%的分类准确率,同时保持了实时性能。我们在多种真实部署条件下评估了整个系统,联合表征了长时间运行期间的推理质量、延迟、吞吐量和热行为。在26公里路线上进行的55分钟车辆部署表明,在单个嵌入式设备上,无需云端卸载,即可在安全热限制内以16.18 FPS持续运行。我们的发现表明,部署感知评估和时域推理稳定是面向真实传感部署的持续运行边缘AI系统的必要组成部分。我们发布了一个带注释的流视频评估数据集样本和完整的系统实现,以支持可重复的以部署为中心的评估。

英文摘要

Continuous AI inference on resource-constrained edge hardware introduces deployment effects that are largely invisible to conventional benchmark evaluation, including temporal instability in streaming video, thermal throttling under sustained load, and workload-dependent performance variability. We present Edge-TSR, a deployment-oriented continuous edge inference system for sustained roadside perception on the NVIDIA Jetson Orin Nano. Edge-TSR integrates detection, tracking, fine-grained classification, and a lightweight track-aware temporal stabilization mechanism that improves streaming inference consistency with negligible computational overhead. Our central finding is that benchmark-centric evaluation systematically overstates deployed edge inference performance. Across three state-of-the-art baselines, we observe consistent 20-30% relative degradation when transitioning from static-image evaluation to real-world streaming deployment. Edge-TSR addresses this gap through temporal inference stabilization, recovering up to 10.16% classification accuracy over per-frame inference baselines while maintaining sustained real-time performance under continuous operation. We evaluate the complete system under diverse real-world deployment conditions, jointly characterizing inference quality, latency, throughput, and thermal behavior during long-duration operation. A 55-minute vehicular deployment over a 26 km route demonstrates sustained operation at 16.18 FPS within safe thermal limits on a single embedded device without cloud offload. Our findings show that deployment-aware evaluation and temporal inference stabilization are necessary components of continuously operating edge AI systems intended for real-world sensing deployments. We release a sample annotated streaming video evaluation dataset and full system implementation to support reproducible deployment-centric evaluation.

2606.17362 2026-06-17 cs.CV cs.AI cs.LG cs.RO 交叉投稿

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

DriveJudge: 用视觉-语言模型重新思考自动驾驶评估

Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang, Sanja Fidler, Kashyap Chitta, Jose M. Alvarez

发表机构 * NVIDIA(英伟达)

AI总结 提出DriveJudge,结合规则评估与VLM推理,通过选择性调用物理规则函数实现可解释且上下文感知的驾驶评估,在驾驶质量分类和轨迹偏好选择任务上超越现有方法。

Comments Under Review

详情
AI中文摘要

自动驾驶已转向端到端策略学习,其中可靠、可解释的策略评估是一个基本挑战,因为驾驶质量高度依赖于上下文。常用的基于规则的驾驶指标(如EPDMS)可解释但缺乏上下文感知,而近期基于VLM的评估虽具有上下文感知能力,但受限于模糊的VLM输出和较弱的物理基础。为了以既可解释又上下文感知的方式评估驾驶,我们引入了DriveJudge。DriveJudge是一个驾驶评估代理,它将规则基础评估与视觉-语言模型(VLM)推理相结合,并在解释环境上下文后有选择地调用基于物理的确定性规则函数。为了训练和评估DriveJudge,我们整理了一个包含33,577个具有挑战性的驾驶样本的大规模数据集,并附有人类标注,指示给定场景中的驾驶行为是否合理。利用该数据集,我们解决了驾驶指标评估中未被充分探索的问题,并引入了两个与人类对齐的基准任务:驾驶质量分类和轨迹偏好选择。DriveJudge在驾驶质量分类上比EPDMS高出21.23 AUC,在轨迹偏好选择上比近期基于VLM的DriveCritic高出6.5%,为可解释且精确的驾驶评估设立了新标准。

英文摘要

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

2606.17386 2026-06-17 cs.CV cs.AI cs.RO 交叉投稿

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

TerraTransfer: 无需专家示范的端到端驾驶策略学习

Zikang Xiong, Weixin Li, Zhouchonghao Wu, Akshay Rangesh, Saarth Bonde, Grantland Hall, Chen Tang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition UCLA(加州大学洛杉矶分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出一种无需专家示范的端到端驾驶方法,通过向量化模拟器中的自博弈预训练策略,再与预训练视觉骨干对齐,降低了数据成本并达到或超越现有方法。

详情
AI中文摘要

端到端自动驾驶在基准测试和实际部署中取得了最先进的性能。然而,其标准训练流程在所有阶段都成本高昂:收集和标注数百万驾驶帧代价昂贵,而在图像上进行闭环强化学习受限于每步的光真实感渲染和大视觉骨干的前向传播成本。在向量化模拟器中进行自博弈改变了经济性:每秒数百万次 rollout 步骤,状态分布自然包含碰撞、近碰撞和恢复等驾驶日志中不包含的情况。我们的方法通过解耦学习驾驶和学习视觉来利用这种不对称性。我们通过自博弈预训练单个策略,然后通过动作 KL 散度和批量关系低秩结构损失将其潜在空间与预训练视觉骨干对齐。动作目标来自自博弈策略,因此对齐从未对记录的轨迹进行监督:只需要一个(图像、场景状态)帧的配对数据集,无需模仿预训练所依赖的精心策划的专家示范。在光真实感 3D 高斯泼溅闭环场景中,得到的端到端策略匹配或超越了先前的端到端方法。

英文摘要

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

2308.14329 2026-06-17 cs.RO cs.AI 版本更新

SSIL: Self-Supervised Imitation Learning for End-to-End Driving

SSIL: 用于端到端驾驶的自监督模仿学习

Jin Bok Park, Jinkyu Lee, Muhyun Back, Hyun Min Han, Tianwei Ma, Sang Min Won, Sung Soo Hwang, Il Yong Chun

AI总结 提出自监督模仿学习框架SSIL,利用车辆位姿生成伪转向角数据,无需驾驶命令或预训练模型,结合交叉注意力条件方法CACA,在三个基准数据集上达到与监督学习相当的驾驶精度。

Comments 8 pages, 4 figures

详情
AI中文摘要

在自动驾驶中,直接从传感器数据预测车辆控制信号的端到端(E2E)驾驶方法正迅速受到关注。为了学习安全的E2E驾驶系统,需要大量的驾驶数据和人工干预。车辆控制数据由数小时的人类驾驶构建,构建大型车辆控制数据集具有挑战性。通常,公开可用的驾驶数据集是在有限的驾驶场景下收集的,而收集车辆控制数据仅由车辆制造商提供。为了解决这些挑战,本文提出了首个用于E2E驾驶的自监督学习框架——自监督模仿学习(SSIL)。所提出的SSIL框架可以在不使用驾驶命令数据或预训练模型的情况下学习基于视觉的E2E驾驶网络。为了构建伪转向角数据,提出的SSIL从当前和先前时间点通过激光雷达传感器估计的车辆位姿预测伪目标。此外,我们提出了一种新的基于交叉注意力的条件方法(CACA),用于E2E驾驶中的视觉编码器,其中高级指令作为视觉信息的条件信号。我们在三个不同基准数据集上的数值实验表明,所提出的SSIL框架实现了与监督学习对应方法非常相当的E2E驾驶精度。此外,所提出的伪标签预测器优于使用比例积分微分控制器的现有方法,并且所提出的CACA在现有条件方法中实现了优越的性能。

英文摘要

In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this paper proposes the first self-supervised learning framework, Self-Supervised Imitation Learning (SSIL), for E2E driving. The proposed SSIL framework can learn vision-based E2E driving networks without using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose a new cross-attention-based conditioning approach (CACA) for a vision encoder in E2E driving, where a high-level instruction serves as the conditioning signal for visual information. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves very comparable E2E driving accuracy with the supervised learning counterpart. Furthermore, the proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller, and proposed CACA achieved superior performance over existing conditioning approaches.

2601.01762 2026-06-17 cs.RO cs.CV 版本更新

AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving

AlignDrive: 用于端到端自动驾驶的对齐横向-纵向规划

Yanhao Wu, Haoyang Zhang, Fei He, Rui Wu, Yanhu Shan, Congpei Qiu, Liang Gao, Wei Ke, Tong Zhang

AI总结 本文提出一种 cascaded 框架,通过将纵向规划转化为路径条件推理过程,提升自动驾驶的协调性和安全性。方法引入锚点回归设计和规划导向的数据增强策略,实现在 Bench2Drive 上达到 SOTA 性能。

Comments underreview

详情
AI中文摘要

实用的自动驾驶需要能够通过时空可能性推理来排除不安全结果的模型。尽管最先进的方法使用并行规划架构,但它们未能明确将速度决策与路径上的代理行为联系起来,导致协调不优。为此,我们提出了一种级联框架,将纵向规划从独立预测任务转化为路径条件推理过程。在模型方面,我们引入基于锚点的回归设计,将纵向预测条件于横向驾驶路径,并将纵向规划重新表述为路径上的 1D 位移预测。这减少了几何不确定性,并使模型更专注于由交互驱动的动力学。在数据方面,我们引入了规划导向的数据增强策略,通过程序性插入代理和重标记纵向目标来模拟罕见的安全关键事件。在具有挑战性的 Bench2Drive 基准上评估,我们的方法在驾驶分数为 89.07 和成功率为 73.18% 的情况下实现了 SOTA 性能,证明了显著改进的协调性和安全性。进一步在 Fail2Drive 上的评估证实了在平行公式通常失败的罕见边缘情况下具有强大的泛化能力。项目页面:https://yanhaowu.github.io/AlignDrive/.

英文摘要

Practical autonomous driving requires models that generalize by reasoning through spatial-temporal possibilities to exclude unsafe outcomes. While state-of-the-art (SOTA) methods use parallel planning architectures, they fail to explicitly couple speed decisions with agent behavior along the driving path, leading to suboptimal coordination. To address this, we propose a cascaded framework that transforms longitudinal planning from an independent prediction task into a path-conditioned reasoning process. On the model side, we introduce an anchor-based regression design that conditions longitudinal prediction on the lateral drive path, and reformulate longitudinal planning as 1D displacement prediction along the path. This reduces geometric uncertainty and sharpens the model's focus on interaction-driven dynamics. On the data side, we introduce a planning-oriented data augmentation strategy that simulates rare safety-critical events by programmatically inserting agents and relabeling longitudinal targets to enforce collision avoidance. Evaluated on the challenging Bench2Drive benchmark, our method achieves SOTA performance with a driving score of 89.07 and a success rate of 73.18%, demonstrating significantly improved coordination and safety. Further evaluation on Fail2Drive confirms strong generalization to rare edge cases where parallel formulations typically fail. Project page:https://yanhaowu.github.io/AlignDrive/.

2604.03120 2026-06-17 cs.CV cs.RO 版本更新

SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

SCC-Loc: 无人机热红外地理定位的统一语义级联共识框架

Xiaoran Zhang, Yu Liu, Jinyu Liang, Kangqiushi Li, Zhiwei Huang, Huaxin Xiao

AI总结 提出SCC-Loc框架,通过共享DINOv2骨干网络、语义引导视口对齐、级联空间自适应纹理结构滤波和共识驱动可靠性感知位置选择,解决热红外-可见光模态差异导致的特征模糊问题,实现零样本高精度绝对位置估计,平均定位误差9.37米。

Comments 17 pages, 5 figures. Submitted to IEEE J-STARS

详情
AI中文摘要

跨模态热红外地理定位(TG)为无人机在GNSS拒止环境中提供了鲁棒的全天候解决方案。然而,深刻的热红外-可见光模态差异引入了严重的特征模糊性,系统性地破坏了传统的由粗到精配准。为打破这一瓶颈,我们提出SCC-Loc,一个统一的语义-级联-共识定位框架。通过在全局检索和MINIMA$_{\ ext{RoMa}}$匹配中共享单个DINOv2骨干网络,它最小化内存占用并实现零样本、高精度的绝对位置估计。具体而言,我们通过引入三个协同组件来解决模态模糊性。首先,我们设计语义引导视口对齐(SGVA)模块,自适应优化卫星裁剪区域,有效校正初始空间偏差。其次,我们开发级联空间自适应纹理结构滤波(C-SATSF)机制,显式强制几何一致性,从而消除密集的跨模态离群点。最后,我们提出共识驱动可靠性感知位置选择(CD-RAPS)策略,通过物理约束位姿优化的协同作用推导出最优解。为解决数据稀缺问题,我们构建了Thermal-UAV数据集,提供11,890个多样化的热红外查询,并参考大规模卫星正射影像和相应的空间对齐数字表面模型(DSM)。大量实验表明,SCC-Loc建立了新的最先进水平,将平均定位误差抑制到9.37米,并在严格的5米阈值内比最强基线提供了7.6倍的精度提升。代码和数据集可在该URL获取。

英文摘要

Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA$_{\text{RoMa}}$ matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at https://github.com/FloralHercules/SCC-Loc.

9. 软体机器人与硬件设计 3 篇

2606.17394 2026-06-17 cs.RO cs.LG 新提交

Damage Adaptation in Seconds for Architected Materials

结构材料的秒级损伤自适应

James Avtges, Jake Ketchum, Helena Young, Taekyoung Kim, Ryan Truby, Todd Murphey

发表机构 * Northwestern University(西北大学)

AI总结 提出LEAP方法,利用潜在损伤表示和集成学习,在软驱动系统中实现一分钟内对灾难性损伤的自适应,无需仿真。

Comments Proceedings of Robotics: Science and Systems

详情
AI中文摘要

对损伤的自适应和原位物理修复对于长期机器人自主性至关重要,但在狭义定义和良好预期的范围之外具有挑战性。在这项工作中,我们在软驱动系统中在一分钟内本体感知地适应灾难性损伤。结构材料非常适合自适应:执行器故障是逐渐发生而非急性,并且损伤可以在低维、离散坐标空间中描述。令人惊讶的是,潜在损伤表示加上简单而稳健的集成方法足以实时适应未见过的损伤。此外,我们确定了指数样本复杂度降低为线性样本复杂度的条件,用于结构材料的学习表示,这是相对于刚性组件或连续软机构的明显优势。我们通过基于手性剪切拉胀(HSA)执行器的6自由度软手腕的追踪任务,演示了我们的自适应本体感知方法LEAP。我们的算法能够适应切割、烧伤和执行器修复,实现了无仿真的实时自适应,这对于在实验室外实现软机器人的承诺至关重要。视频和更多信息请访问此https URL。

英文摘要

Adaptation to damages and in-situ physical repairs is essential for long-term robot autonomy, yet challenging outside of narrowly defined and well-anticipated bounds. In this work we proprioceptively adapt to catastrophic damage in soft-actuated systems in under one minute. Architected materials are well equipped for adaptation: actuator failure occurs gradually rather than acutely, and damage can be described in a low-dimensional, discrete coordinate space. Surprisingly, latent damage representations plus a simple yet robust ensemble method is sufficient for adapting to unseen damage in real-time. Moreover, we identify conditions under which exponential sample complexity collapses to linear sample complexity for learned representations of architected materials, a concrete advantage over rigid components or continuum soft mechanisms. We demonstrate LEAP, our method for adaptive proprioception, via a tracing task for a 6DoF soft wrist based on Handed Shearing Auxetic (HSA) actuators. Our algorithm is able to adapt to cuts, burns, and actuator repairs, enabling simulation-free real-time adaptation that is critical for realizing the promise of soft robots outside the lab. Videos and more information are available at https://murpheylab.github.io/leap.

2606.18144 2026-06-17 cs.AI cs.CY cs.LG cs.RO 交叉投稿

Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So

记忆作为消耗性资产:为具身智能体定价闪存耐久性及其局限性

Josef Liyanjun Chen

发表机构 * KAIKAKU

AI总结 本文提出将机器人闪存耐久性视为折旧资本,通过单一影子价格η进行定价,实现成本最优的存储层级分配,并基于真实机器人日志测量价值-写入关联χ的符号,发现其取决于部署场景。

详情
AI中文摘要

机器人的闪存耐久性是一种不可再生资源:每次持久化写入都会消耗数千次编程/擦除周期中的一次,且无法补充,然而目前没有实际部署的机器人内存系统对哪些记忆值得消耗一次擦除周期进行定价。我们将具身记忆视为折旧资本,并用单一耐久性影子价格η对该资源定价,这使得在RAM/板载NVM/云层级中进行成本最小化的放置成为一个在磨损增强的每字节索引中的阈值。无论价值-写入关联χ的符号如何,该索引都是成本最优的;只有当χ>0时,最优解才变为非单调,将机器人最有价值的记忆从闪存中移出。因此,关键点是经验性的,我们在预定义的关口上测量真实机器人日志中的χ:其符号是部署场景的一个属性——在重复的长时域操作中为正(χ̂≈+1.0×10^{-3},在全功率下可复现),在较短时域任务中为零,在非重复遥操作中为负。两个边界限制了该结果。在高端3,000 P/E TLC闪存按数据手册价格计算时,耐久性预算处于休眠状态;而在廉价边缘机器人使用的商用QLC/eMMC(约1,000 P/E)上则具有约束力。当约束生效时,学习到的磨损感知控制器仅在任务价值上与基于价格的路由持平,因为实现的价值在RAM、NVM和云层级之间是不变的:租金决定设备寿命和成本,而非任务性能。磨损感知放置是否能提高任务价值仍是一个开放问题——χ是针对价值代理测量的,而非单调最优解虽已被证明,但尚未在数据中观察到。

英文摘要

A robot's flash endurance is a non-renewable stock: every persisted write spends one of a few thousand program/erase cycles and never refills, yet no fielded robot memory system prices which memories are worth an erase cycle. We treat embodied memory as depreciating capital and price that stock with a single endurance shadow price $η$, which makes cost-minimizing placement across a RAM / on-board NVM / cloud hierarchy a threshold in a wear-augmented per-byte index. The index is cost-optimal whatever the sign of the value-write association $χ$; only when $χ> 0$ does the optimum turn non-monotone, sending a robot's most valuable memories off its flash. The pivot is thus empirical, and we measure $χ$ on real robot logs at a pre-specified gate: its sign is a property of the deployment regime -- positive on recurrent long-horizon manipulation ($\hatχ \approx +1.0 \times 10^{-3}$, replicated at full power), null on a shorter-horizon suite, and negative on non-recurrent teleoperation. Two boundaries scope the result. The endurance budget is dormant on premium 3,000-P/E TLC at datasheet prices and binding on the commodity QLC/eMMC ($\sim$1,000 P/E) that cheaper edge robots run. And where it binds, a learned wear-aware controller only ties price-based routing on task value, because realized value is tier-invariant across RAM, NVM, and cloud: the rent governs device lifetime and cost, not task performance. Whether wear-aware placement improves task value remains open -- $χ$ is measured against a value proxy, and the non-monotone optimum, while proven, is not yet observed in data.

2601.19098 2026-06-17 cs.RO 版本更新

SimTO: A two-stage, simulation-driven topology optimization framework for bespoke soft robotic grippers

SimTO:一种面向定制软体机器人夹爪的两阶段仿真驱动拓扑优化框架

Kurt Enkera, Josh Pinskier, Marcus Gallagher, David Howard

AI总结 提出SimTO框架,通过两阶段仿真驱动拓扑优化自动提取接触载荷,为特征丰富物体定制软体夹爪,实验证明其抓取力优于传统方法且泛化性强。

Comments 15 pages, 9 figures. Published in Structural and Multidisciplinary Optimization

详情
AI中文摘要

软体机器人夹爪对于在制造业、医疗保健和农业中抓取精致、几何形状复杂的物体至关重要。然而,现有设计难以抓取具有高拓扑变异性、特征丰富的物体,包括汽车装配线上具有锋利齿廓的齿轮、带有脆弱突起的珊瑚,或像西兰花这样具有不规则分支结构的蔬菜。与立方体或球体等简单几何基元不同,特征丰富的物体缺乏明确的“最佳”接触表面,因此既难以抓取又容易受损。因此,安全处理此类物体需要专门设计的软体夹爪,其形态需针对物体特征进行定制。拓扑优化为生产专用夹爪提供了一种有前景的方法,但其效用受限于需要预定义载荷工况。对于软体夹爪,这些载荷来自抓取过程中数百种不可预测的夹爪-物体接触力,且先验未知。为解决此问题,我们引入了SimTO,这是一个两阶段、仿真驱动的拓扑优化框架,它能在执行经典拓扑优化之前,从动态、富含接触的抓取仿真中自动提取载荷工况,从而消除了手动指定载荷的需求。给定任意特征丰富的物体,SimTO能生成高度定制的软体夹爪,其细粒度形态特征针对物体几何形状进行定制。物理实验证实,我们的专用夹爪比传统拓扑优化方法生成的通用设计实现了更高的抓取力,而数值实验表明,它们在不同物体姿态下实现了高抓取成功率,并对一组未见过的物体具有很强的泛化能力。

英文摘要

Soft robotic grippers are essential for grasping delicate, geometrically complex objects in manufacturing, healthcare and agriculture. However, existing designs struggle to grasp feature-rich objects with high topological variability, including gears with sharp tooth profiles on automotive assembly lines, corals with fragile protrusions, or vegetables with irregular branching structures like broccoli. Unlike simple geometric primitives such as cubes or spheres, feature-rich objects lack a clear "optimal" contact surface, making them both difficult to grasp and susceptible to damage. Safe handling of such objects therefore requires specialized soft grippers whose morphology is tailored to the object's features. Topology optimization offers a promising approach for producing specialized grippers, but its utility is limited by the need for pre-defined load cases. For soft grippers, these loads arise from hundreds of unpredictable gripper-object contact forces during grasping and are unknown a priori. To address this problem, we introduce SimTO, a two-stage, simulation-driven topology optimization framework that automatically extracts load cases from a dynamic, contact-rich grasping simulation before performing classical topology optimization, eliminating the need for manual load specification. Given an arbitrary feature-rich object, SimTO produces highly customized soft grippers with fine-grained morphological features tailored to the object geometry. Physical experiments confirm that our specialized grippers achieve higher grasp forces than a generalist design produced by conventional topology optimization methods, while numerical experiments show that they achieve high grasp success rates across varying object poses and strong generalization to a set of unseen objects.

10. 仿真、数据集与评测 17 篇

2606.17080 2026-06-17 cs.RO cs.AI cs.CV 新提交

HRDX: A Large-Scale Vector HD-Map Dataset

HRDX:大规模矢量高清地图数据集

Sahith Reddy Chada, Isht Dwivedi, Nirav Savaliya

发表机构 * Honda Research Institute US(本田美国研究院)

AI总结 提出HRDX大规模矢量高清地图数据集,覆盖1400公里驾驶数据,含10类地图元素和20多种属性,并引入复合评分评估几何与属性准确性。

Comments https://usa.honda-ri.com/hrdx

详情
AI中文摘要

可靠的自动驾驶需要矢量化的高清地图,这些地图应具有几何精确性、语义丰富性,并能够扩展到长距离驾驶。然而,现有的公开高清地图数据集规模有限,提供的语义属性稀疏,并且缺乏诸如航拍图像等能够开启新研究方向的模态。我们提出了HRDX,一个用于矢量高清地图构建的大规模数据集,涵盖约40小时(1400公里)的最小重叠驾驶,比之前的公开高清地图数据集大数倍。数据使用六个同步环视摄像头、一个128线激光雷达和厘米级RTK GNSS/IMU捕获,并辅以精确对齐的航拍正射影像。标注涵盖10个矢量地图类别,并补充了20多个语义和拓扑属性。为了评估这一更丰富的本体,我们引入了复合评分(CS)来联合评估几何保真度和属性正确性。基准实验表明,HRDX的规模改善了在线矢量地图构建,并且对齐的航拍图像提供了有用的结构先验:在训练和/或推理中使用航拍图像可提高几何地图质量,而航拍增强的教师可以将部分优势转移给仅使用摄像头的学生,而无需增加推理时的传感器需求。HRDX旨在支持大规模高清地图学习、多模态BEV融合以及训练时特权信息的可重复研究。HRDX数据集和基准可在以下网址获取:https://github.com/example/HRDX

英文摘要

Reliable autonomous driving requires vectorized HD maps that are geometrically accurate, semantically rich, and scalable to long-horizon driving. However, existing public HD map datasets are limited in scale, provide sparse semantic attributes, and lack modalities such as aerial imagery that could enable new research directions. We present HRDX, a large-scale dataset for vector HD-map construction, spanning about 40 hours (1,400 km) of minimally overlapping drives, which is several times larger than prior public HD map datasets. Data is captured using six synchronized surround cameras, a 128-beam LiDAR, and centimeter-level RTK GNSS/IMU, and is further complemented by precisely aligned aerial orthoimagery. Annotations cover 10 vector map classes, complemented with over 20 semantic and topological attributes. To evaluate this richer ontology, we introduce the Composite Score (CS) to jointly assess geometric fidelity and attribute correctness. Benchmark experiments show that HRDX's scale improves online vector-map construction, and that aligned aerial imagery provides a useful structural prior: using aerial imagery at training and/or inference improves geometric map quality, while aerial-augmented teachers can transfer part of this benefit to camera-only students without increasing inference-time sensor requirements. HRDX is intended to support reproducible research on large-scale HD-map learning, multimodal BEV fusion, and training-time privileged information. HRDX dataset and benchmarks are available at https://github.com/honda-research-institute/HRDX

2606.17200 2026-06-17 cs.RO 新提交

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

ACE-Ego-0:统一第一人称人类与机器人数据用于VLA预训练

Hao Li, Ganlong Zhao, Yufei Liu, Haotian Hou, Guoquan Ye, Tongyan Fang, Chunxiao Liu, Siyuan Huang, Jianbo Liu, Xiaogang Wang, Hongsheng Li

发表机构 * ACE Robotics CUHK MMLab(香港中文大学多媒体实验室) CUHK, Shenzhen(香港中文大学(深圳)) SJTU(上海交通大学) THU(清华大学)

AI总结 提出ACE-EGO-0框架,通过可扩展的第一人称视频到动作管道和可靠性感知训练目标,统一人类与机器人数据用于VLA预训练,在多个基准上达到最优性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型受益于大规模和多样化的具身数据,但收集机器人轨迹成本高昂且劳动密集。最近的进展表明,大规模第一人称人类视频在预训练中提供了互补的真实世界监督。然而,由于动作空间、具身结构、时间动态和监督质量的差异,联合训练人类和机器人数据仍然具有挑战性。我们引入了ACE-EGO-0,一个统一VLA预训练框架,联合利用异构数据源。为了从第一人称人类视频中提取大规模预训练监督,我们构建了一个可扩展的第一人称视频到动作管道,将原始人类视频转换为机器人格式的伪动作轨迹。为了使这些标签与机器人演示可比,ACE-EGO-0使用基于相机空间动作、形态条件化和时间对齐动作分块的统一动作表示。为了稳健地利用来自第一人称人类视频的噪声伪动作监督,我们制定了一个可靠性感知训练目标,并带有一个人辅助损失,将监督集中在可靠信号上。我们在4.53K小时的机器人和模拟数据以及1.48K小时的伪动作标记的第一人称人类数据上实例化ACE-EGO-0。实验表明,在可靠性感知加权下纳入大规模人类监督一致地改进了统一联合预训练和监督微调。ACE-EGO-0在RoboCasa GR1 TableTop和RoboTwin 2.0上达到了最先进的性能,并展示了向真实世界双臂操作的强迁移能力。

英文摘要

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

2606.17385 2026-06-17 cs.RO 新提交

EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

EgoInfinity: 一个面向任意视角机器人重定向与视频到动作机器人学习的网络规模4D手物交互数据引擎

Gaotian Wang, Kejia Ren, Andrew Morgan, Yiting Chen, Howard H. Qian, Podshara Chanrungmaneekul, Kaiyu Hang

发表机构 * Rice University(莱斯大学) Robotics and AI Institute(机器人与人工智能研究所)

AI总结 提出EgoInfinity引擎,从互联网视频自动生成4D手物交互数据,实现跨机器人形态的动作重定向与技能学习,无需人工标注。

Comments 24 pages. Project page: https://huggingface.co/spaces/Rice-RobotPI-Lab/EgoInfinity

详情
AI中文摘要

互联网视频构成了具身人类操作知识的最大储备,然而将任意RGB视频转化为可操作的机器人训练数据仍然是一个主要瓶颈。现有的实验室或工厂收集的数据集在规模和多样性上有限,限制了开放世界机器人学习。我们不提出静态数据集,而是引入EgoInfinity,一个通用的4D手物交互数据引擎,能够为机器人重定向和学习生成网络规模的数据。EgoInfinity是一个模块化引擎,集成了感知、分割、重建、交互感知精炼和重定向,以自动化这一传统上不可扩展的视频到动作问题,无需人工循环标注。其模块化设计使引擎能够持续受益于任何集成组件的进步。通过EgoInfinity,野外人类操作视频被提升为与智能体无关的度量4D手物表示,包括手部轨迹、6自由度物体姿态和接触相关状态。EgoInfinity不是简单连接独立组件,而是结合跨模块度量校准与交互感知精炼,以提高物理可靠性,减少纯视觉重建中常见的漂移和接触不一致。我们进一步提出一种新颖的运动重定向器,将恢复的3D手部运动编译为适用于不同机器人形态的可执行关节轨迹,从而实现从任意视角和镜头尺寸(例如,人体仅部分可见)下任意机器人的视频到动作重定向。我们在感知保真度、运动学可行性、接触一致性、跨形态泛化以及真实机器人技能获取(例如,抓取、切割、擦拭和倒水)方面验证了EgoInfinity,展示了从互联网视频到可执行机器人行为的可扩展桥梁,用于开放世界机器人学习。

英文摘要

Internet videos constitute the largest reservoir of embodied human manipulation knowledge, yet converting arbitrary RGB footage into actionable robot training data remains a major bottleneck. Existing lab- or factory-collected datasets are narrow in scale and diversity, limiting open-world robot learning. Instead of proposing a static dataset, we introduce EgoInfinity, a universal 4D hand-object interaction data engine that enables web-scale data generation for robot retargeting and learning. EgoInfinity is a modular engine integrating perception, segmentation, reconstruction, interaction-aware refinement, and retargeting to automate this traditionally unscalable video-to-action problem without human-in-the-loop annotation. Its modular design lets the engine continuously benefit from advances in any incorporated component. With EgoInfinity, in-the-wild human manipulation videos are lifted into agent-agnostic, metric 4D hand-object representations, including hand trajectories, 6-DoF object poses, and contact-relevant states. Rather than naively connecting standalone components, EgoInfinity combines cross-module metric calibration with interaction-aware refinement to improve physical reliability, reducing drift and contact inconsistencies common in pure visual reconstruction. We further propose a novel motion retargeter that compiles the recovered 3D hand motions into executable joint trajectories for diverse robot morphologies, enabling video-to-action retargeting on any robot from arbitrary viewpoints and shot sizes (e.g., the human body is only partially visible). We validate EgoInfinity across perception fidelity, kinematic feasibility, contact consistency, cross-embodiment generalization, and real-robot skill acquisition (e.g., grasping, cutting, wiping, and pouring), demonstrating a scalable bridge from internet videos to executable robot behavior for open-world robot learning.

2606.17446 2026-06-17 cs.RO cs.CV 新提交

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

AnnotateAnything:面向机器人操作的3D资产自动标注

Haoran Lu, Mutian Shen, Shuyang Yu, Yu Xiao, Songling Liu, Jianshu Zhang, Shang Wu, Yue Chen, Guo Ye, Jiayi Wang, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Peking University(北京大学)

AI总结 提出AnnotateAnything框架,通过视觉-语言标注和物理标注双流水线,自动为3D资产生成可执行操作标签,提升仿真数据收集效率和任务成功率。

详情
AI中文摘要

仿真使得可扩展的机器人数据收集成为可能,但原始3D资产仅提供几何信息,缺乏指定机器人应在何处以及如何操作的语义、交互和物理知识。在这项工作中,我们提出了AnnotateAnything,一个通用的自动标注框架,将被动3D资产转换为具有结构化、多样化和可执行操作标签的、可用于操作的资产。AnnotateAnything围绕两个互补的流水线构建。首先,一个统一的视觉-语言标注流水线,利用视觉-语言推理来推断对象语义、交互约束和3D接地线索,为识别有意义的交互区域提供人类先验指导。其次,一个全自动且大规模并行的物理标注流水线,通过候选生成、几何优化和轨迹生成,将这些先验知识嵌入每个资产的几何和物理约束中。该流水线生成多样且可执行的动作标注,包括抓取姿态、灵巧接触、关节运动路径点、插入方向、悬挂可供性和导航目标。利用生成的标注,我们进一步构建了一个跨不同对象、任务和机器人形态的异步并行仿真数据收集系统。实验表明,与现有的标注和数据生成流水线相比,AnnotateAnything在标注效率、数据收集效率和任务成功率方面均表现优越,同时支持下游任务如可供性检测、机器人VQA和视觉指令微调。我们在项目页面上提供项目材料,并计划发布完整代码、标注和基准以促进未来研究。视频、代码、演示资产和标注在补充材料中提供。项目页面:此https URL。

英文摘要

Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset's geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: https://tourmaline-caramel-169490.netlify.app.

2606.17511 2026-06-17 cs.RO cs.AI cs.CV 新提交

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

MagicSim: 可执行具身交互的统一基础设施

Haoran Lu, Songling Liu, Yue Chen, Guo Ye, Mutian Shen, Shuyang Yu, Yu Xiao, Jihai Zhao, Shang Wu, Jianshu Zhang, Xiangtian Gui, Chuye Hong, Yuran Wang, Maojiang Su, Jiayi Wang, Ruihai Wu, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Peking University(北京大学) University of California, Berkeley(加州大学伯克利分校) ShanghaiTech University(上海科技大学)

AI总结 提出MagicSim,一个基于确定性批处理运行时和共享MDP的具身交互基础设施,通过YAML规范解耦内容、放置、行为和智能体暴露,统一世界构建、执行、评估和自动生成轨迹。

详情
AI中文摘要

机器人学习和具身智能体现在需要模拟作为连接控制、技能和规划的共享执行基底,而不仅仅是渲染器、控制器测试平台或固定任务环境。现有的流水线通过“魔法”动作、脱节的训练环境或仅前向渲染来分割这些层,无法重现、评估和标注同一情节。我们提出MagicSim,一个围绕确定性批处理运行时和共享马尔可夫决策过程(MDP)构建的具身交互基础设施。通过YAML优先的规范解耦内容、放置、行为和智能体暴露,MagicSim在单一重置-步进循环中构建多样化的可执行世界,涵盖任务族、交互模式、物理、布局、传感器、化身和机器人具身。一个通用的执行接口通过控制器、原子技能、规划器原语和异步规划将高级命令具体化,将其实现为机器人动作而非模拟器端的状态编辑。一个任务定义支持三种能力:基准测试和强化学习评估、自动收集接口(自动将命令转化为具体轨迹)以及面向智能体/VLM的交互。对于自动执行,命令流经Command->Skill->Planner->Robot->Record流水线,而每个环境的命令、技能、规划、重试、标注和情节状态在共享物理滴答之上独立推进。成功的展开被保存为结构化的多模态轨迹,将语言监督、动作表示、视觉/几何表示和任务级别状态与执行的情节对齐。因此,MagicSim在一个规划器在环运行时中统一了多样化的世界构建、具身执行、任务评估、自动展开生成和交互式智能体接口。

英文摘要

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

2606.17520 2026-06-17 cs.RO cs.CV 新提交

GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

GASE:基于高斯溅射的自动化系统用于重建具身仿真环境

Jiawei Zhang, Yiming Yan, Chao Liang, Nuo Xu, Seson Sun, Qichen Zhang, Yuhao Xu, Yantai Yang, Yingqiao Wang, Qin Jin, Zhipeng Zhang

发表机构 * AutoLab, SAI, Shanghai Jiao Tong University(上海交通大学SAI学院AutoLab实验室) AIM3 Lab, School of Information, Renmin University of China(中国人民大学信息学院AIM3实验室) Research Lab, Anyverse Dynamics(Anyverse Dynamics研究实验室)

AI总结 提出GASE系统,利用全景相机阵列和多视图视频流,通过相机位姿策略提取前景物体并修复场景,独立重建后导入物理仿真器,实现高效高保真仿真环境构建,分割精度提升超10%,真实机器人部署性能差距小于10%。

详情
AI中文摘要

在现实世界中训练具身代理需要熟练的操作人员和昂贵的硬件。仿真环境通过实现大规模、低成本的数据增强提供了一种引人注目的替代方案。因此,快速构建具有最小仿真到现实差距的高保真仿真场景已成为机器人学习的关键目标。尽管基于重建的方法提供了优越的视觉质量,但当前的工作流程受到低效的数据采集和次优的前景物体提取的阻碍。因此,我们提出了GASE,一个高度自动化的仿真场景构建系统。GASE利用全景相机阵列的多视角视频流实现快速环境扫描。为确保高质量的资产生成,我们的流程引入了一种基于相机位姿的策略,在2D域中跨帧鲁棒地提取物体,随后进行高保真场景修复。前景物体和静态背景随后被独立重建,并无缝导入物理仿真器用于策略训练。大量实验表明,GASE在分割精度上比现有的基于3D高斯的方法提高了超过10%,同时实现了最先进的修复质量。此外,在操作和导航任务中的真实机器人部署保持了与纯真实世界数据训练策略相比低于10%的性能差距。这些结果证实GASE为弥合仿真到现实差距提供了高效且高度有效的解决方案。代码将发布。

英文摘要

Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

2606.17833 2026-06-17 cs.RO 新提交

HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning

HumanoidArena: 以自我为中心的层级全身学习基准

Taowen Wang, Zikang Xie, Bin Yang, Yunheng Wang, Zizhao Yuan, Yuetong Fang, Yixiao Feng, Yichi Wang, Xingyu Chen, Haodong Chen, Qiwei Wu, Weisheng Xu, Lihan Chen, Lusong Li, Zecui Zeng, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Beijing University of Technology(北京工业大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen MSU-BIT University(深圳北理莫斯科大学) JD Explore Academy(京东探索研究院)

AI总结 提出HumanoidArena基准,通过层级控制(高层策略输出全身动作,低层通用运动跟踪器执行)解决人形机器人全身交互学习问题,设计7个腿部关键任务评估策略的泛化与迁移能力。

Comments 29 pages, 13 figures, 10 tables

详情
AI中文摘要

人形机器人有望在人类中心环境中实现全身交互,但由于任务级决策与全身动态执行紧密耦合,可扩展的策略学习仍然困难。一个实用的解决方案是层级控制,其中高层策略预测中间全身动作,低层通用运动跟踪器(GMT)将其执行为稳定的人形运动。然而,现有基准很少评估策略-跟踪器接口本身,因此尚不清楚中间全身动作是否可执行、在任务分布变化下是否鲁棒以及是否可跨不同GMT后端迁移。我们引入HumanoidArena,一个以自我为中心的层级全身学习的仿真优先基准。该基准将策略学习形式化为一个层级决策问题:高层策略将自我中心视觉、本体感觉和指令转换为紧凑的全身动作,随后由低层GMT执行。HumanoidArena不将腿部视为平面运输工具,而是强调下肢协调在任务完成中结构上必要的交互。因此,我们设计了7个腿部关键的人-物交互/人-场景交互(HOI/HSI)任务,其中成功需要足部放置、平衡维持、姿势调整和全身重新定向。为了进一步诊断层级系统,我们从两个互补角度评估策略:扰动条件泛化和GMT条件迁移。实验表明,层级控制使学习策略能够解决多样的腿部关键交互,但性能强烈依赖于跟踪器,且跨GMT迁移仍然脆弱。这些结果使HumanoidArena成为研究可迁移中间动作表示和可扩展的自我中心全身策略学习的基准。

英文摘要

Humanoid robots promise whole-body interaction in human-centered environments, but scalable policy learning remains difficult because task-level decision-making and whole-body dynamic execution are tightly coupled. A practical solution is hierarchical control, where a high-level policy predicts intermediate whole-body actions and low-level general motion trackers (GMTs) execute them as stable humanoid motion. However, existing benchmarks rarely evaluate the policy-tracker interface itself, leaving open whether intermediate whole-body actions are executable, robust under task distribution shifts, and transferable across different GMT backends. We introduce HumanoidArena, a simulation-first benchmark for egocentric hierarchical whole-body learning. The benchmark formulates policy learning as a hierarchical decision making problem: a high-level policy converts egocentric vision, proprioception, and instructions into a compact whole-body action, which is subsequently executed by a low-level GMT. Instead of treating the legs as planar transport tools, HumanoidArena emphasizes interactions where lower-body coordination is structurally necessary in task completion. We therefore design 7 leg-critical HOI/HSI tasks in which success requires foot placement, balance maintenance, posture adjustment, and whole-body reorientation. To further diagnose the hierarchical system, we evaluate policies from two complementary perspectives: perturbation-conditioned generalization and GMT-conditioned transfer. Experiments show that hierarchical control enables learned policies to solve diverse leg-critical interactions, but performance is strongly tracker-conditioned and cross-GMT transfer remains fragile. These results position HumanoidArena as a benchmark for studying transferable intermediate action representations and scalable egocentric whole-body policy learning.

2606.18097 2026-06-17 cs.RO 新提交

WireCraft: A Simulation Benchmark for Industrial DLO Manipulation

WireCraft:工业DLO操作仿真基准

Chongyu Zhu, Ramy ElMallah, Hyegang Kim, Zachary Tang, Jiachen Rao, Artem Arutyunov, Seungyeon Ha, Chi-Guhn Lee

发表机构 * Department of Mechanical and Industrial Engineering, University of Toronto(多伦多大学机械与工业工程系) Department of Computer Science, University of Toronto(多伦多大学计算机科学系) CREFLE Inc.(CREFLE公司)

AI总结 针对工业中可变形线性物体(DLO)操作缺乏统一基准的问题,提出WireCraft仿真基准,支持可配置难度和资产,涵盖三种任务族,并评估强化学习、模仿学习和视觉-语言-动作策略。

详情
AI中文摘要

可变形线性物体(DLO),如电线和电缆,是工业装配的核心。与刚体不同,刚体的状态由6自由度位姿捕获,而DLO具有无限维配置空间,并在与夹爪、夹具和工作空间的接触下连续变形,使其成为通用灵巧操作的一个高要求基准。尽管其重要性,策略开发和比较仍然困难:现有基准通常绑定到特定硬件设置,缺乏模块化和可定制的任务资产,或者研究没有真实世界工业线缆操作相关夹具的通用可变形物体任务。很少有基准将仿真、真实世界数据和共享评估协议对齐。为弥合这一差距,我们引入了WireCraft,一个用于工业DLO操作的仿真基准,具有可配置的难度和资产,涵盖三个任务族:连接器插入、夹子布线和通道就位。它支持两种互补的DLO物理模型——铰接式和可变形式,轨迹来自仿真和物理UR5。我们在共享指标下对强化学习(RL)、模仿学习(IL)和视觉-语言-动作(VLA)策略进行基准测试。基于特权状态的RL在每个任务族的一个代表性设置中实现了超过82%的成功率,确认了任务的良好定义。然而,对于连接器插入,从到达插座到接触丰富的对齐的过渡仍然是视觉RL、IL和VLA策略的关键瓶颈。这些结果表明,工业DLO操作虽然在特权状态下可处理,但对于当前基于视觉的学习仍然是一个开放的挑战。基准、数据和工具将在接收后开源。

英文摘要

Deformable Linear Objects (DLOs), such as wires and cables, are central to industrial assembly. Unlike rigid objects, whose state is captured by a 6-DoF pose, DLOs have an infinite-dimensional configuration space and deform continuously under contact with grippers, fixtures, and the workspace, making them a demanding benchmark for general dexterous manipulation. Despite their importance, policy development and comparison remain difficult: existing benchmarks are often tied to specific hardware setups, lack modular and customizable task assets, or study generic deformable-object tasks without the fixtures relevant to real-world industrial wire manipulation. Few benchmarks align simulation, real-world data, and shared evaluation protocols. To bridge this gap, we introduce WireCraft, a simulation benchmark for industrial DLO manipulation with configurable difficulty and assets, spanning three task families: connector insertion, clip routing, and channel seating. It supports two complementary DLO physics models, articulated and deformable, and the trajectories come from both simulation and a physical UR5. We benchmark reinforcement learning (RL), imitation learning (IL), and vision-language-action (VLA) policies under shared metrics. Privileged state-based RL solves a representative setting in each task family with over 82\% success, confirming the tasks are well-posed. For connector insertion, however, the transition from reaching the socket to contact-rich alignment remains a key bottleneck for vision RL, IL, and VLA policies. These results indicate that industrial DLO manipulation, though tractable under privileged state, remains an open challenge for current vision-based learning. The benchmark, data, and tools will be open-sourced upon acceptance.

2606.18239 2026-06-17 cs.RO 新提交

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

EBench: 通用移动操作策略的要素诊断

Ning Gao, Jinliang Zheng, Xing Gao, Haoxiang Ma, Hanqing Wang, Yukai Wang, Jiantong Chen, Zanxin Chen, Shujie Zhang, Mingda Jia, Xuekun Jiang, Zihou Zhu, Xinyu Li, Shuai Wang, Hao Li, Wenzhe Cai, Yuqiang Yang, Xudong Xu, Zhaoyang Lyu, Yao Mu, Tai Wang, Jiangmiao Pang, Jia Zeng, Weinan Zhang, Chunhua Shen

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Xi’an Jiaotong University(西安交通大学) Institute for AI Industry Research (AIR), Tsinghua University(清华大学智能产业研究院) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学)

AI总结 提出EBench基准,从5个能力和4个泛化维度诊断通用移动操作模型,揭示不同模型在成功率相近时能力差异显著。

详情
AI中文摘要

我们提出EBench,一个仿真基准,用于诊断通用移动操作策略,超越单一的成功率标量。EBench包含26个多样且具有挑战性的操作任务,沿5个能力维度和4个泛化维度进行标注。我们评估了最先进的通用操作模型,包括$\pi_0$、$\pi_{0.5}$、XVLA和InternVLA-A1,并揭示出成功率相近的模型展现出截然不同的能力轮廓:$\pi_{0.5}$实现了最高的测试成功率和最佳的训练-测试保持率,而InternVLA-A1在移动操作上占主导地位,但在灵巧任务上崩溃,XVLA与其他策略相比在一组不相交的原子技能上表现出优势。除了能力轮廓分析,EBench还从4个代表性角度分析了泛化能力,识别了不同分布偏移因素的影响。结果揭示了模型在总体得分背后的优势和弱点。我们希望这个基准能提供广泛的诊断信号,以指导通用操作模型的迭代。

英文摘要

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $π_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.

2606.18231 2026-06-17 cs.CV cs.LG cs.RO 交叉投稿

Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

自适应体积力学属性场:分辨率无关

Rishit Dagli, Donglai Xiang, Vismay Modi, Xuning Yang, Gavriel State, David I. W. Levin, Maria Shugrina

发表机构 * NVIDIA(英伟达)

AI总结 提出AdaVoMP方法,利用稀疏自适应体素结构和自回归Transformer编解码器,为3D物体预测高分辨率空间变化的杨氏模量、泊松比和密度,相比现有技术分辨率提升16^3倍且更准确。

Comments Project Page and hi-res paper: https://research.nvidia.com/labs/sil/projects/adavomp/. ICML 2026

详情
AI中文摘要

精确的力学属性(或材料)杨氏模量($E$)、泊松比($\ u$)和密度($\ ho$)对于数字世界的可靠物理模拟至关重要,但大多数3D资产缺乏这些信息。我们提出AdaVoMP,一种预测输入3D物体跨表示形式的精确密集空间变化($E$,$\ u$,$\ ho$)的方法,在分辨率、准确性和内存效率上优于现有技术。我们技术的基础是一种稀疏自适应体素结构SAV,它能高效地表示输入3D形状和材料场输出。我们将最准确的先前方法VoMP的固定体素模型替换为一种新颖的稀疏Transformer编码器-解码器模型,该模型学习为每个输入形状自回归地生成唯一的SAV来表示其材料,实现比先前技术高$16^3$倍的分辨率。实验表明,即使测试时计算量少于所有先前技术,AdaVoMP也能估计出更准确的体积属性。这使得我们能够将高分辨率复杂3D物体转换为可模拟的资产,从而实现逼真的可变形模拟。

英文摘要

Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($ν$) and density ($ρ$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $ν$, $ρ$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

2603.25937 2026-06-17 cs.RO cs.LG 版本更新

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

视觉基础模型能否导航?零样本真实世界评估与经验教训

Maeva Guerrier, Karthik Soma, Jana Pavlasek, Giovanni Beltrame

发表机构 * Polytechnique Montreal(蒙特利尔理工学院)

AI总结 本文对五种视觉导航模型在真实环境中进行零样本评估,发现其存在几何理解不足、感知混淆和分布漂移等系统性问题,并公开评估代码与数据集。

详情
AI中文摘要

视觉导航模型(VNMs)通过从大规模视觉演示中学习,有望实现通用化的机器人导航。尽管在真实世界部署日益增多,现有评估几乎完全依赖成功率(机器人是否到达目标),这掩盖了轨迹质量、碰撞行为以及对环境变化的鲁棒性。我们针对五种最先进的VNMs(GNM、ViNT、NoMaD、NaviBridger和CrossFormer)在两个机器人平台和五个室内外环境中进行了真实世界评估。除了成功率,我们结合了基于路径的指标与基于视觉的目标识别分数,并通过受控图像扰动(运动模糊、太阳眩光)评估鲁棒性。我们的分析揭示了三个系统性问题:(a) 即使是架构复杂的扩散和Transformer模型也频繁发生碰撞,表明几何理解有限;(b) 模型无法区分感知相似但存在语义差异的不同位置,导致在重复环境中出现目标预测错误;(c) 在分布偏移下性能下降。我们将公开发布评估代码和数据集,以促进VNMs的可重复基准测试。

英文摘要

Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.

2506.05797 2026-06-17 cs.LG cs.CE cs.RO 版本更新

EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator

EqCollide: 等变且碰撞感知的可变形物体神经模拟器

Qianyi Chen, Tianrun Gao, Chenbo Jiang, Tailin Wu

发表机构 * Westlake University(西交大大学) Fudan University(复旦大学) Tongji University(同济大学) McGill University(麦吉尔大学)

AI总结 提出首个端到端等变神经场模拟器EqCollide,通过等变编码器和碰撞感知消息传递的图神经网络常微分方程,实现可变形物体碰撞的准确、稳定和可扩展模拟。

Comments SIGKDD 2026 Oral AI4S Track. 20 pages, 16 figures

详情
AI中文摘要

模拟可变形物体的碰撞是一项基础但具有挑战性的任务,因为涉及固体力学和多体相互作用的复杂性。现有的数据驱动方法通常缺乏对物理对称性的等变性、对碰撞处理不足以及可扩展性有限。本文介绍\name,这是首个用于可变形物体及其碰撞的端到端等变神经场模拟器。我们提出一个等变编码器,将物体几何和速度映射到潜在控制点。随后,基于等变图神经网络的神经常微分方程通过碰撞感知消息传递建模控制点之间的相互作用。为了重建速度场,我们查询一个以控制点特征为条件的神经场,实现连续且分辨率无关的运动预测。在2D和3D场景上的实验结果表明,\name在不同物体配置下实现了准确、稳定且可扩展的模拟。与最佳基线模型相比,其滚动均方误差降低了24.34%至57.62%。此外,\name能够泛化到更多碰撞物体和更长的时间范围,并对群作用下的输入变换保持鲁棒。代码可在以下网址获取:this https URL

英文摘要

Simulating collisions of deformable objects is a fundamental yet challenging task due to the complexity of modeling solid mechanics and multi-body interactions. Existing data-driven methods often suffer from lack of equivariance to physical symmetries, inadequate handling of collisions, and limited scalability. Here we introduce EqCollide, the first end-to-end equivariant neural fields simulator for deformable objects and their collisions. We propose an equivariant encoder to map object geometry and velocity into latent control points. A subsequent equivariant Graph Neural Network-based Neural Ordinary Differential Equation models the interactions among control points via collision-aware message passing. To reconstruct velocity fields, we query a neural field conditioned on control point features, enabling continuous and resolution-independent motion predictions. Experimental results on 2D and 3D scenarios show that EqCollide achieves accurate, stable, and scalable simulations across diverse object configurations. It achieves $24.34\%$ to $57.62\%$ lower rollout MSE, even compared with the best-performing baseline model. Furthermore, EqCollide could generalize to more colliding objects and extended temporal horizons, and stay robust to input transformed with group action. Code is available at: https://github.com/AI4Science-WestlakeU/EqCollide

2507.05169 2026-06-17 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model: A Generative Latent Prediction Architecture for World Modeling

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

2603.03485 2026-06-17 cs.CV cs.AI cs.RO 版本更新

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Phys4D: 从视频扩散模型实现细粒度物理一致的4D建模

Haoran Lu, Shang Wu, Songling Liu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

AI总结 提出Phys4D流水线,通过三阶段训练(伪监督预训练、物理监督微调、强化学习校正)从视频扩散模型学习物理一致的4D世界表示,显著提升细粒度时空与物理一致性。

详情
AI中文摘要

最近的视频扩散模型作为大规模生成式世界模型已经取得了令人印象深刻的能力。然而,这些模型通常难以保持细粒度的物理一致性,随时间表现出物理上不合理的动态。在这项工作中,我们提出了 \textbf{Phys4D},一个从视频扩散模型中学习物理一致的4D世界表示的流水线。Phys4D 采用 \textbf{三阶段训练范式},逐步将外观驱动的视频扩散模型提升为物理一致的4D世界表示。我们首先通过大规模伪监督预训练引导出稳健的几何和运动表示,为4D场景建模奠定基础。然后,我们使用模拟生成的数据进行基于物理的监督微调,强制执行时间一致的4D动态。最后,我们应用基于模拟的强化学习来纠正难以通过显式监督捕获的残留物理违规。为了评估超越外观指标的细粒度物理一致性,我们引入了一套 \textbf{4D世界一致性评估},探测几何一致性、运动稳定性和长期物理合理性。实验结果表明,与外观驱动的基线相比,Phys4D 显著改善了细粒度时空和物理一致性,同时保持了强大的生成性能。我们的项目页面可在此 https URL 获取。

英文摘要

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

2603.22281 2026-06-17 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

ThinkJEPA:赋予潜在世界模型大型视觉-语言推理能力

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

AI总结 提出ThinkJEPA框架,结合密集JEPA分支与稀疏VLM思考者分支,通过分层金字塔表示提取模块,实现细粒度运动建模与长程语义引导,在手部操作轨迹预测任务上超越基线。

Comments 10 pages, 5 figures

详情
AI中文摘要

潜在世界模型(如V-JEPA2)的最新进展展示了从视频观测预测未来世界状态的能力。然而,短观测窗口的密集预测限制了时间上下文,可能导致预测偏向局部低层次外推,难以捕捉长程语义并降低下游效用。相比之下,视觉-语言模型(VLM)通过对均匀采样帧进行推理,提供强大的语义基础和通用知识,但由于计算驱动的稀疏采样、语言输出瓶颈(将细粒度交互状态压缩为文本导向表示)以及适应小规模动作条件数据集时的数据分布不匹配,它们不适合作为独立的密集预测器。我们提出了一种VLM引导的JEPA风格潜在世界建模框架,通过双时间路径结合密集帧动态建模与长程语义指导:一个密集JEPA分支用于细粒度运动和交互线索,以及一个均匀采样的VLM“思考者”分支,具有更大的时间步长以提供知识丰富的指导。为了有效传递VLM的渐进推理信号,我们引入了一个分层金字塔表示提取模块,将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上,我们的方法优于强VLM-only基线和JEPA预测器基线,并展现出更鲁棒的长程展开行为。

英文摘要

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

2604.27583 2026-06-17 q-bio.NC cs.RO 版本更新

Simulating Infant First-Person Sensorimotor Experience via Motion Retargeting from Babies to Humanoids

通过从婴儿到类人机器人的运动重定向模拟婴儿第一人称感觉运动经验

Francisco M. López, Hoshinori Kanazawa, Ondrej Fiala, Yakov Balashov, Valentin Marcel, Lukas Rustler, Miles Lenz, Dongmin Kim, Yasuo Kuniyoshi, Jochen Triesch, Matej Hoffmann

AI总结 提出一种从单视频重建婴儿3D姿态并映射到物理/虚拟类人平台的方法,实现亚厘米级精度的多感觉流模拟,为发育研究和神经发育障碍早期检测提供新工具。

Comments Accepted at IEEE ICDL 2026. 8 pages, 6 figures. Cite as: F. M. López, H. Kanazawa, O. Fiala, Y. Balashov, V. Marcel, L. Rustler, M. Lenz, D. Kim, Y. Kuniyoshi, J. Triesch, and M. Hoffmann, "Simulating infant first-person sensorimotor experience via motion retargeting from babies to humanoids'', in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-8

详情
AI中文摘要

随着人形机器人能力的增强,从人类到类人人工体的运动重定向变得越来越重要。然而,现有方法大多只关注运动学再现,而忽略了与人类运动相关的丰富感觉运动经验。在这项工作中,我们提出了一个框架,使用物理和虚拟类人机器人模拟婴儿的多模态感觉运动经验。从单个视频中,我们的方法通过提取骨骼结构并从每一帧估计完整的3D姿态来重建婴儿的身体配置。然后,我们将重建的运动映射到几个发育平台上:物理iCub机器人和虚拟模拟器pyCub、EMFANT和MIMo。在这些实体上重放重定向的运动会产生模拟的多感觉流,包括本体感觉(关节和肌肉)、触觉和视觉。对于最佳匹配的实体,重定向实现了亚厘米级的精度,并能够对婴儿发育进行丰富的多模态分析,以及增强的行为自动标注。该框架为婴儿的感觉运动经验提供了一个独特的窗口,为机器人学、发育科学和神经发育障碍的早期检测提供了新工具。代码可在https://this URL获取。

英文摘要

Motion retargeting from humans to human-like artificial agents is becoming increasingly important as humanoid robots grow more capable. However, most existing approaches focus only on reproducing kinematics and ignore the rich sensorimotor experience associated with human movement. In this work, we present a framework for simulating the multimodal sensorimotor experiences of infants using physical and virtual humanoids. From a single video, our method reconstructs the infant's body configuration by extracting its skeletal structure and estimating the full 3D pose from each frame. Then we map the reconstructed motion onto several developmental platforms: the physical iCub robot and the virtual simulators pyCub, EMFANT and MIMo. Replaying the retargeted motions on these embodiments produces simulated multisensory streams including proprioception (joints and muscles), touch, and vision. For the best-matching embodiment, the retargeting achieves sub-centimeter accuracy and enables a rich multimodal analysis of infant development as well as enhanced automated annotation of behaviors. This framework provides a unique window into the infant's sensorimotor experience, offering new tools for robotics, developmental science, and early detection of neurodevelopmental disorders. The code is available at https://github.com/ctu-vras/motion-retargeting/.

2605.29563 2026-06-17 cs.AI cs.CV cs.RO 版本更新

Planning with the Views

通过场景自我探索进行视图规划

Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas, Lijuan Wang, Manling Li

发表机构 * Northwestern University(西北大学) University of Washington(华盛顿大学) Microsoft(微软) University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 提出ViewSuite基准测试揭示VLM在多步视图规划中的不足,并设计迭代框架通过自我探索和视图图蒸馏将Qwen2.5-VL-7B的交互式视图规划准确率从2.5%提升至47.8%。

详情
AI中文摘要

VLM能否预测每个相机移动如何改变视图,并提前规划许多这样的移动?我们称这种能力为视图规划,需要(1)理解单个动作如何变换视图,以及(2)在多步规划中组合许多这样的变换以识别目标视图。我们在提出的ViewSuite中探测了这两种能力,ViewSuite是一个基于真实ScanNet场景的3D点云环境。在13个前沿VLM中,出现了一个关键的规划差距:它们具备基本的视图-动作知识,但无法在多步规划中组合这些知识,并且随着视点距离的增加,差距扩大。为了缩小这一差距,我们提出了一个迭代框架,交替进行自我探索和视图图蒸馏。关键洞察是,所有探索轨迹,无论其结果如何,共同形成一个视图图,紧凑地捕捉了场景中视点如何连接。将这个图蒸馏到多样化的监督任务中,重塑了策略分布,并克服了使纯RL停滞的稀疏奖励。这将Qwen2.5-VL-7B在交互式视图规划上的准确率从2.5%提升到47.8%,超过了GPT-5.4 Pro(18.5%)和Gemini 3.1 Pro(21.4%)。自我探索成为VLM在3D空间中主动推理和规划的一条有前景的路径。

英文摘要

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.

11. 安全、鲁棒性与可信机器人 3 篇

2606.18043 2026-06-17 cs.RO cs.LG 新提交

Uncertainty Quantification for Flow-Based Vision-Language-Action Models

基于流的视觉-语言-动作模型的不确定性量化

Ralf Römer, Maximilian Seeliger, Saida Liu, Ben Sturgis, Marco Bagatella, Daniel Marta, Andreas Krause, Angela P. Schoellig

发表机构 * TU Munich(慕尼黑工业大学) ETH Zurich(苏黎世联邦理工学院) MPI IS Tübingen(马克斯·普朗克智能系统研究所)

AI总结 提出利用速度场差异(VFD)量化流匹配模型中的认知不确定性,用于故障检测和主动微调,在LIBERO基准上实现高效任务适应。

Comments Project page: tum-lsy.github.io/uq_vla/. 28 pages, 12 figures

详情
AI中文摘要

视觉-语言-动作模型(VLAs)将视觉-语言骨干网络与通过大规模机器人数据集上的流匹配训练的生成式动作头相结合。尽管在机器人操作中表现出强大的经验性能,但VLAs缺乏量化其预测置信度和检测动作可能不可靠的机制。这对于在非平稳环境中的实际部署构成了关键限制,因为模型不可避免地会遇到其预训练分布之外的场景,并可能在没有警告的情况下失败。为了解决这个问题,我们通过利用小集成中的速度场差异(VFD),推导出一种量化流匹配模型中认知不确定性的高效方法。我们成功地将这种不确定性估计用于部署期间的故障检测和基于流的VLA的主动微调。为此,我们提出了SAVE,一个不确定性引导的主动多任务微调框架,减少了将VLA适应新任务所需的高成本专家演示数量。通过在LIBERO基准上的广泛实验,我们证明VFD能产生更校准的不确定性估计,预测下游性能,VFD在检测故障方面表现出色,并且使用SAVE进行不确定性引导的数据采集所需的样本比基线至少少22%。总之,我们的工作表明,量化基于流的VLA中的认知不确定性既提高了故障感知能力,也提高了适应性。项目网站:此http URL。

英文摘要

Vision-language-action models (VLAs) combine vision-language backbones with expressive generative action heads trained via flow matching on large-scale robotic datasets. Despite their strong empirical performance in robotic manipulation, VLAs lack mechanisms to quantify confidence in their predictions and to detect when their actions may be unreliable. This presents a critical limitation for real-world deployment in non-stationary environments, where models inevitably encounter scenarios outside their pretraining distribution and may fail without warning. To address this, we derive an efficient method for quantifying epistemic uncertainty in flow-matching models by leveraging velocity-field disagreement (VFD) across a small ensemble. We successfully use this uncertainty estimate for failure detection during deployment and active fine-tuning of flow-based VLAs. To this end, we propose SAVE, a framework for uncertainty-guided active multitask fine-tuning that reduces the number of costly expert demonstrations required to adapt VLAs to new tasks. Through extensive experiments on the LIBERO benchmark, we demonstrate that VFD yields better-calibrated uncertainty estimates predictive of downstream performance, that VFD achieves strong performance in detecting failures, and that uncertainty-guided data acquisition with SAVE requires at least 22% fewer samples than baselines. In summary, our work shows that quantifying epistemic uncertainty in flow-based VLAs improves both failure awareness and adaptation. Project website: tum-lsy.github.io/uq_vla/.

2606.17451 2026-06-17 cs.LG cs.RO 交叉投稿

Credibility-Weighted Pricing of Autonomous Vehicle Liability Under Operational Design Domain Shift

操作设计域转移下自动驾驶汽车责任的可信度加权定价

Doyeon Jang

AI总结 针对自动驾驶系统部署中经验稀疏、ODD转移及风险非平稳问题,提出分层贝叶斯可信度框架,通过ODD相似性核进行部分池化,在Waymo数据上验证其有效性。

详情
AI中文摘要

自动驾驶系统的部署带来了一个基础性的费率制定挑战:稀疏的经验、不断变化的操作设计域以及跨软件版本的非平稳风险。我们提出了一个分层贝叶斯可信度框架,通过学习的ODD相似性核汇集城市、软件版本和区域的信息,将Buhlmann-Straub作为极限情况嵌套其中。基于NHTSA Standing General Order数据库中美国四个大都市区的648起Waymo已验证碰撞事件与1.16亿匹配里程的演示表明,城市聚合可信度权重适中(0.12-0.46),部分池化明显优于无池化,且功效分析显示,学习核的优势在大约十二个部署城市时变得可检测。

英文摘要

Automated Driving System deployments create a foundational ratemaking challenge: sparse experience, shifting operational design domains, and non-stationary risk across software releases. We propose a hierarchical Bayesian credibility framework pooling across cities, software versions, and territories via a learned ODD-similarity kernel, nesting Buhlmann-Straub as a limiting case. Demonstrated on 648 verified-engaged Waymo crashes across four U.S. metros from the NHTSA Standing General Order database against 116 million matched miles, city-aggregate credibility weights are moderate (0.12-0.46), partial pooling decisively outperforms no pooling, and a power analysis shows the learned kernel's advantage becomes detectable at approximately twelve deployed cities.

2606.14438 2026-06-17 cs.RO cs.AI 版本更新

CADET: Physics-Grounded Causal Auditing and Training-Free Deconfounding of End-to-End Driving Planners

CADET: 基于物理的因果审计与无训练去混杂的端到端驾驶规划器

Zikun Guo

发表机构 * School of Electronics Engineering, Kyungpook National University(庆北国立大学电子工程学院)

AI总结 提出CADET框架,无需重新训练即可审计和修复预训练端到端驾驶规划器中的虚假关联,通过物理因果图识别混杂因素并干预测试时输入。

Comments 8pages 4figures

详情
AI中文摘要

通过模仿学习训练的端到端自动驾驶规划器容易产生统计捷径:它们将仅与专家动作共现的场景元素(如路边物体、建筑立面)与驾驶决策关联,而非因果决定驾驶的变量。这种因果混淆在长尾场景中悄然损害可靠性,且难以检测,因为常见的开环指标(L2位移和碰撞率)受自车状态主导,无法指示规划器是否依赖虚假线索。现有的基于因果干预训练的修复方法需要重新训练大型模型,且无法审计已部署的规划器。我们提出CADET,一个无需训练的框架,可以在不更新任何参数的情况下审计、基准测试和修复预训练端到端规划器中的虚假依赖。

英文摘要

End-to-end (E2E) autonomous-driving planners trained by imitation are prone to statistical shortcuts: they associate scene elements that merely co-occur with expert actions (a roadside object, a building facade) with driving decisions, rather than the variables that causally determine them. Such causal confusion silently compromises reliability in long-tail scenarios, and it is difficult to detect, because prevailing open-loop metrics (L2 displacement and collision rate) are dominated by ego status and do not indicate whether a planner depends on spurious cues. Existing remedies based on causal-intervention training require retraining large models and cannot audit a planner that is already deployed. We present CADET, a training-free framework that audits, benchmarks, and repairs spurious reliance in pretrained E2E planners without any parameter update.

12. 其他/综合机器人 7 篇

2606.17073 2026-06-17 cs.RO cs.AI 新提交

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

提取语义:从URDF自动构建机器人本体的LLM引导方法

Bastien Dussard, Guillaume Sarthou

发表机构 * LAAS-CNRS, Department of Robotics, Toulouse, France(法国图卢兹机器人系CNRS实验室)

AI总结 提出利用大语言模型从URDF文件自动生成机器人语义本体,通过多数投票和语法验证确保与现有本体对齐,初步实验表明该方法能有效桥接低层描述与高层知识表示。

详情
Journal ref
18th International Conference on Social Robotics (ICSR 2026), University of London, Jul 2026, Londres, United Kingdom
AI中文摘要

虽然常识知识可能足以满足虚拟代理的需求,但与人类交互的具身机器人需要对其环境和自身物理形态具有基于现实的、语义丰富的表示。在认知机器人学中,本体论能够有效整合这种异构知识,以支持可解释的推理,即使在持续知识更新过程中也是如此。然而,手动构建本体仍然是一个瓶颈。我们提出了一种初步方法,通过将统一机器人描述格式(URDF)模型转换为填充的本体,自动生成机器人语义抽象。尽管URDF文件提供了结构和运动学描述,但其标识符通常需要常识解释才能恢复有意义的语义,而大语言模型(LLM)擅长此任务。我们的流程利用LLM,通过用现有本体中的概念提示它们来推断语义关系,确保最终分类与形式模型保持一致。为了提高可靠性,该流程结合了跨多个LLM查询的多数投票以及语法和模式级验证,以确保生成的输出符合预期的表示格式和本体约束。我们在多个机器人描述上评估了该方法,并讨论了生成的抽象。初步结果表明,所提出的方法能够有效弥合低层机器人描述与人机交互所需的结构化、基于现实的知识表示之间的差距。

英文摘要

While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.

2606.17388 2026-06-17 cs.RO cs.CG cs.SY eess.SY 新提交

Agent Utilities over Generalized Voronoi Regions and their Gradients

广义Voronoi区域上的智能体效用及其梯度

Andre N. Costa, Petter Ögren, Carlos H. C. Ribeiro

发表机构 * Royal Institute of Technology (KTH)(皇家理工学院(KTH)) Aeronautics Institute of Technology(航空技术研究所)

AI总结 本文通过引入成本诱导Voronoi区域,将智能体效用定义为效用密度在该区域上的积分,并利用雷诺输运定理推导效用梯度,在足球双队示例中验证了方法,计算时间比有限差分法减少约一个数量级。

Comments Under review at IEEE Control Systems Letters (L-CSS)

详情
AI中文摘要

在本文中,我们推广了Voronoi区域的概念,将智能体效用定义为相应Voronoi区域上效用密度的积分,推导了效用的梯度,并在足球双队示例中说明了该方法。Voronoi区域的推广形式为所谓的成本诱导Voronoi(CIV)区域,其中智能体状态空间可能与划分的空间不同。这类区域的一个例子是当成本由LQR控制问题的最优解给出时。此时,智能体状态包括位置和速度,而划分的空间仅包括位置。智能体效用通过将某个效用密度在智能体的CIV区域上积分来定义。该效用密度可能是某个有益事件(例如在足球中接球)的概率密度。那么效用就是接球的总体概率,梯度表示提高该概率的方法。我们展示了如何使用流体力学中的雷诺输运定理计算该效用梯度,并且该方法在达到类似精度的同时,计算时间比基准有限差分近似减少约一个数量级。

英文摘要

In this paper, we generalize the concept of Voronoi regions, define agent utility as the integral of a utility density over the corresponding Voronoi region, derive gradients of the utility, and illustrate the approach in a two-team example from soccer. The generalization of Voronoi regions is in the form of so-called Cost-Induced Voronoi (CIV) regions, where the agent state space may differ from the space being partitioned. One example of such regions is when the cost is given by the optimal solution of an LQR control problem. Then the agent states include position as well as velocity, while the partitioned space only includes positions. The agent utility is defined by integrating some utility density over the CIV region of the agent. This utility density might be the probability density of some beneficial event, such as receiving a pass in soccer. The utility is then the overall probability of receiving a pass and the gradient represents a way to improve that probability. We show how this utility gradient can be computed using the Reynolds Transport Theorem from fluid mechanics, and that this approach achieves similar accuracy while reducing computation time by about an order of magnitude compared to a baseline finite-difference approximation.

2606.17456 2026-06-17 cs.RO q-bio.NC 新提交

Embodiment Shapes Rolling Behavior in a Multimodal Infant Model

具身形态塑造多模态婴儿模型中的翻滚行为

Leon Philipp, Francisco M. López, Jochen Triesch

发表机构 * Frankfurt Institute for Advanced Studies(法兰克福高等研究院) Goethe University Frankfurt(法兰克福大学) University of New South Wales(新南威尔士大学)

AI总结 通过虚拟婴儿MIMo学习仰卧到俯卧翻滚,研究婴儿运动发展中的具身形态变化如何影响行为,发现与真实婴儿一致的发育趋势和协调模式。

Comments 7 pages, 7 figures. Accepted at the 2026 IEEE ICDL Conference. Cite as: L. Philipp, F. M. López, and J. Triesch, "Embodiment Shapes Rolling Behavior in a Multimodal Infant Model", in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-7

详情
AI中文摘要

翻身是婴儿运动发展中最早期的里程碑之一,反映了协调的全身感觉运动控制的出现。在这里,我们使用MIMo(一个配备本体感觉和前庭感觉的虚拟婴儿具身模型)对婴儿翻滚进行计算研究。MIMo通过强化学习学习从仰卧到俯卧的翻滚。有趣的是,学习到的行为捕捉到了与真实婴儿报告一致的发育趋势和协调模式,包括随着年龄增长表现提升和执行速度加快。我们的结果解释了婴儿的能力和限制如何能在人工代理中产生逼真的行为,特别强调了运动发展如何受到不断变化的身体形态的影响。这项工作突出了具身计算模型作为研究感觉运动发展的强大工具的作用。

英文摘要

Rolling over is one of the earliest milestones in infant motor development, reflecting the emergence of coordinated, whole-body sensorimotor control. Here, we conduct a computational study of infant rolling using MIMo, a virtual infant embodiment equipped with proprioception and vestibular sensation. MIMo learns supine-to-prone rolls with reinforcement learning. Interestingly, the learned behaviors capture developmental trends and coordination patterns consistent with those reported in real infants, including improved performance and faster execution with age. Our results explain how infant capabilities and constraints can give rise to realistic behaviors in artificial agents, with a particular emphasis on how motor development is shaped by the changing body morphology. This work highlights the role of embodied computational models as a powerful tool for studying sensorimotor development.

2605.12220 2026-06-17 cs.CV cs.AI cs.LG cs.RO 版本更新

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

TriBand-BEV:基于高度感知的鸟瞰图与高分辨率特征融合的实时仅LiDAR三维行人检测

Mohammad Khoshkdahan, Alexey Vinel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文提出TriBand-BEV方法,通过高度感知的鸟瞰图与高分辨率特征融合实现实时LiDAR-only三维行人检测,采用轻量级鸟瞰图张量映射,单网络一次通过检测车辆、行人和自行车,提升检测精度与速度。

Comments Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
Journal ref
Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
AI中文摘要

安全的自动驾驶代理和移动机器人需要快速的实时三维感知,尤其是对于行人等易受伤害道路使用者。我们介绍了一种新的鸟瞰图(BEV)编码方法,将完整的三维LiDAR点云映射到轻量级的二维BEV张量中,分为三个高度带。我们明确地将三维检测重新公式化为二维检测问题,然后从BEV输出中重建三维框。单个网络在一次通过中检测车辆、行人和自行车。骨干网络在深层阶段使用区域注意力,层次化的双向颈部网络在P1到P4之间融合上下文和细节,头部使用分布焦点学习预测定向框,以预测侧偏移和旋转IoU损失。训练应用小垂直重新分箱和温和的反射率抖动以防止记忆化。我们使用四分位距(IQR)过滤器在三维重建中去除噪声和离群的LiDAR点。在KITTI数据集上,TriBand-BEV在49 FPS的单个消费级GPU上实现了易、中等和困难样本的行人BEV AP分别为58.7/52.6/47.2%,优于Complex-YOLO,分别提升了+12.6%、+7.5%和+3.1%。定性场景显示在遮挡下检测稳定。该流程紧凑且适用于实时机器人部署。我们的源代码在GitHub上公开可用。

英文摘要

Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.

2511.06500 2026-06-17 cs.RO 版本更新

Cross-Platform Learnable Fuzzy Gain-Scheduled Proportional-Integral-Derivative Controller Tuning via Physics-Constrained Meta-Learning and Reinforcement Learning Adaptation

JiaHao Wu, ShengWen Yu

发表机构 * The University of Hong Kong(香港大学) Guangzhou College of Commerce(广州商学院)

Comments 24 pages,15 tables, 6 figures

详情
英文摘要

Motivation and gap: PID-family controllers remain a pragmatic choice for many robotic systems due to their simplicity and interpretability, but tuning stable, high-performing gains is time-consuming and typically non-transferable across robot morphologies, payloads, and deployment conditions. Fuzzy gain scheduling can provide interpretable online adjustment, yet its per-joint scaling and consequent parameters are platform-dependent and difficult to tune systematically. Proposed approach: We propose a hierarchical framework for cross-platform tuning of a learnable fuzzy gain-scheduled PID (LF-PID). The controller uses shared fuzzy membership partitions to preserve common error semantics, while learning per-joint scaling and Takagi-Sugeno consequent parameters that schedule PID gains online. Combined with physics-constrained virtual robot synthesis, meta-learning provides cross-platform initialization from robot physical features, and a lightweight reinforcement learning (RL) stage performs deployment-specific refinement under dynamics mismatch. Starting from three base simulated platforms, we generate 232 physically valid training variants via bounded perturbations of mass (+/-10%), inertia (+/-15%), and friction (+/-20%). Results and insight: We evaluate cross-platform generalization on two distinct systems (a 9-DOF serial manipulator and a 12-DOF quadruped) under multiple disturbance scenarios. The RL adaptation stage improves tracking performance on top of the meta-initialized controller, with up to 80.4% error reduction in challenging high-load joints (12.36 degrees to 2.42 degrees) and 19.2% improvement under parameter uncertainty. We further identify an optimization ceiling effect: online refinement yields substantial gains when the meta-initialized baseline exhibits localized deficiencies, but provides limited improvement when baseline quality is already uniformly strong.

2509.19525 2026-06-17 cs.RO 版本更新

Real-Time Reinforcement Learning for Dynamic Tasks with a Parallel Soft Robot

动态任务的实时强化学习与并行软机器人

James Avtges, Jake Ketchum, Millicent Schlafly, Helena Young, Taekyoung Kim, Allison Pinosky, Ryan L. Truby, Todd D. Murphey

发表机构 * Department of Mechanical Engineering, Northwestern University(西北大学机械工程系) Department of Materials Science and Engineering, Northwestern University(西北大学材料科学与工程系)

AI总结 本文提出基于课程学习的实时强化学习方法,用于在单次部署中实现软机器人的动态平衡,通过并行软执行器和HSA结构实现高可靠性控制。

Comments Published at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

详情
AI中文摘要

闭环控制仍然是软机器人领域的开放挑战。在动态负载条件下,软执行器的非线性响应限制了分析模型在软机器人控制中的应用。传统方法在控制软机器人时未能充分利用其配置空间,以避免非线性、迟滞、大变形和执行器损坏的风险。此外,基于经验的数据驱动控制方法,如强化学习(RL),通常受到样本效率和初始化不一致的限制。在本工作中,我们展示了RL在实时单次硬件部署中可靠地学习动态平衡任务的控制策略。我们使用由并行3D打印软执行器构建的可变形斯图尔特平台,基于电机驱动的 handed shearing auxetic(HSA)结构。通过引入基于已知平衡点扩展邻域的课程学习方法,我们实现了在任意坐标处的可靠单次部署平衡。除了对基于模型和无模型方法的性能进行基准测试外,我们还证明了在单次部署中,最大扩散RL能够在半数执行器有效失效的情况下学习动态平衡,通过诱导屈曲并用切割器破坏执行器。训练无需先验数据,可在15分钟内完成,性能几乎与完整平台相同。单次硬件学习使软机器人系统能够可靠地在现实世界中学习,并将使更多样化和有能力的软机器人成为可能。

英文摘要

Closed-loop control remains an open challenge in soft robotics. The nonlinear responses of soft actuators under dynamic loading conditions limit the use of analytic models for soft robot control. Traditional methods of controlling soft robots underutilize their configuration spaces to avoid nonlinearity, hysteresis, large deformations, and the risk of actuator damage. Furthermore, episodic data-driven control approaches such as reinforcement learning (RL) are traditionally limited by sample efficiency and inconsistency across initializations. In this work, we demonstrate RL for reliably learning control policies for dynamic balancing tasks in real-time single-shot hardware deployments. We use a deformable Stewart platform constructed using parallel, 3D-printed soft actuators based on motorized handed shearing auxetic (HSA) structures. By introducing a curriculum learning approach based on expanding neighborhoods of a known equilibrium, we achieve reliable single-deployment balancing at arbitrary coordinates. In addition to benchmarking the performance of model-based and model-free methods, we demonstrate that in a single deployment, Maximum Diffusion RL is capable of learning dynamic balancing after half of the actuators are effectively disabled, by inducing buckling and by breaking actuators with bolt cutters. Training occurs with no prior data, in as fast as 15 minutes, with performance nearly identical to the fully-intact platform. Single-shot learning on hardware facilitates soft robotic systems reliably learning in the real world and will enable more diverse and capable soft robots.

2506.19277 2026-06-17 cs.RO cs.SY eess.SY 版本更新

Ontology Neural Network and ORTSF: A Framework for Topological Reasoning and Delay-Robust Control

本体神经网络与ORTSF:一种用于拓扑推理和延迟鲁棒控制的框架

Jaehong Oh

发表机构 * Department of Mechanical Engineering Soongsil University, Seoul, Korea Email

AI总结 本文提出Ontology Neural Network和ORTSF框架,解决现有方法在关系语义表示和动态环境中协作所需认知透明度的不足,通过统一架构实现语义认知与鲁棒控制的统一。

Comments 12 pages, 5 figures, includes theoretical proofs and simulation results

详情
AI中文摘要

自主机器人系统的进步在感知、定位、建图和控制方面取得了显著成果,但存在根本性缺口:现有框架在几何推理和动态稳定性方面表现优异,但在关系语义表示、上下文推理和认知透明度方面存在不足,这些是动态、以人为中心环境中协作的关键。本文提出包含本体神经网络(ONN)和本体实时语义织体(ORTSF)的统一架构,以解决这一缺口。ONN将关系语义推理形式化为动态拓扑过程。通过将Forman-Ricci曲率、持续同调和语义张量结构嵌入统一的损失公式中,ONN确保随着场景随时间演变,关系完整性和拓扑一致性得以保持。ORTSF将推理轨迹转化为可操作的控制命令,同时补偿系统延迟。它整合了预测性和延迟感知的操作符,确保在显著延迟条件下相位边距的保持和控制信号的连续性。实证研究展示了ONN + ORTSF框架在统一语义认知和鲁棒控制方面的能力,提供了一种数学上严谨且实际可行的解决方案,用于认知机器人学。

英文摘要

The advancement of autonomous robotic systems has led to impressive capabilities in perception, localization, mapping, and control. Yet, a fundamental gap remains: existing frameworks excel at geometric reasoning and dynamic stability but fall short in representing and preserving relational semantics, contextual reasoning, and cognitive transparency essential for collaboration in dynamic, human-centric environments. This paper introduces a unified architecture comprising the Ontology Neural Network (ONN) and the Ontological Real-Time Semantic Fabric (ORTSF) to address this gap. The ONN formalizes relational semantic reasoning as a dynamic topological process. By embedding Forman-Ricci curvature, persistent homology, and semantic tensor structures within a unified loss formulation, ONN ensures that relational integrity and topological coherence are preserved as scenes evolve over time. The ORTSF transforms reasoning traces into actionable control commands while compensating for system delays. It integrates predictive and delay-aware operators that ensure phase margin preservation and continuity of control signals, even under significant latency conditions. Empirical studies demonstrate the ONN + ORTSF framework's ability to unify semantic cognition and robust control, providing a mathematically principled and practically viable solution for cognitive robotics.