arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 机器人学习与模仿强化学习 9 篇

2606.10025 2026-06-10 cs.RO cs.CV cs.LG 新提交

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

GHOST: 用于泛化机器人操作的层次化子目标策略

Sriram Krishna, Ben Eisner, Haotian Zhan, Ying Yuan, Haoyu Zhen, Chuang Gan, Shubham Tulsiani, David Held

AI总结 提出GHOST框架,通过将控制分解为高层子目标预测和低层目标条件控制器,实现视觉运动操作策略的泛化,并利用人类演示适应新物体和任务变化。

Comments Accepted at RSS 2026

详情
AI中文摘要

我们提出了GHOST,一个学习视觉运动操作策略的框架,该策略能够泛化到训练分布之外。GHOST将控制分解为:(i) 高层策略,从多视角RGB-D观测中预测下一个子目标作为3D末端执行器位姿的分布,以及(ii) 低层目标条件控制器,执行特定于具体体的动作。为了将基于图像的策略条件化于3D目标,我们引入了一个简单的空间接口,将预测的目标投影到图像平面,并将其表示为末端执行器热图。在一系列操作任务中,与平坦的扩散策略相比,这种层次化分解持续提高了性能和鲁棒性。此外,我们展示了这种层次化接口也使得整合人类演示变得容易,而无需依赖(嘈杂的)动作重定向。由于子目标在很大程度上与具体体无关,我们在人类视频上训练高层策略,以指定如何应用和组合学到的技能,同时保持低层策略仅在机器人数据上训练。这种层次结构使得能够使用少量人类演示适应新物体和任务变化。

英文摘要

We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.

2606.10267 2026-06-10 cs.RO cs.AI cs.LG 新提交

What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents

机器人策略编排的关键因素:分层VLA智能体的系统研究

Jiaheng Hu, Mohit Shridhar, Caden Lu, Dhruv Shah, Hao-Tien Lewis Chiang, Jie Tan, Annie Xie

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 系统研究分层视觉-语言-动作(Hi-VLA)系统的设计原则,通过统一框架分析规划器、控制器及接口机制对短时、长时及推理密集型任务性能的影响,提出构建更强健分层VLA智能体的实用原则。

详情
AI中文摘要

分层视觉-语言-动作(Hi-VLA)系统已成为复杂机器人操作的一种有前景的范式,它通过使用高层VLM规划器将任务分解为语言子目标,由低层VLA控制器执行。尽管近期取得了实证进展,但这些系统缺乏统一的设计原则:现有的Hi-VLA系统在选择和连接规划器、控制器、两者之间的切换机制以及规划器中观测和记忆的表示方式上存在差异。在本文中,我们对机器人操作的Hi-VLA设计进行了系统研究。我们将代表性的Hi-VLA智能体统一在一个选项式控制框架下,并在短时、长时和推理密集型任务上基准测试核心设计选择。我们的分析提炼出构建Hi-VLA系统的实用原则,展示了模型选择和接口机制如何共同塑造性能。应用这些原则,在仿真和真实ALOHA机器人上的实验中,我们得到了一个比平面VLA控制或朴素设计的分层系统都显著更强的系统。总体而言,我们的结果为构建更强大、更鲁棒且更有原则的分层VLA智能体奠定了基础。更多信息和视频请访问此http URL。

英文摘要

Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-level VLM planners to decompose tasks into language subgoals executed by low-level VLA controllers. Despite recent empirical progress, there is a lack of unified design principles for these systems: existing Hi-VLA systems differ in how they choose and connect planners, controllers, mechanisms to switch between the two, and how observations and memory are represented in the planner. In this paper, we present a systematic study of Hi-VLA design for robot manipulation. We unify representative Hi-VLA agents under an options-style control framework and benchmark core design choices across short-horizon, long-horizon, and reasoning-intensive tasks. Our analysis distills practical principles for building Hi-VLA systems, showing how model choices and interface mechanisms jointly shape performance. Applying these principles yields a substantially stronger system than either flat VLA control or a naively designed hierarchy, across experiments both in simulation and on a real ALOHA robot. Overall, our results provide a foundation for building more capable, robust, and principled hierarchical VLA agents. More information and video at jiahenghu.github.io/hi-vla.

2606.10305 2026-06-10 cs.RO 新提交

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

SARM2: 多任务阶段感知奖励建模用于自我改进的机器人操作

Qianzhong Chen, Hau Zheng, Justin Yu, Suning Huang, Jiankai Sun, Ken Goldberg, Chuan Wen, Pieter Abbeel, Yide Shentu, Philipp Wu, Mac Schwager

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出多任务阶段感知奖励模型RM,结合动作基元阶段估计器和多门控专家混合值头,为机器人操作任务提供密集逐步奖励,并基于RM构建SPIRAL框架,通过廉价自主轨迹改进VLA策略,在10任务基准上显著提升成功率。

详情
AI中文摘要

微调视觉-语言-动作(VLA)策略以进行长程操作仍然严重依赖于行为克隆,这需要昂贵的高质量演示,并使策略保持在演示分布附近。奖励模型可以通过重新加权演示并为机器人上的强化学习(RL)提供密集监督来减少这种依赖,但它们必须密集、准确且通用。现有方法存在不足:特定任务的阶段感知模型准确但需要每任务注释,而通用视觉-语言模型(VLM)奖励模型适用范围广但对于细粒度的长程进展过于粗糙。我们引入了RM,一种多任务阶段感知奖励模型,它将基于动作基元的阶段估计器与多门控专家混合(MMoE)值头相结合,以在操作任务中产生密集的每步奖励。基于RM,我们进一步提出了SPIRAL(通过奖励对齐学习进行自策略改进),一种在策略奖励引导框架,通过廉价的自主轨迹改进VLA策略。在一个10任务基准上,RM将值估计MSE比最强基线降低了80%;当在SPIRAL中使用时,它将任务成功率从约50%提高到近乎完美,例如折叠短裤(58%到100%)和清洁白板(50%到90%),表明高质量密集奖励是稳定机器人数据飞轮的关键。项目网站:此https URL。

英文摘要

Fine-tuning vision-language-action (VLA) policies for long-horizon manipulation still relies heavily on behavior cloning, which requires costly high-quality demonstrations and keeps policies near the demonstration distribution. Reward models can reduce this dependence by reweighting demonstrations and providing dense supervision for on-robot reinforcement learning (RL), but they must be dense, accurate, and general. Existing methods fall short: task-specific stage-aware models are accurate but require per-task annotations, while general vision-language-model (VLM) reward models are broadly applicable but too coarse for fine-grained long-horizon progress. We introduce RM, a multi-task stage-aware reward model that combines an action-primitive-based stage estimator with a multi-gate Mixture-of-Experts (MMoE) value head to produce dense per-step rewards across manipulation tasks. Building on RM, we further propose SPIRAL (Self-Policy Improvement via Reward-Aligned Learning), an on-policy reward-guided framework that improves VLA policies from cheap autonomous rollouts. On a 10-task benchmark, RM reduces value-estimation MSE by 80% over the strongest baselines; when used in SPIRAL, it improves task success from around 50% to near-perfect performance on Folding Shorts (58% to 100%) and Cleaning Whiteboard (50% to 90%), showing that high-quality dense rewards are key to a stable robot data flywheel. Project website: https://qianzhong-chen.github.io/sarm2.github.io/.

2606.10363 2026-06-10 cs.RO 新提交

HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation

HiMem-WAM: 用于机器人操作的分层记忆门控世界动作模型

Xiaoquan Sun, Ruijian Zhang, Chen Cao, Yihan Sun, Jiahui Chen, Zetian Xu, Bo Chen, Haijier Chen, Zhen Yang, Jiarun Zhu, Yijun Hong, JingZhe Xu, Jingrui Pang, Mingqi Yuan, Jiayu Chen

发表机构 * The University of Hong Kong(香港大学) INFIFORCE Huazhong University of Science and Technology(华中科技大学) Tsinghua University(清华大学) Wuhan University(武汉大学) Southern University of Science and Technology(南方科技大学)

AI总结 提出分层记忆门控世界动作模型HiMem-WAM,通过分层潜在动作框架和边界触发记忆更新,提升长时域机器人操作的任务相关记忆与泛化鲁棒性。

详情
AI中文摘要

世界动作模型(WAM)已成为具身智能的一种新的强大范式,学习与动作相关的视觉动态,显著增强了泛化性和鲁棒性。然而,现有的WAM在长时域机器人操作中仍难以处理任务相关记忆。为了解决这个问题,我们提出了HiMem-WAM,一种分层记忆门控WAM,它集成了以运动为中心的潜在动作、高级技能潜在变量和边界触发的记忆更新。具体来说,我们开发了一个分层潜在动作框架,共同学习低级运动和高级技能潜在变量,提供结构化的时间抽象。同时,边界感知记忆门在预测的技能转换处写入紧凑的任务状态,无需在测试时生成未来视频或光流估计即可实现因果推理。在LIBERO、LIBERO-PLUS、RMBench和真实世界任务上的评估表明,HiMem-WAM的分层潜在变量提高了部署扰动下的鲁棒性,而记忆模块显著有益于依赖记忆的长时域操作。

英文摘要

World Action Models (WAMs) have emerged as a new powerful paradigm for embodied intelligence, learning action-relevant visual dynamics that significantly enhance generalization and robustness. However, existing WAMs still struggle with task-relevant memory in long-horizon robotic manipulation. To address this, we present HiMem-WAM, a Hierarchical Memory-Gated WAM that integrates motion-centric latent actions, high-level skill latents, and boundary-triggered memory updates. Specifically, we develop a hierarchical latent action framework that jointly learns low-level motion and high-level skill latents, providing structured temporal abstraction. Meanwhile, a boundary-aware memory gate writes compact task states at predicted skill transitions, enabling causal inference without test-time generation of future video or optical flow estimation. Evaluated on LIBERO, LIBERO-PLUS, RMBench and real-world tasks, HiMem-WAM shows that hierarchical latents improve robustness under deployment perturbations, and the memory module substantially benefits memory-dependent long-horizon manipulation.

2606.10918 2026-06-10 cs.RO cs.LG 新提交

Task Robustness via Re-Labelling Vision-Action Robot Data

通过重新标注视觉-动作机器人数据的任务鲁棒性

Artur Kuramshin, Özgür Aslan, Cyrus Neary, Glen Berseth

发表机构 * Mila — Quebec AI Institute(Mila — 魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) The University of British Columbia(不列颠哥伦比亚大学)

AI总结 提出TREAD框架,利用大型视觉语言模型对机器人数据集进行语义子任务分解和多样化指令生成,无需额外数据收集,提升策略在未见任务上的泛化能力。

Comments Project website: https://akuramshin.github.io/tread

详情
AI中文摘要

近年来,机器人学习模型规模的扩大产生了令人印象深刻的策略,能够执行各种操作任务并泛化到新场景。然而,这些策略在遵循指令方面仍然存在困难,很可能是因为现有机器人数据集中的语言和动作序列多样性有限。本文介绍了通过重新标注视觉-动作机器人数据实现任务鲁棒性(TREAD),这是一个可扩展的框架,利用大型视觉语言模型(VLM)在不进行额外数据收集的情况下增强现有机器人数据集,利用这些模型中嵌入的可迁移知识。我们的方法通过三个阶段利用预训练的VLM:从原始指令标签和初始场景生成语义子任务,根据这些子任务对演示视频进行分割,并生成包含对象属性的多样化指令,有效地将较长的演示分解为基于语言-动作对。我们进一步通过用语言多样化的文本目标版本增强数据来提高鲁棒性。在LIBERO上的评估表明,在我们增强的数据集上训练的策略在未见过的、新颖的任务和目标上表现出改进的性能。我们的结果表明,TREAD通过轨迹分解增强了规划泛化,并通过增加语言多样性增强了语言条件策略泛化。

英文摘要

The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. However, these policies continue to struggle with following instructions, likely due to the limited linguistic and action sequence diversity in existing robotics datasets. This paper introduces Task Robustness via Re-Labelling Vision-Action Robot Data (TREAD), a scalable framework that leverages large Vision-Language Models (VLMs) to augment existing robotics datasets without additional data collection, harnessing the transferable knowledge embedded in these models. Our approach leverages a pretrained VLM through three stages: generating semantic sub-tasks from original instruction labels and initial scenes, segmenting demonstration videos conditioned on these sub-tasks, and producing diverse instructions that incorporate object properties, effectively decomposing longer demonstrations into grounded language-action pairs. We further enhance robustness by augmenting the data with linguistically diverse versions of the text goals. Evaluations on LIBERO demonstrate that policies trained on our augmented datasets exhibit improved performance on novel, unseen tasks and goals. Our results show that TREAD enhances both planning generalization through trajectory decomposition and language-conditioned policy generalization through increased linguistic diversity.

2606.10321 2026-06-10 cs.LG cs.AI cs.RO math.OC 交叉投稿

Baseline-Free Policy Optimization for Neural Combinatorial Optimization

无基线的神经组合优化策略优化

Carlos S. Sepúlveda, Gonzalo A. Ruz

AI总结 提出使用GRPO算法消除神经组合优化中的基线依赖,避免训练崩溃,在TSP和CVRP上达到接近POMO的性能。

详情
AI中文摘要

神经组合优化(NCO)训练自回归策略以解决路由问题。标准训练算法REINFORCE使用滚动基线,需要维护并定期更新策略的冻结副本以降低方差。这种基线引入了一个结构脆弱性:在更难的问题实例上,较差的基线会产生噪声梯度估计,从而破坏训练稳定性。我们评估了来自大语言模型对齐的组相对策略优化(GRPO),该算法通过归一化组内采样轨迹的优势完全消除了基线。在RL4CO框架内对TSP和CVRP基准上的五种RL算法进行受控比较,我们发现:(i) GRPO避免了REINFORCE在TSP-100上观察到的训练崩溃,其中性能在预热阶段后立即从成本9.8下降到52.1,并且在延长训练下无法恢复;(ii) 在匹配的梯度更新次数下,GRPO达到了与POMO(一种基于AM的强多起点基线)在2%以内的解质量,同时无需外部基线;(iii) P3O,一种也来自对齐文献的成对偏好算法,在TSP上具有竞争力,但在CVRP上表现出更高的变异性。这些结果表明GRPO是一种有前途的无基线NCO替代方案,特别是在基线依赖训练变得脆弱的场景中。

英文摘要

Neural combinatorial optimization (NCO) trains autoregressive policies to solve routing problems. The standard training algorithm, REINFORCE with a rollout baseline, requires maintaining and periodically updating a frozen copy of the policy for variance reduction. This baseline introduces a structural vulnerability: on harder instances, a poor baseline produces noisy gradient estimates that can destabilize training. We evaluate Group Relative Policy Optimization (GRPO), an algorithm from large language model alignment that eliminates the baseline entirely by normalizing advantages within groups of sampled trajectories. In a controlled comparison of five RL algorithms on TSP and CVRP benchmarks within the RL4CO framework, we find that: (i) GRPO avoids the training collapse observed with REINFORCE on TSP-100, where performance degrades from cost 9.8 to 52.1 immediately after the warmup phase and does not recover under extended training; (ii) at matched gradient updates, GRPO achieves solution quality within 2% of POMO, a strong AM-based multi-start baseline, while requiring no external baseline; and (iii) P3O, a pairwise preference algorithm also from the alignment literature, is competitive on TSP but shows higher variability on CVRP. These results identify GRPO as a promising baseline-free alternative for NCO, particularly in settings where baseline-dependent training becomes fragile.

2606.10771 2026-06-10 astro-ph.IM cs.LG cs.RO 交叉投稿

On-sky demonstration of reinforcement learning for adaptive optics control

自适应光学控制强化学习的在轨演示

Jalo Nousiainen, Vincent Chambouleyron, Benoit Neichel, Sylvain Cetre, Jean-Francois Sauvage, Angelie Alagao, Markus Kasper, Jonathan Dray, Romain Fetick, Byron Engler

发表机构 * European Southern Observatory(欧洲南天文学中心) Aix Marseille University(艾克斯马赛大学) CNRS(法国国家科学研究中心) CNES(法国国家太空研究中心) LAM(雷恩天文物理实验室) Wakea Consulting(Wakea咨询公司) Bertin Alpao

AI总结 首次在望远镜上演示了基于强化学习的自适应光学控制器PO4AO,在多种条件下优于传统积分控制器,展示了鲁棒性和高性能。

Comments 11 pages, 12 figures accepted by A&A

详情
AI中文摘要

基于强化学习(RL)的算法最近已成为自适应光学(AO)控制的一种有前景的方法。在模拟和实验室实验中,它们已展现出对现实世界效应(如光子和探测器噪声、误配准、振动以及视宁度条件的快速变化)的鲁棒性。然而,它们的性能尚未在天空中得到验证。我们报告了首个基于强化学习的自适应光学控制器(名为PO4AO)的在轨演示。我们进一步分析了其在轨行为,并确定了改进算法及其实现的方向。PO4AO在位于OHP的1.52米望远镜(T152)的Coudé焦点的Papyrus自适应光学系统上实现并部署。基于Python的实现通过共享内存缓冲区与现有的实时控制器(DAO RTC)接口连接。在多个夜晚,覆盖不同的流量水平和大气条件,将PO4AO的性能与标准积分控制器进行了比较。PO4AO在所有测试配置中均持续优于标准积分器。该控制器成功学习并补偿了振动模式,并表现出对测量噪声的强鲁棒性。一旦为Papyrus调整好,PO4AO以交钥匙方式运行,在变化的观测条件和科学目标下使用单一超参数集。尽管非优化的Python实现引入了约750微秒的额外延迟,以及控制抖动和偶尔的帧丢失,但仍实现了这些性能提升。当正确实现和优化后,PO4AO构成了单共轭自适应光学系统的鲁棒且高性能的交钥匙控制器,为在轨AO操作中更广泛地采用强化学习策略铺平了道路。

英文摘要

Reinforcement learning (RL)-based algorithms have recently emerged as a promising approach for adaptive optics (AO) control. In simulations and laboratory experiments, they have demonstrated robustness to real-world effects such as photon and detector noise, misregistration, vibrations, and rapid variations in seeing conditions. However, their performance has not yet been validated on sky. We report the first on-sky demonstration of a reinforcement learning controller for adaptive optics, named Policy Optimization for AO (PO4AO). We further analyze its on-sky behavior and identify directions for improving the algorithm and its implementation.PO4AO was implemented and deployed on the Papyrus adaptive optics system installed at the Coudé focus of the 1.52 m telescope (T152) at the OHP. A Python-based implementation was interfaced with the existing real-time controller (DAO RTC) via shared-memory buffers. The performance of PO4AO was compared to that of a standard integrator controller over several nights, covering a range of flux levels and atmospheric conditions. PO4AO consistently outperformed the standard integrator in all tested configurations. The controller successfully learned and compensated for vibration patterns and demonstrated strong robustness to measurement noise. Once tuned for Papyrus, PO4AO operated in a turnkey fashion, using a single set of hyperparameters across varying observing conditions and science targets. These performance gains were achieved despite a non-optimized Python implementation introducing approximately $750\,μ\text{s}$ of additional latency, along with control jitter and occasional frame drops. When properly implemented and optimized, PO4AO constitutes a robust and high-performance turnkey controller for single-conjugate adaptive optics systems, paving the way for broader adoption of reinforcement learning strategies in on-sky AO operations.

2603.05291 2026-06-10 cs.RO 版本更新

Online Self-Training for Co-Adaptation in Hierarchical Diffusion Policies

层次扩散策略中的在线自训练协同适应

Clemence Grislain, Mathilde Kappel, Olivier Sigaud, Mohamed Chetouani

发表机构 * ISIR, Sorbonne Université, CNRS(ISIR,索邦大学,国家科学研究中心)

AI总结 提出ORCHID自训练框架,通过环境反馈过滤轨迹并蒸馏回规划器和控制器,实现层次扩散策略的在线稳定改进,在CALVIN基准上轻量模型超越纯离线方法。

Comments Accepted at ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation (DEMO)

详情
AI中文摘要

层次策略将语言条件的长时域机器人操作分解为高层规划器和低层控制器。然而,HL和LL之间的有效协调要求两个组件在兼容的子目标分布上运行。我们提出ORCHID,一个自训练框架,通过迭代精炼对齐规划和控制,实现层次扩散策略的稳定在线改进。通过环境反馈过滤策略样本,ORCHID识别出规划器和控制器共同成功的轨迹,并通过监督学习将其蒸馏回两个模块。这个过程引发了双向协同适应:规划器将其子目标建立在控制器的实际到达能力上,而控制器则专门处理规划器产生的轨迹结构。通过依赖过滤的在线策略样本的监督蒸馏,ORCHID避免了使用扩散模型的在线层次梯度强化学习训练中典型的不稳定性。在CALVIN基准上,ORCHID使一个轻量级、初始较弱的模型超越了纯离线方法,包括一个两倍大小的视觉-语言-动作模型。

英文摘要

Hierarchical policies decompose language-conditioned long-horizon robotic manipulation into a high-level planner and a low-level controller. However, effective coordination between HL and LL requires that both components operate on compatible subgoal distributions. We propose ORCHID, a self-training framework that enables stable online improvement of hierarchical diffusion policies by aligning planning and control through iterative refinement. By filtering policy samples via environment feedback, ORCHID identifies trajectories where the planner and controller are jointly successful and distills them back into both modules via supervised learning. This process induces a bidirectional co-adaptation: the planner grounds its subgoals in the actual reaching capabilities of the controller, while the controller specializes in the trajectory structures the planner produces. By relying on supervised distillation of filtered on-policy samples, ORCHID avoids the instability typical of online hierarchical gradient-based RL training with diffusion models. On the CALVIN benchmark, ORCHID allows a lightweight, initially weak model to outperform pure offline methods, including a Vision-Language-Action model twice its size.

2606.06323 2026-06-10 cs.RO 版本更新

VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies

VOLT: 面向超演示速度策略的视觉与语言轨迹分割

Robert Ramirez Sanchez, Daniel J. Evans, Dylan P. Losey, Siddarth Jain

发表机构 * Collab , Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24061(机械工程系,弗吉尼亚理工学院,布莱克斯堡,VA 24061) Mitsubishi Electric Research Laboratories ( MERL ), Cambridge, MA 02139(三菱电机研究实验室(MERL),剑桥,MA 02139)

AI总结 提出VOLT方法,通过视觉与语言线索对演示轨迹进行分割,选择性下采样安全加速部分,保留需要精细操作的慢速段,从而训练出比演示更快的机器人策略。

详情
AI中文摘要

人类演示任务所需的时间通常比机器人执行任务的时间长。许多工业和实际应用要求机器人尽可能快地执行任务,而不是学习以相同速度复制演示。本文研究了实现超演示速度策略的几种假设。实验表明,最有效的策略是对记录的演示进行下采样,并在加速后的数据上训练机器人策略。然而,均匀下采样整个轨迹可能存在问题:任务的某些部分可以安全加速(例如无约束运动),而其他部分则需要更慢、更精确的运动(例如物体交互或精细操作)。为解决这一挑战,我们提出了VOLT,一种视觉与语言轨迹分割方法,它推理视频演示,并利用上下文线索确定何时加速合适以及何时需要小心精确。VOLT识别需要缓慢、谨慎运动的分段,然后选择性地对剩余分段进行下采样。得到的重新格式化轨迹可用于标准模仿学习方法,如扩散策略。我们的结果强调分割质量至关重要——基线方法常常错误判断何时可以加速,导致策略过于谨慎或不可靠。与最先进的替代方法相比,VOLT使机器人能够更快地执行任务,同时保持强劲性能。

英文摘要

Humans often take longer to demonstrate a task than a robot would need to execute it. Rather than learning to replicate the demonstration at the same pace, many industrial and practical applications require robots to perform tasks as quickly as possible. In this paper, we investigate several hypotheses for learning policies that operate faster-than-demonstrations. Our experiments show that the most effective strategy is to downsample recorded demonstrations and train the robot's policy on this accelerated data. However, uniformly downsampling an entire trajectory can be problematic. Some parts of a task can be safely sped up (e.g., unconstrained motion), while others demand slower, more precise motion (e.g., object interactions or fine manipulation). To address this challenge, we introduce VOLT, a vision-and-language trajectory segmentation method that reasons over video demonstrations, and leverages contextual cues to determine when acceleration is appropriate and when careful precision is required. VOLT identifies segments where slow, deliberate motion is necessary, then selectively downsamples the remaining segments. The resulting reformatted trajectories can be used with standard imitation learning approaches, such as diffusion policies. Our results highlight that segmentation quality is critical -- baseline methods often misidentify when acceleration is possible, leading to overly cautious or unreliable policies. Compared to state-of-the-art alternatives, VOLT allows robots to execute tasks faster while maintaining strong performance.

2. 运动规划、控制与动力学 12 篇

2606.09958 2026-06-10 cs.RO cs.AI 新提交

Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment

混合交通环境下自动驾驶的不确定性感知运动规划

Ming Cheng, Hao Chen, Ziyi Yang, Ziluowen Luo, Senzhang Wang

AI总结 提出不确定性感知运动规划(UAMP),通过量化人类意图不确定性并引入不确定性校准值学习,提升自动驾驶在混合交通中的安全性和舒适性。

详情
AI中文摘要

在自动驾驶和人类驾驶车辆可能共存的混合交通环境中,自动驾驶车辆的运动规划需要预测周围人类驾驶员的未来行为。现有的基于强化学习的方法通常直接将预测的人类意图纳入观测以实现主动规划。然而,由于行为多样性、感知噪声和部分可观测性,人类意图本质上是不确定的。将预测意图视为确定性状态可能导致自动驾驶车辆做出不安全决策。为解决此问题,我们提出不确定性感知运动规划(UAMP),该规划将人类意图预测的不确定性纳入自动驾驶决策。具体来说,UAMP首先引入一个邻近感知不确定性估计器,以量化交互条件下的意图不确定性,并构建一个不确定性引导的联合意图分布,覆盖周围的人类驾驶车辆。在此不确定性集合内,UAMP进一步引入不确定性校准值学习(UCVL),以纠正因直接将不确定的人类意图预测纳入观测而产生的值函数学习偏差。在各种混合交通场景中的大量实验表明,与现有方法相比,UAMP显著提高了安全性和驾驶舒适性,同时保持了交通效率。代码发布在此https URL。

英文摘要

In mixed-traffic environments where autonomous and human-driven vehicles may co-exist, motion planning for autonomous vehicles requires anticipating the future behaviors of surrounding human drivers. Existing reinforcement learning-based methods generally directly incorporate the predicted human intents into the observation to enable a proactive planning. However, human intent is inherently uncertain due to the behavioral diversity, perception noise, and partial observability. Treating predicted intends as deterministic states can result in unsafe decisions for autonomous vehicles. To address this problem, we propose Uncertainty-Aware Motion Planning (UAMP), which incorporates uncertainty in human intent prediction for AV decision-making. Specifically, UAMP first introduces a proximity-aware uncertainty estimator to quantify the interaction-conditioned intent uncertainty and constructs an uncertainty-guided joint intent distribution over surrounding human-driven vehicles. Within this uncertainty set, UAMP further introduces Uncertainty-Calibrated Value Learning (UCVL) to correct value function learning biases arising from directly incorporating uncertain human intent predictions into the observation. Extensive experiments in various mixed-traffic scenarios show that UAMP significantly improves safety and driving comfort, while maintaining traffic efficiency compared with existing approaches. The code is released at https://anonymous.4open.science/r/UAMP-5638.

2606.10273 2026-06-10 cs.RO 新提交

Locomotion analysis of a quadruped interacting with the lunar granular surface

四足机器人月球颗粒表面交互的运动分析

Yash J Vyas

发表机构 * Department of Industrial Engineering, University of Padua(工业工程系,帕多瓦大学)

AI总结 通过强化学习训练四足机器人在模拟月球颗粒表面运动,对比刚性与软接触环境下的步态和能耗,发现软接触增加训练难度、改变步态并提高能量消耗。

详情
AI中文摘要

在星外环境中部署腿式机器人面临许多挑战,包括复杂的地形交互、能量和热约束。为了有效设计月球探测四足机器人的机械结构,需要仔细考虑电机扭矩、能量消耗和运输成本。月球表面由颗粒状风化层组成,这会影响腿式机器人的运动及其性能。基于刚性接触假设训练的运动算法在应用于软接触环境(如颗粒表面)时也无效,可能导致不稳定和跟踪不良。在本报告中,将月球颗粒表面-机器人足部接触的物理建模应用于使用强化学习训练运动的仿真环境。对在刚性接触和软接触环境下训练的策略进行比较,分析步态和运动性能指标。分析表明,模拟风化层表面的软接触给基于强化学习的训练带来了额外挑战,导致步态定性差异,并增加了总体能量消耗。

英文摘要

Deploying legged robots in extra-terrestrial environments includes many challenges due to complex terrain interactions, energy, and thermal constraints. For effective mechanical design of a lunar exploration quadrupedal robot, careful consideration of motor torques, energy expenditure, and cost of transport is required. The lunar surface is composed of granular regolith, which impacts the locomotion of legged robots and their performance. Locomotion algorithms trained with rigid contact assumptions are also ineffective when applied to environments with soft contacts, such as granular surfaces, which can result in instability and poor tracking. In this report, the physical modelling of the granular lunar surface-robot foot contacts is applied to a simulation environment with locomotion trained using Reinforcement Learning. A comparison is conducted between the policy trained on rigid contact and soft contact environments, analysing the gait and locomotion performance metrics. The analysis demonstrates that soft contacts simulating regolith surfaces pose additional challenges for Reinforcement Learning based training, result in a qualitatively different gait, and increase the overall energy expenditure.

2606.10288 2026-06-10 cs.RO 新提交

MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds

MARCH: 模型辅助强化学习实现人形机器人稀疏立足点的感知控制

Codrin Crismariu, Ryan K. Cosner

发表机构 * Department of Mechanical Engineering(机械工程系)

AI总结 提出模型辅助强化学习框架,结合简化模型生成安全参考轨迹、基于控制李雅普诺夫函数的奖励引导教师策略训练以及视觉学生策略蒸馏,实现人形机器人在稀疏立足点上的稳健感知行走。

详情
AI中文摘要

在稀疏地形上的感知双足行走仍然是一个困难的挑战:基于模型的方法精确但对不确定性脆弱,而基于无模型的方法鲁棒但难以发现安全关键型行走所需的精确、受约束的运动,其中小错误可能导致灾难性故障。我们提出了一个模型辅助强化学习(RL)框架,通过三个步骤结合两种视角:(1)使用简化模型生成安全参考轨迹;(2)训练一个特权教师策略,该策略由围绕安全参考轨迹构建的控制李雅普诺夫函数(CLF)奖励引导;(3)将教师策略蒸馏为基于视觉的学生策略。我们表明,这种模型辅助过程产生了物理基础的运动,提高了样本效率,减少了对复杂学习课程的需求,并实现了更平滑的行走行为,同时在与无模型基线相当的踏脚石性能上。我们在仿真中验证了我们的方法,并展示了在Unitree G1人形机器人上成功部署,该机器人导航具有横向约束的稀疏立足点。

英文摘要

Perceptive bipedal locomotion over sparse terrain remains a difficult challenge: model-based methods are precise but brittle to uncertainty, while model-free methods are robust but struggle to discover the precise, constrained motions required for safety-critical locomotion where small errors can cause catastrophic failures. We propose a model-assisted reinforcement learning (RL) framework that combines both perspectives in three steps: (1) generate a safe reference trajectory using simplified models; (2) train a privileged teacher policy guided by a control Lyapunov function (CLF) reward built around the safe reference trajectory; and (3) distill the teacher into a vision-based student policy. We show that this model-assistance procedure produces physically grounded locomotion, improving sample efficiency, reducing the need for a complex learning curriculum, and achieving smoother locomotion behavior alongside stepping stone performance comparable to model-free baselines. We validate our approach in simulation and demonstrate successful deployment on a Unitree G1 humanoid robot navigating sparse footholds with lateral constraints.

2606.10579 2026-06-10 cs.RO cs.SY eess.SY 新提交

LieIPM: Lie Group Interior Point Method for Direct Trajectory Optimization of Rigid Bodies

LieIPM:用于刚体直接轨迹优化的李群内点法

Sangli Teng, Ruiqi Zhang, Tzu-Yuan Lin, William A Clark, Mark Mueller, Ram Vasudevan, Maani Ghaffari, Koushil Sreenath

发表机构 * University of California, Berkeley(加州大学伯克利分校) MIT(麻省理工学院) Ohio University(俄亥俄大学) University of Michigan, Ann Arbor(密歇根大学安娜堡分校)

AI总结 提出一种基于李群结构的约束轨迹优化框架LieIPM,利用二阶刚体模型和变分积分器,实现无奇异、快速收敛的牛顿型更新。

详情
AI中文摘要

设计刚体的动态可行轨迹是机器人学中的一个基本问题。虽然直接方法被广泛使用,但现有的约束优化器通常在欧几里得空间中运行,忽略了刚体运动的流形结构。这种不匹配可能引入奇异性或导致优化问题病态。为了弥补这一差距,我们开发了一个结构感知框架,直接在矩阵李群上进行约束轨迹优化。我们的方法基于利用李群结构的二阶刚体模型,这使得在保持底层几何结构的同时实现高效的牛顿型更新成为可能。在此模型基础上,我们提出了一种线搜索李群内点法(LieIPM)来处理流形上的约束。我们使用李群变分积分器实例化该框架用于刚体运动规划,并推导出利用群对称性的闭式内蕴导数。LieIPM通过构造保留了旋转运动的拓扑结构,避免了奇异性。数值结果表明,与通用求解器和结构利用最优控制方法相比,该方法具有更强的鲁棒性和更快的收敛速度。

英文摘要

Designing dynamically feasible trajectories for rigid bodies is a fundamental problem in robotics. While direct methods are widely used, the existing constrained optimizers typically operate in Euclidean space and ignore the manifold structure of rigid body motions. This mismatch may introduce singularities or lead to poorly conditioned optimization problems. To bridge this gap, we develop a structure-aware framework for constrained trajectory optimization directly on matrix Lie groups. Our approach is based on the second-order rigid body models utilizing Lie group structures, which enables efficient Newton-type updates while preserving the underlying geometry. Building on this model, we propose a line-search Lie Group Interior Point Method (LieIPM) to handle constraints on the manifolds. We instantiate the framework for rigid body motion planning using Lie group variational integrators and derive closed-form intrinsic derivatives that exploit group symmetries. The LieIPM preserves the topology of rotation motions by construction and avoids singularities. Numerical results demonstrate superior robustness and faster convergence compared to general-purpose solvers and structure-exploiting optimal control methods.

2606.10841 2026-06-10 cs.RO cs.SY eess.SY math.OC 新提交

Gradient based Bilevel for Inverse Optimal Control, a Riemannian approach

基于梯度的双层逆最优控制:一种黎曼方法

Ahmed-Manaf Dahmani, Vincent Bonnet, David Daney, François Charpillet

AI总结 提出一种黎曼逆最优控制方法,将最优轨迹集视为流形,通过流形上的优化避免标准约束违规,计算时间减少约四倍。

Comments 6 Pages, 4 Figures. To be published in a control journal

详情
AI中文摘要

逆最优控制旨在恢复解释观测轨迹作为最优控制问题解的成本函数。经典逆最优控制公式依赖于双层优化,反复求解嵌套的最优控制问题,对于实际系统很快变得计算上不可行。最近的基于投影的方法提供了一种有希望的替代方案,但由于违反标准约束条件,在使用基于梯度的方法求解时会出现数值不稳定性。在本文中,我们表明这些困难源于逆最优控制可行集的几何结构。我们证明满足最优性条件的轨迹集自然形成一个流形,并将逆最优控制重新表述为该流形上的优化问题。基于这一见解,我们提出了一种黎曼逆最优控制方法,该方法将观测轨迹投影到最优解流形上,同时通过构造保持可行性。在真实人类手臂轨迹上的实验表明,所提出的方法在重建精度上与经典双层逆最优控制相当或更好,同时计算时间减少约四倍。这些结果凸显了几何优化方法在提高逆最优控制在机器人和人体运动分析中的可扩展性和可靠性方面的潜力。

英文摘要

Inverse Optimal Control (IOC) aims to recover the cost function that explains observed trajectories as solutions of an optimal control problem. Classical IOC formulations rely on bilevel optimization, which repeatedly solves a nested optimal control problem and quickly becomes computationally prohibitive for realistic systems. Recent projection-based approaches offer a promising alternative but suffer from numerical instability when solved with gradient-based methods due to violations of standard constraint qualifications. In this paper, we show that these difficulties stem from the geometric structure of the IOC feasible set. We demonstrate that the set of trajectories satisfying the optimality conditions naturally forms a manifold and reformulate IOC as an optimization problem on this manifold. Based on this insight, we propose a Riemannian Inverse Optimal Control (RIOC) method that projects observed trajectories onto the manifold of optimal solutions while preserving feasibility by construction. Experiments on real human arm trajectories show that the proposed method achieves comparable or better reconstruction accuracy than classical bilevel IOC while reducing computation time by about a factor of four. These results highlight the potential of geometric optimization methods to improve the scalability and reliability of IOC for robotics and human motion analysis.

2606.11034 2026-06-10 cs.RO cs.NE 新提交

A Spiking Neural Architecture for Coordinating Arm and Locomotor Control

一种协调手臂和运动控制的脉冲神经架构

Lea Steffen, Kathryn Simone, Graeme Damberger, Travis DeWolf, Hudson Ly, Chris Eliasmith

发表机构 * Centre for Theoretical Neuroscience(理论神经科学中心) Dept. of Systems Design Engineering, University of Waterloo(滑铁卢大学系统设计工程系) Applied Brain Research(应用脑研究公司) Dept. of Nanotechnology Engineering, University of Waterloo(滑铁卢大学纳米技术工程系) Dept. of Philosophy, University of Waterloo(滑铁卢大学哲学系)

AI总结 提出一种基于脉冲神经网络(SNN)的架构,利用神经工程框架(NEF)和语义指针架构(SPA)协调仿人机器人的手臂力控制与双足运动,并通过基底节模型实现高层动作选择,首次在全身仿人平台上实现集成控制。

详情
AI中文摘要

脉冲神经网络(SNN)结合神经形态硬件为仿人机器人控制提供了节能解决方案。然而,现有的基于SNN的运动控制系统分别处理双足运动和手臂控制,未解决两者的集成控制问题。我们提出了一种脉冲架构,使用神经工程框架(NEF)和语义指针架构(SPA)在仿真仿人机器人中协调基于力的手臂控制和双足运动。运动控制和手臂控制之间的高层动作选择由基于生物学的脉冲基底节模型介导。我们通过Nengo(神经控制)和Isaac Sim的协同仿真验证了该系统,展示了成功的目标到达、连续数字绘制、路径跟随运动,以及通过基底节去抑制在行走和手臂控制之间切换。据我们所知,这是首个在全身仿人平台上结合双足运动和手臂控制的集成脉冲控制器。全脉冲实现使其未来可部署在低功耗神经形态硬件上。

英文摘要

Spiking Neural Networks (SNNs) coupled with neuromorphic hardware offer energy-efficient solutions for humanoid robot control. However, existing SNN-based motor control systems address bipedal locomotion and arm control in isolation, leaving integrated control of both unaddressed. We present a spiking architecture that coordinates force-based arm control and bipedal locomotion in a simulated humanoid, using the Neural Engineering Framework (NEF) and Semantic Pointer Architecture (SPA). High-level action selection between locomotor and arm control is mediated by a biologically grounded spiking basal ganglia model. We validate the system through co-simulation of Nengo, for the neural control, and Isaac Sim, demonstrating successful target reaching, continuous digit drawing, path-following locomotion, and finally, switching between walking and arm control via basal ganglia disinhibition. To our knowledge, this is the first integrated spiking controller to combine bipedal locomotion and arm control on a full-scale humanoid platform. The full spike-based implementation enables future deployment on low-power neuromorphic hardware.

2504.17080 2026-06-10 cs.RO cs.SY eess.SY 版本更新

Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators

基于SE(3)的机器人统一力-阻抗控制的几何公式化

Joohwan Seo, Nikhil Potu Surya Prakash, Soomi Lee, Arvind Kruthiventy, Megan Teng, Jongeun Choi, Roberto Horowitz

发表机构 * University of California, Berkeley, USA(加州大学伯克利分校)

AI总结 提出一种在SE(3)流形上的阻抗控制框架,通过能量罐增强实现力跟踪与无源性,并解决非因果实现问题,继承SE(3)不变性以提高学习效率。

详情
AI中文摘要

在本文中,我们提出了一个在SE(3)流形上的阻抗控制框架,该框架在保证无源性的同时实现力跟踪。基于统一力-阻抗控制(UFIC)和我们先前在几何阻抗控制(GIC)上的工作,我们开发了几何统一力-阻抗控制(GUFIC),以使用微分几何视角在控制器公式中考虑SE(3)流形结构。与UFIC一样,GUFIC利用能量罐增强进行力跟踪和阻抗控制,以保证机械臂相对于外力的无源性。这确保了末端执行器与不确定环境保持安全的接触交互,并跟踪期望的交互力。此外,我们通过引入速度场和力场解决了UFIC公式中的非因果实现问题。由于其在SE(3)上的公式化,所提出的GUFIC继承了GIC理想的SE(3)不变性和等变性,这有助于在将学习算法纳入控制律的机器学习应用中提高样本效率。所提出的控制律在需要跟踪包含位置和方向的SE(3)轨迹同时对表面施加力的场景下的仿真环境中得到了验证。代码可在以下网址获取:https://this URL。

英文摘要

In this paper, we present an impedance control framework on the SE(3) manifold, which enables force tracking while guaranteeing passivity. Building upon the unified force-impedance control (UFIC) and our previous work on geometric impedance control (GIC), we develop the geometric unified force impedance control (GUFIC) to account for the SE(3) manifold structure in the controller formulation using a differential geometric perspective. As in the case of the UFIC, the GUFIC utilizes energy tank augmentation for both force-tracking and impedance control to guarantee the manipulator's passivity relative to external forces. This ensures that the end effector maintains safe contact interaction with uncertain environments and tracks a desired interaction force. Moreover, we resolve a non-causal implementation problem in the UFIC formulation by introducing velocity and force fields. Due to its formulation on SE(3), the proposed GUFIC inherits the desirable SE(3) invariance and equivariance properties of the GIC, which helps increase sample efficiency in machine learning applications where a learning algorithm is incorporated into the control law. The proposed control law is validated in a simulation environment under scenarios requiring tracking an SE(3) trajectory, incorporating both position and orientation, while exerting a force on a surface. The codes are available at https://github.com/Joohwan-Seo/GUFIC_mujoco.

2512.08280 2026-06-10 cs.RO cs.AI cs.SY eess.SY 版本更新

Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making

基于模型扩散采样的离线决策预测控制

Haldun Balim, Na Li, Yilun Du

发表机构 * GitHub

AI总结 提出MPDiffuser框架,通过组合扩散规划器与动力学扩散模型,在采样中交替更新以生成符合任务目标且动力学可行的轨迹,并利用轻量级排序模块选择最优轨迹,在D4RL和DSRL基准及四足机器人上验证了有效性。

详情
AI中文摘要

通过扩散模型进行离线决策通常会产生与系统动力学不对齐的轨迹,限制了其在控制中的可靠性。我们提出了模型预测扩散器(MPDiffuser),一种组合扩散框架,它将扩散规划器与动力学扩散模型相结合,以生成任务对齐且动力学可行的轨迹。MPDiffuser在采样过程中交替进行规划器和动力学更新,逐步修正可行性同时保留任务意图。然后,一个轻量级排序模块选择最能满足任务目标的轨迹。组合设计通过使动力学模型能够独立于规划器利用多样且未见过的数据,提高了样本效率和适应性。实验上,我们在无约束(D4RL)和约束(DSRL)基准上展示了相对于先前基于扩散的方法的一致改进,并通过在真实四足机器人上的部署验证了实用性。

英文摘要

Offline decision-making via diffusion models often produces trajectories that are misaligned with system dynamics, limiting their reliability for control. We propose Model Predictive Diffuser (MPDiffuser), a compositional diffusion framework that combines a diffusion planner with a dynamics diffusion model to generate task-aligned and dynamically plausible trajectories. MPDiffuser interleaves planner and dynamics updates during sampling, progressively correcting feasibility while preserving task intent. A lightweight ranking module then selects trajectories that best satisfy task objectives. The compositional design improves sample efficiency and adaptability by enabling the dynamics model to leverage diverse and previously unseen data independently of the planner. Empirically, we demonstrate consistent improvements over prior diffusion-based methods on unconstrained (D4RL) and constrained (DSRL) benchmarks, and validate practicality through deployment on a real quadrupedal robot.

2602.05791 2026-06-10 cs.RO 版本更新

Scalable and General Whole-Body Control for Cross-Humanoid Locomotion

可扩展且通用的全身控制:跨人形机器人运动

Yufei Xue, YunFeng Lin, Wentao Dong, Yang Tang, Jingbo Wang, Jiangmiao Pang, Ming Zhou, Minghuan Liu, Weinan Zhang

发表机构 * Tsinghua University(清华大学)

AI总结 提出XHugWBC框架,通过形态随机化、语义对齐观测动作空间和有效策略架构,实现单次训练后跨多种人形机器人的零样本泛化控制。

详情
AI中文摘要

基于学习的全身控制器已成为人形机器人的关键驱动力,但现有方法大多需要针对特定机器人进行训练。本文研究了跨实体人形控制问题,并表明单一策略通过一次性训练即可稳健地泛化到各种人形机器人设计。我们提出了XHugWBC,一种新颖的跨实体训练框架,通过以下方式实现通用人形控制:(1) 物理一致的形态随机化,(2) 跨不同人形机器人的语义对齐观测和动作空间,以及(3) 建模形态和动力学特性的有效策略架构。XHugWBC不依赖于任何特定机器人,而是在训练过程中内化广泛的形态和动力学特性分布。通过从多样化的随机实体中学习运动先验,策略获得了强大的结构偏差,支持对未见过的机器人进行零样本迁移。在12个模拟人形机器人和7个真实世界机器人上的实验证明了所得通用控制器的强泛化性和鲁棒性。

英文摘要

Learning-based whole-body controllers have become a key driver for humanoid robots, yet most existing approaches require robot-specific training. In this paper, we study the problem of cross-embodiment humanoid control and show that a single policy can robustly generalize across a wide range of humanoid robot designs with one-time training. We introduce XHugWBC, a novel cross-embodiment training framework that enables generalist humanoid control through: (1) physics-consistent morphological randomization, (2) semantically aligned observation and action spaces across diverse humanoid robots, and (3) effective policy architectures modeling morphological and dynamical properties. XHugWBC is not tied to any specific robot. Instead, it internalizes a broad distribution of morphological and dynamical characteristics during training. By learning motion priors from diverse randomized embodiments, the policy acquires a strong structural bias that supports zero-shot transfer to previously unseen robots. Experiments on twelve simulated humanoids and seven real-world robots demonstrate the strong generalization and robustness of the resulting universal controller.

2605.31405 2026-06-10 cs.RO 版本更新

Adaptive Artificial Time-Delay Control with Barrier Lyapunov Constraints for Euler-Lagrange Robots

具有障碍李雅普诺夫约束的自适应人工时延控制用于欧拉-拉格朗日机器人

Saksham Gupta, Rishabh Dev Yadav, Sarthak Mishra, Amitabh Sharma, Sourish Ganguly, Wei Pan, Spandan Roy, Simone Baldi

发表机构 * Robotics Research Center, International Institute of Information Technology Hyderabad, India(机器人研究中心,国际信息科技大学 Hyderabad,印度) Department of Computer Science, University of Manchester, UK(计算机科学系,曼彻斯特大学,英国) Autonomous Systems and Automatic Control in School of Engineering, Newcastle University, UK(工程学院自主系统与自动控制,新castle大学,英国) Self-Organizing Mobility Lab, School of Mathematics, Southeast University, Nanjing 210096, China(自组织移动实验室,数学学院,东南大学,南京210096,中国)

AI总结 针对欧拉-拉格朗日系统中的状态相关不确定性和时变状态约束问题,提出一种结合人工时延估计与障碍李雅普诺夫函数的自适应控制框架,通过在线估计不确定性上界并强制约束位置和速度,实验验证了其在五自由度机械臂上的有效性。

详情
AI中文摘要

本文解决了欧拉-拉格朗日系统中同时补偿状态相关不确定性和强制执行时变状态约束的挑战,这是机器人学中的常见需求,但现有控制设计尚未充分满足。开发了一种新颖的自适应控制框架,将基于人工时延的不确定性估计策略(也称为时延估计)与障碍李雅普诺夫函数相结合,以实现约束感知控制设计。具体而言,分析性地推导了时延估计近似误差的状态相关上界,并构造了自适应律在线估计其参数,从而无需先验模型知识即可实现实时状态相关不确定性补偿。为确保约束满足,基于障碍李雅普诺夫函数的控制器对位置和速度施加时变界限。通过李雅普诺夫分析证明了所提架构的稳定性。在五自由度机械臂上的实验结果验证了该框架相较于现有技术,在动态不确定性下保持严格遵守安全关键约束的能力。

英文摘要

This paper addresses the challenge of simultaneously compensating for state-dependent uncertainties and enforcing time-varying state constraints in Euler-Lagrange systems, a common requirement in robotics that remains underserved by existing control designs. A novel adaptive control framework is developed that combines an artificial time-delay-based uncertainty estimation strategy, also known as time-delay estimation, with a barrier Lyapunov function to enforce constraint-aware control design. Specifically, a state-dependent upper bound on the time-delay estimation approximation error is analytically formulated, and an adaptive law is constructed to estimate its parameters online, enabling real-time state-dependent uncertainty compensation without relying on prior model knowledge. To ensure constraint compliance, the barrier Lyapunov function-based controller enforces time-varying bounds on both position and velocity. The resulting architecture is provably stable via Lyapunov analysis. Experimental results on a five-degree-of-freedom robotic manipulator validate the framework's capability, compared with the state of the art, in maintaining strict adherence to safety-critical constraints under dynamic uncertainties.

2606.06493 2026-06-10 cs.RO cs.AI cs.LG 版本更新

HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers

HANDOFF: 通过蒸馏互补教师实现人形机器人任务空间全身控制

Lizhi Yang, Junheng Li, Nehar Poddar, Yiling Hou, Gio Huh, Robert Griffin, Georgia Gkioxari, Aaron Ames

发表机构 * California Institute of Technology(加州理工学院) The Institute for Human & Machine Cognition(人机认知研究院)

AI总结 提出HANDOFF框架,通过多教师KL蒸馏和上下文门控机制,将全身运动跟踪、行走和跌倒恢复三个专家策略融合为混合专家学生策略,实现基于紧凑显式接口的全身控制,在Unitree G1上达到先进的速度跟踪性能并扩展了操作工作空间。

Comments 22 pages, 9 figures, Project page: https://lzyang2000.github.io/HANDOFF/

详情
AI中文摘要

对于要在现实世界中部署的人形机器人,命令空间(即任务规划与全身控制之间的接口)的选择至关重要。现有的全身控制器通常需要密集的运动学或空间参考,而规划器难以从任务语义中合成这些参考。我们提出了一种紧凑、显式的接口,该接口直观、通用、模块化且具有足够的表达能力,适用于多种操作技能。为此,我们引入了HANDOFF,这是一个单一的人形全身控制器,遵循该接口,并通过多教师KL蒸馏,在上下文条件门控方案下,从三个互补专家(具有安全过滤数据的全身运动跟踪、行走和跌倒恢复)中蒸馏出混合专家学生。在Unitree G1上,HANDOFF达到了最先进的速度跟踪性能,并提供了最大的鲁棒操作工作空间之一。我们进一步通过多个自然语言驱动的任务执行演示了硬件可行性,这些任务由VLM驱动的智能体规划器提供支持,无需特定任务数据或控制器微调。

英文摘要

For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse loco-manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.

2605.09595 2026-06-10 cs.NE cs.RO 版本更新

Neuromorphic Reinforcement Learning for Quadruped Locomotion Control on Uneven Terrain

用于不平地形四足运动控制的神经形态强化学习

Zhuangyu Han, Abhronil Sengupta

发表机构 * School of Electrical Engineering and Computer Science(电气工程与计算机科学学院)

AI总结 提出基于平衡传播的PPO框架,结合CPG策略与残差调整策略,通过局部学习实现四足机器人在不平地形上的高效运动控制,性能与反向传播相当,GPU内存效率提升4.3倍。

详情
AI中文摘要

强化学习(RL)已实现复杂地形上的鲁棒四足运动,但大多数学习控制器通过反向传播在大量并行仿真中离线训练,并作为固定策略部署,限制了在地形变化、负载变化、执行器磨损以及其他实际条件下的适应能力,且受限于机载功耗。局部学习通过用局部神经状态驱动的更新替代全局反向传播图,为能量感知的机上自适应提供了潜在路径,使学习规则更兼容神经形态和内存计算基底。本文提出一种基于平衡传播(EP)的近端策略优化(PPO)框架,用于不平地形四足运动。控制器结合了仿生中枢模式发生器(CPG)策略和残余姿态调整策略,同时用支持EP的局部学习替代传统的反向传播训练的策略和价值网络。为了用EP训练随机连续控制策略,我们推导了与EP兼容的PPO输出扰动信号,并引入了一种双边比率裁剪机制,在松弛过程中稳定策略更新。在12自由度A1四足机器人上的实验表明,所提控制器在两阶段不平地形运动任务中实现了稳定的策略收敛。其运动性能在成功率、速度跟踪、执行器功率和身体稳定性方面与反向传播训练的PPO基线相当,同时与通过时间反向传播(BPTT)相比,GPU内存效率提高了4.3倍。这些结果表明,基于局部平衡的学习可以支持高维具身运动,并为低功耗机上自适应和微调提供算法基础。

英文摘要

Reinforcement learning (RL) has enabled robust quadruped locomotion over complex terrain, but most learned controllers are trained offline with backpropagation in massively parallel simulation and deployed as fixed policies, limiting adaptation to terrain variation, payload changes, actuator wear, and other real-world conditions under onboard power constraints. Local learning provides a potential path toward energy-aware on-robot adaptation by replacing global backpropagation graphs with updates driven by local neural states, making the learning rule more compatible with neuromorphic and in-memory computing substrates. This work proposes an equilibrium-propagation (EP)-based proximal policy optimization (PPO) framework for uneven-terrain quadruped locomotion. The controller combines a bio-inspired central pattern generator (CPG) policy with a residual postural adjustment policy, while replacing conventional backpropagation-trained policy and value networks with EP-enabled local learning. To train stochastic continuous-control policies with EP, we derive an EP-compatible PPO output-nudging signal and introduce a two-sided ratio clipping mechanism that stabilizes policy updates during relaxation. Experiments on a 12-DoF A1 quadruped show that the proposed controller achieves stable policy convergence in a two-stage uneven terrain locomotion task. Its locomotion performance is comparable to a backpropagation-trained PPO baseline in success rate, velocity tracking, actuator power, and body stability, while improving GPU memory efficiency by 4.3\(\times\) compared with backpropagation through time (BPTT). These results suggest that local equilibrium-based learning can support high-dimensional embodied locomotion and provide an algorithmic foundation for low-power on-robot adaptation and fine-tuning.

3. 操作、抓取与灵巧手 13 篇

2606.10039 2026-06-10 cs.RO 新提交

Robotic Nonprehensile Object Transportation with a Hanging Tray

使用悬挂托盘的机器人非抓取式物体运输

Adam Heins, Angela P. Schoellig

AI总结 针对机器人服务员问题,提出使用绳索悬挂托盘实现三维摆运动,仅需3自由度移动基座即可减少滑动和泼洒,实验验证了有效性并集成到交互演示中。

Comments 8 pages, 11 figures. IEEE/ASME International Conference on Advanced Intelligent Mechatronics, 2026

详情
AI中文摘要

我们考虑称为服务员问题的非抓取式物体运输任务,其中机器人必须将平衡在托盘上的物体从一个位置移动到另一个位置。与先前关于机器人服务员问题的工作(使机器人倾斜由末端执行器刚性握持的托盘)不同,我们使用由绳索从末端执行器悬挂的托盘,使其行为类似于三维摆。一些先前的工作驱动机器人使末端执行器模拟摆的行为,因为摆运动减少了作用在运输物体上的剪切力,从而最小化刚性物体的滑动和液体容器中的泼洒。相比之下,我们使用真实的悬挂托盘,使得我们能够获得摆运动的益处,同时仅驱动3自由度移动基座,而不需要完整的6自由度机械臂。我们在仿真和真实硬件上的实验表明,与静态、刚性握持的托盘相比,悬挂托盘显著减少了滑动和泼洒。此外,我们将悬挂托盘集成到交互式机器人服务员演示中,该演示使用计算机视觉识别举手的人,并通过视觉伺服引导机器人朝向它们,使它们能够接触托盘。

英文摘要

We consider the nonprehensile object transportation task known as the waiter's problem, in which a robot must move an object balanced on a tray from one location to another. In contrast to prior works on the robotic waiter's problem, which make the robot tilt a tray rigidly held by its end effector (EE), we use a tray suspended from the EE by ropes, such that it behaves like a three-dimensional pendulum. Some prior works have actuated the robot so that the EE simulates the behavior of a pendulum, because pendular motion reduces the shear forces acting on the transported objects, minimizing the sliding of rigid objects and sloshing in containers of liquid. In contrast, our use of a real hanging tray allows us to obtain the benefits of pendular motion while only actuating a 3 degree-of-freedom (DOF) mobile base, rather than requiring a full 6-DOF manipulator arm. Our experiments in simulation and on real hardware show that the hanging tray substantially reduces both sliding and sloshing compared to a static, rigidly-grasped tray. Furthermore, we integrate the hanging tray into an interactive robot waiter demonstration, which uses computer vision to identify people with a raised hand and visual servoing to steer toward them and allow them to access the tray.

2606.10244 2026-06-10 cs.RO cs.AI 新提交

YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale

YUBI:面向大规模双手灵巧操作的通用双指接口

Takehiko Ohkawa, Jumpei Arima, Yuki Noguchi, Masatoshi Tateno, Makoto Sugiura, Takuya Okubo, Kengo Ikeuchi, Yuma Shin, Hiroki Nishizawa, Naoaki Kanazawa, Yuki Wakayama, Daiki Fukunaga, Koshi Makihara, Tomohiro Motoda, Floris Erich, Yukiyasu Domae, Tatsuya Matsushima, Yohishiro Okumatsu, Kei Ota

AI总结 提出YUBI手指对齐夹爪,通过屈服式手指驱动映射实现直观、符合人体工学的双手灵巧操作数据采集,构建8434小时/120万集/119任务数据集,单策略跨多机器人迁移。

Comments Project page: https://yubi.airoa.io/

详情
AI中文摘要

我们引入了Yielding Universal Bidigital Interface (YUBI),一种手指对齐的夹爪,旨在实现双手灵巧操作的直观、符合人体工学且可扩展的数据采集。虽然手持数据采集系统(如Universal Manipulation Interface (UMI))实现了低成本数据采集,但其笨重的手枪式握把设计可能给精细灵巧操作任务带来人体工学和使用性挑战。为此,YUBI提出了一种独特的设计原则:屈服式手指驱动,将人类手指运动直接映射到夹爪钳口运动。使用YUBI设备,我们建立了一个集成基于VR的6自由度夹爪跟踪的数据采集系统,确保高保真轨迹数据获取。我们整理了一个前所未有的基于UMI的数据集:8434小时,涵盖120万集和119个任务。实验表明,YUBI在复杂双手任务的通用性、灵巧性和操作效率方面优于UMI夹爪。通过在多个平台上安装夹爪,在YUBI数据集上训练的单一策略可迁移到多个双手机器人(UR、Franka和ELEY),证实采集的数据可直接作为策略监督执行。我们发布了夹爪硬件、数据采集软件和数据集作为集成堆栈,为开放社区提供可复现的大规模数据采集路径,以推动机器人基础模型的发展。

英文摘要

We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8,434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.

2606.10614 2026-06-10 cs.RO cs.CV cs.LG 新提交

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

灵巧点策略:从人类演示中学习基于点的灵巧手策略

Beomjun Kim, Seong Hyeon Park, Seunghoon Sim, Seungjun Moon, Sanghyeok Lee, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院)

AI总结 提出Dexterous Point Policy框架,通过统一3D关键点表示从人类视频学习灵巧操作策略,无需机器人演示,在真实任务中达到75%成功率。

详情
AI中文摘要

基于人类演示视频预训练的机器人基础模型显示出潜力,但当策略部署到真实机器人时仍存在显著的具身差距。常见的补救措施是在机器人特定演示上微调这些模型。然而,机器人数据收集可能过于昂贵和耗时,这在灵巧操作中尤为突出,例如,即使是单个原子任务,遥操作多指手也可能需要数天。为了解决这个问题,我们引入了Dexterous Point Policy,一个直接从人类视频学习灵巧操作策略且无需机器人演示的框架。我们的核心见解是,统一的3D关键点表示在用于观察和动作时,可以桥接人类和机器人的具身。具体来说,我们从原始视频中提取任务相关物体和人类手的3D关键点,并训练一个自回归变换器来处理这些关键点。我们观察到,在关键点层面,特别是手腕和指尖,人类和机器人的行为紧密对齐,从而实现直接策略迁移。在一套包括拾取放置和工具使用的真实机器人任务中,Dexterous Point Policy达到了75.0%的成功率,而最先进的VLA基线仅达到1.0%。此外,我们的方法对未见过的场景具有很强的泛化能力,包括多物体环境和新型物体类别。

英文摘要

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

2606.10743 2026-06-10 cs.RO 新提交

Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization

基于开放世界接触定位的以手为中心的人到机器人轨迹迁移

Yitian Shi, Di Wen, Zhengqi Han, Zicheng Guo, Yu Hu, Edgar Welte, Kunyu Peng, Rainer Stiefelhagen, Rania Rayyes

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 提出HOWTransfer框架,通过接触定位从人类视频中提取接触感知的机器人轨迹,无需物体特定描述,在多样化操作任务中实现86%的成功率。

详情
AI中文摘要

由于嘈杂的手-物体交互、部分观测下的未知物体以及跨实体差异,从人类视频演示中学习仍然具有挑战性。为了解决这些问题,我们提出了\textit{HOWTransfer}(\emph{H}and-\emph{O}bject \emph{O}pen-\emph{W}orld Transfer),这是一个以手为中心的框架,将人类演示提炼为接触感知、分类学信息丰富且多样化的机器人轨迹。\emph{HOWTransfer}不依赖于物体特定描述、视觉语言查询或显式物体状态跟踪,而是通过推理观测到的手-物体交互线索,恢复时间一致的三维手部运动并定位时间接触区间。然后,利用定位的接触起始点将人类抓取意图重定向到多模态平行颚抓取假设,这些假设沿恢复的手腕轨迹传播以生成机器人可执行的运动。最后,轨迹编辑阶段细化接触对齐,并从单个演示生成多样化的可执行变体。跨多种操作任务的实验表明,\emph{HOWTransfer}能够实现准确的接触定位和高质量的机器人运动重定向,成功率为86%,在盲选偏好研究中优于遥操作轨迹。

英文摘要

Learning from human video demonstrations remains challenging due to noisy hand-object interactions, unseen objects with partial observation, and cross-embodiment discrepancy. To address these challenges, we present \textit{HOWTransfer} (\emph{H}and-\emph{O}bject \emph{O}pen-\emph{W}orld Transfer), a hand-centric framework that distills human demonstrations into contact-aware, taxonomy-informed, and diverse robotic trajectories. Instead of relying on object-specific descriptions, vision-language queries, or explicit object-state tracking, \emph{HOWTransfer} recovers temporally consistent 3D hand motion and localizes temporal contact intervals by reasoning over observed hand-object interaction cues. The localized contact onsets are then used to retarget human grasp intent into multi-modal parallel-jaw grasp hypotheses, which are propagated along the recovered wrist trajectory to generate robot-executable motions. Finally, a trajectory editing stage refines contact alignment and produces diverse executable variants from a single demonstration. Experiments across diverse manipulation tasks show that \emph{HOWTransfer} enables accurate contact localization and high-quality robot motion retargeting with $86\%$ success, which is preferred over teleoperated trajectories in a blinded preference study.

2606.10808 2026-06-10 cs.RO 新提交

Bridging Semantics and Physical Execution: A Neuro-Symbolic Framework for Multi-Pair Robotic Assembly

桥接语义与物理执行:面向多对机器人装配的神经符号框架

Xinyi Li, Aiguo Song, Linhu Wei, Huijun Li

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院)

AI总结 提出一种端到端神经符号框架,通过分层生成最优子图、解耦通用性与边缘情况、协调全局序列,解决非结构化环境中多对装配的空间干扰和接触不确定性,在100个真实场景中达到97%全局可执行性,UR3机械臂部署成功率90%。

Comments Corresponding author: Aiguo Song (a.g.song@seu.edu.cn)

详情
AI中文摘要

非结构化环境中的多对机器人装配面临空间干扰和接触不确定性。现有范式无法桥接认知决策与物理执行,要么遭遇状态空间爆炸和知识瓶颈,要么遭受逻辑幻觉和拓扑冲突。我们提出一种端到端神经符号框架,分层解决该挑战:为每对生成最优子图,将通用性与边缘情况解耦,然后解决跨对干扰。给定眼在手RGB-D装配场景,框架提取语义实例身份和状态,同时量化场景以计算散度。对于每对,通过LLM使用基本动作生成最优子图以减轻幻觉。边缘情况的支撑动作通过轻量级判别器推理并插入。由量化基线与当前场景之间的散度驱动,该框架易于以低成本扩展。增强的子图在拓扑上协调为全局序列,同时保持内部行为一致性。嵌入原子技能的动态行为树闭环力感知执行循环。在100个真实场景上的离线评估达到97.00%的全局可执行性,优于经典和最新规划器。在UR3机械臂上的真实机器人部署在强干扰下达到90%的成功率,公差0.5毫米,展示了复杂自主装配的统一且可验证解决方案。

英文摘要

Multi-pair robotic assembly in unstructured environments faces spatial interference and contact uncertainties. Existing paradigms fail to bridge cognitive decision-making and physical execution, as they either encounter state-space explosion and knowledge bottlenecks or suffer from logical hallucinations and topological conflicts. We propose an end-to-end neuro-symbolic framework that solves the challenge hierarchically: generating optimal subgraphs for each pair, decoupling generality from edge cases, and then resolving cross-pair interferences. Given an eye-on-hand RGB-D assembly scene, the framework extracts semantic instance identity and state while quantifying the scene for divergence calculation. For each pair, optimal subgraph is generated via LLM using barely basic actions to mitigate hallucinations. Supportive actions for edge cases are reasoned and inserted with a lightweight discriminator. Driven by the divergence between the quantified baseline and current scene, it is easily extensible at low cost. Augmented subgraphs are topologically coordinated into global sequences while preserving internal behavioral coherence. Dynamic behavior trees embedding atomic skills close the force-aware execution loop. Offline evaluation on 100 real-world scenes achieves 97.00% global executability, outperforming classical and state-of-the-art planners. Real-robot deployment on a UR3 arm attains 90% success rate with 0.5 mm tolerance under strong interference, demonstrating a unified and verifiable solution for complex autonomous assembly.

2606.10818 2026-06-10 cs.RO cs.CV 新提交

IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

IMPACT:面向强力机器人操控的内部模型预测控制学习

Jiawei Gao, Chaoqi Liu, Peilin Wu, Haonan Chen, Yilun Du

发表机构 * Harvard University(哈佛大学) Stanford University(斯坦福大学)

AI总结 提出IMPACT框架,将强力操控任务解耦为任务规划和基于内部模型的预测控制,通过仿真和实验证明其在成功率、泛化性、安全性和能效上的优势。

Comments Project website: https://gao-jiawei.com/IMPACT/

详情
AI中文摘要

现实世界中的机器人操控任务通常涉及与环境的有力交互,例如使用不同重量的工具、运输不同质量的物体以及执行接触密集任务(如擦桌子)。先前的基于学习方法通常采用模仿学习策略,输出由低级阻抗控制器跟踪的目标末端执行器姿态。在这些系统中,有力交互要么通过稳态跟踪误差隐式实现,要么使用腕部力/扭矩或触觉传感器显式命令。然而,隐式方法在不同物体重量下泛化能力差,而显式方法需要专用硬件并增加系统复杂性。在这项工作中,我们提出了IMPACT,一个将这些有力任务解耦为任务规划和基于内部模型的预测控制的框架。广泛的仿真和真实世界实验表明,所提出的框架实现了更高的成功率、对未见物体重量的更好泛化性,以及更好的安全性和能效。

英文摘要

Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previous learning-based approaches typically employ imitation learning policies that output target end-effector poses tracked by low-level impedance controllers. In these systems, forceful interactions are either implicitly realized through steady-state tracking errors or explicitly commanded using wrist force/torque or tactile sensors. However, implicit approaches generalize poorly across object weights, while explicit approaches require specialized hardware and increase system complexity. In this work, we propose IMPACT, a framework that decouples these forceful tasks into task-planning and internal-model-based predictive control. Extensive simulation and real-world experiments demonstrate that the proposed framework achieves higher success rates and improved generalization to unseen object weights, as well as better safety and energy efficiency.

2606.10899 2026-06-10 cs.RO 新提交

MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation

MV-Actor:对齐多视角语义与空间感知以实现双臂操作

Yinchen Tian, Huan Li, Muyao Peng, Xi Wang, Yan Wang, You Yang

发表机构 * School of Electronic Information and Communications, Huazhong University of Science and Technology(华中科技大学电子信息与通信学院) Institute for AI Industry Research (AIR), Tsinghua University(清华大学智能产业研究院) AIR Wuxi Innovation Center, Tsinghua University(清华大学智能产业研究院无锡创新中心)

AI总结 提出MV-Actor框架,通过多视角语义交互和语义-空间令牌交互统一语义与空间表示,并利用引导度量深度修复模块处理深度噪声,在PerAct2基准上达到87.8%平均成功率。

Comments 14 pages,9 figures

详情
AI中文摘要

机器人操作已广泛应用于工业场景。与单臂操作相比,双臂操作配备多个摄像头以从不同视角捕获信息。然而,现有的多视角策略独立编码每个视角或浅层融合视角特征,导致语义感知共享有限且空间感知不可靠。本文提出\textbf{MV-Actor},一种为双臂操作构建统一语义-空间表示的多视角感知框架。首先,MV-Actor执行多视角语义交互以跨视角共享语义感知。然后,它使用语义-空间令牌交互将视觉语义与前馈重建模型特征对齐,并获取可靠的空间感知。最后,引导度量深度修复模块在消费级深度噪声下细化退化的传感器深度,以提供更可靠的度量锚点。在PerAct2双臂基准上进行的仿真实验中,MV-Actor达到了87.8%的最先进平均成功率。在视角变化更频繁且消费级深度不稳定的真实世界评估中,MV-Actor优于RGB和RGB-D基线,进一步证明了共享语义感知和可靠空间感知对双臂操作的好处。

英文摘要

Robotic manipulation has been widely applied in industrial scenarios. Compared with single-arm manipulation, bimanual manipulation is equipped with multiple cameras to capture information from different viewpoints. However, existing multi-view policies encode each view independently or fuse view features shallowly, resulting in limited sharing semantic perception and unreliable spatial awareness. In this paper, we propose \textbf{MV-Actor}, a multi-view perception framework that builds a unified semantic-spatial representation for bimanual manipulation. First, MV-Actor performs Multi-view Semantic Interaction to share semantic perception across views. Then it uses Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features and acquire reliable spatial awareness. Finally, a Guided Metric Depth Repair module refines degraded sensor depth to provide more reliable metric anchors under consumer-grade depth noise. In simulation experiments conducted on the PerAct2 bimanual benchmark, MV-Actor achieves a state-of-the-art average success rate of 87.8\%. In real-world evaluations with more frequent viewpoint changes and unstable consumer-grade depth, MV-Actor outperforms both RGB and RGB-D baselines, further demonstrating the benefit of sharing semantic perception and reliable spatial awareness for bimanual manipulation.

2606.11151 2026-06-10 cs.RO 新提交

JOIN: Anchor-Grasp-Conditioned Joining via Opposition, Inference, and Navigation for Bimanual Assistive Manipulation

JOIN:通过对抗、推理和导航实现基于锚点抓取条件的双臂辅助操作连接

Drake Moore, Matt Cheng, Xiang Zhi Tan, Taşkın Padır

发表机构 * Northeastern University(东北大学)

AI总结 提出一种异构按需双臂系统JOIN,通过锚点臂与移动补臂的条件性连接,结合视觉语言模型和几何工具,解决代表性双臂日常生活任务,在实验中成功率更高且需更少人工修正。

Comments Xiang Zhi Tan and Taşkın Padır share equal advising

详情
AI中文摘要

辅助移动和操作平台作为恢复残疾人独立性的手段已受到越来越多的关注。虽然对于许多基本的日常生活活动(ADL)有效,但诸如开罐、倒液体、端托盘或基本餐食准备等大量日常任务本质上是双臂的,任何单臂系统都无法完成。由于额外的功耗、成本以及转移和移动所需空间的损失,在轮椅上增加第二只手臂是不切实际的。我们提出了一种异构的按需双臂系统,其中安装在轮椅上的锚点臂在需要时与一个被召唤的移动操作器(作为补臂)连接。我们称之为双臂连接的核心技术问题是有条件的:锚点臂已经确定了抓取,补臂必须选择站立位置和抓取对象以完成任务。我们将双臂连接分解为三个阶段(规划、驱动、抓取),并表明视觉语言模型(VLM)结合标准几何工具,提供了足以解决代表性双臂ADL类别的任务级知识。我们的系统JOIN贡献了(i)一个轮椅参考的对抗分数,以及(ii)任务条件方向可操作性。我们在Kinova Gen3锚点臂和Hello Robot Stretch~3补臂上对代表性的同对象和异对象任务进行了评估。JOIN完成了更多尝试(19/20),优于最先进的方法(14/20),并且需要的操作员修正明显更少。

英文摘要

Assistive mobility and manipulation platforms have received increasing attention as a means of restoring independence to individuals with disabilities. While effective for many basic activities of daily living (ADLs), a significant percentage of everyday tasks such as opening a jar, pouring a liquid, lifting a tray, or basic meal preparation, is fundamentally bimanual and remains out of reach for any single-arm system. Adding a second arm to a wheelchair is impractical, due to the additional power draw, cost, and the loss of space required for transfers and mobility. We instead propose a heterogeneous, on-demand bimanual system, in which a wheelchair-mounted anchor arm is joined when needed by a summoned mobile manipulator that serves as a complement arm. The central technical problem, which we call bimanual joining, is conditional: the anchor has already committed to a grasp, and the complement arm must choose where to stand and what to grasp to complete the task. We formulate bimanual joining as a three-phase decomposition (plan, drive, grasp) and show that a vision-language model (VLM), coupled with standard geometric tools, provides task-level knowledge sufficient to solve a representative class of bimanual ADLs. Our system JOIN, contributes (i) a wheelchair-referenced opposition score, and (ii) task-conditioned directional manipulability. We evaluate JOIN on a Kinova Gen3 anchor and a Hello Robot Stretch~3 complement on representative same-object and different-object tasks. JOIN accomplished more attempts (19/20) than state-of-the-art methods (14/20) and required markedly less correction by the operator.

2606.11184 2026-06-10 cs.RO 新提交

TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation

TacForeSight:面向接触丰富操作的力引导触觉世界模型

Yujie Zang, Yuhang Zheng, Xian Nie, Yupeng Zheng, Shuai Tian, Songen Gu, Chen Gao, Zining Wang, Shuicheng Yan, Wenchao Ding

发表机构 * TARS Robotics National University of Singapore(新加坡国立大学) Shanghai Jiao Tong University(上海交通大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Fudan University(复旦大学)

AI总结 提出TacForeSight框架,通过力条件触觉世界模型预测触觉潜动态,结合预测性触觉条件策略实现高频操作下的主动接触推理,在动态接触干扰下优于现有方法。

详情
AI中文摘要

接触丰富操作要求机器人在动态接触过渡或复杂表面几何下持续感知和调节演变的物理交互。最近的模仿学习方法通过整合触觉或力反馈改善了接触感知控制,但很少对全局力和局部触觉感知的非对称时空角色进行建模。为此,我们提出TacForeSight,一种轻量级的力条件触觉预测框架,用于实时操作。核心组件是TacForceWM,一个触觉世界模型,它从双指触觉观测中预测短时域触觉潜动态,并以高频腕部力和力矩信号为条件。另一个关键组件,预测性触觉条件策略,利用预测的潜变量作为预期接触先验,通过交叉注意力建模当前到未来的触觉演化,并通过触觉引导门控模块自适应融合视觉-触觉特征。通过在紧凑潜空间内进行预测,TacForeSight实现了主动接触推理,并具有适用于高频操作控制的高效实时推理。在五个代表性任务和三种过程扰动设置上的真实机器人实验表明,TacForeSight在动态接触干扰下始终优于现有基线。所有模型和数据集将在项目网站上公开。

英文摘要

Contact-rich manipulation requires robots to continuously perceive and regulate evolving physical interactions under dynamic contact transitions or complex surface geometries. Recent imitation learning methods improve contact-aware control by incorporating tactile or force feedback, but they rarely model the asymmetric spatiotemporal roles of global force and local tactile sensing. To address this, we propose TacForeSight, a lightweight force-conditioned tactile foresight framework for real-time manipulation. The core component is TacForceWM, a tactile world model that predicts short-horizon tactile latent dynamics from dual-finger tactile observations conditioned on high-frequency wrist force and torque signals. Another key component, the Predictive Tactile-Conditioned Policy, leverages the predicted latents as anticipatory contact priors, models the current-to-future tactile evolution via cross-attention, and adaptively fuses visuo-tactile features through a tactile-guided gating module. By forecasting purely within a compact latent space, TacForeSight enables proactive contact reasoning with efficient real-time inference suitable for high-frequency manipulation control. Real-robot experiments on five representative tasks and three in-process perturbation settings show that TacForeSight consistently outperforms existing baselines, particularly under dynamic contact disturbances. All models and datasets will be made publicly available on the project website at https://tacforesight.github.io/ProjectPage.

2505.08213 2026-06-10 cs.RO 版本更新

HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands

HandCept: 用于灵巧手精确本体感知的视觉-惯性融合框架

Huang Junda, Honghao Guo, Hao Wu, Zhengyang Liu, Marcelo H Ang, Jianshu Zhou

发表机构 * The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学)

AI总结 提出HandCept,首个视觉-惯性本体感知框架,通过零样本学习和无延迟扩展卡尔曼滤波融合腕部RGB-D相机与9轴IMU,实现2°-4°关节角估计误差且无漂移,优于纯视觉或纯惯性方法。

Comments 8 pages, 7 figures, conference

详情
AI中文摘要

随着机器人向通用操作发展,灵巧手变得越来越关键。然而,由于体积和通用性的限制,灵巧手的本体感知仍然是一个瓶颈。在这项工作中,我们提出了HandCept,这是第一个旨在克服传统灵巧手关节角估计方法挑战的视觉-惯性本体感知框架。HandCept解决了在动态环境中实现准确且鲁棒的关节角估计的难题,在这种环境中,视觉和惯性测量都容易受到噪声和漂移的影响。它利用零样本学习方法,使用腕部RGB-D相机和9轴IMU,通过无延迟扩展卡尔曼滤波器(EKF)实时融合。我们的结果表明,HandCept实现了通常在$2^{\circ}$到$4^{\circ}$之间的关节角估计误差,且没有可观察到的漂移,优于纯视觉和纯惯性方法。此外,我们验证了IMU系统的稳定性和均匀性,表明IMU之间的公共基帧简化了系统标定。为了支持仿真到现实的迁移,我们还开源了我们的高保真渲染管线,这对于在没有真实世界真值的情况下进行训练至关重要。这项工作为灵巧手的本体感知提供了一种鲁棒、可泛化的解决方案,对机器人操作和人机交互具有重要意义。this https URL

英文摘要

As robotics progresses toward general manipulation, dexterous hands are becoming increasingly critical. However, proprioception in dexterous hands remains a bottleneck due to limitations in volume and generality. In this work, we present HandCept, the first visual-inertial proprioception framework designed to overcome the challenges of traditional joint angle estimation methods for dexterous hands. HandCept addresses the difficulty of achieving accurate and robust joint angle estimation in dynamic environments where both visual and inertial measurements are prone to noise and drift. It leverages a zero-shot learning approach using a wrist-mounted RGB-D camera and 9-axis IMUs, fused in real time via a latency-free Extended Kalman Filter (EKF). Our results show that HandCept achieves joint angle estimation errors generally between $2^{\circ}$ and $4^{\circ}$ without observable drift, outperforming visual-only and inertial-only methods. Furthermore, we validate the stability and uniformity of the IMU system, demonstrating that a common base frame across IMUs simplifies system calibration. To support sim-to-real transfer, we also open-source our high-fidelity rendering pipeline, which is essential for training without real-world ground truth. This work offers a robust, generalizable solution for proprioception in dexterous hands, with significant implications for robotic manipulation and human-robot interaction. https://github.com/huangjund/blenderYCB

2601.06997 2026-06-10 cs.RO cs.CV 版本更新

ObjSplat: Geometry-Aware Gaussian Surfels for Active Object Reconstruction

ObjSplat: 几何感知的高斯面元用于主动物体重建

Yuetao Li, Zhizhou Jia, Yu Zhang, Qun Hao, Shaohui Zhang

发表机构 * School of Optics and Photonics, Beijing Institute of Technology(光学与光子学学院,北京理工大学) School of Optoelectronic Engineering, Changchun University of Science and Technology(光电工程学院,长春理工大学)

AI总结 提出ObjSplat框架,利用高斯面元统一表示,通过几何感知视点评估和下一最佳路径规划器,实现高效高保真的主动物体重建。

Comments Accepted to IEEE T-ASE. Code: https://github.com/Li-Yuetao/ObjSplat , Project Page: https://li-yuetao.github.io/ObjSplat-page/

详情
AI中文摘要

自主高保真物体重建是创建数字资产和弥合机器人模拟与现实差距的基础。我们提出ObjSplat,一个主动重建框架,利用高斯面元作为统一表示,逐步重建未知物体,同时具有逼真的外观和准确的几何。针对传统基于不透明度或深度线索的局限性,我们引入了几何感知视点评估管线,明确建模背面可见性和遮挡感知的多视图共视性,即使在几何复杂的物体上也能可靠地识别未重建区域。此外,为了克服贪婪规划策略的局限性,ObjSplat采用下一最佳路径(NBP)规划器,在动态构建的空间图上执行多步前瞻。通过联合优化信息增益和移动成本,该规划器生成全局高效的轨迹。在仿真和真实世界文化遗物上的大量实验表明,ObjSplat在几分钟内生成物理一致的模型,与最先进方法相比,实现了卓越的重建保真度和表面完整性,同时显著减少了扫描时间和路径长度。项目页面:此https URL。

英文摘要

Autonomous high-fidelity object reconstruction is fundamental for creating digital assets and bridging the simulation-to-reality gap in robotics. We present ObjSplat, an active reconstruction framework that leverages Gaussian surfels as a unified representation to progressively reconstruct unknown objects with both photorealistic appearance and accurate geometry. Addressing the limitations of conventional opacity or depth-based cues, we introduce a geometry-aware viewpoint evaluation pipeline that explicitly models back-face visibility and occlusion-aware multi-view covisibility, reliably identifying under-reconstructed regions even on geometrically complex objects. Furthermore, to overcome the limitations of greedy planning strategies, ObjSplat employs a next-best-path (NBP) planner that performs multi-step lookahead on a dynamically constructed spatial graph. By jointly optimizing information gain and movement cost, this planner generates globally efficient trajectories. Extensive experiments in simulation and on real-world cultural artifacts demonstrate that ObjSplat produces physically consistent models within minutes, achieving superior reconstruction fidelity and surface completeness while significantly reducing scan time and path length compared to state-of-the-art approaches. Project page: https://li-yuetao.github.io/ObjSplat-page/ .

2602.07413 2026-06-10 cs.RO 版本更新

Going with the Flow: Koopman Behavioral Models as Pseudo Planners for Visuo-Motor Dexterity

随流而行:Koopman行为模型作为视觉运动灵巧性的伪规划器

Yunhai Han, Jiaqi Fu, Linhao Bai, Ziyu Xiao, Zhaodong Yang, Yogita Choudhary, Krishna Jha, Chuizheng Kong, Shreyas Kousik, Harish Ravichandar

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出统一行为模型(UBM),将灵巧技能建模为耦合动力系统,确保时间一致性;基于Koopman算子实现线性潜空间,通过在线重规划实现反应性和适应性,在模拟和真实任务中达到或超越现有方法。

Comments Website: https://k-ubm.github.io/

详情
AI中文摘要

当代视觉运动灵巧性模型通常依赖于具有扩散和Transformer骨干的表达性策略类来实现强性能。然而,这些架构需要大量数据和计算资源,并且远未达到可靠,特别是对于多指灵巧性。重要的是,它们将技能建模为反应性映射,并依赖于固定视界的动作分块,在时间一致性和反应性之间造成了刚性权衡。为了解决这些问题,我们首先引入统一行为模型(UBMs),这是一个将灵巧技能表示为耦合动力系统的框架,捕捉环境视觉特征(视觉流)和机器人本体感受状态(动作流)如何共同演化。因此,UBMs通过构造而非启发式平均来确保时间一致性。与试图预测任意机器人动作对环境影响的 world models 不同,UBMs 针对行为动力学,编码演示的机器人行为如何与对环境期望的影响相关。UBM 可以视为一个伪规划器:给定初始条件,它计算整个技能视界上的期望机器人行为,同时“想象”视觉特征的流。为了实现UBMs,我们提出Koopman-UBM,作为UBMs的第一个实例化,即结构化的潜在线性系统。K-UBM计算高效,通过在线重规划策略实现反应性和适应性:模型充当自身的运行时监控器,当预测和观察到的视觉流偏离超过阈值时自动触发重规划。在七个模拟任务和四个真实世界任务中,我们的方法匹配或超过了最先进基线的性能,同时提供了更快的推理、平滑的执行、对遮挡的鲁棒性和灵活的重规划。

英文摘要

Contemporary visuo-motor dexterity models often rely on expressive policy classes with diffusion and transformer backbones to achieve strong performance. However, these architectures require significant data and computational resources, and remain far from reliable, particularly for multi-fingered dexterity. Importantly, they model skills as reactive mappings and rely on fixed-horizon action chunking, creating a rigid trade-off between temporal coherence and reactivity. To address these issues, we first introduce Unified Behavioral Models (UBMs), a framework to represent dexterous skills as coupled dynamical systems that capture how visual features of the environment (visual flow) and proprioceptive states of the robot (action flow) co-evolve. As such, UBMs ensure temporal coherence by construction rather than heuristic averaging. Unlike world models that attempt to predict the impact of arbitrary robot actions on the environment, UBMs target behavioral dynamics that encode how demonstrated robot behavior is related to desired impacts on the environment. A UBM can be viewed as a pseudo planner: given an initial condition, it computes the desired robot behavior over the entire skill horizon, while simultaneously ``imagining" the resulting flow of visual features. To operationalize UBMs, we propose Koopman-UBM, a first instantiation of UBMs as a structured latent linear system. K-UBM is computationally efficient, enabling reactivity and adaptation via an online replanning strategy: the model acts as its own runtime monitor, automatically triggering replanning when predicted and observed visual flow diverge beyond a threshold. Across seven simulated tasks and four real-world tasks, our approach matches or exceeds the performance of state-of-the-art baselines, while offering considerably faster inference, smooth execution, robustness to occlusions, and flexible replanning.

2603.20850 2026-06-10 cs.CV cs.RO 版本更新

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

Glove2Hand:从多模态传感手套合成自然的手-物体交互

Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, Li Guan

发表机构 * Meta Reality Labs(Meta现实实验室) Rutgers University(罗格斯大学)

AI总结 提出Glove2Hand框架,将多模态传感手套视频转化为逼真的裸手,并保留物理交互动态;引入3D高斯手模型和扩散手恢复器,创建HandSense数据集,提升下游任务性能。

Comments CVPR 2026 Highlight. This version includes the motion retarget process in the appendix

详情
AI中文摘要

理解手-物体交互(HOI)是计算机视觉、机器人和AR/VR的基础。然而,传统手部视频通常缺乏接触力和运动信号等关键物理信息,并且容易频繁遮挡。为了解决这些挑战,我们提出了Glove2Hand,一个将多模态传感手套HOI视频转化为逼真裸手的框架,同时忠实保留底层物理交互动态。我们引入了一种新颖的3D高斯手模型,确保时间渲染一致性。使用基于扩散的手部恢复器将渲染的手无缝集成到场景中,该恢复器有效处理复杂的手-物体交互和非刚性变形。利用Glove2Hand,我们创建了HandSense,这是第一个多模态HOI数据集,包含手套到手的视频以及同步的触觉和IMU信号。我们证明HandSense显著增强了下游裸手应用,包括基于视频的接触估计和严重遮挡下的手部跟踪。

英文摘要

Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.

4. 导航、定位与SLAM 8 篇

2606.10348 2026-06-10 cs.RO 新提交

Rethinking Embodied Navigation via Relational Inductive Bias

通过关系归纳偏差重新思考具身导航

Weitao An, Chenghao Xu, Xu Yang, Cheng Deng

发表机构 * School of Electronic Engineering, Xidian University(西安电子科技大学电子工程学院) School of Information Science and Engineering, Hohai University(河海大学信息科学与工程学院)

AI总结 提出DB-Nav框架,利用激活偏置和抑制偏置双关系偏置重塑搜索空间,通过关系激活-抑制探索图调节前沿探索,显著提升目标导航成功率和路径效率。

详情
AI中文摘要

目标导航要求智能体通过视觉观察在未知环境中定位目标。现有方法通常依赖开放词汇检测器或视觉语言模型(VLM)来回答在哪里搜索,但往往忽略了什么不可信——哪些语义线索不可靠。开放词汇感知容易产生系统性误导证据:误报、过时的静态先验以及由于缺乏具身验证而导致的重复失败探索,这会污染地图构建和决策制定。此类错误根植于真实场景中的结构化对象关系。为解决此问题,我们提出DB-Nav,一个通过双关系偏置重塑搜索空间的框架。它将目标中心关系分解为激活偏置(传播上下文证据)和抑制偏置(通过感知混淆和动作级证伪抑制不可靠区域)。这些偏置统一到一个关系激活-抑制探索图中,该图利用在线观察和失败访问来调节前沿探索值。在ObjectNav基准上的实验表明,DB-Nav在成功率(SR)和路径长度加权成功率(SPL)上显著优于现有方法,提供了一个轻量级、可解释且鲁棒的导航框架,无需昂贵的在线VLM推理。

英文摘要

Object navigation requires an agent to locate a target in an unknown environment through visual observations. Existing methods typically rely on open-vocabulary detectors or vision-language models (VLMs) to answer where to search, but often overlook what not to trust - which semantic cues are unreliable. Open-vocabulary perception is prone to systematic misleading evidence: false positives, outdated static priors, and repeated failed exploration due to lack of embodied verification, which contaminates mapping and decision-making. Such errors are rooted in structured object relations in real-world scenes. To address this, we propose DB-Nav, a framework that reshapes the search space via dual relational biases. It factorizes target-centric relations into an Activation Bias (propagates contextual evidence) and an Inhibition Bias (suppresses unreliable regions via perceptual confusion and action-level falsification). These biases are unified into a Relational Activation-Inhibition Exploration Graph that modulates frontier exploration values using online observations and failed accesses. Experiments on ObjectNav benchmarks show that DB-Nav significantly outperforms existing methods in success rate (SR) and Success weighted by Path Length (SPL), offering a lightweight, interpretable, and robust navigation framework without costly online VLM reasoning.

2606.10442 2026-06-10 cs.RO 新提交

Information-Preserving Continuous Occupancy Mapping with Variance-Weighted Submap Joining

基于方差加权子图拼接的信息保持连续占据地图构建

Zhuhua Bai, Yingyu Wang, Liang Zhao, Shoudong Huang

发表机构 * University of Technology Sydney(悉尼科技大学) University of Edinburgh(爱丁堡大学)

AI总结 提出首个连续概率子图拼接框架,通过信息保持稀疏贝叶斯公式压缩观测数据为充分统计量,联合优化子图位姿与全局占据场,实现高精度位姿估计与全局一致性地图。

Comments 12 pages, 7 figures

详情
AI中文摘要

大规模SLAM由于累积轨迹漂移和维护全局一致性的计算成本增加而仍然具有挑战性。子图拼接通过构建局部一致子图并随后将其融合为全局地图来缓解这些问题。然而,现有的基于占据的子图拼接方法在离散网格上操作,导致优化过程中梯度不光滑,并忽略了占据估计的不确定性。我们提出了第一个连续概率子图拼接框架,该框架在潜在对数几率空间中联合优化子图位姿和全局占据场。该框架采用信息保持的稀疏贝叶斯公式,将原始占据观测压缩为充分统计量的对数几率元组,同时保留原始观测的后验信息。这为占据地图构建提供了闭式预测均值和方差估计,直接实现了具有解析雅可比矩阵的子图拼接公式,从而得到更精确的子图拼接,并在位姿收敛时产生闭式最优全局地图。在模拟和大规模真实世界数据集上的实验表明,所提方法比最先进的基于网格的子图拼接方法实现了更高的位姿精度和更好的全局一致性,同时比现有的连续占据地图构建方法产生了更紧凑的地图表示和更校准的不确定性估计。

英文摘要

Large-scale SLAM remains challenging due to accumulated trajectory drift and the increasing computational cost of maintaining global consistency. Submap joining alleviates these issues by constructing locally consistent submaps and subsequently fusing them into a global map. However, existing occupancy-based submap joining methods operate on discrete grids, resulting in non-smooth gradients during optimization and neglecting the uncertainty associated with occupancy estimates. We propose the first continuous probabilistic submap joining framework that jointly optimizes submap poses and a global occupancy field in the latent log-odds space. The framework employs an information-preserving sparse Bayesian formulation that compresses raw occupancy observations into sufficient-statistic log-odds tuples while retaining the posterior information of the original observations. This yields closed-form predictive mean and variance estimates for occupancy mapping, which directly enable a submap joining formulation with analytical Jacobians, leading to more accurate submap joining and yielding a closed-form optimal global map upon pose convergence. Experiments on both simulated and large-scale real-world datasets demonstrate that the proposed method achieves higher pose accuracy and improved global consistency than state-of-the-art grid-based submap joining approaches, while producing more compact map representations and better-calibrated uncertainty estimates than existing continuous occupancy mapping methods.

2606.10577 2026-06-10 cs.RO 新提交

AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

AgenticNav:零样本视觉与语言导航作为工具调用框架

Yijian Li, Changze Li, Hantian Shi, Jiaying Luo, Jiyuan Cai, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei Technologies Ltd(华为技术有限公司)

AI总结 提出AgenticNav,通过将动作、深度和记忆作为可调用工具暴露给VLM,实现零样本连续环境导航,在R2R-CE基准上达到SOTA性能。

详情
AI中文摘要

连续环境中的零样本视觉与语言导航(VLN-CE)最近随着大型视觉语言模型(VLM)的出现而变得可行。然而,现有方法通常依赖学习到的航点预测器来提出可导航动作,这严重限制了模型的动作空间,并且未能有效利用深度输入。此外,记忆通常通过累积包含大量无关上下文的冗长文本或视觉历史,或通过检索跨回合经验来处理,这削弱了零样本设置。在本文中,我们将零样本VLN-CE重新思考为VLM与环境之间的代理接口,并提出了AgenticNav,这是一个轻量级导航框架,将动作、深度和记忆暴露为可调用的工具。动作工具允许VLM直接选择RGB观测中的目标像素,并将其转换为可执行运动,而不是从预测的航点中选择。深度通过按需像素深度工具暴露,使VLM能够在需要的地方请求精确的度量距离。对于记忆,AgenticNav提供了一个紧凑的地图图像,总结历史轨迹,并配有一个召回工具,允许VLM有选择地重新访问过去的视觉观测,而不会使提示上下文过载。在R2R-CE基准上,AgenticNav在相同VLM骨干下,在零样本方法中建立了新的最先进(SOTA)性能。真实世界验证进一步突显了其相比先前方法的零样本泛化能力。消融实验表明,我们的动作工具设计优于传统航点预测器,并且深度工具和代理记忆进一步促进了导航性能。

英文摘要

Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model's action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating long textual or visual histories with substantial irrelevant context, or by retrieving cross-episode experiences, which weakens the zero-shot setting. In this paper, we rethink zero-shot VLN-CE as an agentic interface between the VLM and the environment, and present AgenticNav, a lightweight navigation harness that exposes action, depth, and memory as callable tools. Instead of choosing from predicted waypoints, the action tool allows the VLM to directly select a target pixel in RGB observations, converting it into executable motion. Depth is exposed through an on-demand pixel-depth tool, enabling the VLM to request precise metric distances only where they matter. For memory, AgenticNav provides a compact map image summarizing the historical trajectory, paired with a recall tool that allows the VLM to selectively revisit past visual observations without overwhelming the prompt context. On the R2R-CE benchmark, AgenticNav establishes new state-of-the-art (SOTA) performance among zero-shot methods given the same VLM backbone. Real-world validation further highlights its zero-shot generalization compared to prior methods. Ablations show that our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute to navigation performance.

2606.10832 2026-06-10 cs.RO 新提交

GUIDE: Goal-Initialized Directional Understanding for End-to-End Visual Navigation

GUIDE: 目标初始化的定向理解用于端到端视觉导航

Liang Wang, Jin Jin, KanZhong Yao, YiBin Wu, Fangqiang Ding, Jin Wang, Jun Wu, Zhe Sun, Qiuguo Zhu

发表机构 * Institute of Cyber-Systems and Control, Zhejiang University(浙江大学控制科学与工程学院) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI)) Oxford Robotics Institute, University of Oxford(牛津大学牛津机器人研究所) Center for Robotics, University of Bonn(波恩大学机器人中心) Department of Mechanical Engineering, Massachusetts Institute of Technology(麻省理工学院机械工程系)

AI总结 提出GUIDE框架,通过空间锚点预测器利用多频率本体感受历史提取自运动表示,结合深度流感知局部几何,实现无需后续目标更新的端到端四足机器人导航。

Comments https://guide-navigation.github.io/

详情
AI中文摘要

基于学习的足式机器人视觉导航通常依赖于层次状态估计的连续目标更新,以提供持久的定向参考。这种依赖引入了额外的感知和计算开销,偏离了完全端到端的移动自主性。此外,在部分可观测性下,策略容易学习短视行为,容易陷入死角和复杂结构布局。为了解决这些限制,我们研究了一种目标初始化的导航设置,其中目标仅在情节开始时提供一次,要求机器人基于内在空间记忆运行,无需来自外部模块的后续目标更新。在这项工作中,我们提出了GUIDE,一个完全端到端的强化学习框架,旨在培养内部定向意识。具体来说,GUIDE包含一个空间锚点预测器,利用多频率本体感受历史来提取自运动表示,从而为导航维持持久的长期空间上下文。同时,它利用原始深度流感知局部环境几何。我们在仿真和真实场景中对四足机器人进行了评估。实验表明,GUIDE学习了可靠的自运动和定向意识,使得完全端到端部署的策略能够在没有后续目标引导或先验地图的情况下,安全地穿越密集杂乱和结构化迷宫。

英文摘要

Learning-based visual navigation for legged robots typically relies on continuous goal updates from hierarchical state estimation to provide a persistent directional reference. This reliance incurs additional sensory and computational overhead and deviates from fully end-to-end mobile autonomy. Furthermore, under partial observability, policies are prone to learn myopic behaviors, easily becoming trapped in dead ends and complex structural layouts. To address these limitations, we investigate a goal-initialized navigation setting, where the target is provided only once at the beginning of an episode, requiring the robot to operate based on intrinsic spatial memory without subsequent goal updates from external modules. In this work, we propose GUIDE, a fully end-to-end reinforcement learning framework designed to cultivate internal directional awareness. Specifically, GUIDE incorporates a spatial anchor predictor that leverages multi-frequency proprioceptive history to extract egomotion representations, thereby maintaining a persistent long-horizon spatial context for navigation. Concurrently, it utilizes raw depth streams to perceive local environmental geometry. We evaluate the proposed framework across both simulation and real-world scenarios on a quadruped robot. Experiments show that GUIDE learns reliable egomotion and directional awareness, enabling a fully end-to-end deployed policy to safely navigate through dense clutter and structured mazes without subsequent goal guidance or prior maps.

2606.10903 2026-06-10 cs.RO 新提交

AgniNav: Configuration-Driven Cross-Embodiment Local Planning for Robot Navigation

AgniNav:配置驱动的跨具身局部规划机器人导航

Tianhao Zang, Siwei Cheng, Haidong Huang, Shanze Wang, Wei Zhang

发表机构 * Eastern Institute of Technology, Ningbo, China(东方理工(宁波)) University of Nottingham, Nottingham, UK(诺丁汉大学) University of Science and Technology of China, Hefei, China(中国科学技术大学)

AI总结 提出AgniNav框架,通过可配置的四参数安全包络实现单目视觉导航在轮式、四足和人形机器人间的零重训练迁移,联合调节感知与规划。

详情
AI中文摘要

单目局部导航对轻量级机器人具有吸引力,但现有的基于视觉的策略通常将感知耦合到特定机体、相机高度和足迹,使得从轮式底盘到腿式平台的迁移依赖于重新训练或主动深度硬件。本文介绍了AgniNav,一个配置驱动的局部导航框架,在碰撞包络层面标准化跨具身迁移。每个机器人由一个可测量的四参数安全包络指定:碰撞相关高度、前长、后长和半宽。高度参数条件化一个图像到扫描网络,从单目彩色图像预测一维、碰撞相关的伪激光扫描,而剩余的足迹参数配置一个维度感知的局部规划器用于碰撞检测。训练使用从配对的彩色-深度数据生成的高度条件化列最小扫描标签,允许同一图像监督不同的安全包络,无需收集机器人特定数据。据我们所知,AgniNav是第一个单目局部导航框架,它联合调节感知和规划于共享的碰撞包络配置,实现跨轮式、四足和人形平台的零重训练部署。在Turtlebot2、Unitree Go2和Accelerated Evolution K1上的真实机器人实验分别实现了39/40、18/20和18/20的成功率,碰撞次数分别为0/40、1/20和2/20,同时在Jetson Orin上以30 Hz运行。

英文摘要

Monocular local navigation is attractive for lightweight robots, but existing vision-based policies often couple perception to a specific body, camera height, and footprint, making transfer from wheeled bases to legged platforms dependent on retraining or active depth hardware. This paper introduces AgniNav, a configuration-driven local navigation framework that standardizes cross-embodiment transfer at the collision-envelope level. Each robot is specified by a measurable four-parameter safety envelope: collision-relevant height, front length, rear length, and half width. The height parameter conditions an image-to-scan network to predict a one-dimensional, collision-relevant pseudo-laserscan from a monocular color image, while the remaining footprint parameters configure a dimension-aware local planner for collision checking. Training uses height-conditioned column-minimum scan labels generated from paired color-depth data, allowing the same image to supervise different safety envelopes without collecting robot-specific data. To the best of our knowledge, AgniNav is the first monocular local-navigation framework that jointly conditions perception and planning on a shared collision-envelope configuration for zero-retraining deployment across wheeled, quadruped, and humanoid platforms. Real-robot experiments on a Turtlebot2, Unitree Go2, and Accelerated Evolution K1 achieve 39/40, 18/20, and 18/20 successes with 0/40, 1/20, and 2/20 collisions, respectively, while running at 30 Hz on Jetson Orin.

2606.10927 2026-06-10 cs.RO 新提交

AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning

AllDayNav: 通过真实世界强化学习实现终身导航

Hang Yin, Yinan Liang, Jiazhao Zhang, Jiahang Liu, Minghan Li, Zhizheng Zhang, He Wang

发表机构 * Tsinghua University(清华大学) Galbot Robotics Peking University(北京大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出AllDayNav框架,利用自进化多模态记忆和强化学习隐式编码场景动态,在跨房间、跨回合和跨任务场景中实现接近100%的成功率,超越基于地图、VLM和RL的基线方法。

Comments Project Page: https://bagh2178.github.io/AllDayNav/

详情
AI中文摘要

在动态环境中进行终身具身导航需要机器人从碎片化观察中形成持久的场景理解,这对于依赖显式地图或场景图且难以泛化到结构化设置之外的现有方法仍然困难。我们提出AllDayNav,一个终身自学习导航框架,通过强化学习将场景动态隐式编码到大模型的十亿级参数中,并由一个自进化的多模态记忆驱动,该记忆维护和更新视觉关键帧、语义描述和时间上下文,同时自主生成开放词汇指令、图像目标和结构化奖励。在合成和真实环境中的跨房间、跨回合和跨任务场景实验表明,AllDayNav实现了接近100%的成功率,并在路径效率和鲁棒性上持续超越基于地图、VLM和RL的强基线,证明了隐式、记忆驱动的强化学习作为可靠终身导航的可扩展替代方案。

英文摘要

Lifelong embodied navigation in dynamic environments requires robots to form persistent scene understanding from fragmentary observations, which remains difficult for existing methods that rely on explicit maps or scene graphs and struggle to generalize beyond structured settings. We propose AllDayNav, a lifelong self-learning navigation framework that implicitly encodes scene dynamics into the billion-scale parameters of a large model via reinforcement learning, powered by a self-evolving multimodal memory that maintains and updates visual keyframes, semantic descriptions, and temporal context while autonomously generating open-vocabulary instructions, image goals, and structured rewards. Experiments in both synthetic and real-world environments across cross-room, cross-episode, and cross-task scenarios show that AllDayNav achieves success rates approaching $100\%$ and consistently surpasses strong map-based, VLM, and RL baselines in path efficiency and robustness, demonstrating implicit, memory-driven reinforcement learning as a scalable alternative to explicit mapping for reliable lifelong navigation.

2606.10019 2026-06-10 cs.CV cs.AI cs.RO 交叉投稿

Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

广义CVO:基于二阶黎曼优化的快速无对应局部点云配准

Ray Zhang, Marcus Greiff, Thomas Lew, John Subosits

AI总结 提出一种基于几何表面结构和再生核希尔伯特空间嵌入的无对应局部点云配准方法,采用二阶流形优化实现高达10倍加速,在LiDAR和RGB-D跟踪及物体配准中显著降低漂移并提升鲁棒性。

Comments 16 pages, 12 figures

详情
AI中文摘要

我们提出了一种快速且无需对应关系的局部点云配准方法,该方法利用了几何表面结构和再生核希尔伯特空间(RKHS)嵌入。该方法将点云表示为具有逐点各向异性核的连续函数,这些核编码了局部几何信息。这种公式化在沿表面法线方向改善对齐的同时,放松了沿切线方向的对齐。为了解决由此产生的配准问题,我们提出了一种具有近似黎曼海森矩阵的二阶流形优化方案,与先前基于无对应RKHS方法中使用的一阶求解器相比,实现了高达10倍的加速。我们展示了在多种室内外数据集上改进的帧到帧LiDAR和RGB-D跟踪精度。在驾驶领域的LiDAR跟踪配准任务中,我们在具有挑战性的特征稀疏环境下实现了平移和旋转漂移均减少超过55%。在物体配准基准测试中,我们展示了相比基于ICP的方法更强的鲁棒性,并且在优化全局初始化时(尤其是在中等错位情况下)获得了进一步的提升。

英文摘要

We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $>55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.

2605.25371 2026-06-10 cs.RO 版本更新

FOUND-IT: Foundation-model-first Task-driven 3D Scene Graphs with Granularity on Demand

FOUND-IT: 基于基础模型优先、按需粒度的任务驱动3D场景图

Dominic Maggio, Nicolas Gorlo, Kris Hauser, Luca Carlone

发表机构 * Laboratory for Information & Decision Systems, Massachusetts Institute of Technology(信息与决策系统实验室,麻省理工学院) Samsung Research America(三星美国研究院)

AI总结 提出首个基于未标定单目相机实时构建任意室内外环境分层任务驱动3D场景图的方法,通过几何基础模型和可调整粒度支持动态任务,并在ASHiTA SG3D基准上提升79%准确率。

详情
AI中文摘要

我们提出了首个使用未标定单目相机实时构建任意室内外环境分层任务驱动3D场景图的方法。我们利用几何基础模型来估计场景图的几何属性(例如,物体边界框),但也观察到可通行性信息(场景图的“地点”层)可以通过向现有几何基础模型(如VGGT)添加额外头部直接重建。我们的方法是任务驱动的,即根据任务调整地图中物体和区域的粒度;例如,在操作任务中,我们的方法能够分辨炉子上的小旋钮,而在导航任务中则可以关注大物体(如整个炉子)。然而,与相关工作的重要区别在于,我们考虑了任务列表并非预定义固定,而是随着机器人运行而演变的现实情况。这自然允许处理复杂的移动操作任务,机器人可以在任务展开时动态调整其表示。我们将由此产生的方法称为FOUND-IT。FOUND-IT还包括一种代理方法来查询场景图中的信息。除了在ASHiTA SG3D任务定位基准上实现79%的更高准确率外,我们展示了FOUND-IT在Jetson Thor上实时运行于地面机器人。此外,为了突出我们方法的鲁棒性,我们演示了在YouTube上随意拍摄的房地产公寓游览中构建3D场景图。代码将在发表后提供。

英文摘要

We present the first approach to build hierarchical task-driven 3D scene graphs of arbitrary indoor or outdoor environments using an uncalibrated monocular camera in real-time. We leverage geometric foundation models to estimate geometric attributes of the scene graph (e.g., object bounding boxes), but we also observe that traversability information (the "places" layer of a scene graph) can be directly reconstructed by adding an extra head to existing geometric foundation models, like VGGT. Our approach is task-driven in the sense that we adjust the granularity of the objects and regions in the map depending on the task; for instance, during a manipulation task, our approach is able to resolve small knobs on a stove, while during a navigation task it can focus on large objects (e.g., the entire stove). However, in a major departure from related work, we consider the realistic case where the list of tasks is not predefined and fixed, but evolves as the robot operates. This naturally allows dealing with complex loco-manipulation tasks, where the robot can dynamically adjust its representation as the task unfolds. We dub the resulting approach FOUND-IT. FOUND-IT also includes an agentic approach to query information in the scene graph. In addition to achieving 79% higher accuracy on the ASHiTA SG3D task grounding benchmark, we demonstrate FOUND-IT runs in real-time on a ground robot using a Jetson Thor. Furthermore, to highlight the robustness of our method, we demonstrate constructing 3D scene graphs on casually captured realtor apartment tours from YouTube. Code will be made available upon publication.

5. 人机交互与协作机器人 5 篇

2606.10180 2026-06-10 cs.RO cs.AI cs.HC 新提交

Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

流控制:通过简单实时输入引导视觉-语言-动作模型

Jonathan C. Kao, Jason Chan, Andy Wang

AI总结 提出流控制方法,利用键盘等通用实时输入引导VLA模型动作,无需重新训练,能提升任务成功率和完成速度。

Comments 10 pages, 5 figures

详情
AI中文摘要

我们引入了视觉-语言-动作(VLA)模型的流控制,这是一种简单有效的方法,通过通用输入(如键盘)实时引导VLA动作。该方法可直接使用,无需重新训练或微调VLA。它允许相对粗糙的用户输入引导VLA与用户意图对齐。VLA将这些输入转换为从训练期间学习的VLA专家动作分布中采样的动作样本,从而生成高质量(符合动作专家分布)和高保真度(反映用户意图)的动作。我们证明流控制具有许多理想特性:(1)流控制能准确、响应地通过用户输入引导机器人动作;(2)它对次优用户输入具有鲁棒性;(3)它使用户能够引导VLA实现显著更高的成功率和更快的任务完成;(4)在流控制轨迹上微调VLA可提高自主策略性能。这些结果共同为用户提供了一种简单直观的方式来帮助引导VLA动作,提升任务性能。

英文摘要

We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples drawn from the VLA expert action distribution learned during training, so that the generated actions are high quality (conformity to the action expert distribution) and high fidelity (reflecting the user's intent). We demonstrate that flow control has many desirable properties: (1) flow control accurately and responsively steers robot actions with user inputs, (2) it is robust to suboptimal user inputs, (3) it enables users to steer VLAs to achieve significantly higher success rates and faster task completion, and (4) fine-tuning a VLA on flow control trajectories improves the autonomous policy. Together, these results provide a simple and intuitive way for users to help steer VLA actions, increasing task performance.

2606.10276 2026-06-10 cs.RO cs.AI 新提交

Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

基于语言和自我中心人类信号的分层策略用于自然人机交互

Dongjun Lee, Juheon Choi, Dong Kyu Shin, Sinjae Kang, Kimin Lee

发表机构 * KAIST(韩国科学技术院) Seoul National University(首尔国立大学)

AI总结 提出EDITH框架,通过智能眼镜捕捉人类第一人称视角、注视和语言信号,设计分层策略将非语言信号与语言指令结合,实现更自然的人机交互,减少用户表达意图的负担。

Comments We provide video demos and code in: https://project-edith.github.io

详情
AI中文摘要

为了实现自然的人机交互,机器人必须理解人类不仅通过语言,还通过手势和注视等非语言信号表达的意图。然而,当前的机器人策略仅依赖语言指令作为传达意图的唯一接口,忽略了非语言信号,将全部沟通负担放在语言上。在这项工作中,我们提出了EDITH,一个机器人框架,通过智能眼镜的连续第一人称视角和注视流捕捉人类的非语言信号,并将其与语言指令一起作为机器人策略的输入。我们的硬件系统实时将人类的第一人称视角、注视和语音传输给机器人,并将语音转录为语言指令。为了处理这些丰富但嘈杂的信号,我们设计了一个分层策略,其中高层策略推断人类的意图并生成一系列子任务,每个子任务表示为一个细粒度指令,配有一个关键帧,将意图锚定在场景中(例如,人类指向目标物体的帧)。然后低层策略执行这些子任务。在我们的人机交互任务实验中,即使意图仅被短暂表达,EDITH也能使机器人根据人类的非语言信号行动,并且与仅使用语言指令相比,显著减少了用户传达意图的努力。请访问我们的项目页面获取源代码和真实机器人演示视频。

英文摘要

For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human's nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human's first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human's intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human's nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.

2606.11109 2026-06-10 cs.RO 新提交

EM-Fall: Embodied mmWave Sensing for Day-and-Night Fall Detection on Humanoid Robots

EM-Fall: 用于人形机器人昼夜跌倒检测的具身毫米波感知

Yanshuo Lu, Yuxuan Hu, Shenghai Yuan, Xinyu Zhou, Kuangji Zuo, Bofan Lyu, XiChen Yuan, Jianfei Yang

发表机构 * MARS Lab(MARS实验室) NTU(南洋理工大学) IOT Lab(物联网实验室)

AI总结 提出EM-Fall框架,将毫米波感知与移动人形机器人结合,通过主动调整视角实现跨房间遮挡下的跌倒检测,并设计轻量时序模型处理宠物干扰和多径效应,在8个真实环境中验证了鲁棒性。

详情
AI中文摘要

跌倒是老年人受伤和住院的主要原因之一,因此可靠的跌倒感知成为住宅环境中安全监测的重要能力。然而,现有的跌倒检测系统通常依赖于可穿戴设备或固定传感装置,可能存在用户依从性低、空间覆盖有限或在遮挡和光照不良条件下性能下降的问题。在这项工作中,我们提出了\textbf{EM-Fall},一种部署在移动人形机器人上的具身跌倒检测框架。该系统将毫米波(mmWave)感知与机器人移动性相结合,使机器人能够主动调整其传感视角,并在跨房间和遮挡情况下保持目标可观测性。为了解决复杂住宅环境中的干扰,包括宠物运动和多径伪影,我们设计了一个以人为中心的感知流水线,结合轻量级时序建模,以捕捉跌倒事件前、中、后的运动演变。我们在八个真实室内环境中对四位参与者进行了系统评估,并构建了一个家庭毫米波跌倒检测数据集。实验结果表明,具身移动感知范式提高了监测连续性,并在多种环境条件下保持了鲁棒的跌倒检测性能。所提出的框架为家庭环境中的机器人辅助安全监测提供了一种实用解决方案。

英文摘要

Falls are one of the leading causes of injury and hospitalization among elderly individuals, making reliable fall awareness an essential capability for safety monitoring in residential environments. However, existing fall detection systems often rely on wearable devices or fixed sensing installations, which may suffer from low user compliance, limited spatial coverage, or degraded performance under occlusion and poor lighting conditions. In this work, we propose \textbf{EM-Fall}, an embodied fall detection framework deployed on a mobile humanoid robot. The system integrates millimeter-wave (mmWave) sensing with robotic mobility, allowing the robot to actively adjust its sensing viewpoint and maintain target observability across rooms and under occlusion. To address interference in complex residential environments, including pet motion and multipath artifacts, we design a human-centered perception pipeline combined with lightweight temporal modeling to capture motion evolution before, during, and after fall events. We evaluate the proposed system across eight real indoor environments with four participants and construct an in-home mmWave fall detection dataset. Experimental results show that the embodied mobile sensing paradigm improves monitoring continuity and maintains robust fall detection performance under diverse environmental conditions. The proposed framework provides a practical solution for robot-assisted safety monitoring in home environments.

2606.09836 2026-06-10 cs.HC cs.RO 交叉投稿

Equanimity in HRI: Applying Calm Technology Principles to Human-Robot Interaction

人机交互中的平和心态:将平静技术原则应用于人机交互

Barbara Sienkiewicz, Bipin Indurkhya

发表机构 * Cognitive Science Department, Jagiellonian University(杰兹维日大学认知科学系)

AI总结 本文探索将平静技术整合到人机交互中,为家庭辅助机器人设计提供指南,以促进平和、非侵入性的交互,并强调负责任机器人学与伦理考量。

Comments Conference pre-print. https://doi.org/10.1007/978-981-96-3525-2_41

详情
AI中文摘要

本文探讨如何将{\ extit{平静技术}}整合到人机交互中,特别关注家庭环境。它提供了全面的指南,用于设计优先考虑并增强人类对{\ extit{平和心态}}需求的辅助机器人,确保交互是平静、非侵入性和和谐的。本文审视了技术在当代生活中的广泛影响及其对认知能力的影响,强调了未来技术发展中负责任机器人学和伦理考量的必要性。通过将{\ extit{平静技术}}原则应用于家用机器人,本文提供了在家庭辅助机器人中应使用的具体示例和特征。目标是促进人类与机器人之间平衡、不引人注目的交互,特别是在家庭环境中,因为它是每个人生活中最私密的空间,为该领域的应用和进一步研究铺平道路。

英文摘要

This paper explores how {\textit{Calm Technology}} can be integrated into Human-Robot Interaction (HRI), with a particular focus on the household environment. It offers comprehensive guidelines for designing assistive robots that prioritize and enhance the human need for {\textit{equanimity}}, ensuring interactions are calm, non-intrusive, and harmonious. The paper examines the widespread influence of technology in contemporary life and its impact on cognitive capabilities, underscoring the need for responsible robotics and ethical considerations in future technological developments. By adapting {\textit{Calm Technology}} principles to domestic robots, the article provides concrete examples and features that should be employed in household assistive robotics. The goal is to foster a balanced, unobtrusive interaction between humans and robots, especially in the home environment, as it is the most privat environment in everyone's life, paving the way for applications and further research in the field.

2605.06234 2026-06-10 cs.RO cs.HC 版本更新

RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

RobotEQ:从被动智能到主动智能的具身AI过渡

Kuofei Fang, Xinyi Che, Haomin Ouyang, Shufan Zhang, Xuehao Wang, Qi Liu, Liyi Liu, Chenqi Zhang, Wenxi Cai, Wenyu Dai, Jinyang Wu, Fan Zhang, Haoyu Chen, Bin He, Zheng Lian

发表机构 * State Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University(自主智能无人系统国家重点实验室,同济大学) Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学) CMVS, University of Oulu(奥卢大学CMVS)

AI总结 提出RobotEQ基准,评估模型在具身场景中理解并遵守社会规范的能力,实验表明现有模型在主动智能上仍有不足,利用RAG技术可提升性能。

详情
AI中文摘要

具身AI是学术界和工业界的一个突出研究课题。当前研究集中于根据明确的用户指令完成任务。然而,为了让机器人融入人类社会,它们必须理解哪些行为是允许的、哪些是禁止的,即使没有明确指令。我们将用户引导的AI称为被动智能,无引导的AI称为主动智能。本文介绍了RobotEQ,第一个主动智能基准,旨在评估现有模型在具身场景中理解并遵守社会规范的能力。首先,我们构建了RobotEQ-Data数据集,包含1,894张以自我为中心的图像,涵盖10个代表性具身类别和56个子类别。通过大量人工标注,我们提供了4,944个动作判断问题和1,157个空间定位问题,指定了不同场景下合适的机器人动作。此外,我们建立了RobotEQ-Bench来评估最先进模型在该任务上的性能。实验结果表明,当前模型在实现可靠的主动智能方面仍有不足,特别是在空间定位上。同时,利用RAG技术结合外部社会规范知识库可以普遍提升性能。这项工作有助于推动机器人从用户引导的被动操作向主动社会合规过渡。

英文摘要

Embodied AI is a prominent research topic in both academia and industry. Current research centers on completing tasks based on explicit user instructions. However, for robots to integrate into human society, they must understand which actions are permissible and which are prohibited, even without explicit commands. We refer to the user-guided AI as passive intelligence and the unguided AI as active intelligence. This paper introduces RobotEQ, the first benchmark for active intelligence, aiming to assess whether existing models can comprehend and adhere to social norms in embodied scenarios. First, we construct RobotEQ-Data, a dataset consisting of 1,894 egocentric images, spanning 10 representative embodied categories and 56 subcategories. Through extensive manual annotation, we provide 4,944 action judgment questions and 1,157 spatial grounding questions, specifying appropriate robot actions across diverse scenarios. Furthermore, we establish RobotEQ-Bench to evaluate the performance of state-of-the-art models on this task. Experimental results demonstrate that current models still fall short in achieving reliable active intelligence, particularly in spatial grounding. Meanwhile, leveraging RAG techniques to incorporate external social norm knowledge bases can generally enhance performance. This work can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.

6. 具身智能与视觉语言动作模型 7 篇

2606.10340 2026-06-10 cs.RO 新提交

OMG: Omni-Modal Motion Generation for Generalist Humanoid Control

OMG: 面向通用人形机器人的全模态运动生成

Siqiao Huang, Kun-Ying Lee, Dongming Qiao, Guanqi He, Zhenyu Wang, Yitang Li, Shaoting Zhu, Hang Zhao

发表机构 * Tsinghua University(清华大学)

AI总结 提出OMG框架,通过精心策划的数据流程和扩散模型,实现基于语言、音频和参考动作的全模态全身控制,展示了最先进的性能和可扩展性。

Comments Project Page: https://tsinghua-mars-lab.github.io/OMG/

详情
AI中文摘要

近年来,人形机器人全身控制取得了显著进展,但现有方法仍局限于需要大量奖励工程的少数技能策略,或难以扩展到新输入模态的运动跟踪器。我们认为,通用人形机器人的关键在于构建一个可扩展的大脑——一个能够处理多种条件模态的模块,位于反应式运动跟踪小脑之上,模仿生物运动系统的层次结构。实现这一愿景面临两个挑战:获取大量高质量数据以实现通用控制,以及使生成器具备处理组合式、可扩展的多模态输入的能力。我们提出了OMG,通过精心策划的数据整理、过滤和标注流程,以及基于扩散的运动生成骨干网络(可条件于语言、音频和人类参考运动),解决了这些挑战。大量实验验证了OMG作为全模态全身控制器的性能,展示了最先进的结果、模型扩展行为以及对新分布和模态的高效适应,标志着向人形机器人基础模型迈出了具体一步。

英文摘要

Humanoid whole-body control has made significant progress in recent years, yet existing approaches remain limited to few-skill policies with heavy reward engineering, or motion trackers that are difficult to extend to new input modalities. We argue that the key to general-purpose humanoid control is to build a scalable brain, a module capable of reasoning with diverse conditioning modalities, atop a reactive motion tracking cerebellum, mirroring the hierarchical structure of biological motor systems. Two challenges arise in realizing this vision: acquiring a vast amount of high-quality data to achieve general purpose control, and equipping the generator with the capability to condition on compositional, extensible multi-modal inputs. We present OMG, which addresses these challenges with a meticulous data curation, filtering and labeling pipeline, as well as a diffusion-based motion generation backbone that conditions on language, audio, and human reference motions. Extensive experiments validate OMG as an omni-modal whole-body controller exhibiting state-of-the-art performance, model scaling behavior and efficient adaptation to new distributions and modalities, marking a concrete step toward foundation models for humanoid robots.

2606.10568 2026-06-10 cs.RO 新提交

VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

VeriSpace: 面向视觉-语言-动作模型的空间基础动作验证

Guiyu Zhao, Longteng Guo, Junyou Zhu, Jun Fu, Yanghong Mei, Bin Cao, Jie Jiang, Xingjian He, Jing Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出VeriSpace,一种3D感知的动作验证器,通过双路径3D注入场景编码和空间基础动作推理,在测试时选择候选动作,提升VLA模型的可靠性。

Comments Submit to ACM MM

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出强大潜力,但其测试时的可靠性仍受限于一次性动作预测,即使微小的动作误差也可能导致抓取失败、碰撞或任务进展错误。一种自然的替代方案是为VLA系统配备测试时验证,允许在执行前提出并评估多个候选动作。然而,可靠的动作验证具有挑战性,因为它不仅需要区分候选动作之间的细微几何差异,还需要评估动作是否朝着任务目标有意义地推进。我们提出VeriSpace,一种用于VLA系统测试时动作选择的3D感知动作验证器。VeriSpace通过两个关键组件评估候选动作:双路径3D注入场景编码,构建同时保留视觉语义和显式3D几何的场景表示;以及空间基础动作推理,通过推理任务相关的空间关系、几何有效性和预期的目标进展来评估每个动作。这些组件共同实现了对细微但结果关键的候选动作更可靠的区分,同时与现有VLA策略完全兼容。在公共基准和真实机器人操作任务上的实验表明,VeriSpace在分布内和分布外设置中均持续提高了底层VLA策略和先前基于验证的方法的决策可靠性,带来了显著的性能提升。

英文摘要

Vision-language-action (VLA) models have shown strong promise for robotic manipulation, but their reliability at test time remains limited by one-shot action prediction, where even small action errors can cause grasp failure, collision, or incorrect task progression. A natural alternative is to equip VLA systems with test-time verification, allowing multiple candidate actions to be proposed and evaluated before execution. However, reliable action verification is challenging because it requires not only distinguishing subtle geometric differences between candidate actions, but also assessing whether an action makes meaningful progress toward the task goal. We present VeriSpace, a 3D-aware action verifier for test-time action selection in VLA systems. VeriSpace evaluates candidate actions through two key components: Dual-Path 3D-Injected Scene Encoding, which constructs a scene representation that jointly preserves visual semantics and explicit 3D geometry, and Spatially-Grounded Action Reasoning, which evaluates each action by reasoning over task-relevant spatial relations, geometric validity, and expected goal progress. Together, these components enable more reliable discrimination between subtle yet outcome-critical action candidates while remaining fully compatible with existing VLA policies. Experiments on public benchmarks and real-world robotic manipulation tasks show that VeriSpace consistently improves decision reliability over both underlying VLA policies and prior verification-based methods, yielding substantial gains in both in-distribution and out-of-distribution settings.

2605.29662 2026-06-10 cs.CV cs.RO 交叉投稿

SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

SAFE-Pruner: 语义注意力引导的未来感知令牌剪枝用于高效视觉-语言-动作操控

Shilin Ma, Chubin Zhang, Changyuan Wang, Yuji Wang, Yue Wu, Zixuan Wang, Jingqi Tian, Zheng Zhu, Yansong Tang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学)

AI总结 针对视觉-语言-动作模型推理加速中现有剪枝方法忽略深层视觉信息的问题,提出SAFE-Pruner框架,通过引入未来层注意力线索和语义注意力一致性实现前瞻性令牌剪枝,在仿真和真实实验中取得最高1.89倍加速且成功率下降小于1.7%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型的实时推理对于机器人控制至关重要。虽然视觉令牌剪枝在加速推理方面显示出巨大潜力,但现有方法主要基于浅层线索进行剪枝决策,并存在丢弃深层所需视觉信息的风险。为解决此问题,我们提出SAFE-Pruner,一种即插即用的剪枝框架,将未来层的注意力线索融入剪枝决策。具体而言,我们识别出语义注意力一致性,即VLA模型在执行步骤中倾向于将其注意力概率质量集中在同一语义实体上。基于这一观察,我们设计了一种前瞻性策略来预测深层令牌的显著性,从而防止关键令牌过早移除并实现更稳定的加速。我们进一步引入自适应子任务划分策略来检测注意力突变,从而提高预测准确性和剪枝可靠性。在仿真和真实环境中的大量实验表明,我们的方法实现了高达1.89倍的加速,成功率下降最小(低于1.7%),同时比最先进的方法高出1.9%。

英文摘要

Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on shallow-layer cues and risk discarding visual information required by deep layers. To address this issue, we propose SAFE-Pruner, a plug-and-play pruning framework that incorporates attention cues of future layers into pruning decisions. Specifically, we identify semantic attention consistency, the tendency that VLA models concentrate their attention probability mass on the same semantic entity across execution steps. Based on this observation, we design a forward-looking strategy to forecast the token saliency in deep layers, which prevents the premature removal of critical tokens and leads to more stable acceleration. We further introduce an adaptive subtask division strategy to detect abrupt attention shifts, thereby improving forecasting accuracy and pruning reliability. Extensive experiments in simulation and real-world settings demonstrate that our method achieves up to 1.89x speedup with a minimal degradation in success rate of less than 1.7%, while outperforming state-of-the-art methods by up to 1.9%.

2508.13446 2026-06-10 cs.RO 版本更新

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

CAST: 反事实标签提升视觉-语言-动作模型中的指令跟随能力

Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, Sergey Levine

发表机构 * University of California Berkeley(加州大学伯克利分校) Princeton University(普林斯顿大学)

AI总结 针对VLA模型难以遵循细粒度指令的问题,提出利用视觉语言模型生成反事实标签增强数据集,提升语言基础多样性,实验表明该方法在导航和操作任务中显著提升指令跟随成功率。

详情
AI中文摘要

通用机器人应能理解并遵循用户指令。尽管当前视觉-语言-动作(VLA)模型为将开放词汇语言指令映射到机器人动作提供了强大架构,但它们难以遵循细粒度命令。原因之一是现有机器人数据集缺乏语义多样性和语言基础,特别是对于相似观测缺乏细粒度任务多样性。为解决此问题,我们提出一种新方法,利用视觉语言模型创建反事实标签来增强现有机器人数据集。通过用这些标签增强现有数据集,我们增加了机器人数据集语言基础的多样性和粒度,最终提升了VLA的语言跟随能力。我们通过在3个不同室内外环境中进行视觉语言导航实验,评估了所得模型遵循语言指令的能力,范围从简单的以物体为中心的指令到复杂的指代任务。实验表明,反事实重标记(无需额外数据收集)显著提升了VLA策略的指令跟随能力,超越了最先进方法,并且与在未增强数据上训练的VLA相比,成功率翻倍。我们还评估了该方法在操作VLA上的表现,发现在有干扰物的任务中性能有类似提升。

英文摘要

Generalist robots should be able to understand and follow user instructions. Despite providing a powerful architecture for mapping open-vocabulary language instructions to robot actions, current vision-language-action (VLA) models struggle to follow fine-grained commands. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision-language models to create counterfactual labels. By augmenting existing datasets with these labels, we increase the diversity and granularity of language grounding for robot datasets, ultimately improving the language-following capabilities of VLAs. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting vision-language navigation experiments in 3 different indoor and outdoor environments. Our experiments show that counterfactual relabeling (without additional data collection) significantly improves instruction-following in VLA policies, outperforming state-of-the-art methods and doubling the success rate compared to VLAs trained on unaugmented data. We also evaluate our method for manipulation VLAs and find a similar gain in performance on tasks with distractors.

2512.06628 2026-06-10 cs.RO cs.CV 版本更新

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

MIND-V:基于强化学习物理对齐的长期机器人操作分层世界模型

Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, Xiu Li

发表机构 * Tsinghua University(清华大学) X Square Robot(X Square机器人) Sun Yat-sen University(中山大学) HKUST(香港科技大学)

AI总结 提出MIND-V分层世界模型,通过语义推理、行为语义桥接和运动视频生成,结合强化学习物理对齐,实现长期机器人操作视频的物理合理合成。

详情
AI中文摘要

可扩展的具身智能受到多样化、长期机器人操作数据稀缺的限制。现有视频世界模型仅能合成简单动作的短视频,且常依赖手动定义轨迹。为此,我们提出MIND-V,一种认知分层世界模型,旨在合成物理合理且逻辑连贯的长期机器人操作视频。受认知科学启发,MIND-V通过三个核心组件桥接高层推理与像素级合成:语义推理中心(SRH)利用预训练视觉语言模型进行任务规划;行为语义桥(BSB)将抽象指令转换为域不变表示;运动视频生成器(MVG)用于条件视频渲染。MIND-V采用分阶段视觉未来展开(Staged Visual Future Rollouts)这一测试时优化策略以增强长期鲁棒性。为强制遵循物理定律,我们引入GRPO强化学习后训练阶段,由新颖的物理预见一致性(PFC)奖励引导。PFC利用V-JEPA2世界模型作为物理裁判,在潜在特征空间中惩罚不合理动态。实验证实MIND-V在长期模拟中的SOTA性能及其对策略学习的重要价值,为具身数据合成引入了可扩展且完全自主的框架。

英文摘要

Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.

2606.05645 2026-06-10 cs.RO 版本更新

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

Discrete-WAM:面向世界-策略学习的统一离散视觉-动作标记编辑

Ziyang Yao, Haochen Liu, Yuncheng Jiang, Zeyu Zhu, Zibin Guo, Jingru Wang, Tianle Liu, Jianwei Cui, Kuiyuan Yang, Hongwei Xie, Jingwei Zhao, Guang Chen, Hangjun Ye

发表机构 * Xiaomi EV(小米电动车)

AI总结 提出Discrete-WAM,通过将未来视觉状态和自车动作对齐为离散标记,构建统一离散扩散框架,实现世界建模、世界-动作策略和分层决策策略的联合学习,支持可控生成和反事实推理,提升自动驾驶决策可靠性。

详情
AI中文摘要

自动驾驶需要对自车动作如何影响周围世界的演变进行推理。然而,大多数端到端方法依赖于直接的状态到动作映射,捕捉相关性而没有显式建模动作条件动力学。相反,连续潜在世界模型通常缺乏用于跨反事实未来进行因果推理的组合结构。我们提出Discrete-WAM,一种统一的潜在视觉-动作世界策略,将未来视觉状态和自车动作表示为对齐的离散标记,实现跨替代未来的组合因果推理。基于这种统一的离散对齐,Discrete-WAM建立了一个具有统一生成任务的共享离散扩散框架,共同制定世界建模、世界-动作策略和分层决策使能策略,支持跨多种驾驶场景的组合泛化。在大型自动驾驶基准上的实验表明,Discrete-WAM在实现竞争性能的同时,支持可控生成和反事实推理,为更可靠的决策提供了一条原则性路径。

英文摘要

Autonomous driving requires reasoning about how ego actions shape future world evolution, rather than merely mapping observations to actions. However, most end-to-end methods rely on direct state-to-action imitation, while existing world models often remain weakly aligned with downstream policy generation. We introduce Discrete-WAM, a unified discrete vision-action world-policy framework that represents visual observations, future states, high-level decisions, and ego actions within a shared token space. Built on this discrete alignment, Discrete-WAM jointly trains world modeling, world-policy modeling, and policy modeling through multi-task and multi-stage pretraining, allowing action-conditioned future prediction to directly support policy generation. For downstream planning, Discrete-WAM further decomposes policy generation into hierarchical decision prediction and parallel action-token editing, where the decision token provides a high-level planning skeleton and confidence-based scheduling refines dense future actions efficiently. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves strong planning performance while supporting controllable future generation, counterfactual evaluation, surprise-based world-model analysis, and efficient parallel policy decoding. These results suggest that discrete representation alignment, unified world-policy training, and hierarchical token editing provide a promising design paradigm for physical AI.

2510.14836 2026-06-10 cs.CV cs.RO 版本更新

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

QDepth-VLA:量化深度预测作为视觉-语言-动作模型的辅助监督

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Zhongke Huiling Robot Technology Co.(北京中科创联机器人科技有限公司)

AI总结 提出QDepth-VLA框架,通过辅助深度预测任务增强VLA模型的空间感知与推理能力,在仿真和真实任务中提升操作性能。

详情
AI中文摘要

空间感知和推理对于视觉-语言-动作(VLA)模型完成精细操作任务至关重要。然而,现有方法往往缺乏理解和推理精确控制所需的基本3D结构的能力。为解决这一局限,我们提出QDepth-VLA,一种通过辅助深度预测任务增强VLA模型的通用框架。设计了一个专门的深度专家,用于预测从VQ-VAE编码器获得的深度图的量化潜在令牌,使模型能够学习捕捉关键几何线索的深度感知表示。在仿真基准和真实世界任务上的实验结果表明,QDepth-VLA在操作任务上展现出强大的空间推理能力和竞争性能。

英文摘要

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

7. 多机器人与群体系统 3 篇

2606.10986 2026-06-10 cs.RO cs.SY eess.SY 新提交

Multi-UAV Active Sensing with Information Gain-based Planning and Belief Fusion

基于信息增益规划与信念融合的多无人机主动感知

S. Habibi, L. Marques

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出多无人机主动感知框架,利用信息增益路径规划与概率信念融合实现二元地形映射,在合成和真实农业图像上验证,相比随机游走和扫描覆盖降低熵与误差。

详情
AI中文摘要

无人机越来越多地用于空间分布环境中的主动感知和信息收集。然而,其性能受到有限飞行时间、感知不确定性以及空间覆盖与观测精度之间权衡的制约。本文提出了一个多无人机主动感知框架的实际验证,用于概率二元地形映射,以精准农业作为应用案例。环境表示为概率信念图,其中空间依赖性通过因子图建模。无人机决策由基于信息增益的信息路径规划(IGbIPP)引导,并与随机游走和扫描覆盖路径规划基线在合成地形和真实无人机农业图像上进行比较。研究还评估了空间相关权重和几种用于多无人机信息共享的概率信念融合规则。结果表明,IGbIPP比基线更有效地降低了熵和映射误差,而更宽的视场提高了实际覆盖和地图精度。结果进一步表明,简单的相等或偏置空间权重比自适应权重更稳健,并且贝叶斯、对数几率与Dempster-Shafer融合实现了最佳协同映射性能。这些发现强调了不确定性驱动规划、感知几何、空间建模和概率融合对于实际无人机主动感知的重要性。

英文摘要

Unmanned aerial vehicles (UAVs) are increasingly used for active sensing and information gathering in spatially distributed environments. Their performance, however, is constrained by limited flight time, sensing uncertainty, and the trade-off between spatial coverage and observation accuracy. This paper presents a real-world validation of a multi-UAV active sensing framework for probabilistic binary terrain mapping, with precision agriculture used as the application case. The environment is represented as a probabilistic belief map, where spatial dependencies are modeled through a factor-graph formulation. UAV decision making is guided by Information Gain based Informative Path Planning (IGbIPP), and the approach is compared with Random Walk and Sweep coverage path planning baselines using both synthetic terrains and real UAV-derived agricultural imagery. The study also evaluates spatial correlation weights and several probabilistic belief-fusion rules for multi-UAV information sharing. Results show that IGbIPP reduces entropy and mapping error more effectively than the baselines, while a wider field of view improves real-world coverage and map accuracy. The results further show that simple equal or biased spatial weights can be more robust than adaptive weights, and that Bayesian, log-odds, and Dempster--Shafer fusion achieve the best cooperative mapping performance. These findings highlight the importance of uncertainty-driven planning, sensing geometry, spatial modeling, and probabilistic fusion for real-world UAV-based active sensing.

2606.11088 2026-06-10 cs.RO 新提交

A Distributed Multi-UGV Exploration Framework With Loop-Aware Planning and Descriptor-Aided Localization in Resource-Limited Environments

资源受限环境下的分布式多UGV探索框架:环回感知规划与描述符辅助定位

Zhiwei Li, Haiou Liu, Xijun Zhao, Ji Li, Yingze Wang, Boyang Wang

发表机构 * School of Mechanical Engineering, Beijing Institute of Technology(北京理工大学机械与车辆学院) China North Artificial Intelligence & Innovation Research Institute, Collective Intelligence & Collaboration Laboratory (CIC)(中国北方人工智能与创新研究院集体智能与协作实验室) Zhengzhou Intelligent Technology Research Institute, Beijing Institute of Technology(北京理工大学郑州智能科技研究院)

AI总结 提出一种完全分布式的多无人地面车辆(UGV)探索框架,通过轻量级LiDAR全局描述符实现跨UGV环回检测,并结合环回感知分层规划,在资源受限环境中减少探索时间和行驶距离。

详情
Journal ref
IEEE Transactions on Industrial Electronics, 2026
AI中文摘要

在未知、无GPS且带宽受限的环境中,多无人地面车辆(UGV)在没有先验地图的情况下进行鲁棒且高效的协同探索仍然具有挑战性,因为定位漂移会降低地图一致性并导致重复覆盖。本文提出了一种完全分布式的探索框架,该框架将描述符辅助的跨UGV环回闭合与环回感知分层规划相结合,同时实现自主定位和探索。我们开发了一种轻量级的LiDAR全局描述符,具有距离图像预对齐功能,可在较大的偏航和横向变化下实现鲁棒的跨UGV地点识别,并使用验证的环回闭合来维护全局一致的轨迹和稀疏拓扑表示。我们进一步引入了一种不确定性感知的跨UGV环回闭合选择模块,该模块在姿态不确定性下对候选环回闭合进行评分,并将高实用性的环回闭合保留为规划锚点,用于全局任务分配和局部路径优化。仿真和真实UGV实验表明,环回闭合模块实现了89.9%/95.5%的AR@1/AR@1%,分布式优化减少了绝对轨迹误差,系统显著降低了双向通信量,并且与mTSP基线相比,整体框架将探索时间和行驶距离分别减少了15%和14%。

英文摘要

Robust and efficient cooperative exploration with multiple unmanned ground vehicles (UGVs) in unknown, GPSdenied, and bandwidth-limited environments without prior maps remains challenging, as localization drift degrades map consistency and induces redundant coverage. This paper presents a fully distributed exploration framework that couples descriptoraided inter-UGV loop closure with loop-aware hierarchical planning while enabling autonomous localization and exploration. We develop a lightweight LiDAR global descriptor with range-image prealignment to enable robust cross-UGV place recognition under large yaw and lateral variations, and use verified loop closures to maintain globally consistent trajectories and a sparse topological representation. We further introduce an uncertainty-aware crossUGV loop-closure selection module that scores candidate loop closures under pose uncertainty and retains high-utility loop closures as planning anchors for global task allocation and local route refinement. Simulations and real-UGV experiments show that the loop-closure module achieves AR@1/AR@1% of 89.9%/95.5%, distributed optimization reduces absolute trajectory error, the system substantially reduces two-way communication volume, and the overall framework reduces exploration time and travel distance by 15% and 14%, respectively, compared with an mTSP baseline.

2606.09919 2026-06-10 cs.LG cs.AI cs.MA cs.RO 交叉投稿

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

Co-GLANCE: 异构机器人团队的不确定性感知主动感知

Michal P. Podolinsky, Neel P. Bhatt, Pranay Samineni, Rohan Siva, Christian Ellis, Ufuk Topcu

AI总结 提出Co-GLANCE系统,通过蒸馏视觉语言模型实现实时遮挡分割与机器人分配,结合共形预测与选择性弃权提供统计保证的不确定性量化,驱动主动感知,在真实场景中遮挡分割和分配准确率分别提升25%和36%,推理延迟降低350倍。

Comments Code, videos, and dataset available at https://co-glance.github.io/

详情
AI中文摘要

感知不确定性是异构机器人团队在非结构化户外环境中运行的核心挑战,因为单一视角无法提供可靠的场景理解。由遮挡等来源引起的感知不确定性,根据场景结构在不同机器人视角下表现不同。检测和解决感知不确定性的来源需要基于场景的上下文推理和具备能力感知的机器人分配。虽然视觉语言模型为两者提供了强大的语义先验,但它们对于机载推理在计算上过于昂贵,且缺乏校准的不确定性量化。我们介绍了Co-GLANCE,一个用于异构机器人团队不确定性解决的实时机载感知与决策系统。Co-GLANCE将视觉语言模型的语义推理能力蒸馏为用于遮挡分割和机器人分配的端到端模型,消除了对基于云推理的需求。为了量化感知不确定性,Co-GLANCE结合了共形预测与选择性弃权,为分割、机器人分配和检测输出提供统计有效的覆盖保证。这些校准的不确定性估计直接触发主动感知,派遣最合适的机器人获取信息丰富的视角并解决不确定性。在真实世界场景中,Co-GLANCE在遮挡分割和机器人分配准确率上分别比基于云的视觉语言模型基线高出25%和36%,同时将每帧推理延迟降低350倍。我们还发布了一个空地数据集以供未来研究。代码、视频和数据集可在以下网址获取:此 https URL。

英文摘要

Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification. We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, eliminating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty. Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame inference latency 350x. We also release an air-ground dataset for future research. Code, videos, and dataset available at https://co-glance.github.io/ .

8. 无人车、无人机与移动机器人 10 篇

2606.10688 2026-06-10 cs.RO 新提交

Self-Supervised Relevance Modelling in Autonomous Driving via Counterfactual Analysis

自动驾驶中基于反事实分析的自监督相关性建模

Luca Lusvarghi, Javier Gozalvez, Pablo Urbano Hidalgo

发表机构 * Networked Systems Lab, Universidad Miguel Hernandez de Elche(网络系统实验室,米格尔·希内斯·埃尔切大学)

AI总结 提出一种基于反事实分析的自监督方法,用于量化自动驾驶中物体的相关性,实现毫秒级实时估计,并生成相关性热图以辅助感知与规划。

详情
AI中文摘要

自动驾驶依赖于计算密集型的感知管线,以持续检测和跟踪周围环境中的物体。虽然某些物体对于规划安全有效的操作至关重要,但其他物体可能不相关,并且对自动驾驶车辆的驾驶决策没有影响。关注相关物体可以更有效地利用可用计算资源,减少处理延迟,并限制感知噪声的下游传播。在这项工作中,我们提出了一种基于反事实分析的新型自监督方法,以开发相关性模型——一种基于AI的工具,用于量化物体对自动驾驶车辆的相关性。为了展示所提出方法的潜力,我们在选定城市场景中生成的合成因果数据集上训练了相关性模型。结果表明,该相关性模型能够以毫秒级延迟准确估计物体的相关性,从而在高密度场景中实现实时相关性估计。我们还展示了该相关性模型可用于构建相关性热图,为自动驾驶车辆的驾驶策略提供有价值的见解,并可用于主动通知感知和规划任务。我们公开发布了相关性模型和因果数据集。

英文摘要

Autonomous driving relies on computationally intensive perception pipelines to continuously detect and track objects in the surrounding environment. While some objects are key to plan safe and effective maneuvers, others may not be relevant and have no impact on the autonomous vehicle's driving decisions. Focusing on relevant objects allows a more efficient usage of available computational resources, reduces processing latencies, and limits the downstream propagation of perception noise. In this work, we propose a novel self-supervised approach based on counterfactual analysis to develop a relevance model - an AI-based tool that quantifies the relevance of objects for an autonomous vehicle. To demonstrate the potential of the proposed approach, we train a relevance model on a synthetic causal dataset generated in a selected urban scenario. Results show that the relevance model is able to accurately estimate the objects' relevance with millisecond-level latency, enabling real-time relevance estimation also in high-density scenarios. We also show that the relevance model can be used to build relevance heatmaps that offer valuable insights into the autonomous vehicle's driving policy and can be used to proactively inform perception and planning tasks. We openly release both the relevance model and the causal dataset.

2606.10732 2026-06-10 cs.RO 新提交

Vehicle Prediction Model for Enhanced MPC Path Tracking in Formula Student Driverless

面向大学生无人驾驶方程式赛车增强MPC路径跟踪的车辆预测模型

Sebastian Baader, Tamara Bergerhoff, Pascal Meißner, Frank Deinzer

发表机构 * Center for Artificial Intelligence and Robotics (CAIRO)(人工智能与机器人中心(CAIRO);维尔茨堡-施韦因富特应用科学大学) TUAS Würzburg-Schweinfurt

AI总结 提出一种结合离线贝叶斯线性回归与在线稀疏高斯过程回归的实时车辆预测模型,将预测精度提升高达57%,并在实际赛车MPC路径跟踪控制器中验证有效性。

Comments Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

详情
AI中文摘要

自动驾驶赛车,如大学生无人驾驶方程式赛车,在接近其物理操控极限下运行。由此产生的高度非线性车辆行为增加了路径跟踪的复杂性,尤其是在狭窄赛道上。模型预测控制(MPC)通常用于解决此问题,其性能与底层预测模型的准确性密切相关。本文提出一种新颖的、实时能力强的自动驾驶赛车预测模型,该模型通过结合过去运行和当前驾驶情况的信息来适应变化的条件。我们的模型分为三个连续的子模型:名义运动学自行车模型、离线贝叶斯线性回归(BLR)模型和在线稀疏高斯过程回归(SGPR)模型。所提出的方法能够在不显著增加计算成本的情况下有效整合所有可用数据,确保从运行开始就具有高预测精度和定量不确定性评估。与现有方法相比,预测精度提高了高达57%。此外,我们成功地在基于MPC的路径跟踪控制器中,在真实的大学生方程式赛车上展示了该模型的实际适用性。

英文摘要

Autonomous race cars, such as in Formula Student Driverless, operate close to their physical handling limits. The resulting highly nonlinear vehicle behavior increases the path tracking complexity, especially on narrow tracks. Model Predictive Control (MPC) is commonly used to address this issue, a method whose performance is closely tied to the accuracy of the underlying prediction model. This paper presents a novel, real-time capable prediction model for autonomous race cars that adjusts to changing conditions by combining information from past runs and the current driving situation. Our model is divided into three consecutive submodels: a nominal Kinematic Bicycle Model, an offline Bayesian Linear Regression (BLR) model, and an online Sparse Gaussian Process Regression (SGPR) model. The proposed approach enables efficient integration of all available data without significantly increasing computational cost, ensuring high prediction accuracy and a quantitative uncertainty assessment right from the start of the run. Compared to existing approaches, an improvement in prediction accuracy of up to 57% was achieved. Further, we successfully demonstrated the practical applicability of the model within an MPC-based path tracking controller on a real Formula Student race car.

2606.10733 2026-06-10 cs.RO 新提交

Pushing the Performance Limits in Autonomous Racing: Continuous Stability-Aware Adaptive Velocity Planning in Formula Student Driverless

推动自动驾驶赛车的性能极限:大学生方程式无人驾驶中的连续稳定性感知自适应速度规划

Tamara Bergerhoff, Sebastian Baader, Pascal Meißner, Frank Deinzer

发表机构 * Center for Artificial Intelligence and Robotics (CAIRO)(人工智能与机器人中心(CAIRO);维尔茨堡-施韦因富特应用技术大学) TUAS Würzburg-Schweinfurt

AI总结 提出一种连续稳定性感知自适应速度规划方法,通过推断连续缩放因子生成摩擦图,实现实时最优目标速度计算,在真实赛车上测试圈速提升35%。

Comments Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

详情
AI中文摘要

在自动驾驶赛车中,尤其是在大学生方程式无人驾驶等比赛中,精确规划赛车的目标速度对于实现有竞争力的圈速和稳定的驾驶行为至关重要。特别是在高速行驶时,速度规划是一项重大挑战,因为它必须实时进行,同时考虑赛道布局、环境影响、机械公差以及由此产生的控制不准确性。在本文中,我们提出了一种新颖的速度规划方法,能够动态适应这些变化的条件。该方法不是估计物理轮胎-路面摩擦系数,而是从车辆稳定性中间接推断出一个连续缩放因子。该因子不仅反映了有效的轮胎-路面相互作用,还捕捉了控制不准确性的影响。由此,我们生成一个连续的摩擦图,作为稳健、自适应的基础,用于计算考虑车辆和环境限制的最优目标速度。我们提出的方法在一辆真实的大学生方程式赛车上进行了评估,结果显示,与十圈相比,圈速提高了35%,与非自适应方法相比,平均提高了8%。

英文摘要

In autonomous racing, especially in competitions such as Formula Student Driverless, precise planning of the target velocity of a race car is crucial for competitive lap times and stable driving behavior. Especially at high speeds, Velocity Planning (VP) is a significant challenge as it has to be performed in real time, taking into account track layouts, environmental influences, mechanical tolerances, and the resulting control inaccuracies. In this paper, we present a novel approach to VP that dynamically adapts to such changing conditions. Instead of estimating the physical Tire-Road Friction Coefficient (TRFC), a continuous scaling factor is inferred indirectly from vehicle stability. This factor not only reflects the effective tire-road interaction but also captures effects of control inaccuracies. From this, we generate a continuous friction map, which serves as a robust, adaptive basis for computing the optimal target speed, accounting for both vehicle and environmental limits. Our proposed approach was evaluated on a real Formula Student race car, showing a lap time improvement of 35 % over ten laps and an average increase of 8 % compared to a non-adaptive approach.

2606.10856 2026-06-10 cs.RO 新提交

An Exposure-Time-Aligned Primary-Path Architecture for Autonomous-Driving ECUs

一种曝光时间对齐的主路径架构用于自动驾驶ECU

Toru Saito, Yuki Hagura, Tatsuya Konishi, Satoru Mizusawa, Takumi Yajima

发表机构 * National Institute of Advanced Industrial Science and Technology, Japan(日本国家先进工业科学与技术研究院)

AI总结 针对生产车辆从模块化多NN流水线向端到端自动驾驶过渡的需求,提出主路径、曝光时间对齐和共路径共存三项设计原则,在双SoC平台上实现平均296ms的延迟。

详情
AI中文摘要

虽然端到端(E2E)自动驾驶已成为主导研究方向,但在一个非平凡的过渡期内,量产车辆仍然依赖模块化的多NN流水线。本文的主题是设计一种架构,在此阶段支持模块化流水线和E2E路径并行,并嵌入一条用于分阶段迁移的路径。移植到量产SoC上,平等主义的后期融合计算效率低下,且没有自然单元用于分阶段的E2E替代。作为替代方案,我们提出三项设计原则:(i)主路径,明确选择一条主要感知链,并优先将其封装在单个SoC对中,而非关键路径;(ii)曝光时间对齐,将主传感器的曝光时间τ_exp作为标签沿链传播,并在匹配的τ_exp上事件驱动融合节点,而非固定周期;(iii)共路径共存,基于(i)和(ii),让E2E输出路径与模块化流水线在同一τ_exp周期内并行运行。在双SoC量产AD-ECU上,实现从相机快门到规划器输出的平均延迟为296毫秒,在350毫秒的设计预算内。在(iii)下,模块化流水线在生产启动时为主路径,E2E路径作为影子在实车上运行,随着评估证据的积累,E2E范围逐步扩大。

英文摘要

While end-to-end (E2E) autonomous driving has become the dominant research direction, production vehicles continue to rely on modular multi-NN pipelines for a non-trivial transitional period. The subject of this paper is the design of an architecture that, during this phase, supports a modular pipeline and an E2E path side by side and embeds a path for staged migration. Transplanted to a production SoC, egalitarian late fusion is compute-inefficient and offers no natural unit for staged E2E substitution. As an alternative, we propose three design principles: (i) Primary-Path, which explicitly selects a primary perception chain and prioritizes its enclosure within a single SoC pair over the non-critical paths (ii) Exposure-Time-Aligned, which propagates the primary sensor's exposure time $τ_{\rm exp}$ as a tag along the chain and event-drives the fusion node on matched $τ_{\rm exp}$ rather than a fixed cycle and (iii) Co-Path Coexistence, which, building on (i) and (ii), lets an E2E output path co-run with the modular pipeline within the same $τ_{\rm exp}$ cycle. On a Dual-SoC production AD-ECU, the implementation closes camera-shutter to planner-output latency at a mean of 296 ms within the 350 ms design budget. Under (iii), the modular pipeline is primary at production launch and the E2E path runs as shadow on real vehicles, and the E2E scope is expanded as evaluation evidence accumulates.

2606.10857 2026-06-10 cs.RO cs.LG 新提交

Embodiment-conditioned Generalist Control for Multirotor Aerial Robots

基于具身条件的多旋翼空中机器人通用控制

Orestis Konstantaropoulos, Welf Rehberg, Mihir Kulkarni, Kostas Alexis

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology (NTNU), Trondheim, Norway(挪威科技大学工程控制论系)

AI总结 提出一种通用位置控制策略,通过物理具身描述符(质量与惯性归一化控制分配矩阵)实现单一网络权重控制任意多旋翼构型,采用PPO训练,五分钟后零样本迁移至真实世界。

详情
AI中文摘要

我们提出了一种通用位置控制策略,能够使用单一网络权重控制具有特定旋翼数量(例如六旋翼或四旋翼)的任意多旋翼构型。该策略基于一个物理驱动的具身描述符:一个质量和惯性归一化的控制分配矩阵,该矩阵捕捉了质量归一化的电机推力如何在机体坐标系中产生线性和角加速度。为了训练该策略,我们从任意多旋翼构型的广泛分布中采样,包括非平面和非对称系统,并使用近端策略优化(PPO)优化单个紧凑网络。训练仅需在RTX 3090 GPU上使用基于NVIDIA Warp的自定义动力学模拟器进行五分钟。通过大量仿真实验,我们展示了具身条件化使得通用控制能够在任意形态下鲁棒工作。我们还在三种不同的六旋翼系统上展示了该通用策略的零样本真实世界迁移,包括一个平面机器人、一个部分对称的非平面系统,以及一个随机非对称非平面构型。

英文摘要

We present a generalist position control policy capable of controlling arbitrary multirotor configurations of a certain rotor count (e.g., hexarotors or quadrotors) with a single set of network weights. The policy is conditioned on a physics-grounded embodiment descriptor: a mass and inertia-normalized control allocation matrix that captures how mass-normalized motor thrusts generate linear and angular accelerations in the body-frame. To train the policy, we sample from a broad distribution of arbitrary multirotor configurations, including non-planar and asymmetric systems, and optimize a single, compact network using Proximal Policy Optimization. Training requires only five minutes on an RTX 3090 GPU using a custom NVIDIA Warp-based dynamics simulator. Through extensive simulation experiments, we show that embodiment conditioning enables robust generalist control across arbitrary morphologies. We demonstrate zero-shot real-world transfer of this generalist policy on three diverse hexarotor systems, including a planar robot, a partially symmetric non-planar system, and a random asymmetric, non-planar configuration.

2606.10971 2026-06-10 cs.RO cs.SY eess.SY 新提交

Resilient Navigation for Autonomous Farm Robots by Leveraging Jerk-Augmented Models with IMU-Only Disturbance Rejection

利用基于加加速度增强模型与仅IMU干扰抑制的自主农业机器人弹性导航

Batu Candan, Mohammed Atallah, Simone Servadio, Saeed Arabi

发表机构 * Iowa State University(爱荷华州立大学) Salin247

AI总结 针对农业机器人传感器中断和振动问题,提出加加速度增强EKF与多调谐因子自适应方法,动态调整测量协方差,显著降低3D位置RMSE。

详情
AI中文摘要

自主农业机器人导航的精确状态估计常因传感器中断(GNSS/激光雷达/视觉)和越野环境固有的高频振动而受损。本文提出一种基于加加速度增强扩展卡尔曼滤波器(EKF)与多调谐因子(MTF)自适应方法集成的鲁棒导航算法。与假设恒定测量噪声的标准EKF方法不同,我们的方法实时动态调整测量协方差矩阵,使系统能够应对突然干扰和传感器异常。我们使用Salin247自主机器人的真实数据评估该算法。结果表明,与基线EKF模型相比,加加速度增强结合MTF自适应显著降低了3D位置均方根误差(RMSE),提供了卓越的航位推算能力。

英文摘要

Precise state estimation for navigation of autonomous agricultural robots is often compromised by sensor outages (GNSS/LiDAR/Visual) and high-frequency vibrations inherent in off-road environments. This paper proposes a robust navigation algorithm based on a jerk-augmented Extended Kalman Filter (EKF) integrated with a Multiple Tuning Factor (MTF) adaptation method. Unlike standard EKF approaches that assume constant measurement noise, our method dynamically adjusts the measurement covariance matrix in real-time, allowing the system to cope with sudden disturbances and sensor outliers. We evaluate the algorithm using real-world data from a Salin247 autonomous robot. Results demonstrate that jerk-augmentation combined with MTF adaptation significantly reduces 3D position Root Mean Square Error (RMSE) compared to baseline EKF models, providing superior dead-reckoning capabilities.

2606.10974 2026-06-10 cs.RO 新提交

Language-Driven Cost Optimization for Autonomous Driving

语言驱动的自动驾驶成本优化

Diego Martinez-Baselga, Khaled Mustafa, Javier Alonso-Mora

发表机构 * TU Delft(代尔夫特理工大学)

AI总结 提出语言驱动框架,利用大语言模型解释场景和用户查询,生成风险感知MPPI控制器的参数,并通过人机交互验证和反馈迭代优化自动驾驶行为。

Comments Paper accepted at IEEE Intelligent Transportation Systems Conference (ITSC) 2026

详情
AI中文摘要

自动驾驶车辆的驾驶行为通常由其运动规划器的成本函数控制,该函数编码了速度跟踪、平滑性、车道保持和碰撞避免等目标。然而,调整构成该成本函数的参数是一项需要技术专长的挑战性任务,限制了车辆适应不断变化的交通场景或最终用户偏好的能力。本文提出了一种语言驱动的自适应成本设计框架,用于自动驾驶。大语言模型(LLM)解释结构化场景描述和自然语言用户查询,生成应用于风险感知模型预测路径积分(MPPI)控制器的参数。该系统包含一个人在环验证阶段,在该阶段中,拟议的行为变化以非技术语言描述,并在部署前确认。用户还可以在部署前后提供反馈,从而实现车辆运动行为的迭代优化。该框架在真实驾驶场景中通过多个查询进行评估,以评估其有效性。仿真结果表明,该方法成功诱导了与预期要求一致的行为变化,且方式直观,从而弥合了智能车辆控制系统与最终用户之间的差距。

英文摘要

The driving behavior of autonomous vehicles is typically governed by the cost function of their motion planner, which encodes objectives such as speed tracking, smoothness, lane keeping, and collision avoidance. However, tuning the parameters that shape this cost function is a challenging task that requires technical expertise, limiting the vehicle's ability to adapt to evolving traffic scenarios or end-user preferences. This work presents a language-driven framework for adaptive cost design in autonomous driving. A Large Language Model (LLM) interprets structured scenario descriptions and natural language user queries to generate the parameters applied to a risk-aware Model Predictive Path Integral (MPPI) controller. The system incorporates a human-in-the-loop validation stage in which the proposed behavioral changes are described in non-technical language and confirmed prior to deployment. Users may additionally provide feedback either before or after deployment, enabling iterative refinement of the vehicle's motion behavior. The framework is evaluated across multiple queries in realistic driving scenarios to assess its effectiveness. Simulation results demonstrate that the method successfully induces behavioral changes that align with the intended requirements in an intuitive manner, thereby bridging the gap between intelligent vehicle control systems and end users.

2606.11019 2026-06-10 cs.RO cs.AI 新提交

Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving

扩散强制规划器:基于时间依赖引导的历史退火规划用于自动驾驶

Zehan Zhang, Neng Zhang, Yaoyi Li, Jia Cai, Zhiling Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Yinwang Intelligent Technology Co., Ltd(银网智能科技有限公司) Hefei Institutes of Physical Science, Chinese Academy of Sciences(中国科学院合肥物质科学研究院)

AI总结 提出扩散强制规划器(DFP),通过历史引导控制实现异构联合扩散过程,结合退火历史的条件引导,解决运动规划中的时间不一致问题,在nuPlan上取得竞争性能。

Comments CVPR2026

详情
AI中文摘要

基于学习的运动规划器尽管近期取得进展,但常常遭受时间不一致性问题。跨帧的小扰动可能累积成不稳定的轨迹,降低闭环驾驶的舒适性和安全性。几种方法尝试将历史作为静态条件信号注入以稳定输出,却导致规划器复制历史模式而非适应环境上下文。为解决这一限制,我们提出扩散强制规划器(DFP),一种由历史引导控制驱动的基于扩散的规划框架。具体地,DFP将完整轨迹分解为历史段、当前段和未来段,并为每个段分配独立的噪声水平。模型联合去噪历史段和未来段,强制执行异构联合扩散过程。在推理时,使用无分类器引导(CFG)以可控方式利用退火历史引导未来采样。在nuPlan上的闭环评估和全面消融实验表明,DFP在复杂驾驶场景中实现了竞争性能,同时生成连续、稳定且可控的运动规划。

英文摘要

Learning-based motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can accumulate into unstable trajectories, degrading comfort and safety in closed-loop driving. Several methods attempt to inject history as a static conditioning signal to stabilize outputs, only to induce the planner to copy historical patterns instead of adapting to environment contexts. To address this limitation, we propose Diffusion Forcing Planner (DFP), a diffusion-based planning framework driven by history-guided control. Specifically, DFP decomposes the full trajectory into history, current and future segments, and assign independent noise levels to each segment. The model jointly denoises the historical and the future segments, enforcing a heterogeneous joint diffusion process. At inference, classifier-free guidance (CFG) is applied to steer future sampling using annealed history in a controllable manner. Closed-loop evaluation and comprehensive ablations on nuPlan show that DFP achieves competitive performance while producing continuous, stable, and controllable motion plans in complex driving scenarios.

2203.03018 2026-06-10 cs.RO cs.SY eess.SY 版本更新

RAPTOR: Rapid Aerial Pickup and Transport of Objects by Robots

RAPTOR: 机器人快速空中抓取与运输物体

Aurel Appius, Erik Bauer, Marc Blöchlinger, Aashi Kalra, Robin Oberson, Arman Raayatsanati, Pascal Strauch, Sarath Suresh, Marco von Salis, Robert K. Katzschmann

发表机构 * Soft Robotics Lab, ETH Zurich, Switzerland(软机器人实验室,苏黎世联邦理工学院,瑞士)

AI总结 提出一种结合软材料Fin Ray夹爪和Fast DDS中间件的四旋翼平台RAPTOR,实现高速飞行中对不同几何形状物体的灵活抓取,平均抓取成功率83%,有效载荷达先前工作的四倍。

Comments 7 pages, 10 figures, accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022. Video: https://youtu.be/KHkBlBABsC8 Project page: https://srl-ethz.github.io/RAPTOR

详情
AI中文摘要

通过机器人进行快速空中抓取可以推动许多利用物体快速动态抓取和放置的应用。传统用于空中机械臂的刚性夹爪需要高精度和特定物体几何形状才能成功抓取。我们提出RAPTOR,一种四旋翼平台结合定制Fin Ray夹爪,利用软材料的特性增加夹爪与物体之间的接触面,从而实现对不同几何形状物体的更灵活抓取。为了减少通信延迟,我们提出一种基于Fast DDS(数据分发服务)的新型轻量级中间件解决方案,作为ROS(机器人操作系统)的替代方案。我们展示了RAPTOR在真实环境中以平均1 m/s的速度抓取四种不同几何形状物体时,平均抓取成功率达到83%。在高速设置下,RAPTOR的有效载荷是先前工作的四倍。我们的结果突显了空中无人机在自动化仓库和其他需要速度、敏捷性和鲁棒性且在难以到达区域操作的操作应用中的潜力。

英文摘要

Rapid aerial grasping through robots can lead to many applications that utilize fast and dynamic picking and placing of objects. Rigid grippers traditionally used in aerial manipulators require high precision and specific object geometries for successful grasping. We propose RAPTOR, a quadcopter platform combined with a custom Fin Ray gripper to enable more flexible grasping of objects with different geometries, leveraging the properties of soft materials to increase the contact surface between the gripper and the objects. To reduce the communication latency, we present a new lightweight middleware solution based on Fast DDS (Data Distribution Service) as an alternative to ROS (Robot Operating System). We show that RAPTOR achieves an average of 83% grasping efficacy in a real-world setting for four different object geometries while moving at an average velocity of 1 m/s during grasping. In a high-velocity setting, RAPTOR supports up to four times the payload compared to previous works. Our results highlight the potential of aerial drones in automated warehouses and other manipulation applications where speed, swiftness, and robustness are essential while operating in hard-to-reach places.

2606.03963 2026-06-10 cs.RO cs.AI 版本更新

AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

面向视觉条件的无人机导航的自优化智能体强化学习

Roohan Ahmed Khan, Yasheerah Yaqoot, Amir Atef Habel, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AgenticRL框架,利用多模态GPT智能体自动设计奖励函数、通过闭环自改进优化策略,在多种无人机导航任务中提升性能并实现高成功率。

详情
AI中文摘要

深度强化学习在使自主机器人学习复杂导航任务方面显示出巨大潜力。然而,其实际应用仍然严重依赖于人工设计的奖励函数和重复的手动微调,这既耗时又无法保证在目标任务中取得高成功率。本文提出了AgenticRL,一种智能体引导的强化学习框架,用于提高无人机导航任务中奖励设计、策略优化和实际部署的自主性。AgenticRL使用多模态生成预训练变换器(GPT)智能体来解释任务信息和视觉场景观察,生成特定于任务的奖励函数,使用近端策略优化(PPO)算法训练策略,然后通过诊断包评估训练后的策略作为批评者,生成反馈。基于该反馈,智能体识别失败模式并在闭环自改进过程中优化奖励函数。为了在推理期间进一步利用多模态GPT智能体,AgenticRL使用真实世界图像和自然语言任务信息自动识别活动场景并选择适当的训练策略执行。该框架在多种导航任务上进行了评估,包括穿越门、避障、穿越墙障并着陆、轨迹跟踪和运动行为学习。实验结果表明,与初始奖励相比,闭环优化过程将策略行为提升了71%。我们还展示了所提出框架的仿真到现实迁移,实现了91%的真实世界成功率和94%的仿真到现实准确率。

英文摘要

Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained transformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.

9. 软体机器人与硬件设计 3 篇

2407.05886 2026-06-10 cs.RO 版本更新

Rod models in continuum and soft robot control: a review

连续体和软体机器人控制中的杆模型:综述

Carlo Alessi, Camilla Agabiti, Daniele Caradonna, Cecilia Laschi, Federico Renda, Egidio Falotico

发表机构 * Istituto Italiano di Tecnologia(意大利技术研究院) The BioRobotics Institute(生物机器人研究所) Department of Excellence in Robotics and AI(机器人与人工智能卓越部门)

AI总结 本文综述了杆模型在连续体和软体机器人建模与控制中的应用,涵盖数学基础、机器人建模及控制策略,并讨论了其优势、局限和未来方向。

详情
AI中文摘要

连续体和软体机器人可以变革在受限或非结构化环境中需要柔顺交互的自动化任务,包括医疗、农业、海洋和太空应用。然而,其复杂的力学特性给建模和控制带来了重大挑战。低维连续介质力学模型,如杆理论,能够有效捕捉细长体在接触丰富场景中的大变形,同时平衡精度和计算效率。本文对连续体和软体机器人的杆模型进行了纵向综述,涵盖其数学基础、机器人建模和控制应用。我们回顾了软体机器人中采用的主要杆理论,并引入了一种基于变形的杆模型分类方法。此外,我们调查了近期基于模型和基于学习的利用杆模型的控制策略,强调了它们在操作和物理交互任务中的作用。最后,我们讨论了基于杆的方法的优势、局限性、研究空白和新兴方向。本文旨在为开发连续体和软体机器人的模型和控制策略提供参考。

英文摘要

Continuum and soft robots can transform automation tasks requiring compliant interaction in constrained or unstructured environments, including healthcare, agriculture, marine, and space applications. However, their complex mechanics introduce significant challenges in modeling and control. Low-dimensional continuum mechanical models, such as rod theories, effectively capture the large deformations of slender bodies in contact-rich scenarios while balancing accuracy and computational efficiency. This paper presents a vertical survey of rod models for continuum and soft robots, spanning their mathematical foundations, robot modeling, and control applications. We review the main rod theories adopted in soft robotics and introduce a deformation-based classification of rod models for continuum and soft robots. Furthermore, we survey recent model-based and learning-based control strategies leveraging rod models, highlighting their role in manipulation and physical interaction tasks. Finally, we discuss advantages, limitations, research gaps, and emerging directions of rod-based approaches. This paper aims to serve as a reference for developing models and control strategies for continuum and soft robots.

2602.21331 2026-06-10 cs.RO 版本更新

CableRobotGraphSim: A Graph Neural Network for Modeling Partially Observable Cable-Driven Robot Dynamics

CableRobotGraphSim:一种用于建模部分可观测缆索驱动机器人动力学的图神经网络

Nelson Chen, William R. Johnson, Rebecca Kramer-Bottiglio, Kostas Bekris, Mridul Aanjaneya

发表机构 * Rutgers University(罗切斯特大学) Yale University(耶鲁大学)

AI总结 提出CableRobotGraphSim,一种图神经网络模型,通过将缆索驱动机器人表示为图(刚体为节点,缆索和接触为边),仅利用部分可观测输入即可快速准确匹配其他仿真和真实机器人,并采用仿真-真实联合训练提升鲁棒性,最后集成MPPI控制器实现闭环导航。

详情
AI中文摘要

通用仿真器加速了机器人的发展。然而,基于第一性原理的传统仿真器通常需要全状态可观测性或依赖参数搜索进行系统辨识。本文提出\texttt{CableRobotGraphSim},一种用于缆索驱动机器人的新型图神经网络(GNN)模型,旨在解决先前仿真方案的不足。通过将缆索驱动机器人表示为图,其中刚体作为节点,缆索和接触作为边,该模型能够快速准确地匹配其他仿真模型和真实机器人的特性,同时仅接收部分可观测输入。伴随GNN模型的是一个仿真-真实联合训练过程,该过程促进了对噪声真实数据的泛化能力和鲁棒性。该模型进一步与模型预测路径积分(MPPI)控制器集成,用于闭环导航,展示了模型的速度和准确性。

英文摘要

General-purpose simulators have accelerated the development of robots. Traditional simulators based on first-principles, however, typically require full-state observability or depend on parameter search for system identification. This work presents \texttt{CableRobotGraphSim}, a novel Graph Neural Network (GNN) model for cable-driven robots that aims to address shortcomings of prior simulation solutions. By representing cable-driven robots as graphs, with the rigid-bodies as nodes and the cables and contacts as edges, this model can quickly and accurately match the properties of other simulation models and real robots, while ingesting only partially observable inputs. Accompanying the GNN model is a sim-and-real co-training procedure that promotes generalization and robustness to noisy real data. This model is further integrated with a Model Predictive Path Integral (MPPI) controller for closed-loop navigation, which showcases the model's speed and accuracy.

2605.12804 2026-06-10 cs.RO 版本更新

BiPneu: Design and Control of a Bipolar-Pressure Pneumatic System for Soft Robots

BiPneu:用于软体机器人的双极气压气动系统的设计与控制

Yu Mei, Xinyu Zhou, Vedant Naik, Alan Gao, Xiaobo Tan

发表机构 * Department of Electrical and Computer Engineering, Michigan State University(电气与计算机工程系,密歇根州立大学)

AI总结 提出一种可扩展、高性价比的多通道双极气压气动系统BiPneu,并设计基于混合电-气动模型的双模式滑模控制器(DM-SMC),实现宽范围、精确、快速的压力调节,在软体机器人应用中显著优于MPC和PID控制器。

Comments Full Version of BiPenu, including the supplementary materials

详情
Journal ref
IEEE/ASME Transactions on Mechatronics, 2026
AI中文摘要

正负压力调节对于软体机器人执行器至关重要,可实现大运动范围和多种驱动模式。然而,由于不对称的充放气动力学、阀门非线性以及切换引起的流量扰动,在两种压力极性下实现高性能调节仍然具有挑战性。本文提出BiPneu,一种可扩展且经济高效的多通道双极气压气动系统,用于软体机器人,能够实现宽范围、精确和快速的压力调节,同时与高级软件生态系统无缝兼容。基于混合电-气动模型,提出了一种带有滞后监督模式选择的双模式滑模控制器(DM-SMC)。广泛的仿真和实验表明,与先进模型预测控制器和良好调谐的PID控制器相比,DM-SMC在跟踪阶跃和正弦压力参考方面具有优越性能。实验结果显示,多步测试中平均绝对误差为1.44 kPa,正弦跟踪中为4.23 kPa,相对于PID控制分别降低了11.9%和35.6%,同时改善了控制力度、阀门切换速率和瞬态响应。DM-SMC的鲁棒性在具有压力依赖体积的波纹管执行器上得到进一步验证。最后,通过两个软体机器人示例——使用软体并联执行器快速控球和基于实时有限元方法(FEM)的软体波纹管执行器遥操作——展示了BiPneu的能力。

英文摘要

Positive-negative pressure regulation is critical to soft robotic actuators, enabling large motion ranges and versatile actuation modes. However, achieving high-performance regulation across both pressure polarities remains challenging due to asymmetric inflation-deflation dynamics, valve nonlinearities, and switching-induced flow disturbances. This paper presents BiPneu, a scalable and cost-efficient multi-channel bipolar-pressure pneumatic system for soft robots that enables wide-range, accurate, and responsive pressure regulation while providing seamless compatibility with high-level software ecosystems. A dual-mode sliding-mode controller (DM-SMC) with hysteresis-supervised mode selection is proposed based on a hybrid electro-pneumatic model. Extensive simulation and experiments demonstrate the superior performance of DM-SMC in tracking step and sinusoidal pressure references compared with both advanced model predictive controllers and well-tuned PID controllers. Experimental results show average absolute errors of 1.44 kPa in multi-step tests and 4.23 kPa in sinusoidal tracking, corresponding to reductions of 11.9% and 35.6% relative to PID control, along with improved control effort, valve switching rate, and transient response. Robustness of DM-SMC is further verified on a bellow actuator with pressure-dependent volume. Finally, BiPneu's capability is demonstrated via two soft robotic examples, quick ball-maneuvering with a soft parallel manipulator and real-time finite element method (FEM)-based teleoperation of a soft bellows actuator.

10. 仿真、数据集与评测 6 篇

2606.10229 2026-06-10 cs.RO cs.LG 新提交

What Demonstration Curation Metrics Do to Your Policy

演示筛选指标对策略的影响

Aarav Bedi

AI总结 研究演示筛选指标在检测缺陷演示后,是否提升基于行为克隆的策略性能。发现指标检测缺陷的能力与策略性能严重脱钩,并揭示演示时长作为混淆变量的影响。

Comments 6 pages, 1 figure, 2 tables

详情
AI中文摘要

我们研究了检测缺陷训练演示的筛选指标是否也能改善基于筛选数据训练的行为克隆策略。在一个接触密集的LIBERO抓取放置基准任务中,通过引入受控结构缺陷(搬运阶段早期释放夹爪),我们发现这两个量是严重解耦的。具有最高缺陷检测AUROC(0.804)的指标产生了最差的筛选策略(任务成功率13.3%),而AUROC显著较低(0.638)的指标产生的策略几乎与在真实干净数据上训练的Oracle策略相匹配(90.0% vs. 93.3%)。我们进一步表明,我们评估的七个指标中有五个利用演示时长作为缺陷标签的琐碎代理,这种混淆因素将报告的AUROC膨胀到接近完美的值,并且在控制演示时长后消失。在所有条件下,受污染的基线仅在3.3%的测试中成功,而两种最佳的筛选方法将差距缩小到Oracle上限93.3%的3个百分点以内。我们的结果认为,筛选方法应根据其产生的策略来评估,而不是根据其标记的缺陷,并且任何筛选基准在报告检测准确性之前必须控制演示时长。我们发布了测试平台、所有指标实现和评估流程。

英文摘要

We study whether demonstration-curation metrics that detect defective training episodes also improve the downstream behavior-cloning policy that trains on the curated data. On a contact-rich LIBERO pick-and-place benchmark with a controlled structural defect (early gripper release during the carry phase), we find that the two quantities are sharply decoupled. The metric with the highest defect-detection AUROC (0.804) produces the worst curated policy (13.3% task success), while a metric with a substantially lower AUROC (0.638) produces a policy that nearly matches the oracle trained on ground-truth clean data (90.0% vs. 93.3%). We further show that five of the seven metrics we evaluate exploit episode length as a trivial proxy for the defect label, a confound that inflates reported AUROCs to near-perfect values and disappears once episode length is controlled. Across all conditions, the contaminated baseline succeeds on only 3.3% of rollouts, and the two best curation methods close this to within 3 percentage points of the 93.3% oracle ceiling. Our results argue that curation methods should be evaluated by the policy they produce, not the defects they flag, and that any curation benchmark must control for episode length before reporting detection accuracy. We release the testbed, all metric implementations, and the evaluation pipeline.

2606.10366 2026-06-10 cs.RO cs.AI 新提交

A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation

提升VLA评估中仿真与真实相关性的实用指南

Shuo Wang, Hanyuan Xu, Yingdong Hu, Fanqi Lin, Yang Gao

发表机构 * Tsinghua University(清华大学) Shanghai Qi Zhi Institute(上海期智研究院)

AI总结 本文系统研究仿真与真实环境在VLA策略评估中的相关性,提出统一框架来测量和提升仿真作为真实评估代理的有效性。

Comments 20 pages

详情
AI中文摘要

仿真已成为评估和改进视觉-语言-动作(VLA)策略的重要工具,为昂贵的真实机器人评估提供了可扩展、可重复且可控的替代方案。最近的仿真基准在真实感和多样性方面取得了实质性进展,但这些平台尚未被广泛用作可靠的真实策略评估代理。在这项工作中,我们通过仿真与真实相关性的视角研究这一问题。我们在多个仿真平台、VLA策略、任务和扰动因素上进行了系统研究,测量模拟评估在策略排名一致性、性能相关性和扰动方面失败模式上是否保留真实结论。这一分析使我们能够表征现有模拟器的局限性,并确定哪种模拟信号更符合真实部署。我们进一步研究了用户应如何利用仿真进行策略改进,包括何时基于模拟器的微调是有益的,以及后训练数据量如何影响仿真与真实的对齐。总体而言,我们的工作提供了一个统一的框架,用于测量、解释和提升仿真对VLA策略的有用性,为模拟器设计者和在策略开发流程中使用仿真的实践者提供指导。

英文摘要

Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. In this work, we investigate this issue through the lens of sim-and-real correlation. We conduct a systematic study across multiple simulation platforms, VLA policies, tasks, and perturbation factors, measuring whether simulated evaluation preserves real-world conclusions in terms of policy ranking consistency, performance correlation, and perturbation-wise failure patterns. This analysis allows us to characterize the limitations of existing simulators and identify what kinds of simulation signals are more aligned with real-world deployment. We further examine how users should exploit simulation for policy improvement, including when simulator-based finetuning is beneficial and how the amount of post-training data affects sim-and-real alignment. Overall, our work provides a unified framework for measuring, interpreting, and improving the usefulness of simulation for VLA policies, offering guidance both for simulator designers and for practitioners who use simulation as part of the policy development pipeline.

2606.10382 2026-06-10 cs.RO 新提交

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

UMI-Bench 1.0:基于UMI数据的桌面机器人操作开放可复现真实世界基准

Shi Jin, Yuntian Wang, Yuhui Duan, Di Wu, Gaoqi Dong, Xiaohang Liu, Xiaotong Li, Hongfei Jia, Zehao Zhang, Tianyu Wang, Zhongjie Jia, Yuanqi Yao, Chenjia Bai, Zhaxizhuoma, Siao Liu, Nieqing Cao, Jin Wang, Chao Yu, Yan Ding

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出UMI-Bench 1.0,首个专为UMI风格操作策略设计的真实机器人基准,通过统一协议实现数据收集、场景重置、策略执行、结果记录和任务因素分析,提供可复现的评估平台。

详情
AI中文摘要

真实机器人评估对于理解学习到的操作策略能否在精心策划的演示之外可靠运行至关重要。这一需求对于通用操作接口(UMI)风格策略尤为迫切,其性能取决于腕部视角观测、动作表示、数据收集和物理部署之间的耦合。现有的真实世界基准已取得重要进展,但它们并非围绕这种UMI数据到部署的设置而设计。我们提出UMI-Bench 1.0,一个本地优先的真实机器人基准,用于标准化评估UMI风格的操作策略。据我们所知,这是首个专门用于基于UMI的操作模型真实世界评估的基准。UMI-Bench将数据收集、场景重置、策略执行、结果记录和任务因素分析统一在一个协议中。通过使整个评估过程可复现和可审计,UMI-Bench为衡量UMI训练策略如何泛化到真实物理操作提供了一个实用的测试平台。

英文摘要

Real-robot evaluation is essential for understanding whether learned manipulation policies can operate reliably outside curated demonstrations. This need is particularly pressing for Universal Manipulation Interface (UMI)-style policies, whose performance depends on the coupling between wrist-view observations, action representation, data collection, and physical deployment. Existing real-world benchmarks have made important progress, but they are not designed around this UMI data-to-deployment setting. We present UMI-Bench 1.0, a local-first real-robot benchmark for standardized evaluation of UMI-style manipulation policies. To the best of our knowledge, this is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models. UMI-Bench aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol. By making the full evaluation process reproducible and auditable, UMI-Bench provides a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.

2505.01458 2026-06-10 cs.RO cs.AI 版本更新

A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

具身智能时代基于物理模拟器的机器人导航与操作综述

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, Jianwei Zhang

发表机构 * Department of Computer Science, City University of Hong Kong(城市大学计算机科学系) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) Department of Informatics, Universität Hamburg(汉堡大学信息学院)

AI总结 本文综述了物理模拟器在缩小具身智能中导航与操作的模拟到现实差距方面的关键特性、任务支持及硬件需求,并提供了基准数据集、指标、平台和方法资源。

Comments Under Review

详情
AI中文摘要

导航和操作是具身智能的核心能力,但直接在现实世界中训练智能体执行这些任务成本高、耗时且不安全。因此,模拟到现实的迁移已成为关键方法,然而模拟到现实的差距仍然存在。本综述通过分析先前综述中关注有限的属性,考察了物理模拟器如何解决这一差距。我们还分析了它们在导航和操作任务中的特性,以及它们的硬件需求。此外,我们提供了包含基准数据集、指标、模拟平台和方法的资源,以帮助研究人员在考虑硬件约束的同时选择合适的工具。

英文摘要

Navigation and manipulation are core capabilities in Embodied AI, but training agents to perform them directly in the real world is costly, time-consuming, and unsafe. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing properties that have received limited attention in prior surveys. We also analyze their features for navigation and manipulation tasks, as well as their hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and methods to help researchers select suitable tools while accounting for hardware constraints.

2602.23499 2026-06-10 cs.RO cs.AI 版本更新

TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving

TaCarla: 端到端自动驾驶的综合基准数据集

Tugrul Gorgulu, Atakan Dag, M. Esat Kalfaoglu, Halil Ibrahim Kuru, Baris Can Cam, Halil Ibrahim Ozturk, Ozsel Kilinc

发表机构 * Tuğrul Gorgülü *†(土耳其巴伊塞蒂大学) Atakan Dağ †(土耳其巴伊塞蒂大学) M. Esat Kalfaoğlu ‡(土耳其巴伊塞蒂大学) Halil İbrahim Kuru †(土耳其巴伊塞蒂大学) Barış Can Cam †(土耳其巴伊塞蒂大学) Halil İbrahim Öztürk †(土耳其巴伊塞蒂大学) Özsel Kılınç §(土耳其巴伊塞蒂大学)

AI总结 针对现有自动驾驶数据集不完整、行为多样性不足及闭环评估缺失等问题,基于CARLA Leaderboard 2.0挑战场景收集超过285万帧的多任务数据集,支持规划、检测、预测及视觉语言动作模型,并提供数值稀有度评分。

Comments Accepted at the Third Workshop on Simulation for Autonomous Driving (SAD), CVPR 2026

详情
AI中文摘要

收集高质量数据集是一项需要细致关注细节的关键任务,因为忽略某些方面可能导致整个数据集无法使用。自动驾驶挑战仍然是一个重要的研究领域,需要进一步探索以提升车辆的感知和规划性能。然而,现有数据集往往不完整。例如,包含感知信息的数据集通常缺乏规划数据,而规划数据集通常由大量驾驶序列组成,其中自车主要向前行驶,行为多样性有限。此外,许多真实数据集难以评估其模型,特别是对于规划任务,因为它们缺乏合适的闭环评估设置。CARLA Leaderboard 2.0挑战提供了多样化的场景来解决自动驾驶中的长尾问题,已成为在开环和闭环评估设置下开发感知和规划模型的有价值替代平台。然而,在该平台上收集的现有数据集存在一定局限性。一些数据集似乎主要针对有限的传感器配置,具有特定的传感器配置。为了支持端到端自动驾驶研究,我们使用CARLA仿真环境为多样化的Leaderboard 2.0挑战场景收集了一个包含超过285万帧的新数据集。我们的数据集不仅设计用于规划任务,还支持动态目标检测、车道分隔线检测、中心线检测、交通灯识别、预测任务和视觉语言动作模型。此外,我们通过使用数据集训练各种模型来展示其多功能性。同时,我们还提供了数值稀有度评分,以理解当前状态在数据集中出现的稀有程度。

英文摘要

Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.

2606.04746 2026-06-10 cs.RO 版本更新

CADENCE: Predicting Realized MAPF Execution Time Beyond Sum of Costs

CADENCE:预测实际MAPF执行时间超越成本总和

Abhishek S, Badrikanath Praharaj, Sreeram MV

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CADENCE框架,通过分析原始运动负担和交互感知协调特征,发现原始运动负担能显著提高多智能体路径规划执行时间的预测精度,超越传统成本总和指标。

Comments 7 pages, 4 figures, 3 tables and this paper was accepted at Multi-Agent Robotic Systems: Real-World Collaboration and Interaction a workshop at the international conference of robotics and automation (ICRA 2026)

详情
AI中文摘要

多智能体路径规划(MAPF)算法越来越多地用于规划工业仓库和机器人共享工作空间中机器人团队的运动,但标准MAPF算法评估指标(如成本总和(SoC)、完工时间和规划器运行时间)可能掩盖规划选择如何转化为实际执行性能。我们提出了CADENCE(面向网络化连续执行的协调与动作驱动估计),在一个固定的7×7工作单元上使用七台差分驱动机器人进行硬件研究,探究在执行前可用的哪些特征最能预测最终的挂钟完成时间。我们比较了SoC、总规划行程成本、原始运动负担(计划所需的基本运动量,如完工时间、转弯、连续移动和启停转换)以及交互感知协调结构(计划引起的机器人间协调量,如依赖链接、交互机器人对、依赖深度和拥挤暴露)。为了测试这一点,我们生成了跨越15个场景(5个空场景、5个中等随机场景和5个瓶颈场景)的120个计划,并每个计划执行四次,产生了480次试验的硬件语料库。使用场景留出岭回归模型和试验级混合效应模型,我们发现仅SoC提供信息但不完整,而原始运动负担提供了最强的改进,相对于仅SoC模型,将留出误差在MAE上降低了约48.6%-59.8%,在RMSE上降低了44.2%-61.4%。交互感知协调特征增加了较小且不太均匀的增益,在混合效应分析中最为明显。在两种模型和不确定性检查中,原始运动负担是除SoC之外最可靠的附加信号,表明大部分执行时间差距在机器人开始移动之前就已经在离线计划中可见。

英文摘要

Multi-Agent Path Finding (MAPF) algorithms are increasingly used to plan motion for robot teams in industrial warehouses and robotic shared workspaces, but standard MAPF algorithm evaluation metrics, such as Sum of Costs (SoC), makespan, and planner runtime, can obscure how planner choices translate into realistic execution performance. We present CADENCE (Coordination and Action-Driven Estimation for Networked Continuous Execution), a hardware study of this evaluation gap on a fixed 7 by 7 workcell with seven differential drive robots, asking which features available before execution can best predict final wall-clock completion time. We compare SoC, total planned travel cost, primitive motion burden (how much basic motion the plan requires, such as makespan, turns, consecutive moves, and start-stop transitions), and interaction aware coordination structure (how much inter-robot coordination the plan induces, such as dependency links, interacting robot pairs, dependency depth, and crowding exposure). To test this, we generate 120 plans across 15 scenarios -- 5 Empty, 5 Medium Random, and 5 Bottleneck and execute each plan four times, yielding a 480 trial hardware corpus. Using both a scenario-held -- out ridge model and a trial-level mixed-effects model, we find that SoC alone is informative but incomplete, while primitive motion burden gives the strongest improvement, reducing held out error by about 48.6%-59.8% in MAE and 44.2%-61.4% in RMSE relative to SoC-only models. Interaction-aware coordination features add smaller, less uniform gains, most clearly in the mixed-effects analysis. Across both models and uncertainty checks, primitive motion burden is the most reliable additional signal beyond SoC, suggesting that much of the execution time gap is already visible in the offline plan before any robot starts moving.

11. 安全、鲁棒性与可信机器人 5 篇

2606.10371 2026-06-10 cs.RO cs.AI 新提交

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

测试时对抗接管:针对机器人扩散策略的实时劫持接口

Zi Yin, Peilin Chai, Siyuan Huang, Zhanhao Hu

发表机构 * Tsinghua University(清华大学) Independent Researcher(独立研究员) Johns Hopkins University(约翰霍普金斯大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出测试时对抗接管(TAKO)方法,通过可微扩散推理学习可重复使用的通用补丁,在测试时切换补丁以劫持机器人策略,实现远程操控,在多种任务和模型上达到100%接管成功率。

详情
AI中文摘要

基于扩散的动作生成已成为具身AI的基础组件,但其对视觉条件的依赖使得部署的视觉运动策略容易受到对抗性操纵。大多数先前的攻击侧重于破坏:它们扰动观测流以降低任务成功率或引发异常行为。我们研究了一种更强的威胁,即测试时对抗接管(TAKO),其中攻击者获得对冻结机器人策略的实时转向接口,并将其转变为远程操控仪器。TAKO通过可微扩散推理学习一个小的可重用通用补丁词汇表;在测试时,攻击者在摄像头流中切换这些补丁以组合攻击者选择的轨迹。这种方法之所以有效,是因为扰动作用于视觉条件路径,其中诱导的偏差可以通过迭代生成推理持续存在。我们进一步表明,自然的目标基线——目标策略匹配——会失败,因为受害者策略无法可靠地在分布外目标偏移上监督自身。在四个任务(2D操作、模拟空中递送、模拟地面导航和物理世界地面导航)、两个视觉编码器(ResNet-18和EfficientNet-B0 + Transformer)以及三个生成推理族(DDPM、DDIM和流匹配)中,人类操作员在每个评估设置中均实现了100%的接管成功率,满足攻击者定义的目标。项目页面可在此https URL获取。

英文摘要

Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.

2606.10501 2026-06-10 cs.RO 新提交

Uncovering Vulnerability of Vision-Language-Action Models under Joint-Level Physical Faults

揭示视觉-语言-动作模型在关节级物理故障下的脆弱性

Minsoo Jo, Taeju Kwon, Junha Chun, Youngjoon Jeong, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院)

AI总结 本研究揭示VLA模型在机器人关节级物理故障(如执行器退化、摩擦增加)下性能显著下降,并提出轻量级残差校准框架J-PARC,通过推断关节故障状态并自适应修正动作,提升鲁棒性。

详情
AI中文摘要

在真实机器人系统中部署视觉-语言-动作(VLA)模型不仅需要对语义和感知变化具有鲁棒性,还需要对改变动作物理实现方式的实体侧故障具有鲁棒性。真实机器人可能经历由执行器退化、硬件故障、安全限制、碰撞损坏或磨损引起的摩擦导致的关节级变化。这些故障至关重要,因为它们改变了策略的动作到运动接口,破坏了指令动作、实现运动与后续观测之间的学习闭环关系。在这项工作中,我们研究了真实的关节级物理故障,并表明当预测动作通过受扰动的机器人身体执行时,VLA模型是脆弱的。我们的分析揭示了关节依赖效应,受影响关节的任务成功率呈现异质性退化。我们还表明,性能下降不能仅归因于物理不可行性,因为可行的故障(如增加的关节摩擦)仍能显著降低成功率并引发闭环执行不匹配。受这些发现的启发,我们提出了关节级物理故障感知残差校准器(J-PARC),这是一个构建在冻结VLA策略之上的轻量级残差校准框架。J-PARC从最近的关节动力学中推断出潜在的关节故障状态,并在此状态下调节共享的残差校准器,从而实现对故障关节的自适应动作修正。实验表明,J-PARC在关节级故障下提高了鲁棒性,同时保持了无故障环境下的性能。

英文摘要

Deploying Vision-Language-Action (VLA) models in real robotic systems requires robustness not only to semantic and perceptual variations, but also to embodiment-side faults that change how actions are physically realized. Real robots can experience joint-level changes caused by actuator degradation, hardware faults, safety limits, collision damage, or wear-induced friction. These faults are critical because they alter the action-to-motion interface of a policy, disrupting the learned closed-loop relationship between commanded actions, realized motion, and subsequent observations. In this work, we study realistic joint-level physical faults and show that VLA models are vulnerable when predicted actions are executed through a perturbed robot body. Our analysis reveals joint-dependent effects, with heterogeneous degradation in task success across affected joints. We also show that performance drops cannot be attributed solely to physical infeasibility, since feasible faults such as increased joint friction can still substantially reduce success rates and induce closed-loop execution mismatch. Motivated by these findings, we propose Joint-level Physical-fault Aware Residual Calibrator (J-PARC), a lightweight residual calibration framework built on top of a frozen VLA policy. J-PARC infers a latent joint-fault regime from recent joint dynamics and conditions a shared residual calibrator on this regime, enabling adaptive action correction across faulty joints. Experiments show that J-PARC improves robustness under joint-level faults while preserving fault-free environment performance.

2606.10228 2026-06-10 cs.LG cs.AI cs.RO 交叉投稿

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

SHAPO: 面向安全探索的锐度感知策略优化

Kaustubh Mani, Yann Pequignot, Vincent Mai, Liam Paull

AI总结 提出SHAPO算法,通过锐度感知策略更新隐式重加权梯度,放大罕见不安全动作的影响,抑制安全动作的贡献,从而在欠探索区域实现保守行为,提升安全性与任务性能。

Comments ICLR 2026

详情
AI中文摘要

安全探索是在安全关键领域部署强化学习(RL)智能体的先决条件。在本文中,我们通过认知不确定性的视角来探讨安全探索,其中智能体对参数扰动的敏感性作为高不确定性区域的实际代理。我们提出了锐度感知策略优化(SHAPO),一种锐度感知的策略更新规则,该规则在扰动参数处评估梯度,使得策略更新相对于智能体的认知不确定性变得悲观。分析表明,这种调整隐式地重新加权了策略梯度,放大了罕见不安全动作的影响,同时抑制了已安全动作的贡献,从而在欠探索区域将学习偏向于保守行为。在多个连续控制任务中,我们的方法在安全性和任务性能上均持续优于现有基线,显著扩展了它们的帕累托前沿。

英文摘要

Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor's epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.

2601.18765 2026-06-10 cs.RO 版本更新

Goal-oriented Communication for Fast and Robust Robotic Fault Detection and Recovery

面向快速鲁棒机器人故障检测与恢复的目标导向通信

Shutong Chen, Adnan Aijaz, Yansha Deng

发表机构 * Department of Engineering, King’s College London(伦敦国王学院工程系) Bristol Research and Innovation Laboratory, Toshiba Europe Ltd.(托bsd欧洲有限公司布里斯托尔研究与创新实验室)

AI总结 提出目标导向通信框架,通过联合设计通信-计算-控制回路,利用3D场景图检测故障,并微调小语言模型结合知识蒸馏生成恢复动作,实现故障检测与恢复时间降低82.6%,任务成功率提升76%。

Comments Submit to IEEE for potential publication

详情
AI中文摘要

自主机器人系统广泛部署于智能工厂,并在动态、不确定及有人参与的环境中运行,需要低延迟且鲁棒的故障检测与恢复(FDR)。然而,现有FDR框架存在各种局限性,例如通信和计算的显著延迟,以及机器人运动/轨迹生成的不可靠性,这主要是因为通信-计算-控制(3C)回路的设计未考虑下游FDR目标。为了解决这个问题,我们提出了一种新颖的目标导向通信(GoC)框架,该框架联合设计3C回路,专门用于快速鲁棒的机器人FDR,目标是最小化FDR时间同时最大化机器人任务(例如工件分拣)成功率。对于故障检测,我们的GoC框架创新性地通过我们设计的表示提取器定义并提取3D场景图(3D-SG)作为语义表示,并通过监测3D-SG中的空间关系变化来检测故障。对于故障恢复,我们通过低秩适配(LoRA)微调一个小语言模型(SLM),并通过知识蒸馏增强其推理和泛化能力,以生成机器人的恢复动作。我们还设计了一个轻量级的目标导向数字孪生重建模块,在需要精细机器人控制时,仅使用任务相关的物体轮廓进行数字孪生重建,以优化SLM生成的恢复动作。大量仿真表明,与依赖视觉语言模型进行故障检测和大型语言模型进行故障恢复的最先进框架相比,我们的GoC框架将FDR时间降低了高达82.6%,并将任务成功率提高了高达76%。

英文摘要

Autonomous robotic systems are widely deployed in smart factories and operate in dynamic, uncertain, and human-involved environments that require low-latency and robust fault detection and recovery (FDR). However, existing FDR frameworks exhibit various limitations, such as significant delays in communication and computation, and unreliability in robot motion/trajectory generation, mainly because the communication-computation-control (3C) loop is designed without considering the downstream FDR goal. To address this, we propose a novel Goal-oriented Communication (GoC) framework that jointly designs the 3C loop tailored for fast and robust robotic FDR, with the goal of minimising the FDR time while maximising the robotic task (e.g., workpiece sorting) success rate. For fault detection, our GoC framework innovatively defines and extracts the 3D scene graph (3D-SG) as the semantic representation via our designed representation extractor, and detects faults by monitoring spatial relationship changes in the 3D-SG. For fault recovery, we fine-tune a small language model (SLM) via Low-Rank Adaptation (LoRA) and enhance its reasoning and generalization capabilities via knowledge distillation to generate recovery motions for robots. We also design a lightweight goal-oriented digital twin reconstruction module to refine the recovery motions generated by the SLM when fine-grained robotic control is required, using only task-relevant object contours for digital twin reconstruction. Extensive simulations demonstrate that our GoC framework reduces the FDR time by up to 82.6% and improves the task success rate by up to 76%, compared to the state-of-the-art frameworks that rely on vision language models for fault detection and large language models for fault recovery.

2407.20242 2026-06-10 cs.CY cs.AI cs.RO 版本更新

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

BadRobot: 在物理世界中越狱具身LLM智能体

Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo, Leo Yu Zhang

发表机构 * Huazhong University of Science and Technology(华中科技大学) Beihang University(北航) Griffith University(格里菲斯大学)

AI总结 提出BadRobot攻击范式,利用LLM在机器人系统中的操纵、语言输出与物理动作的错位以及世界知识缺陷三个漏洞,通过语音交互使具身LLM执行有害行为,并在基准测试中验证了有效性。

Comments Accepted to ICLR 2025. Please cite the conference version. Project page: https://Embodied-LLMs-Safety.github.io

详情
Journal ref
International Conference on Learning Representations (ICLR) 2025
AI中文摘要

具身AI代表将AI集成到物理实体中的系统。大型语言模型(LLM)展现出强大的语言理解能力,通过促进复杂的任务规划,已被广泛用于具身AI。然而,一个关键的安全问题仍被忽视:这些具身LLM是否会实施有害行为?为此,我们引入了BadRobot,一种新颖的攻击范式,旨在通过典型的基于语音的用户-系统交互,使具身LLM违反安全和伦理约束。具体来说,我们利用了三个漏洞来实现这种攻击:(i) 机器人系统中LLM的操纵,(ii) 语言输出与物理动作之间的错位,以及(iii) 世界知识缺陷导致的意外危险行为。此外,我们构建了一个包含各种恶意物理动作查询的基准,以评估BadRobot的攻击性能。基于该基准,针对现有突出的具身LLM框架(例如Voxposer、Code as Policies和ProgPrompt)的大量实验证明了我们BadRobot的有效性。我们的代码可在以下网址获取:this https URL。

英文摘要

Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot. Our code is available at https://github.com/Rookie143/BadRobot.

12. 其他/综合机器人 9 篇

2606.10208 2026-06-10 cs.RO cs.AI 新提交

Exploration of Foundation Model-Based Robots in Patient and Elderly Care

基于基础模型的机器人在患者和老年人护理中的探索

Zhiwen Qiu, Wei Liu, Yuexing Hao

AI总结 本文综述了基于基础模型的护理机器人在设计特征、用户体验和护理效果方面的现状,指出当前系统多用于语音交互,多模态和物理自主性有限,并呼吁向护理特定评估标准和负责任自主性发展。

详情
AI中文摘要

随着全球人口老龄化,对老年人和患者护理的需求迅速增长。基础模型越来越多地被集成到机器人和交互代理中,有望实现更灵活的沟通和个性化辅助。然而,护理环境需要可靠且与工作流程兼容的系统,并具备可问责的人类监督,目前尚不清楚当前具身系统能否将技术进步转化为临床影响。本综述从三个方面综合了基于基础模型的护理机器人:设计特征、用户体验以及护理相关结果的证据。当前系统最常将基础模型用作以语音为中心的社会辅助具身中的对话和推理层,而多模态基础和物理自主性仍然有限。实证评估报告了积极的可用性和参与度益处,但交互流程中仍存在可靠性故障,如幻觉和对话中断。护理影响的证据主要集中在认知参与和参与等近期结果上,而经过验证的临床或护理相关变化的证据有限。我们认为,未来的研究应转向护理特定的评估标准、可问责的自主性以及融入护理工作流程,以支持更具响应性和负责任的护理技术。

英文摘要

Demand for older-adult and patient care is growing rapidly as populations age worldwide. Foundation models are increasingly being integrated into robots and interactive agents, with the promise of more flexible communication and personalized assistance. However, care settings require reliable and workflow-compatible systems with accountable human oversight, and it remains unclear whether current embodied systems can translate technical advances into clinical impact. This Perspective synthesizes foundation model-based care robots across three areas: design features, user experience, and evidence for care-related outcomes. Current systems most commonly use foundation models as conversational and reasoning layers within voice-centered socially assistive embodiments, while multimodal grounding and physical autonomy remain limited. Empirical evaluations report positive usability and engagement benefits, but reliability failures persist across the interaction pipeline such as hallucinations and conversational breakdowns. Evidence for care impact remains concentrated in proximal outcomes such as cognitive engagement and participation, with limited evidence for validated clinical or care-related changes. We argue that future research should transition toward care-specific evaluation standards, accountable autonomy, and integration into care workflows to support more responsive and responsible care technologies.

2606.10289 2026-06-10 cs.RO cs.NA math.NA 新提交

Improved Representation of Matrix Lie Group Operations through Tensor Notation

通过张量符号改进矩阵李群运算的表示

Clark Taylor

AI总结 本文引入张量和爱因斯坦求和符号来简化矩阵李群在李导数计算中的表示,提高估计框架中梯度计算的清晰度。

Comments 12 pages, 4 figures + graphical abstract, 1 algorithm, 4 tables

详情
AI中文摘要

近期的几篇论文展示了在估计问题中使用李群的实用性,提高了准确性和一致性。本文介绍了一种描述矩阵李群运算的新工具:张量和爱因斯坦求和符号。虽然张量和爱因斯坦符号在其他研究领域广为人知,但应用这种数学符号来表示和计算矩阵李导数却是新颖的。更重要的是,这种新符号极大地澄清了在(基于梯度的)估计框架中处理矩阵李群所需的导数和运算。因此,本文的主要贡献不是一种新能力,而是一种用于处理矩阵李群的更清晰的数学符号。

英文摘要

Several recent papers have demonstrated the utility of using Lie groups within estimation problems, yielding improved accuracy and consistency. This paper introduces a new tool for describing operations with matrix Lie groups: tensors and the Einstein summation notation. While tensors and Einstein notation are well-known in other research fields, applying this mathematical notation to represent and compute matrix Lie derivatives is novel. More importantly, this new notation greatly clarifies the derivatives and operations necessary to work with matrix Lie Groups in (gradient-based) estimation frameworks. Therefore, the main contribution of this paper is not a new capability, but a more perspicuous mathematical notation for working with matrix Lie groups.

2606.10746 2026-06-10 cs.RO 新提交

ros2probe: Non-intrusive, Kernel-selective Observability for Robot Operating System 2 Middleware

ros2probe: 面向机器人操作系统2中间件的非侵入式、内核选择性可观测性

Jisang Yu, Sanghoon Lee, Yeonwoo Choi, Kyung-Joon Park

发表机构 * DGIST(大邱庆北科学技术院)

AI总结 针对ROS 2观测工具因加入DDS域而产生的探针效应(膨胀发现平面、增加反序列化开销、导致丢包偏差),提出ros2probe,通过被动捕获发现包重构通信状态,并利用内核过滤仅提取用户指定主题的包,消除探针效应,保持发现图误差在0.5%以内,无丢包,CPU和内存开销降低最高28倍。

Comments 13 pages, 8 figures, 7 tables

详情
AI中文摘要

机器人操作系统2(ROS 2)是机器人的事实标准中间件框架,它将每个机器人作为节点图运行,节点通过数据分发服务(DDS)——一种发布/订阅底层——进行通信。实时观察这种节点间通信对机器人开发至关重要,但需要付出代价。工具只能通过作为订阅者加入DDS域来接收数据,而发现过程会将其与发布者匹配,因此观察将工具折叠到其所测量的系统中并扰动该系统。我们将这种协议固有的扰动定义为观察者的探针效应。它会膨胀发现平面,增加观察者的反序列化成本,使其报告的丢包与订阅者实际接收的丢包偏离,并在接近饱和时取代订阅者的消息。唯一的逃避方法是被动捕获所有线路流量,但这会丢弃ROS 2消息语义,并且其规模与总流量成正比,而非被观察的流量。我们提出ros2probe,一种非侵入式观察框架,消除了探针效应。它从域中的发现数据包重构完整的ROS 2通信状态,且无带宽成本,然后驱动一个内核级过滤器,仅限用户请求的主题,以最小成本提取这些数据包,并观察真实订阅者接收的内容。其接口和记录与标准ROS 2工具匹配。在三个硬件平台(笔记本电脑、Jetson和树莓派)、两种DDS实现和七种机器人操作工作负载上,ros2probe将发现图保持在未观察系统的0.5%以内,而加入域的工具将发现膨胀高达2.6倍,并在饱和时丢弃订阅者38.5%的消息,而ros2probe无丢包。其丢包报告召回率为1.0,将观察者的CPU和内存开销分别降低高达7倍和28倍,并在现有工具会使系统过载的嵌入式机器人上保持实用性。

英文摘要

Robot Operating System 2 (ROS 2), the de facto standard middleware framework for robots, runs each robot as a graph of nodes communicating over the Data Distribution Service (DDS), a publish/subscribe substrate. Observing this inter-node communication in real time is essential to robot development, yet it has a price. A tool can receive data only by joining the DDS domain as a subscriber that discovery has matched to the publisher, so observing folds the tool into the system it measures and perturbs it. We define this protocol-inherent perturbation as the observer's probe effect. It inflates the discovery plane, adds deserialization cost on the observer, makes the loss it reports diverge from what the subscriber actually received, and near saturation displaces the subscriber's messages. The only escape, capturing all wire traffic passively, discards ROS 2 message semantics and scales with total traffic, not what is observed. We present ros2probe, a non-intrusive observation framework that removes the probe effect. It reconstructs the full ROS 2 communication state from the domain's discovery packets at no bandwidth cost, then drives an in-kernel filter restricted to the topics the user asks for, lifting only those packets at minimal cost and observing what the real subscriber receives. Its interfaces and recordings match the standard ROS 2 tools. Across three hardware platforms (laptop, Jetson, and Raspberry Pi), two DDS implementations, and seven robot-operation workloads, ros2probe holds the discovery graph within 0.5% of an unobserved system, whereas domain-joining tools inflate discovery up to 2.6$\times$ and drop 38.5% of the subscriber's messages at saturation while ros2probe drops none. It reports loss with a recall of 1.0, cuts observer CPU and memory by up to 7$\times$ and 28$\times$, and stays practical on the embedded robots where existing tools overload the system.

2606.11037 2026-06-10 cs.RO 新提交

Generation of Diverse and Functional Robot Designs using Superquadrics Parametrisation and Quality-Diversity

使用超二次曲面参数化和质量-多样性生成多样化且功能性的机器人设计

Leni Le Goff, Simon Smith, Emma Hart

发表机构 * Edinburgh Napier University(爱丁堡纳皮尔大学)

AI总结 提出基于超二次曲面(SQs)的机器人身体表示,结合质量-多样性算法MAP-Elites,以增强形态多样性并避免过早收敛,在测试中取得最高QD分数。

Comments Accepted at PPSN 2026

详情
AI中文摘要

机器人的生成设计需要导航一个巨大的搜索空间,涵盖物理配置和行为参数。进化算法(EAs)已显示出有希望的结果,但常常过早收敛到一小部分次优设计。大多数EAs未能保持种群中足够的多样性,从而无法发现不同的功能性机器人。为了应对过早收敛,我们引入了一种基于超二次曲面(SQs)的机器人身体表示。SQs是3D几何形状的可解释、紧凑且计算高效的数学表示,可以针对特定设计空间进行调整。为了鼓励形态多样性,我们将这种表示与质量-多样性(QD)算法(MAP-Elites)相结合。我们比较了SQs和组合模式生成网络表示作为形态生成器,将它们与标准EAs和MAP-Elites结合。在两个测试环境中,我们发现使用SQs生成形态并结合MAP-Elites算法在两个环境中都达到了最高的QD分数,最大化了生成机器人的设计和功能多样性。研究结果强调了使用紧凑且可解释的几何表示来探索复杂设计空间的好处,并表明将SQs与显式多样性机制结合可以提高生成设计的质量和数量。

英文摘要

Generative design of robots requires navigating a vast search-space, encompassing physical configurations and behavioural parameters. Evolutionary Algorithms (EAs) have shown promising results, but often converge prematurely to a small set of sub-optimal designs. Most EAs fail to maintain sufficient diversity in the population that would allow the discovery of distinct functional robots. To counter premature convergence, we introduce a superquadrics-based representation (SQs) for robot bodies. SQs are interpretable, compact and computationally efficient mathematical representations of 3D geometrical shapes that can be tuned to specific design-spaces. To encourage morphological diversity, we combine this representation with a quality-diversity (QD) algorithm (MAP-Elites). We compare SQs and Compositional Pattern Producing Networks representations as generators of morphologies, combining them with standard EAs and MAP-Elites. In two test environments, we find that using SQs to generate morphology in conjunction with the MAP-Elites algorithm reaches the highest QD-score across both environments, maximising diversity of design and functionality of generated robots. The findings highlight the benefits of using a compact and interpretable geometric representation for exploring a complex design-space and suggest that combining SQs with an explicit diversity mechanism increases the quality and number of designs generated.

2606.09203 2026-06-10 cs.RO 版本更新

Deterministic Execution of ROS 2 Applications via Lingua Franca

通过Lingua Franca实现ROS 2应用的确定性执行

Harun Teper, Shaokai Lin, Shulu Li, Edward A. Lee, Jian-Jia Chen

发表机构 * TU Dortmund University(多特蒙德工业大学) University of California, Berkeley(加州大学伯克利分校) RWTH Aachen University(亚琛工业大学)

AI总结 提出框架将未修改的ROS 2应用转换为Lingua Franca程序,利用逻辑时间实现确定性执行,解决ROS 2中回调执行顺序和消息交织的非确定性问题。

详情
AI中文摘要

机器人操作系统2(ROS 2)是一种广泛用于机器人系统的中间件,其特点是发布-订阅(pub-sub)通信机制,计算结构为由ROS 2执行器调度的回调。尽管很流行,但ROS 2中的pub-sub模式本质上是不确定的:即使在单个执行器内,这些回调的运行顺序也是不确定的,分布式部署由于节点间消息的交织和网络延迟进一步增加了不确定性。这种不确定性常常导致并发问题,使得几乎不可能分析安全性并提供保证。我们提出了一个框架,能够将未修改的ROS 2应用程序转换并在Lingua Franca(LF)下运行,LF是一种使用逻辑时间进行确定性执行的协调语言,使得相同的输入总是产生相同的确定性执行顺序。我们首先描述了哪些ROS 2特性可以在逻辑时间下确定性执行。这些特性使得建立自动转换框架成为可能,该框架从ROS 2应用程序中提取信息并直接将其转换为LF程序。然后可以应用LF的丰富特性,如逻辑时间延迟、跨进程的联邦执行和故障处理,使ROS 2应用程序以确定性和时序可预测的方式执行,而无需更改ROS 2代码。我们在一个合成示例和Autoware参考系统上评估了该框架。我们表明,在默认ROS 2中,回调的执行顺序不同,同时端到端延迟在不同执行中也有所变化。相比之下,我们由LF控制的ROS 2系统产生了确定的执行顺序和一致的端到端延迟。

英文摘要

The Robot Operating System 2 (ROS 2) is a widely used middleware for robotic systems, characterized by a publish-subscribe (pub-sub) communication mechanism in which computation is structured as callbacks dispatched by ROS 2 executors. Despite its popularity, the pub-sub pattern in ROS 2 is inherently nondeterministic: the order in which these callbacks run is nondeterministic even within a single executor, and distributed deployments add further nondeterminism from the interleaving of messages across nodes and from network latency. Such nondeterminism often leads to concurrency issues and makes it virtually impossible to analyze for safeness and provide guarantees. We present a framework that is able to convert an unmodified ROS 2 application and run it under Lingua Franca (LF), a coordination language for deterministic execution using logical time, so that the same input always produces the same deterministic execution order. We first describe which ROS 2 features can be executed deterministically under logical time. Such features enable the possibility to establish an automatic conversion framework to extract information from a ROS 2 application and directly convert it into an LF program. The rich features of LF, such as logical-time delays, federated execution across processes, and fault handling, can then be applied to make the ROS 2 application be executed in a deterministic and timing-predictable manner without changing the ROS 2 code. We evaluate the framework on a synthetic example and on the Autoware reference system. We show that the order in which callbacks are executed differs in default ROS 2, while also having end-to-end latencies that vary across executions. In contrast, our LF-controlled ROS 2 system produces a deterministic execution order and consistent end-to-end latencies.

2606.00097 2026-06-10 cs.RO cs.MA 版本更新

RocketSmith: An Agentic System for High-Powered Rocket Design and Manufacturing

RocketSmith: 一种用于高功率火箭设计与制造的智能系统

Peter Pak, Jesse Barkley, Rumi Loghmani, Derek Baich, Ananya Pamal, Amir Barati Farimani

发表机构 * Graduate Research Assistant, Mechanical Engineering(机械工程研究生助理) AI Fellow, Mechanical Engineering(人工智能研究员,机械工程) Undergraduate Student, Mechanical Engineering(机械工程本科生) Senior Member, Pittsburgh Prefecture One(高级会员,匹兹堡郡一区) Russell V. Trader Associate Professor, Mechanical Engineering(Russell V. Trader副教授,机械工程)

AI总结 本文提出RocketSmith,一种基于智能体系统的自动化设计、制造与优化框架,通过子智能体与技能实现零样本和人在回路的飞行参数优化,并利用增材制造成功开发并测试了四枚高功率火箭。

详情
AI中文摘要

本文介绍了RocketSmith,一种能够完成高功率火箭开发中设计、制造和优化过程的智能系统。该系统实现了软件工具的智能自动化,不仅能够验证飞行稳定性等因素,还能生成火箭组件的参数化设计。通过一组子智能体和技能,该系统能够在零样本和人在回路的工作流程中通过迭代优化飞行参数。利用该系统,结合增材制造的独特设计能力,开发了四种不同电机和组件配置的高功率火箭。这些组件使用各种FDM打印机打印,手动评估飞行准备状态,并在发射活动中进行了飞行测试。测试中,所有火箭均实现了稳定发射,其中两枚火箭成功回收并具备再次飞行条件。在收集的飞行数据中,实测远地点与飞行模拟计算值的准确率达到84%。

英文摘要

This work presents RocketSmith, an agentic system capable of the design, manufacturing, and optimization processes in high powered rocket development. The system enables the intelligent automation of software tools as to not only validate factors such as flight stability but also generate the parametric design components for the rocket assembly. A collection of subagents and skills enable optimization workflows of flight parameters via iteration in both zero-shot and human-in-the-loop workflows. With this system, four distinct high power rockets with various motor and assembly configurations were developed utilizing the unique design capabilities of additive manufacturing. These assembly components were fabricated using various FDM printers, manually evaluated for flight readiness, and flight tested at a launch event. From these tests, all rockets achieved a stable launched and two of the four rockets were successfully recovered in reflyable condition. Within the collected flight data, an 84% accuracy was achieved when comparing measured apogee to that calculated in flight simulations.

2602.16898 2026-06-10 cs.RO cs.AI cs.CV cs.LG 版本更新

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI:一种多智能体框架用于集成通用机器人操作

Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

发表机构 * Department of Electrical Engineering, Sharif University of Technology(电气工程系,谢里夫大学)

AI总结 MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

Comments Some fundemental change in text and codebase

详情
AI中文摘要

MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

英文摘要

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .

2603.04056 2026-06-10 cs.CV cs.RO 版本更新

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

长期动态底栖环境中的视觉定位:一个数据集、基于足迹的地面真实信息以及视觉地点识别基准

Martin Kvisvik Larsen, Oscar Pizarro

发表机构 * Department of Marine Technology(海洋技术系) Norwegian University of Science and Technology(挪威科学技术大学) Trondheim, Norway(特罗姆瑟,挪威)

AI总结 本文提出一个用于长期底栖环境视觉定位的 curated 数据集和基于足迹的地面真实方法,评估了八种最先进的视觉地点识别方法,发现其在该数据集上的 Recall@K 显著低于传统基准。

详情
Journal ref
Frontiers in Robotics and AI Volume 13 (2026) 1821019
AI中文摘要

长期视觉定位有潜力降低光学底栖监测中自主水下机器人(AUV)的成本并提高制图质量。尽管有这种潜力,底栖环境中长期视觉定位仍被低估,主要由于缺乏用于基准测试的curated数据集。此外,有限的地理参考精度和图像足迹需要精确的几何信息以实现准确的地面真实。在本文中,我们通过提出一个用于长期视觉定位的底栖环境curated数据集和一种新的方法来为近垂直水下影像的视觉定位结果进行地面真实,解决了这些差距。我们的数据集包括来自五个底栖参考站点的地理参考AUV影像,这些站点在长达六年的期间内被重新访问,包括原始和颜色校正的立体影像、相机校准和亚分米注册的相机姿态。据我们所知,这是首个涵盖多个站点和光层栖息地的长期视觉定位水下数据集。我们的地面真实方法估计3D海底图像足迹,并将具有重叠足迹的相机视图联系起来,确保地面真实链接反映共享的视觉内容。基于此数据集和地面真实,我们基准测试了八种最先进的视觉地点识别(VPR)方法,并发现Recall@K在我们的数据集上显著低于传统陆地和水下基准。最后,我们比较了基于足迹的地面真实与传统位置基于的地面真实,并表明距离阈值地面真实在地形崎岖和海拔变化的站点上会高估VPR Recall@K。共同,curated数据集、地面真实方法和VPR基准为在动态底栖环境中推进长期视觉定位提供了基础。

英文摘要

Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.

2508.00491 2026-06-10 cs.RO cs.AI 版本更新

HannesImitation: Grasping with the Hannes Prosthetic Hand via Imitation Learning

HannesImitation:通过模仿学习控制Hannes假手进行抓取

Carlo Alessi, Federico Vasile, Federico Ceola, Giulia Pasquale, Nicolò Boccardo, Lorenzo Natale

发表机构 * Humanoid Sensing and Perception(人形感知与感知实验室) Istituto Italiano di Tecnologia(意大利技术研究院) Rehab Technologies Lab(康复技术实验室)

AI总结 本文提出HannesImitationPolicy,通过模仿学习控制Hannes假手在无结构环境中抓取物体,并引入HannesImitationDataset进行训练,实验表明其在无结构场景中优于基于分割的视觉伺服控制器。

Comments Paper accepted at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

详情
Journal ref
IEEE/RSJ International Conference on Intelligent Robots and Systems, Hangzhou, China, 2025
AI中文摘要

最近,假手控制的进步集中在通过摄像头和其他传感器输入提高自主性。这些系统旨在通过自动控制某些自由度来减少用户认知负担。在机器人学中,模仿学习已成为学习抓取和复杂操作任务并简化数据收集的有前途的方法。然而,其在假手控制中的应用仍 largely 未被探索。填补这一差距可以提高灵活性恢复,并使假手设备能够在更多无约束场景中运行,其中任务是通过演示学习而非依赖手动标注序列。为此,我们提出了HannesImitationPolicy,一种基于模仿学习的方法来控制Hannes假手,使其在无结构环境中进行物体抓取。此外,我们引入了HannesImitationDataset,包含在桌子、架子和人到假手交接场景中的抓取演示。我们利用此类数据训练了一个单扩散策略,并将其部署在假手上以预测手腕方向和手部闭合以进行抓取。实验评估显示在多样化的物体和条件下成功抓取。最后,我们展示该策略在无结构场景中优于基于分割的视觉伺服控制器。附加材料可在我们的项目页面上提供:https://hsp-iit.github.io/HannesImitation

英文摘要

Recent advancements in control of prosthetic hands have focused on increasing autonomy through the use of cameras and other sensory inputs. These systems aim to reduce the cognitive load on the user by automatically controlling certain degrees of freedom. In robotics, imitation learning has emerged as a promising approach for learning grasping and complex manipulation tasks while simplifying data collection. Its application to the control of prosthetic hands remains, however, largely unexplored. Bridging this gap could enhance dexterity restoration and enable prosthetic devices to operate in more unconstrained scenarios, where tasks are learned from demonstrations rather than relying on manually annotated sequences. To this end, we present HannesImitationPolicy, an imitation learning-based method to control the Hannes prosthetic hand, enabling object grasping in unstructured environments. Moreover, we introduce the HannesImitationDataset comprising grasping demonstrations in table, shelf, and human-to-prosthesis handover scenarios. We leverage such data to train a single diffusion policy and deploy it on the prosthetic hand to predict the wrist orientation and hand closure for grasping. Experimental evaluation demonstrates successful grasps across diverse objects and conditions. Finally, we show that the policy outperforms a segmentation-based visual servo controller in unstructured scenarios. Additional material is provided on our project page: https://hsp-iit.github.io/HannesImitation