arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.28812 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

超越二元：基于物理接触表示的仿真到现实灵巧操作

Jiahe Pan, Stelian Coros, Jitendra Malik, Toru Lin

发表机构 * ETH Zürich（苏黎世联邦理工学院）； UC Berkeley（伯克利加州大学）

AI总结提出基于物理原理的中心压力（CoP）触觉表示，结合可微动力学传感器标定，实现多指手的零样本仿真到现实迁移，在插销入孔和球平衡任务中优于二元接触和原始触觉基线。

Comments Project site: https://mpan31415.github.io/tactile_rep/

详情

AI中文摘要

接触丰富操作的主要瓶颈是收集真实世界数据的困难。仿真到现实强化学习提供了一种可扩展的替代方案，但仿真-现实差距阻碍了像触觉这样信息密集的模式被有效使用。现有的仿真到现实方法通常通过将触觉数据简化为粗略的低维特征来缩小这一差距——牺牲了复杂操作所需的丰富性。在这项工作中，我们引入了中心压力（CoP），一种基于物理原理的有效触觉表示，它保留了密集的接触信息，同时保持了仿真到现实迁移的鲁棒性。为了支持这种表示，我们提出了一种基于可微动力学的传感器标定方案，使得能够在不需真实力测量的情况下估计触觉单元的朝向。我们在两个盲态、具有挑战性的接触丰富操作任务上评估了CoP：插销入孔和球平衡。在这两个任务中，基于CoP的策略在多指手上实现了零样本仿真到现实迁移，并且优于粗略的二元接触和原始触觉基线。对学习策略状态的分析进一步表明，基于CoP的策略编码了任务相关的物理属性，如物体质量，作为控制的涌现副产品。

英文摘要

A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offers a scalable alternative, but the simulation-reality gap prevents information-dense modalities like touch from being effectively used. Existing sim-to-real methods often mitigate this gap by simplifying tactile data into coarse low-dimensional features -- sacrificing the richness required for complex manipulation. In this work, we introduce Center-of-Pressure (CoP), an effective tactile representation grounded in physical principles that preserves dense contact information while maintaining robustness for sim-to-real transfer. To support this representation, we propose a sensor calibration scheme based on differentiable dynamics, enabling the estimation of taxel orientations without requiring ground-truth force measurements. We evaluate CoP on two blind, challenging contact-rich manipulation tasks: peg-in-hole insertion and ball balancing. Across both tasks, policies conditioned on CoP achieve zero-shot sim-to-real transfer on a multi-fingered hand, and outperform both coarse binary-contact and raw-taxel baselines. Analysis of learned policy states further suggests that CoP-conditioned policies encode task-relevant physical properties, such as object mass, as an emergent byproduct of control.

URL PDF HTML ☆

赞 0 踩 0

2605.28736 2026-05-28 cs.RO 版本更新

Imitation Learning for Robot Assistance in Open Surgery: A Multi-Policy Evaluation on Suture Following

开放手术中机器人辅助的模仿学习：针对缝合跟随的多策略评估

Xucheng Wang, Zhizhou Yang, Xiaoman Zhang, Sung Eun Kim, Romain Hardy, Pranav Rajpurkar

发表机构 * Harvard Medical School（哈佛医学院）； Massachusetts General Hospital（麻省总医院）

AI总结本研究首次评估通用模仿学习在开放手术中用于外科医生-机器人协作辅助的可行性，以缝合跟随（每次缝合时助手执行的抓取-拉动-释放动作）为任务，通过比较四种策略（ACT、Diffusion Policy、SmolVLA、π₀）在28个训练模型上的表现，发现π₀在数据效率、背景鲁棒性和轨迹平滑性上最优，并在机器人缝合试验中达到92%的缝合完成率。

详情

AI中文摘要

本研究首次评估了通用模仿学习在外科医生-机器人协作辅助开放手术中的应用，针对缝合跟随：即助手在每次缝合时执行的抓取-拉动-释放动作。我们在一个开源机器人臂上收集了160次遥操作演示（32,374帧），并基准测试了四种架构不同的模仿学习策略（ACT、Diffusion Policy、SmolVLA、π₀），涉及28个训练模型，在32种配置下沿三个临床相关维度（数据集大小、相机视角和背景变化）进行评估。结果表明，在理想条件下，四种策略实现了50%-75%的任务成功率，深度误差是所有架构的主要失败模式。在所有策略中，π₀凭借预训练的视觉-语言骨干网络取得了最强结果，展现出优越的数据效率、对背景变化的更强鲁棒性以及与手术工作流兼容的更平滑轨迹。在外科医生-机器人缝合试验中，π₀实现了92%的缝合完成率。这些发现确立了开放手术中的协作机器人辅助作为模仿学习的可行目标，并强调深度感知和末端执行器设计是临床转化的关键优先事项。

英文摘要

This study presents the first evaluation of general-purpose imitation learning for surgeon-robot collaborative assistance in open surgery, targeting suture following: the grab-pull-release motion an assistant performs at every stitch. We collect 160 teleoperated demonstrations (32,374 frames) on an open-source robot arm, benchmark four architecturally diverse imitation learning policies (ACT, Diffusion Policy, SmolVLA, $π_0$) across 28 trained models evaluated in 32 configurations along three clinically motivated dimensions: dataset size, camera viewpoint, and background variation. Our results demonstrate that under ideal conditions, the four policies achieve $50$-$75\%$ task success, with depth error as the dominant failure mode across all architectures. Among all policies, $π_0$ achieves the strongest results with a pretrained vision-language backbone, demonstrating superior data efficiency, greater robustness to background variation, and smoother trajectories compatible with surgical workflow. When deployed in a surgeon-robot suturing trial, $π_0$ yields a $92\%$ stitch completion rate. These findings establish collaborative robotic assistance in open surgery as a feasible target for imitation learning and highlight depth perception and end-effector design as key priorities for clinical translation.

URL PDF HTML ☆

赞 0 踩 0

2605.28726 2026-05-28 cs.RO cs.LG 版本更新

How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

VLA如何以不同方式失败：黑盒动作监控揭示架构特定的失败特征

Krishnam Gupta

发表机构 * Independent Research（独立研究）

AI总结本文通过黑盒动作监控发现，视觉-语言-动作（VLA）架构在电机指令层面以根本不同且可预测的方式失败，并证明架构匹配的监控器选择至关重要。

Comments Accepted at IEEE ICRA 2026 Workshop "From Data to Decisions: VLA Pipelines for Real Robots", Vienna, June 2026. Non-archival workshop. 5 pages, 2 figures, 22 references

详情

AI中文摘要

我们发现VLA架构在电机指令层面以根本不同且可预测的方式失败。在相同的评估协议（PushT和ALOHA 14自由度双手操作共450个回合）上运行VQ-BeT、Diffusion Policy和ACT，我们发现：（1）方向反转率是所有三种架构的通用失败预测器（AUROC=0.93, 0.79, 0.91; p<0.001）；（2）加加速度监控仅对离散令牌架构具有预测性，遵循离散到连续的梯度（0.88, 0.69, 0.41）；（3）速度违规本身在所有地方均无预测性（AUROC 0.41-0.69），然而速度检查是VLA部署代码中最常见的安全机制；（4）对于连续族VLA，速度监控提供的预测信号几乎为零（ACT上AUROC=0.52，Diffusion上0.41），证明架构匹配的监控器选择至关重要。这些结果量化了众所周知的离散/连续VLA区分的监控后果：两个家族产生定性不同的失败特征，需要不同的监控器。没有单一的监控器能普遍适用；需要架构匹配的选择。这一发现得益于SafeContract，一个无需训练、黑盒动作监控工具包，具有共形校准。代码：https://github.com/krishnam94/vla-edge

英文摘要

We discover that VLA architectures fail in fundamentally different, predictable ways at the motor-command level. Running VQ-BeT, Diffusion Policy, and ACT on identical evaluation protocols (n=450 episodes across PushT and ALOHA 14-DOF bimanual manipulation), we find: (1) direction reversal rate is a universal failure predictor across all three architectures (AUROC=0.93, 0.79, 0.91; p<0.001); (2) jerk monitoring is predictive only for discrete-token architectures, following a discrete-to-continuous gradient (0.88, 0.69, 0.41); (3) velocity violations alone are non-predictive everywhere (AUROC 0.41-0.69), yet velocity checking is the most common safety mechanism in VLA deployment code; and (4) for continuous-family VLAs, velocity monitoring provides effectively zero predictive signal (AUROC=0.52 on ACT, 0.41 on Diffusion), proving that architecture-matched monitor selection is essential. These results quantify a monitoring consequence of the well-known discrete/continuous VLA distinction: the two families produce qualitatively different failure signatures that require different monitors. No single monitor works universally; architecture-matched selection is required. This finding was enabled by SafeContract, a training-free, black-box action monitoring toolkit with conformal calibration. Code: https://github.com/krishnam94/vla-edge

URL PDF HTML ☆

赞 0 踩 0

2605.28634 2026-05-28 cs.RO 版本更新

PrimitiveVLA: Learning Reusable Motion Primitives for Efficient and Generalizable Robotic Manipulation

PrimitiveVLA：学习可复用的运动基元以实现高效且可泛化的机器人操作

Yutai Li, Shaohui Peng, Jiaming Guo, Di Huang, Zihao Zhang, Yuxuan Guo, Yunkai Gao, Siming Lan, Ling Li, Xing Hu, Yunji Chen

发表机构 * State Key Lab of Processors, Institute of Computing Technology, CAS（处理器国家重点实验室，计算技术研究所，中国科学院）； Jiangsu Key Laboratory of AI for Industries, Institute of AI for Industries, CAS（江苏人工智能工业重点实验室，人工智能工业研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）； Cambricon Technologies（寒武科技）； Intelligent Software Research Center, Institute of Software, CAS（软件研究所智能软件研究中心，中国科学院）； University of Science and Technology of China（中国科学技术大学）

AI总结提出PrimitiveVLA框架，通过将视觉-语言-动作模型从直接指令到控制映射转向以基元为中心的拆解与组装范式，利用多模态规范表示和自动化流水线，提升数据效率并实现零样本泛化。

详情

AI中文摘要

视觉-语言-动作（VLA）模型为通用机器人策略提供了有前景的范式，但其适应受到数据效率低下和泛化能力差的阻碍。我们认为这些瓶颈源于主流的直接指令到控制映射，该映射迫使模型记忆整体轨迹而非可复用的运动模式，即基元。我们提出PrimitiveVLA，一个将该范式转向以基元为中心的拆解与组装范式的框架。在共享的多模态规范表示（MCR）支持下，PrimitiveVLA统一了两个阶段：（1）微调阶段拆解，使用自动化流水线将演示拆解为可复用的基元；（2）推理阶段组装，采用基于VLM的规划器和LLM生成的切换模块实现鲁棒的闭环执行。通过将任务拆解为可复用的基元，PrimitiveVLA使VLA模型能够学习不变的运动模式而非特定任务的轨迹。大量实验表明，我们的框架提高了数据效率，并在未见过的任务和长时域任务上实现了卓越的零样本泛化。

英文摘要

Vision-Language-Action (VLA) models offer a promising paradigm for generalist robotic policies, yet their adaptation is hindered by data inefficiency and poor generalization. We argue that these bottlenecks stem from the prevailing Direct Instruction-to-Control Mapping, which forces models to memorize monolithic trajectories rather than reusable motion patterns, i.e., primitives. We propose PrimitiveVLA, a framework that shifts this paradigm toward a Primitive-Centric Disassemble & Assemble paradigm. Supported by a shared Multimodal Canonical Representation (MCR), PrimitiveVLA unifies two phases: (1) Fine-tuning-phase Disassembly, which uses an automated pipeline to disassemble demonstrations into reusable primitives; and (2) Inference-phase Assembly, which employs a VLM-based planner and an LLM-generated switch module for robust closed-loop execution. By disassembling tasks into reusable primitives, PrimitiveVLA enables VLA models to learn invariant motion patterns instead of task-specific trajectories. Extensive experiments show that our framework improves data efficiency and achieves superior zero-shot generalization across unseen and long-horizon tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.28583 2026-05-28 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving

SARAD：基于LLM的安全感知混合强化学习与碰撞预测在自动驾驶中的应用

Kangyu Wu, Peng Cui, Guoxi Chen, Ya Zhang

发表机构 * National Natural Science Foundation (NNSF) of China（中国国家自然科学基金委员会）； National Science and Major Project（国家科学技术重大专项）

AI总结提出SARAD框架，结合大语言模型和深度强化学习，通过检索增强生成和碰撞预测模块提升自动驾驶的安全性和效率。

Comments 7 pages, 4 figures, accepted by IJCNN 2026

详情

AI中文摘要

确保自动驾驶系统决策的安全性和效率仍然是一个基本挑战。传统的深度强化学习（DRL）存在不安全的随机探索和收敛缓慢的问题，而大语言模型（LLM）在实时推理操作中表现出固有的延迟。为了解决这些限制，本文提出了SARAD，一种新颖的安全感知混合框架，协同LLM和DRL用于自动驾驶。SARAD用来自动态专家知识库的、经检索增强生成（RAG）增强的LLM引导决策替代了DRL的随机探索。提出了一个注意力判别器，将LLM的先验知识整合到DRL策略优化中。进一步设计了一个碰撞预测模块，使用历史碰撞数据进行微调，以提高车辆安全性。大量实验表明，SARAD在Highway-Env模拟器中实现了显著的性能提升，验证了所提模型在自动驾驶中的有效性。

英文摘要

Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Reinforcement Learning (DRL) suffers from unsafe random exploration and slow convergence, while Large Language Models (LLMs) demonstrate inherent latency in real-time inference operations. To address these limitations, this paper proposes SARAD, a novel safety-aware hybrid framework that synergizes LLMs and DRL for autonomous driving. SARAD substitutes the random exploration of DRL with Retrieval-Augmented Generation (RAG)-enhanced, LLM-guided decisions sourced from a dynamic expert knowledge repository. An attention discriminator is proposed to integrate the prior knowledge of LLMs into DRL policy optimization. A collision predictor module, fine-tuned with historical collision data, is further designed to improve vehicle safety. Extensive experiments show that SARAD achieves significant performance improvements in the Highway-Env simulator, validating the effectiveness of the proposed model in autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.28549 2026-05-28 cs.RO cs.LG 版本更新

学习动力学轨迹流形以实现对快速移动物体的冲击感知柔顺抓取

Guorui Pei, Mengshi Zhang, Xi Chen, Jinsong Wu, Jiaming Qi, Peng Zhou

发表机构 * College of Robotics（机器人学院）； Taiyuan University of Technology（太原科技大学）； School of Data Science（数据科学学院）； City University of Hong Kong (Dongguan)（香港城市大学（东莞））； School of Advanced Engineering（先进工程学院）； Great Bay University（大湾大学）； Department of Mechanical Engineering（机械工程系）； The Hong Kong Polytechnic University（香港理工大学）； College of Mechanical and Electrical Engineering（机械与电子工程学院）； Northeast Forestry University（东北林业大学）

AI总结本文通过仿真中的强化学习收集成功抓取轨迹，学习低维动力学轨迹流形，并在运行时将估计的物体初始状态直接映射到参考抓取轨迹，结合近接触柔顺控制实现快速移动物体的冲击感知抓取。

2605.28448 2026-05-28 cs.RO 版本更新

A Digital Twin Framework for Virtual Visuo-Haptic Teleoperation of Complex-Shaped Optical Microrobots

复杂形状光学微机器人的虚拟视觉-触觉遥操作数字孪生框架

Zongcai Tan, Lan Wei, Dandan Zhang

发表机构 * Department of Bioengineering, Imperial-X AI Initiative, Imperial College London（生物工程系、Imperial-X人工智能倡议、帝国理工学院伦敦分校）

AI总结本文提出一个数字孪生框架，集成多陷阱光学操纵、图像位姿估计、微机器人运动仿真和基于模型的触觉渲染，用于复杂形状光学微机器人的虚拟视觉-触觉遥操作，实验表明触觉反馈显著降低接触力和位置误差标准差并提高任务成功率。

Comments Accepted by 2026 MARSS

详情

AI中文摘要

光镊（OT）为精细生物医学任务提供皮牛级操纵，其中视觉-触觉反馈可通过传达交互力线索和陷阱稳定性信息来增强操作员感知。然而，针对复杂形状光学微机器人的视觉-触觉遥操作框架仍不成熟，特别是在多陷阱操纵场景中。本文提出一个用于复杂形状OT驱动微机器人的虚拟视觉-触觉遥操作数字孪生框架。该框架在机器人操作系统（ROS）连接的双臂遥操作系统中集成了数字孪生环境、基于图像的位姿和深度估计、微机器人运动仿真以及基于模型的触觉渲染。在力建模方面，我们结合了多球分布操纵（MSDM）模型与来自光镊工具箱的光学力估计，从而实现仿真驱动的视觉-触觉反馈。该框架再现了代表性微机器人的运动趋势，并提供了与拟合光学力模型数值一致的触觉力渲染。在模拟细胞递送任务中，触觉反馈使接触力指标和微机器人到陷阱中心距离指标的标准差分别降低了53.2%和55.2%，并将任务成功率从30%提高到80%。这些结果证明了该框架在评估复杂形状光学微机器人视觉-触觉遥操作策略方面的有效性。

英文摘要

Optical tweezers (OT) provide piconewton-scale manipulation for delicate biomedical tasks, where visuo-haptic feedback can improve operator awareness by conveying interaction-force cues and trap-stability information. However, visuo-haptic teleoperation frameworks for complex-shaped optical microrobots remain underdeveloped, particularly in multi-trap manipulation scenarios. This paper presents a digital twin framework for virtual visuo-haptic teleoperation of complex-shaped OT-driven microrobots. The framework integrates a digital twin environment, image-based pose and depth estimation, microrobot motion simulation, and model-based haptic rendering within a Robot Operating System (ROS)-connected bimanual teleoperation system. For force modeling, we combine a Multi-Sphere Distributed Manipulation (MSDM) model with optical-force estimation from the Optical Tweezers Toolbox, enabling simulator-driven visuo-haptic feedback. The framework reproduces representative microrobot motion trends and provides haptic force rendering that is numerically consistent with the fitted optical-force model. In simulated cell-delivery tasks, haptic feedback reduced the standard deviations of the contact-force metric and the microrobot-to-trap-center distance metric by 53.2% and 55.2%, respectively, and improved task success from 30% to 80%. These results demonstrate the framework's effectiveness for evaluating visuo-haptic teleoperation strategies for complex-shaped optical microrobots.

URL PDF HTML ☆

赞 0 踩 0

2605.28412 2026-05-28 cs.RO cs.LG 版本更新

Tactile-Proprioceptive Sensor Fusion for Contact Wrench Estimation in Whole-Body Physical Human-Robot Interaction

触觉-本体感觉传感器融合用于全身物理人机交互中的接触力估计

Junha Min, Junghyeon Ma, Jiwung Kwon, Sunggyu Bae, Joohyung Kim, Kyungseo Park

发表机构 * Department of Robotics and Mechatronics Engineering, DGIST (Daegu Gyeongbuk Institute of Science and Technology)（机器人与机电工程系，DGIST（大邱庆尚科学技术研究所））； Kinetic Intelligent Machine Lab (KIMLAB), University of Illinois Urbana-Champaign（动能智能机器实验室（KIMLAB），伊利诺伊大学厄巴纳-香槟分校）

AI总结提出触觉-本体感觉融合框架，利用气动皮肤垫的触觉线索作为接触指示器，结合基于电机电流的本体感觉，通过时间卷积网络消除摩擦滞后，实现多轴接触力重建，提高物理人机交互的灵敏度和响应性。

Comments 8 pages, 6 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

详情

AI中文摘要

直接物理引导是一种自然的教学和与机器人交互的方式，机器人皮肤通过实现灵敏的接触感知和定位做出关键贡献。本文提出了一种用于自然物理人机交互的触觉-本体感觉传感器融合框架。来自气动皮肤垫的触觉线索作为接触指示器，绕过了摩擦残余和施加外力之间的模糊性，实现了无需明确摩擦识别的高灵敏度接触检测。我们将这些线索与基于电机电流的本体感觉融合，以重建机器人表面的多轴接触力。为了在运动过程中保持精度，我们采用时间卷积网络（TCN）来减轻粘滑过渡期间的摩擦滞后，减少接触起始时的不确定性，并产生平滑、响应灵敏的引导。我们在集成皮肤的机器人臂上验证了该方法：（i）在静止接触中重建多轴力，以及（ii）同时进行力估计和动觉教学。结果表明，与仅触觉和仅本体感觉的基线相比，在不同接触条件下灵敏度和响应性均有提高，支持触觉-本体感觉融合作为安全、直观的物理人机交互的可靠途径。

英文摘要

Direct physical guidance is a natural means of teaching and interacting with robots, and robotic skins make a key contribution by enabling sensitive contact sensing and localization. This paper presents a tactile-proprioceptive sensor fusion framework for natural physical human-robot interaction. Tactile cues from pneumatic skin pads serve as contact indicators that bypass the ambiguity between frictional residues and applied external forces, enabling highly sensitive contact detection without explicit friction identification. We fuse these cues with motor-current-based proprioception to reconstruct multi-axis contact forces on the robot surface. To maintain accuracy during motion, we employ a temporal convolutional network (TCN) to mitigate friction hysteresis during stick-slip transitions, reducing uncertainty at contact onset and yielding smooth, responsive guidance. We validate the approach on a skin-integrated robot arm: (i) multi-axis forces are reconstructed in stationary contacts, and (ii) simultaneous force estimation and kinesthetic teaching are demonstrated. Results indicate improved sensitivity and responsiveness across diverse contact conditions compared with tactile-only and proprioceptive-only baselines, supporting tactile-proprioceptive fusion as a reliable pathway to safe, intuitive physical human-robot interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.28372 2026-05-28 cs.LG cs.RO 版本更新

Teacher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning

教师-学生表征对齐用于强化学习驱动的模仿学习

Meraj Mammadov, Pedro Zuidberg Dos Martires, Johannes Andreas Stork

发表机构 * Department of Computer Science（计算机科学系）； Örebro University（奥雷布罗大学）

AI总结提出一种通过自监督对比学习构建共享嵌入空间的方法，以减小教师和学生策略之间的不可模仿差距，从而提升学生策略性能。

Comments 6 pages, 5 figures. Accepted as an oral presentation at the RL4IL Workshop at ICRA 2026

详情

AI中文摘要

从基于状态的强化学习策略进行模仿学习是克服机器人学中复杂高维观测空间维度灾难的常用方法。本文解决了当教师和学生策略孤立学习时出现的不可模仿差距，即教师策略可以依赖学生无法从其观测中推断的特权状态信息。我们提出了一种新算法，不是通过在模仿学习后进行强化学习微调（通常需要全新的训练设置）来改善学生性能，而是学习一个共享嵌入空间，该空间隐藏了特定于智能体的观测，从而通过构造训练出可模仿的教师策略。我们通过自监督对比学习与教师策略并行训练共享嵌入空间，并通过限制其梯度更新编码器网络来防止其提取私有信息。我们在多个示例领域进行了评估，并与最先进的基线方法比较，结果表明我们的算法能够实现更高的学生性能，并显著减小模仿差距。

英文摘要

Imitation learning (IL) from a state-based reinforcement learning (RL) policy is a common approach to overcome the curse of dimensionality in complex and high-dimensional observation spaces prevalent in robotics. This paper addresses the irreducible imitation gap that emerges when teacher and student are learned in isolation, and the teacher policy has the liberty to rely on privileged state information that the student cannot infer from its observations. Instead of improving poor student performance with RL finetuning after IL, which often requires a whole new training setup, we propose a novel algorithm which learns a shared embedding space that hides agent-specific observations and thus trains imitable teacher policies by construction. We train the shared embedding space with self-supervised contrastive learning in parallel to the teacher policy and prevent it from extracting private information by limiting its gradients from updating the encoder networks. We perform evaluations on several example domains and compare to state-of-the-art baselines showing that our algorithm enables higher student performance with substantially reduced imitation gap.

URL PDF HTML ☆

赞 0 踩 0

2605.28362 2026-05-28 cs.RO 版本更新

Accelerating Robot Path Planning via Connectivity-Preserving Region Proposal Network

加速机器人路径规划的连通性保持区域提议网络

Zhanzheng Ma, Cancan Zhao, Shuai Zhang, Bo Ouyang

发表机构 * School of Management, Hefei University of Technology（合肥工业大学管理学院）

AI总结提出连通性保持区域提议网络（CP-RPN），通过分割模型预测紧凑且拓扑连通的候选区域，压缩搜索空间，结合Voronoi图与局部A*回退机制实现低延迟高成功率路径规划。

详情

AI中文摘要

移动机器人路径规划方法常受限于巨大的搜索空间，导致基于采样的算法存在延迟。基于学习的方法经常遭受局部区域碎片化和全局拓扑不一致性的困扰。为解决这一问题，我们提出了连通性保持区域提议网络（CP-RPN），一种分割引导模型，旨在预测紧凑且拓扑连通的候选区域，显著压缩搜索空间。具体来说，我们设计了一个分割模型，利用可变形注意力变换器（DAT）捕获长距离依赖以实现全局连通性，并采用反卷积解码器保留细粒度空间细节。为保证预测掩膜的连通性，我们设计了一个复合损失函数，结合交叉熵损失进行逐像素监督、连通性感知损失增强局部一致性，以及基于持续同调的拓扑连续性损失强制全局连通性。在这些高连通性走廊状区域的基础上，使用Voronoi图规划路径，并辅以局部A*回退机制确保鲁棒性。实验结果表明，与MPT基线相比，CP-RPN将候选区域大小减少了超过60.13%，实现了确定性低延迟规划（平均0.11秒），成功率达99.60%，在稳定性上优于传统的基于采样的算法。

英文摘要

Mobile robot path planning methods are often constrained by vast search spaces, resulting in latency in samplingbased algorithms. Learning-based approaches frequently suffer from local region fragmentation and global topological inconsistency. To tackle the problem, we present the Connectivity- Preserving Region Proposal Network (CP-RPN), a segmentationguided model designed to predict compact and topologically connected candidate regions, significantly compressing the search space. Specifically, we design a segmentation model that leverages a Deformable Attention Transformer (DAT) to capture long-range dependencies for global connectivity, with a Deconvolutional decoder to preserve fine-grained spatial details. To guarantee the connectivity of the predicted mask, we design a composite loss function that combines Cross-Entropy loss for pixelwise supervision, a Connectivity-Aware loss to enhance local coherence, and a Topological Continuity loss based on persistent homology to enforce global connectivity. Building on these highconnectivity corridor-like regions, the Voronoi diagram is used to plan the path, backed by a local A* fallback mechanism to ensure robustness. Experimental results demonstrate that CPRPN reduces the candidate region size by over 60.13% compared to the MPT baseline and achieves deterministic low-latency planning (avg. 0.11s) with a 99.60% success rate, outperforming traditional sampling-based algorithms in stability.

URL PDF HTML ☆

赞 0 踩 0

2605.28352 2026-05-28 cs.RO 版本更新

Magnet-Based Soft Robotic Skin Using a 3D-Printed Multi-Lattice Structure and CNN-Based Tactile Super-Resolution

基于磁体的软体机器人皮肤：使用3D打印多格点结构和CNN触觉超分辨率

Yunseong Bang, Joowon Park, Suan Sim, Youngjun Ryu, Sukho Park, Kyungseo Park

AI总结提出一种集成多层软格点、霍尔效应传感器阵列和CNN触觉超分辨率模型的磁基机器人皮肤，通过格点参数调节实现机械柔顺性与传感特性的联合优化，并利用3D打印快速制造，实现接触位置和法向力的实时估计。

Comments 6 pages, 9 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026. Y. Bang and J. Park contributed equally

详情

AI中文摘要

本文提出一种基于磁体的机器人皮肤，它集成了多层软格点、分布式霍尔效应传感器阵列和触觉超分辨率模型。外部接触力通过嵌入的永磁体转换为磁场变化，而格点将这些变化扩散到整个传感域。这使得每个传感器具有大且重叠的感受野，从而在最小盲区的情况下实现大面积的传感。格点参数可调，能够联合调整机械柔顺性和传感特性。隐式建模工作流和选择性激光烧结（SLS）3D打印支持快速制造共形、高复杂度的结构。基于实验测量训练的卷积神经网络实时估计接触位置和法向力。实验验证了定位精度，并表明可扩展到更大表面，适用于全身机器人皮肤和安全的人机交互。

英文摘要

This paper presents a magnet-based robotic skin that integrates a multilayer soft lattice with distributed Hall-effect sensor arrays and a tactile super-resolution model. External contact forces are converted to magnetic field changes by embedded permanent magnets, and the lattice spreads these changes across the sensing domain. This gives each sensor a large, overlapping receptive field and enables a large sensing area with minimal blind spots. Lattice parameters are tunable, enabling joint adjustment of mechanical compliance and transduction characteristics. An implicit modeling workflow and selective laser sintering (SLS) 3D printing support rapid fabrication of conformal, high-complexity structures. A convolutional neural network trained on experimental measurements estimates contact location and normal force in real time. Experiments validate localization accuracy and indicate scalability to larger surfaces, suggesting applicability to whole-body robotic skin and safe human-robot interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.28330 2026-05-28 cs.RO 版本更新

Chance-Constrained MPPI under State and Dynamic Object Prediction Uncertainty and the Evaluation of Collision Risk Calibration

状态与动态物体预测不确定性下的机会约束MPPI及碰撞风险校准评估

Benjamin Serfling, Konrad Doll, Kati Radkhah-Lens

发表机构 * Faculty of Engineering and Informatics, University of Applied Sciences Aschaffenburg（应用科学阿施芬堡大学工程与信息学院）

AI总结针对机会约束MPPI控制中上游不确定性校准不足导致的过自信或过保守问题，提出DUCCT-MPPI架构，通过无迹变换和蒙特卡洛聚合联合集成定位与动态障碍预测不确定性，在仿真中实现鲁棒导航，成功率提升28%。

Comments Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情

自然运动：原理与方法

Mirado Mortel, Luc Jaulin, Lionel Lapierre, Simon Rohou

AI总结本文提出自然运动作为系统与环境约束或相互作用介导的运动交换原理，通过构建自然运动流形（NLM）并采用闭/开构造方法，在理想非完整无滑移系统上验证了该原理。

Comments Preprint. 20 pages, 7 figures

详情

AI中文摘要

当机构利用被动动力学、柔顺性和共振而非跟踪预定轨迹时，机器人运动可以变得高效。本文将自然运动表述为一种交换原理，适用于运动由环境约束或相互作用介导的系统。当内部振荡器周期性返回、身体姿态漂移且平均推进-振荡器交换功率（POE功率）在一个周期内为零时，运动是自然的。所选族是自然运动流形（NLM）。我们针对连续理想环境约束发展了该原理的保守实现：约束不做外部功，总机械能守恒，零平均POE功率是与环境介导的推进通道的内部交换，而非外部能量输入。该方法是一种闭/开构造。首先关闭推进通道以揭示有效的内部振荡器，该振荡器由一个有效自由度中的标量作用-角结构或多个自由度中的非线性模态扇区组织。然后重新打开通道，重建姿态，接受的周期必须保持内部递归和零平均POE功率。我们在两个理想非完整无滑移系统上演示了该原理：一个Chaplygin雪橇/摆驱动小车和一个三体扩展。在标量情况下，POE闭合等价于缺失的内部返回条件，从而给出NLM族的定理支持计算。在多自由度情况下，POE闭合仍然是必要的，但必须由模态恒等性、内部返回、动力学一致性、相同的固定被动架构和非零位移来补充。自然运动成为一个设计问题：哪些被动架构支持零个、一个或多个经过认证的NLM族？

英文摘要

Robotic locomotion can become efficient when mechanisms exploit passive dynamics, compliance, and resonance rather than track prescribed trajectories. This paper formulates natural locomotion as an exchange principle for systems whose motion is mediated by environmental constraints or interactions. A motion is natural when an internal oscillator returns periodically, the body pose drifts, and the mean Propulsion--Oscillator Exchange power (POE power) vanishes over one cycle. The selected family is a Natural Locomotion Manifold (NLM). We develop the conservative realization of this principle for continuous ideal environmental constraints: the constraints do no external work, total mechanical energy is conserved, and zero mean POE power is an internal exchange with the environment-mediated propulsive channel, not external energy input. The method is a closed/open construction. The propulsive channel is first closed to reveal an effective internal oscillator, organized by scalar action-angle structure in one effective degree of freedom or by nonlinear modal sectors in several degrees of freedom. The channel is then reopened, pose is reconstructed, and accepted cycles must preserve internal recurrence and zero mean POE power. We demonstrate the principle on two ideal nonholonomic no-slip systems: a Chaplygin-sleigh / pendulum-driven car and a three-body extension. In the scalar case, POE closure is equivalent to the missing internal return condition, giving a theorem-backed computation of the NLM family. In the multi-degree case, POE closure remains necessary but must be completed by modal identity, internal return, dynamics consistency, same fixed passive architecture, and nonzero displacement. Natural locomotion becomes a design question: which passive architectures support no, one, or several certified NLM families?

URL PDF HTML ☆

赞 0 踩 0

2605.28237 2026-05-28 cs.RO cs.CV 版本更新

POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

POINav: 在真实世界视觉语言导航中基准测试与增强最终米级到达

Ruiyan Gong, Meisheng Zhang, Yuxiang Zhao, Mingchao Sun, Yanfen Shen, Zedong Chu, Zhining Gu, Wei Guo, Xiaolong Cheng, Qiming Li, Kangning Niu, Yanqing Zhu, Xiaolong Wu, Tianlun Li, Mu Xu

发表机构 * Amap CV Lab, Alibaba Group（阿里集团阿里的Amap视觉实验室）

AI总结针对真实世界POI导航的“最后几米”挑战，提出首个闭环评估基准POINav-Bench，并设计脑-动作框架结合70K真实标志-入口数据对，实现高保真度导航。

Comments 25 pages, 9 figures

详情

AI中文摘要

真实世界导航本质上由兴趣点（POI）驱动，然而到达精确的POI仍然是一个关键的“最后几米”挑战。现有的POI目标导航的视觉语言导航（VLN）基准通常由于生成的场景而存在粗粒度或显著的模拟到现实差距。为弥合这一差距，我们提出了POINav-Bench，这是第一个专为真实世界POI目标导航闭环评估设计的基准。它包含使用3D高斯泼溅（3DGS）从真实世界捕获重建的11个商业区域，总面积达126,398平方米，涵盖163个不同的POI。通过可通行性感知标注和参考轨迹，POINav-Bench能够在真实、POI丰富的现实环境中对导航智能体进行高保真评估。在此基础上，我们提出了POINav脑-动作框架，其中脑模块执行基于POI的推理以指导动作模块预测用于真实世界执行的连续航点。我们进一步整理了POINav-Dataset，包含70K个真实世界标志-入口对。实验表明，我们的框架为改进真实世界POI目标导航提供了一条可行路径。

英文摘要

Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

URL PDF HTML ☆

赞 0 踩 0

2605.28231 2026-05-28 cs.RO cs.LG 版本更新

ProgVLA: Progress-Aware Robot Manipulation Skill Learning

ProgVLA：进度感知的机器人操作技能学习

Seungsu Kim, Jinyoung Choi, Seungmin Baek, Jean-Michel Renders

发表机构 * NAVER LABS（NAVER实验室）； NAVER LABS Europe（NAVER实验室欧洲）

AI总结提出ProgVLA，一种紧凑的视觉-语言-动作模型，通过显式表示任务进度和两阶段Perceiver重采样机制，在有限计算和内存下实现长序列多模态处理，并在多任务操作基准上达到或超越大模型性能。

详情

AI中文摘要

我们提出了ProgVLA，一种紧凑的视觉-语言-动作（VLA）模型，专为在严格的计算和内存预算下进行可靠的机器人操作而设计。该模型特别关注通过维护任务进度的显式表示来高效处理长多模态序列。为此，ProgVLA集成了两个关键组件。首先，一个带有两阶段Perceiver重采样方案的多模态编码器将可变长度的视觉、语言和本体感受流压缩为一组固定的控制就绪上下文令牌，在保持跨模态基础的同时大幅减少序列长度。其次，一组辅助的进度头通过离线强化学习（RL）目标进行训练，以联合学习针对归一化剩余水平目标的批评者。这为策略提供了任务进度的内部估计，并实现了优势加权和成功加权的流匹配模仿学习。在两个成熟的多任务机器人操作基准上，一个0.1B参数的ProgVLA模型达到了与显著更大的预训练基线相当的成功率，并且在长时域和更困难的任务层级上超过了它们。消融实验表明，学习到的上下文重采样器和任务自适应视觉微调是最大的单一贡献者，而进度感知训练提供了集中在长时域和多对象任务上的一致额外增益。我们还在真实世界的玩具厨房环境中进一步验证了该方法。

英文摘要

We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by maintaining an explicit representation of task progress over extended horizons. To this end, ProgVLA integrates two key components. First, a multi-modal encoder with a two-stage Perceiver resampling scheme compresses variable-length visual, language, and proprioceptive streams into a fixed set of control-ready context tokens, substantially reducing sequence length while preserving cross-modal grounding. Second, an auxiliary set of progress heads is trained with offline reinforcement learning (RL) objectives to jointly learn critics over normalized remaining-horizon targets. This provides the policy with an internal estimate of task progress and enables advantage- and success-weighted flow-matching imitation learning. On two well-established multi-task robot manipulation benchmarks, a 0.1B-parameter ProgVLA model reaches success rates that are competitive with, and on long-horizon and harder task tiers exceed, substantially larger pretrained baselines. Ablations indicate that the learned context resampler and task-adaptive visual fine-tuning are the largest single contributors, while progress-aware training provides a consistent additional gain that is concentrated on long-horizon and multi-object tasks. We further validate the approach in real-world toy-kitchen environments.

URL PDF HTML ☆

赞 0 踩 0

2605.28202 2026-05-28 cs.RO 版本更新

Natural Functional Gradients for Smooth Trajectory Optimization

平滑轨迹优化的自然函数梯度

Kisang Park, Chanwoo Kim, Kyungjae Lee, Sungjoon Choi

发表机构 * Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea（韩国大学人工智能系，首尔，大韩民国）； Department of Statistics, Korea University, Seoul, Republic of Korea（韩国大学统计系，首尔，大韩民国）

AI总结提出一种基于自然函数梯度的轨迹优化框架，通过函数空间中的几何感知更新和蒙特卡洛估计，在无解析梯度时生成更平滑、更可行的运动轨迹。

详情

AI中文摘要

生成无碰撞且平滑的运动仍然是机器人操作中的一个核心挑战，尤其是在杂乱环境和狭窄通道中，可行区域高度受限且碎片化。我们提出了一种轨迹优化框架，该框架使用自然函数梯度直接在函数空间中进行几何感知更新。该方法优化了一个高斯平滑的替代目标，通过平滑轨迹扰动正则化优化景观，同时保留轨迹级结构。由于更新在函数空间内固有定义，轨迹规则性可以独立于特定时间离散化进行控制。我们推导了自然函数梯度的实用蒙特卡洛估计器，仅需黑盒轨迹评估，使得该方法在由于碰撞检测和接触丰富的仿真导致解析梯度不可用或不可靠时适用。在受限机器人操作任务上的实验表明，与代表性的规划和轨迹优化基线相比，所提出的方法在几何间隙狭窄的环境中提高了轨迹可行性并生成了更平滑的运动。更多结果、视频和实现细节可在项目页面获取：https://kisangpark.github.io/natural-functional-gradient/

英文摘要

Generating collision-free and smooth motions remains a central challenge in robotic manipulation, particularly in cluttered environments and narrow passages where feasible regions are highly constrained and fragmented. We propose a trajectory optimization framework that performs geometry-aware updates directly in function space using natural functional gradients. The method optimizes a Gaussian-smoothed surrogate objective that regularizes the optimization landscape through smooth trajectory perturbations while preserving trajectory-level structure. Because the updates are defined intrinsically in function space, trajectory regularity can be controlled independently of a particular time discretization. We derive a practical Monte-Carlo estimator of the natural functional gradient that requires only black-box trajectory evaluations, making the method applicable when analytic gradients are unavailable or unreliable due to collision checking and contact-rich simulation. Experiments on constrained robotic manipulation tasks demonstrate that the proposed method improves trajectory feasibility and produces smoother motions than representative planning and trajectory optimization baselines in environments with narrow geometric clearances. Additional results, videos, and implementation details are available at the project page: https://kisangpark.github.io/natural-functional-gradient/

URL PDF HTML ☆

赞 0 踩 0

2605.28186 2026-05-28 cs.RO cs.AI 版本更新

Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension

可视化运动策略中的潜在相位结构：基于时间特征扩展的多环境研究

Daisuke Yasui, Toshitaka Matuki, Hiroshi Sato

发表机构 * Mathematics and Computer Science National Defense Academy of Japan（日本防卫大学校数学与计算机科学系）

AI总结提出一种框架，通过扩展聚类特征（包括动作、下一状态和下一动作）并引入抑制自转移的聚类数确定方法，从深度强化学习运动策略中揭示更清晰、更规则的潜在运动相位结构。

详情

AI中文摘要

深度强化学习（DRL）已被证明在MuJoCo基准测试（如HalfCheetah、Ant和Walker2D）的运动控制任务中表现出高性能。然而，可视化由深度神经网络实现的训练策略函数内部获得的运动结构仍然具有挑战性。从生物力学及相关领域可知，运动控制是通过重复运动相位（如站立相和摆动相）实现的。在本研究中，我们提出一个框架，用于从运动控制策略通过与环境交互生成的轨迹中揭示潜在的相位结构。所提出的方法将聚类特征从仅状态观测扩展到包括动作、下一状态和下一动作的增强特征，并引入一种抑制自转移的聚类数确定方法。将所提出的方法应用于三个环境——Ant-v5、HalfCheetah-v5和Walker2D-v5，我们成功识别出比现有方法具有更清晰和更规则转换规则的相位结构。

英文摘要

Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfCheetah, Ant, and Walker2D. However, visualizing the motion structures internally obtained by a trained policy function implemented as a deep neural network remains challenging. It is known from biomechanics and related fields that locomotion control is realized through the repetition of motion phases such as the stance phase and swing phase. In this study, we propose a framework for uncovering latent motion phase structures from trajectories generated by locomotion control policies through interaction with the environment. The proposed method extends the clustering features from state observations alone to augmented features including actions, next states, and next actions, and introduces a method for determining the number of clusters that suppresses self-transitions. Applying the proposed method to three environments -- Ant-v5, HalfCheetah-v5, and Walker2D-v5 -- we successfully identified phase structures with clearer and more regular transition rules than those obtained by the existing method.

URL PDF HTML ☆

赞 0 踩 0

2605.28172 2026-05-28 cs.RO 版本更新

Provably Guaranteed Polytopic Uncertainty Quantification for SLAM

具有可证明保证的多面体不确定性量化用于SLAM

Guangyang Zeng, Yulong Gao, Yuan Shen, Lingpeng Chen, Haoying Li, Guodong Shi, Junfeng Wu

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen（数据科学学院，香港中文大学（深圳））； School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen（人工智能学院，香港中文大学（深圳））； Department of Electrical and Electronic Engineering, Imperial College London（电子与电气工程系，帝国理工学院伦敦分校）； School of Aerospace, Mechanical and Mechatronic Engineering, The University of Sydney（航空航天、机械与机电工程学院，悉尼大学）

AI总结本文提出基于多面体表示的不确定性量化算法，通过前向映射、后向位姿跟踪和位姿复合三个模块，为3D-3D路标SLAM提供可证明的确定性保证，并结合共形预测提高实用性。

Comments 16 pages, 10 figures; accepted by Robotics: Science and Systems 2026

详情

AI中文摘要

在安全关键的机器人应用中，感知中保证且实用的不确定性量化至关重要。许多现有工作要么没有提供正式包含保证，要么依赖限制性建模假设，要么只关注位姿估计而非完整的SLAM流水线。本文提出了用于基于3D-3D路标的SLAM的可证明保证的不确定性量化算法。该算法由三个基本的不确定性量化模块组成：用于建图的前向不确定性量化、用于位姿跟踪的后向不确定性量化以及位姿复合。每个模块生成一个认证的不确定性集；当输入不确定性边界是确定性的时，输出集继承确定性保证，即它们可证明地包含真实位姿和路标。具体来说，我们使用多面体表示不确定性集，从而实现易处理的计算和对位姿不确定性的统一处理。为了提高算法的实际可用性，我们结合了共形预测，从数据中以规定概率校准测量不确定性。仿真和实验表明，所提出的算法既提供了强大的理论保证，又具有实际可用性。代码开源在 https://github.com/LIAS-CUHKSZ/Polytopic-SLAM-Uncertainty-Quantification。

英文摘要

In safety-critical robotics applications, guaranteed and practical uncertainty quantification (UQ) in perception is vital. Many existing works either offer no formal containment guarantee, rely on restrictive modeling assumptions, or focus only on pose estimation rather than a complete SLAM pipeline. This paper presents provably guaranteed UQ algorithms for 3D-3D landmark-based SLAM. The algorithms consist of three basic UQ modules: forward UQ for mapping, backward UQ for pose tracking, and pose compound. Each module produces a certified uncertainty set; when the input uncertainty bounds are deterministic, the output sets inherit deterministic guarantees, i.e., they provably contain the true poses and landmarks. Specifically, we use polytopes to represent uncertainty sets, enabling tractable computations and a unified treatment of pose uncertainty. To enhance algorithms' practical usability, we incorporate conformal prediction to calibrate measurement uncertainty from data with prescribed probability. Simulations and experiments demonstrate that the proposed algorithms provide both strong theoretical guarantees and practical usability. The code is open-sourced at https://github.com/LIAS-CUHKSZ/Polytopic-SLAM-Uncertainty-Quantification.

URL PDF HTML ☆

赞 0 踩 0

2605.28154 2026-05-28 cs.HC cs.RO 版本更新

Robo-Blocks: Generative Scaffolding in End-User Design and Programming of Social Robots

Robo-Blocks：社交机器人终端用户设计与编程中的生成式支架

Arissa J. Sato, Callie Y. Kim, Nathan Thomas White, Abhinav Maneesh, Yuqing Wang, Hui-Ru Ho, Bilge Mutlu

发表机构 * Department of Computer Sciences\ of Wisconsin--Madison

AI总结通过研究通过设计（RtD）过程，提出基于LLM的积木式编程环境Robo-Blocks，利用生成式支架将高级想法转化为可执行机器人行为，支持新手程序员，并揭示了用户角色与使用模式。

详情

DOI: 10.1145/3800645.3812997

AI中文摘要

由于需要规划、交互设计和编程方面的专业知识，编程社交机器人对新手机器人程序员来说具有挑战性。虽然大型语言模型（LLM）通过从自然语言描述生成代码具有巨大潜力，但它们可能掩盖编程的关键元素并取代设计者的意图，最终导致过度依赖而非发展编程技能。在本文中，我们通过研究通过设计（RtD）过程，探索基于LLM的社交机器人编程工具如何支持新手机器人程序员。我们设计并原型化了Robo-Blocks，这是一个基于积木的编程环境，利用LLM通过结构化叙述为新手机器人程序员提供生成式支架，将高级想法连接到可执行的机器人行为。通过与新手的部署，我们发现了生成式支架的新兴用户角色和使用模式，并展示了这种支架如何塑造终端用户的设计和编程策略。我们提出了有效使用生成式支架及其融入社交机器人编程实践的设计见解。

英文摘要

Programming social robots is challenging for novice robot programmers due to required expertise in planning, interaction design, and programming. While large language models (LLMs) hold significant promise through code generation from natural-language descriptions, they can obscure critical elements of programming and supplant designer intent, eventually resulting in over-reliance instead of developing programming skills. In this paper, we explore how LLM-based social-robot-programming tools can support novice robot programmers through a Research through Design (RtD) process. We designed and prototyped Robo-Blocks, a block-based programming environment that leverages LLMs to offer novice robot programmers generative scaffolding through structured narratives that connect high-level ideas to executable robot behaviors. Through deployment with novices, we discovered emerging user personas and usage patterns for generative scaffolding and showed how this scaffolding shapes end-user design and programming strategies. We present design insights for the effective use of generative scaffolding and its integration into the practice of social-robot programming.

URL PDF HTML ☆

赞 0 踩 0

2605.25770 2026-05-28 cs.RO 版本更新

Implicit Null-space Manifold Generation for Redundant Robotic Systems

冗余机器人系统的隐式零空间流形生成

Taiki Ishigaki, Teresa Vidal-Calleja, Ko Ayusawa, Eiichi Yoshida

发表机构 * Tokyo University of Science, Japan（日本东京科学大学）； University of Technology Sydney, Australia（澳大利亚悉尼技术大学）； National Institute of Advanced Industrial Science and Technology, Japan（日本国家先进工业科学与技术研究院）

AI总结针对冗余机器人系统，提出一种基于雅可比引导探索的隐式标量场方法，通过零水平集表示解流形，实现解空间几何结构的有效估计与连续任务建模。

Comments Corrected author names in references

详情

AI中文摘要

具有冗余自由度的机器人系统可以通过多种配置实现相同的任务结果，从而形成配置空间中的解流形。现有方法通常通过基于雅可比的技术局部利用这种冗余性来计算单个解或轨迹。虽然这些方法在求解计算上有效，但它们不保留解集本身的几何结构表示。在这项工作中，我们采用以表示为中心的方法来估计解空间的几何结构。我们考虑由通用任务定义映射诱导的解流形，并在配置空间上构建一个隐式标量场，其零水平集对应于解流形。为此，我们使用雅可比引导的探索策略在解流形附近生成样本，该策略有效捕获其局部和全局结构。得到的隐式表示定义在配置空间上，并自然诱导出一个连续的距离场，编码到解流形的接近度。在平面三连杆机器人和七自由度Franka机械臂上的实验证明了所提出表示的有效性。此外，该框架能够对具有连续变化的任务族进行解空间的一致建模。

英文摘要

Robotic systems with redundant degrees of freedom can achieve the same task outcome using multiple configurations, resulting in solution sets that form manifolds in the configuration space. Existing approaches typically exploit such redundancy locally through Jacobian-based techniques to compute individual solutions or trajectories. While effective for solution computation, these methods do not retain a representation of the geometry of the solution set itself. In this work, we adopt a representation-centric approach to estimate the geometric structure of the solution space. We consider solution manifolds induced by general task-defining maps and construct an implicit scalar field over the configuration space, whose zero-level set corresponds to the solution manifold. To this end, we generate samples in the neighborhood of the solution manifold using a Jacobian-guided exploration strategy, which efficiently captures its local and global structure. The resulting implicit representation is defined over the configuration space and naturally induces a continuous, distance field that encodes proximity to the solution manifold. Experiments on a planar three-link robot and a seven-degree-of-freedom Franka manipulator demonstrate the effectiveness of the proposed representation. Furthermore, the framework enables consistent modeling of solution spaces across families of tasks with continuous variation.

URL PDF HTML ☆

赞 0 踩 0

2605.25010 2026-05-28 cs.RO cs.AI 版本更新

Performance Comparison of Classical and Neural Sampling Algorithms for Robotic Navigation

经典与神经采样算法在机器人导航中的性能比较

Hichem Cheriet, Badra Khellat Kihel, Samira Chouraqui

发表机构 * dept. of Economics Oran2 Mohamed BenAhmed University（经济系奥兰2莫哈梅德·本·阿赫迈德大学）

AI总结本文在含凸凹障碍物的环境中比较了RRT*、Neural RRT*和Neural Informed RRT*三种算法，发现神经引导规划器能生成更短（最多14%）和更平滑（55-75%）的路径，其中Neural Informed RRT*综合性能最优。

详情

Journal ref: Presented at The 3rd Edition of National Conference on Applications of Artificial Intelligence A2I' 26. 2026

AI中文摘要

将人工智能（AI）集成到基于采样的运动规划中为提高自主导航效率提供了新的可能性。本文在包含不同障碍物密度的凸凹障碍物环境中实现并评估了三种算法，即RRT*、Neural RRT*和Neural Informed RRT*。结果表明，与传统RRT*算法相比，神经引导规划器提高了路径质量，生成了最多短14%的路径和55-75%更平滑的轨迹。在评估的方法中，Neural Informed RRT*在路径长度和轨迹平滑度方面实现了最佳整体性能。这些结果证明了AI引导采样策略在提高机器人和无人机导航的可靠性和轨迹效率方面的有效性，尽管计算时间略有增加。总体而言，该研究凸显了人工智能在实时机器人路径规划应用中日益增长的重要性。

英文摘要

Integrating artificial intelligence (AI) into sampling-based motion planning provides new possibilities for improving autonomous navigation efficiency. In this paper, three algorithms, namely RRT*, Neural RRT*, and Neural Informed RRT*, are implemented and evaluated on environments containing convex and concave obstacles with different obstacle densities. The obtained results indicate that neural-guided planners improve path quality, producing up to 14\% shorter paths and 55--75\% smoother trajectories compared with the conventional RRT* algorithm. Among the evaluated methods, Neural Informed RRT* achieves the best overall performance in terms of path length and trajectory smoothness. These results demonstrate the effectiveness of AI-guided sampling strategies for improving reliability and trajectory efficiency in robotic and UAV navigation, despite a slight increase in computation time. Overall, the study highlights the growing importance of artificial intelligence in real-time robotic path planning applications.

URL PDF HTML ☆

赞 0 踩 0

2605.28136 2026-05-28 cs.CV cs.RO 版本更新

SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving

SAM增强的道路数据集分割：自动驾驶中关键类别的平衡

Toomas Tahves, Mauro Bellone, Junyi Gu, Raivo Sell

发表机构 * Department of Mechanical and Industrial Engineering, Tallinn University of Technology（塔林技术大学机械与工业工程系）； FinEst Centre for Smart Cities, Tallinn University of Technology（塔林技术大学智能城市研究中心）； Department of Computer Science and Engineering, Universitas Mercatorum（默卡托姆大学计算机科学与工程系）； Department of Computer Science and Engineering, Chalmers University of Technology（挑战者技术大学计算机科学与工程系）； University of Gothenburg（哥德堡大学）

AI总结提出基于SAM的标注流水线，将ZOD数据集的边界框转换为密集像素级语义掩码，并评估不同架构在类别不平衡下的性能，通过双向迁移学习实现跨传感器配置的有效迁移。

详情

AI中文摘要

密集语义分割对于自动驾驶至关重要，然而许多多模态数据集缺乏像素级标注。Zenseact开放数据集（ZOD）提供丰富的多传感器数据，但仅有边界框标签，限制了其在分割研究中的应用。我们的主要贡献是一个基于Segment Anything Model（SAM）的标注流水线，通过将边界框转换为语义掩码，为ZOD生成密集的像素级标注。在这项初步研究中，我们处理了超过10万帧，并手动筛选出一个2300帧的子集（接受率36%），以建立可靠的基线。利用这些标注，我们评估了基于Transformer的CLFT和基于CNN的DeepLabV3+架构在不同天气条件下的性能，其中CLFT-Hybrid达到了48.1%的mIoU。为了解决极端类别不平衡问题（行人、骑行者、标志牌像素占比不足1%），我们探索了针对稀有类别的专门模型。我们还在Iseauto自动驾驶平台上验证了该流水线，达到了77.5%的mIoU，并展示了通过双向迁移学习，SAM导出的表示能够有效地跨传感器配置迁移。所有代码和标注均已发布，以支持可重复研究。

英文摘要

Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

URL PDF HTML ☆

赞 0 踩 0

2605.28110 2026-05-28 cs.RO 版本更新

STR Robot: Design of an Autonomous Mobile Robot from Simulation to Reality

STR机器人：从仿真到现实的自主移动机器人设计

Vinh Nguyen, Gia-Uy Le, Tien-Dat Nguyen, Tri-Tin Nguyen, Vinh-Hao Nguyen

发表机构 * Faculty of Electrical and Electronic Engineering, Ho Chi Minh City University of Technology, VNU-HCM（电子工程学院，胡志明市技术大学，VNU-HCM）

AI总结本文提出一种基于现有机械平台的自主移动机器人仿真到现实实现方法，重点开发机载控制、自定位和自主导航系统，并通过仿真和实验验证其可行性。

2605.28097 2026-05-28 cs.RO 版本更新

ICAN-Deploy: Identity-Stable Canary Deployment for Safety-Critical Embodied Agents

ICAN-Deploy：面向安全关键具身智能体的身份稳定金丝雀部署

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Heriot-Watt University, Malaysia Campus（赫瑞-沃德大学马来西亚分校）； Fraunhofer Institute for Applied Information Technology（弗劳恩霍夫应用信息技术研究所）； Soochow University（苏州大学）

AI总结提出ICAN-Deploy中间件，通过分离能力名称与版本，在安全关键具身智能体的金丝雀部署中保持身份哈希不变，避免重新认证。

Comments 14 pages, 6 figures, 4 tables

详情

AI中文摘要

金丝雀部署将一小部分流量路由到新软件版本，监控指标，并在出现回归时回滚。主流控制器（Argo Rollouts、Spinnaker、Flagger）在金丝雀窗口期间会改变部署系统的加密身份。这种漂移对于无状态微服务是无害的，但对于安全关键的具身智能体，它打破了“你认证的智能体仍然是你拥有的智能体”这一声明，迫使每次金丝雀部署都要重新认证。我们提出了ICAN-Deploy（身份稳定的金丝雀部署），这是一种中间件构造，其状态机通过分离能力名称（冻结、哈希化）和能力版本（可变运行时状态），在金丝雀窗口期间保持身份哈希不变。我们在LLM驱动的机器人的运行时治理层中实现了ICAN-Deploy，并通过封闭式证明、AST lint和TLA+模型检查验证了不变性，然后在MuJoCo中的Franka Panda手臂上通过N=100个真实金丝雀周期进行了验证（零漂移；入口延迟95% BCa CI [1.52, 2.01] ms）。一个将版本折叠到清单中的功能标志稻草人在相同工作负载下失败。在身份创建时一次性认证的系统，可以在同一认证下，在版本和名称范围内，交付任意能力演化。

英文摘要

Canary deployment routes a fraction of traffic to a new software version, monitors metrics, and rolls back on regression. Mainstream controllers (Argo Rollouts, Spinnaker, Flagger) change the deployed system's cryptographic identity during the canary window. The drift is harmless for stateless microservices but breaks the claim that "the agent you certified is still the agent you have" for safety-critical embodied agents, forcing re-certification per canary. We present ICAN-Deploy (Identity-stable CANary Deployment), a middleware construction whose state machine holds the identity hash invariant across the canary window by separating capability names (frozen, hashed) from capability versions (mutable runtime state). We implement ICAN-Deploy inside a runtime governance layer for LLM-driven robots and verify invariance by closed-form proof, AST lint, and TLA+ model-checking, then corroborate over N=100 real canary cycles on a Franka Panda arm in MuJoCo (zero drift; entry latency 95% BCa CI [1.52, 2.01] ms). A feature-flagged strawman that folds versions into the manifest falsifies on the same workload. A system certified once at identity-creation time can then ship arbitrary capability evolution under that same certification, within the version-and-name envelope.

URL PDF HTML ☆

赞 0 踩 0

2605.28092 2026-05-28 cs.RO 版本更新

An Operator-Based Approach to STL

一种基于算子的STL方法

Panagiotis Rousseas, Dimos V. Dimarogonas

发表机构 * Department of Decision and Control Systems, School of Electrical Engineering and Computer Science, Royal Institute of Technology (KTH)（决策与控制系统系，电气工程与计算机科学学院，皇家理工学院（KTH））

AI总结提出一种基于可达性值函数算子的STL新框架，通过直接开发算子嵌套规则处理复杂多嵌套公式，并实现在线控制综合。

2605.28087 2026-05-28 cs.RO 版本更新

Whose Is This?: Context-Aware Object Ownership Inference with Uncertainty-Guided Questioning

这是谁的？：基于不确定性引导提问的上下文感知物体所有权推断

Saki Hashimoto, Akira Taniguchi, Shoichi Hasegawa, Yoshinobu Hagiwara, Tadahiro Taniguchi

发表机构 * Kyutech（京都科技大学）

AI总结提出一种结合大语言模型和共形预测的上下文感知所有权推断框架（COIN），通过不确定性引导的交互式提问，在模拟家庭环境中实现高精度物体所有权估计。

Comments Under review in Advanced Robotics. Project page is https://emergentsystemlabstudent.github.io/COIN/

详情

AI中文摘要

服务机器人必须推断物体所有权才能正确解释诸如“把我的杯子拿来”之类的指令。然而，所有权是一个无法直接观察的潜在属性，现有方法通常依赖有限线索（如近期使用），在临时共享等场景中不可靠。我们提出一种具有不确定性引导交互的上下文感知所有权推断框架（COIN）。该方法使用大语言模型（LLM）整合用户背景信息和物体使用历史来估计所有权分数。为处理不确定性，我们应用共形预测构建一组可能的拥有者，并在预测不确定时选择性生成用户查询。在模拟家庭环境中的实验表明，所提方法始终优于基线方法，子集准确率达到0.988，平均Jaccard指数达到0.991。该方法在临时使用和共享所有权场景中也保持高性能。结果表明，结合上下文推理与不确定性感知交互提高了估计准确性和鲁棒性。项目页面见https://emergentsystemlabstudent.github.io/COIN/。

英文摘要

Service robots must infer object ownership to correctly interpret instructions such as "bring me my cup." However, ownership is a latent attribute that cannot be directly observed, and existing methods often rely on limited cues such as recent usage, making them unreliable in scenarios such as temporary sharing. We propose a framework for context-aware ownership inference with uncertainty-guided interaction (COIN). The method integrates user background information and object usage history using a large language model (LLM) to estimate ownership scores. To handle uncertainty, we apply conformal prediction to construct a set of plausible owners and selectively generate user queries when the prediction is uncertain. Experiments in a simulated home environment show that the proposed method consistently outperforms baseline approaches, achieving a Subset Accuracy of 0.988 and a Mean Jaccard index of 0.991. The method also maintains high performance in scenarios involving temporary use and shared ownership. The results demonstrate that combining contextual reasoning with uncertainty-aware interaction improves both estimation accuracy and robustness. The project page is available at https://emergentsystemlabstudent.github.io/COIN/.

URL PDF HTML ☆

赞 0 踩 0

2605.28048 2026-05-28 cs.RO 版本更新

SAFEVPR: Patch-Based Conformal Verification for Safe Cross-Condition Sequence Visual Place Recognition

SAFEVPR: 基于补丁的共形验证用于安全跨条件序列视觉地点识别

Ha Sier, Jiaqiang Zhang, Zhuo Zou, Xianjia Yu, Tomi Westerlund

发表机构 * Turku Intelligent Embedded and Robotic Systems (TIERS) Lab（图尔库智能嵌入式与机器人系统实验室）； University of Turku（图尔库大学）； School of Information Science and Technology（信息科学与技术学院）

AI总结提出SAFEVPR，一种无需训练的验证与校准流程，通过互近邻补丁匹配评分和Mondrian共形LTT校准，在跨条件部署下实现序列VPR的有限样本FDR控制，实验证明在23个跨条件设置中均有效。

详情

AI中文摘要

基于序列的视觉地点识别（VPR）用于SLAM和机器人重定位必须决定检索到的top-1候选是否安全可接受。共形预测是这种接受/拒绝决策的自然框架，但其有限样本保证依赖于校准数据和部署（测试）数据之间的可交换性，这在跨条件部署下被违反。我们引入了SAFEVPR，一种无需训练的验证与校准流程，用于安全的跨条件序列VPR。SAFEVPR将标准的骨干余弦相似度替换为从冻结的DINOv2 ViT特征计算出的互近邻（MNN）补丁匹配分数，并将平坦的Learn-Then-Test校准替换为Mondrian共形LTT，为不同分数区间拟合独立的Bonferroni校正阈值。在可交换性下，这些阈值将提供有限样本的假发现率（FDR）控制；在条件偏移下，我们评估每个部署的经验有效性。在来自Oxford RobotCar、NCLT和St Lucia数据集的23个跨条件设置中，使用三个冻结的VPR骨干，SAFEVPR在目标FDR alpha=0.10下，在23/23的设置中经验有效，平均接受FDR为0.014，平均真阳性率（TPR）为0.75。结果表明，仅凭原始区分度不足以实现共形有效性：AnyLoc-VLAD和Super-Point+LightGlue达到了可比的ROC曲线下面积（AUROC），但在相同校准下失败的设置更多。在无纹理重复场景中，SAFEVPR安全地弃权，而不是接受不可靠的匹配。代码可在https://github.com/Hasar12139/SafeVPR获取。

英文摘要

Sequence-based visual place recognition (VPR) for SLAM and robot relocalization must decide whether the retrieved top-1 candidate is safe to accept. Conformal prediction is a natural framework for this accept/reject decision, but its finite-sample guarantees rely on exchangeability between calibration and deployment (test) data, which is violated under cross-condition deployment. We introduce SAFEVPR, a non-trainable verification-and-calibration pipeline for safe cross-condition sequence VPR. SAFEVPR replaces the standard backbone cosine similarity with a mutual-nearest-neighbour (MNN) patch-matching score computed from frozen DINOv2 ViT features, and replaces flat Learn-Then-Test calibration with Mondrian conformal LTT, fitting separate Bonferroni-corrected thresholds across score bins. Under exchangeability, these thresholds would provide finite-sample false-discovery-rate (FDR) control; under condition shift, we evaluate empirical validity per deployment. Across 23 cross-condition setups from Oxford RobotCar, NCLT, and St Lucia datasets, using three frozen VPR backbones, SAFEVPR is empirically valid on 23/23 setups at target FDR alpha = 0.10, achieving mean accepted FDR 0.014 and mean true-positive rate (TPR) 0.75. The results show that raw discrimination alone is not sufficient for conformal validity: AnyLoc-VLAD and Super-Point+LightGlue reach comparable area under the receiver operating characteristic curve (AUROC) but fail more setups under the same calibration. On textureless repetitive scenery, SAFEVPR safely abstains rather than accepting unreliable matches. Code is available at https://github.com/Hasar12139/SafeVPR.

URL PDF HTML ☆

赞 0 踩 0

2605.28033 2026-05-28 cs.RO 版本更新

How Should We Teach Robots? A Comparison of Kinesthetic, Joystick, and Gesture-Based Teaching

我们应如何教机器人？动觉、摇杆和手势教学的比较

Petr Vanc, Jan Kristof Behrens, Václav Hlaváč, Karla Stepanova

发表机构 * Czech Institute of Informatics, Robotics and Cybernetics (CIIRC CTU)（捷克信息学、机器人学与控制研究所（CIIRC CTU））

AI总结通过用户研究比较动觉引导、摇杆遥操作和手势教学三种示范方式，评估其在操作任务中的成功率、工作负载和常见错误。

Comments 7 pages, 3 figures, 3 tables, presented at Cognition and Artificial Life (CAL/KUZ) 2026 conference at Chateau Trest

2605.27972 2026-05-28 cs.RO 版本更新

Colosseum V2：视觉语言动作模型的泛化能力基准测试

Jeremy Morgan, Prajwal Vijay, Hyeonho Oh, Jincen Song, Ashvin Arora, Alina Du, Gaurav Sukhatme, Jesse Thomason, Ishika Singh

发表机构 * Department of Computer Science, University of Southern California（南加州大学计算机科学系）； Department of Electrical Engineering, Indian Institute of Technology Madras（印度理工学院Madras分校电子工程系）； Fu Foundation School of Engineering and Applied Science, Columbia University（哥伦比亚大学工程与应用科学学院）

AI总结提出Colosseum V2大规模仿真基准，通过28个任务和两种机器人形态，系统评估VLA模型在分布偏移下的泛化能力，揭示其在高层次理解与鲁棒行为之间的差距。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在大规模视觉和语言预训练的推动下，在机器人操作中展现出有前景的泛化能力。然而，这种进展可能具有误导性。尽管VLA具有零样本感知和语言能力，但它们的整体任务性能在分布偏移下常常下降，揭示了这些系统将高层次理解转化为鲁棒行为方面的差距。为了系统地研究这一差距，我们引入了Colosseum V2，这是一个大规模仿真基准，用于评估机器人学习中VLA在不同条件下的泛化能力。该基准包含28个任务，涵盖13个任务类别和两种机器人形态，覆盖了广泛的操作原语和长时域行为。基于ManiSkill仿真器构建，Colosseum V2支持快速、GPU并行化的评估，并支持大规模域内和域外测试。我们评估了包括Action Chunking Transformers (ACT)和Pi0.5在内的最先进方法，揭示了它们在基础性能和泛化方面的局限性。我们展示了仿真与真实世界指标之间的强相关性，支持了该基准的生态效度。通过在统一基准中标准化任务、指标和评估协议，Colosseum V2实现了可重复和公平的比较，降低了评估开销，并加速了向通用机器人策略的进展。

英文摘要

Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.

URL PDF HTML ☆

赞 0 踩 0

2605.27724 2026-05-28 cs.RO cs.AI 版本更新

HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning

HumanoidMimicGen: 通过全身规划生成行走操作数据

Kevin Lin, Ajay Mandlekar, Caelan Reed Garrett, Nikita Chernyadev, Yu Fang, Runyu Ding, Yuqi Xie, Justin Tran, Linxi Fan, Yuke Zhu

发表机构 * NVIDIA ； The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出HumanoidMimicGen方法，通过全身规划自动生成人形机器人行走操作演示数据，在模拟基准上使联合训练的策略性能提升20%。

Comments website: https://humanoidmimicgen.github.io/

详情

AI中文摘要

模仿学习是训练人形机器人行走和操作的一种有前景的方法，但它需要大量演示，而这些演示通过遥操作收集耗时且困难。现有的数据生成算法可以自动合成操作器的演示，但它们在类人机器人上效果不佳，因为其高维复合动作空间涉及手臂、腿和躯干。我们提出HumanoidMimicGen，一种生成人形机器人腿部行走操作数据的方法。我们的方法将少量源演示中的接触丰富的全身技能适应到新状态，并泛化到物体姿态的变化。通过将这些单臂和双臂技能与全身运动规划和操作规划交替进行，该方法在多样化的场景和布局中生成稳定、无碰撞的数据。为了评估我们的方法，我们引入了一个新的模拟行走操作基准，包含九个测试人形机器人行走操作能力的多样化任务。在那里，我们证明HumanoidMimicGen自动生成用于模仿学习的大规模数据集，并能够系统研究数据生成和策略学习决策如何影响模型性能。我们表明，与仅使用真实世界数据训练的策略相比，与HumanoidMimicGen生成的数据联合训练的全身视觉运动策略性能提升20%。

英文摘要

Imitation learning is a promising approach for training humanoid robots to both walk and manipulate, but it requires a large number of demonstrations, which are time-intensive and difficult to collect via teleoperation. Existing data-generation algorithms can automatically synthesize demonstrations for manipulators, but they are ineffective on humanoids because their high-dimensional composite action spaces involve arms, legs, and torsos. We present HumanoidMimicGen, a method for generating humanoid legged loco-manipulation data. Our method adapts contact-rich whole-body skills from a handful of source demonstrations to new states, generalizing across changes in object pose. By interleaving these single- and dual-arm skills with whole-body locomotion and manipulation planning, the method generates stable, collision-free data across diverse scenes and layouts. To evaluate our approach, we introduce a new simulated loco-manipulation benchmark containing nine diverse tasks that test humanoid loco-manipulation capabilities. There, we demonstrate that HumanoidMimicGen automatically generates large datasets for imitation learning and enables a systematic study of how data generation and policy learning decisions impact model performance. We show that whole-body visuomotor policies co-trained with data generated by HumanoidMimicGen outperform those trained only on real-world data by 20%.

URL PDF HTML ☆

赞 0 踩 0

2605.27699 2026-05-28 cs.RO 版本更新

AURA: Asymptotically Optimal Uncertainty-Robust Replanning Algorithm for Kinodynamic Systems

AURA: 动力学系统渐近最优的鲁棒重规划算法

Seyedali Golestaneh, Zhuoyun Zhong, Donghyung Lee, Constantinos Chamzas

发表机构 * Department of Robotics Engineering, Worcester Polytechnic Institute (WPI)（机器人工程系，沃斯通理工大学）

AI总结提出AURA元规划框架，通过在线重规划和优化控制输入，在运动不确定性下实现渐近最优轨迹规划与跟踪精度提升。

详情

AI中文摘要

基于采样的运动规划器为动力学运动规划提供了一种实用且可扩展的方法，尤其适用于高维、欠驱动或非完整系统。然而，这些规划器通常离线使用，要求在执行开始前完成轨迹计算。此外，在存在运动不确定性的情况下，规划轨迹可能无法被准确跟踪，导致偏离名义解。本文在一个统一框架\method中解决了这些局限性，该框架是一个渐近最优的元规划器框架，在执行过程中同时提高路径质量和跟踪性能。除了主执行线程外，该框架包含一个重规划方法，在执行过程中持续探索状态空间并优化轨迹，以及一个优化过程，用于优化未来控制输入以减少跟踪误差。这些组件共同使\method能够在线利用渐近最优规划，同时在不确定性下提高执行精度。所提出的方法在多个系统的仿真和真实环境中进行了评估，与基线方法相比，在轨迹质量、跟踪精度和整体性能方面表现出一致的改进。

英文摘要

Sampling-based motion planners offer a practical and scalable approach to kinodynamic motion planning, notably for high-dimensional, underactuated, or non-holonomic systems. However, these planners are typically used offline, requiring execution to begin only after the trajectory has been computed. In addition, the planned trajectory may not be accurately tracked in the presence of motion uncertainty, leading to deviations from the nominal solution. In this work, these limitations were addressed within a unified framework, \method, an asymptotically-optimal meta-planner framework that improves both path quality and tracking performance during execution. In addition to the main execution thread, this framework comprises a replanning method that continuously explores the state space and refines the trajectory during execution, and an optimization process that refines future control inputs to reduce tracking error. Together, these components enable \method to leverage asymptotically optimal planning online while improving execution accuracy under uncertainty. The proposed approach is evaluated in both simulation and real-world environments across multiple systems, demonstrating consistent improvements in trajectory quality, tracking accuracy, and overall performance compared with baseline methods.

URL PDF HTML ☆

赞 0 踩 0

2605.27697 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

仿真引导的扩散方法用于去中心化多机器人运动规划

Jinhao Liang, Sven Koenig, Ferdinando Fioretto

发表机构 * University of Virginia（弗吉尼亚大学）； University of California, Irvine（加州大学伊文斯顿分校）

AI总结提出一种基于约束感知扩散模型的去中心化框架SID，通过仿真邻居未来轨迹并利用安全约束规划自身轨迹，在密集场景下实现高效协调。

详情

AI中文摘要

去中心化多机器人运动规划要求每个机器人仅根据局部观测生成无碰撞轨迹，无需全局感知或可靠通信。然而，大多数现有规划器（无论是经典方法还是基于学习的方法）都是从局部观测的静态快照生成轨迹，这限制了它们预测相邻机器人未来行为的能力。随着机器人数量增加和环境变得更加拥挤，这一限制变得至关重要。为了克服这一挑战，本文引入了仿真引导的扩散（SID），这是一种基于约束感知扩散模型（CADM）的去中心化框架。SID首先使用CADM从当前观测状态仿真相邻机器人的未来轨迹，然后利用这些仿真提供的安全约束，使用相同的CADM规划每个机器人自身的轨迹。关键的是，对邻居的精确仿真使得一种最小通信方案成为可能，该方案仅在高度拥挤的场景中必要时触发协调。在多种环境中的实验表明，SID在规划有效性和约束满足方面始终优于基线方法，并且可扩展到108个机器人和160个障碍物的场景。

英文摘要

Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without global sensing or reliable communication. However, most existing planners, whether classical or learning-based, generate trajectories from a static snapshot of the local observation, which limits their ability to anticipate the future behavior of neighboring robots. This limitation is critical as the number of robots increases and the environment becomes more cluttered. To overcome this challenge, this paper introduces Simulation-Informed Diffusion (SID), a decentralized framework built on constraint-aware diffusion models (CADM). SID first uses CADM to simulate the future trajectories of neighboring robots from their currently observed states, and then uses the same CADM to plan each robot's own trajectory under safety constraints informed by these simulations. Crucially, the accurate simulation of neighbors enables a minimal communication scheme that triggers coordination only when necessary in highly congested scenarios. Experiments across diverse environments show that SID consistently outperforms baseline methods in terms of planning effectiveness and constraint satisfaction, and scales to scenarios with 108 robots and 160 obstacles.

URL PDF HTML ☆

赞 0 踩 0

2605.27661 2026-05-28 cs.RO 版本更新

Design of a Real-time Asynchronous Monocular Odometry for Planetary Exploration

面向行星探测的实时异步单目里程计设计

Benat Inigo, Florian Steidle, Wolfgang Stuerzl

发表机构 * Institute of Robotics and Mechatronics（机器人与机电研究所）； German Aerospace Center (DLR)（德国航空航天中心（DLR））； University of Zaragoza（萨拉戈萨大学）

AI总结针对行星探测中计算资源受限、环境复杂且高动态范围光照的挑战，提出一种基于误差状态卡尔曼滤波（ESKF）的实时异步事件相机单目里程计，利用异步事件流和RATE特征跟踪器实现连续相机运动估计。

2605.27644 2026-05-28 cs.RO cs.AI cs.LG 版本更新

SCALE-COMM：用于多智能体强化学习通信的共享、对比对齐潜在嵌入

Mahmoud Abouelyazid, Eman Hammad

AI总结提出SCALE-COMM框架，通过自监督学习紧凑、稳定的潜在通信表示，解耦通信学习与策略优化，提升多智能体协调的稳定性和样本效率。

Comments IEEE IV 2026

详情

AI中文摘要

涌现通信使得部分可观测的自主移动机器人（AMR）能够在去中心化多智能体强化学习（MARL）环境中有效协调。然而，现有方法常常面临通信协议不稳定、消息语义无根基以及通信学习与策略优化之间的干扰，导致协调性能随时间下降。我们提出SCALE-COMM（用于通信的共享、对比对齐潜在嵌入），一种自监督框架，用于学习紧凑、稳定且与策略相关的通信表示。SCALE-COMM通过训练低维潜在消息来解耦通信学习与策略优化，这些消息捕获与任务相关的规划和交通信息，同时跨智能体和时间强制执行一致性。在标准MARL基准测试和一个现实的仓库协调任务中，SCALE-COMM在表示质量和任务性能方面均持续优于现有通信框架。学习到的通信空间在策略微调下展现出改进的稳定性、样本效率和吞吐量，证明了表示驱动的通信对于可扩展多智能体协调的有效性。

英文摘要

Emergent communication enables partially observant Autonomous Mobile Robots (AMRs) to coordinate effectively in decentralized multi-agent reinforcement learning (MARL) settings. However, existing approaches often struggle with unstable communication protocols, ungrounded message semantics, and interference between communication learning and policy optimization, leading to degraded coordination over time. We propose SCALE-COMM (Shared, Contrastively-Aligned Latent Embeddings for COMMunication), a self-supervised framework for learning compact, stable, and policy-relevant communication representations. SCALE-COMM decouples communication learning from policy optimization by training low-dimensional latent messages that capture task-relevant planning and traffic information, while enforcing consistency across agents and time. Across standard MARL benchmarks and a realistic warehouse coordination task, SCALE-COMM consistently outperforms existing communication frameworks in both representation quality and task performance. The learned communication space yields improved stability, sample efficiency, and throughput under policy fine-tuning, demonstrating the effectiveness of representation-driven communication for scalable multi-agent coordination.

URL PDF HTML ☆

赞 0 踩 0

2605.27491 2026-05-28 cs.RO 版本更新

GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation

GE-Sim 2.0：迈向机器人操作综合闭环视频世界模拟器的路线图

Boxiang Qiu, Liliang Chen, Yue Liao, Nan Wang, Lintao Wang, Jiayi Luo, Wenzhi Zhao, Shengcong Chen, Di Chen, Ye Li, Chen Gao, Shuicheng Yan, Si Liu, Maoqing Yao, Guanghui Ren

发表机构 * AgiBot ； BUAA（北京航空航天大学）； LV-NUS Lab（国立大学理工学院实验室）； TJU（天津大学）

AI总结提出GE-Sim 2.0，一种基于动作条件视频生成的闭环视频世界模拟器，通过重新训练数千小时真实机器人数据并新增状态专家、世界裁判和加速框架三个模块，实现高保真动作跟随和轨迹覆盖，在WorldArena排行榜上以2B参数超越专用模型和通用视频生成器，并验证了基于其生成轨迹和奖励训练的策略在真实世界中的有效性。

详情

AI中文摘要

我们介绍了GE-Sim 2.0（Genie Envisioner世界模拟器2.0），一种用于机器人操作的闭环视频世界模拟器。基于Genie Envisioner的动作条件视频生成框架，GE-Sim 2.0在数千小时的真实机器人数据上重新训练，涵盖遥操作、接触丰富交互和机载策略部署，显著提高了动作跟随保真度和轨迹覆盖范围。在此基础之上，三个新模块实现了从视频模拟到策略学习的闭环：一个状态专家，从视频潜在表示中解码本体感觉状态，以支持下游VLA策略的下一块预测；一个世界裁判，根据任务指令对生成的轨迹进行评分，提供机器可验证的成功信号和奖励，取代人工检查；以及一个加速框架，在单个H100上以2.3秒生成25帧轨迹，并在推理时支持高达4倍跳帧以实现长程评估。GE-Sim 2.0仅以2B参数便登顶公开的WorldArena排行榜，超越了专用机器人世界模型和闭源通用视频生成器，并且基于其生成轨迹和奖励训练的策略可转化为可测量的真实世界收益，确立了GE-Sim 2.0作为可扩展评估和操作策略闭环学习的实用平台。

英文摘要

We introduce GE-Sim 2.0 (Genie Envisioner World Simulator 2.0), a closed-loop video world simulator for robotic manipulation. Building on the action-conditioned video generation framework of Genie Envisioner, GE-Sim 2.0 is re-trained on thousands of hours of real-world robot data spanning teleoperation, contact-rich interaction, and on-robot policy deployment, substantially improving action-following fidelity and trajectory coverage. On top of this foundation, three new modules close the loop from video simulation to policy learning: a state expert that decodes proprioceptive state from video latents to support next-chunk prediction by downstream VLA policies; a world judge that scores generated rollouts against task instructions, yielding machine-verifiable success signals and rewards in place of manual inspection; and an acceleration framework that delivers a 25-frame rollout in 2.3 seconds on a single H100, with up to 4* frame skipping at inference for long-horizon evaluation. GE-Sim 2.0 tops the public WorldArena leaderboard at only 2B parameters, outperforming both dedicated robotic world models and closed-source general video generators, and policies trained against its rollouts and rewards translate into measurable real-world gains, establishing GE-Sim 2.0 as a practical platform for scalable evaluation and closed-loop learning of manipulation policies.

URL PDF HTML ☆

赞 0 踩 0

2605.27461 2026-05-28 cs.RO 版本更新

A Factory-Floor Deployment Case Study of VLA Pipelines for Industrial Packaging Task: Workflow, Failures, and Lessons

工业包装任务的VLA流水线工厂部署案例研究：工作流、故障与经验教训

Brian Zhu, Philipp Schmitt, Philine Meister, Lukas Gensler, Momen Khalil, Emmanuele Poggi, Johannes Hechtl, Carsten Braunroth, Kai Wurm, Gokul Narayanan, Eugen Solowjow, Georg von Wichert, Andre Scholz, Felix Albrecht, Maxmillian Metzner

发表机构 * Siemens Corporation（西门子公司）

AI总结本研究通过在西门子工厂部署预训练Pi0.5策略执行工业包装任务，迭代微调并收集2535个现场数据片段，总结了VLA流水线部署中的常见故障模式与改进工作流的经验教训。

详情

AI中文摘要

视觉-语言-动作（VLA）策略展示了有前景的操作能力，但其实际影响常受限于现实部署的可靠性要求。我们展示了西门子工厂（德国埃尔朗根GWE）中一项工业包装任务的部署研究：机器人必须从杂乱堆中拾取透明配件袋，将其插入纸板包装的剩余空腔，并确保袋子及其内容物保持在闭合平面以下。我们的目标是理解通过迭代微调和部署驱动的改进，将预训练的Pi0.5策略适配到单一工厂任务所需的实际工作量。该流水线包括数据收集、整理、微调、评估和针对性恢复数据收集的重复循环。我们从现场工厂设置中积累了2535个片段（10小时）。在本文中，我们贡献了一个工厂级VLA部署的实证报告，重点介绍了常见的故障模式和有助于改进部署工作流的经验教训。

英文摘要

Vision-Language-Action (VLA) policies have shown promising manipulation capabilities, yet their practical impact is often limited by the reliability demands of real-world deployment. We present a deployment study of an industrial packaging task at Siemens Factory (GWE, Erlangen, Germany), where a robot must pick a transparent accessory bag from a cluttered pile, insert it into the remaining cavity of a cardboard package, and ensure that the bag and its contents remain below the closing plane. Our goal is to understand the practical effort required to adapt a pretrained Pi0.5 policy to a single factory-floor task through iterative fine-tuning and deployment-driven refinement. The pipeline consists of repeated loops of data collection, curation, fine-tuning, evaluation, and targeted recovery data collection. We have accumulated 2535 episodes (10 hours) from the on-site factory settings. In this paper, we contribute an empirical account of a factory-floor VLA deployment, highlighting recurring failure modes and lessons that inform how to improve the deployment workflow.

URL PDF HTML ☆

赞 0 踩 0

2605.27418 2026-05-28 cs.MA cs.RO 版本更新

Differentiable Model Predictive Safety for Heterogeneous Mobility at Urban Intersections

城市交叉口异构移动体的可微分模型预测安全

Wenzhe Song, Hao Zhang

发表机构 * School of Business（商学院）； Department of Mechanical Engineering（机械工程系）； Stevens Institute of Technology（史蒂文斯理工学院）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出可微分模型预测安全（DMPS）框架，将模型预测控制的前瞻性嵌入数据驱动的端到端强化学习架构，通过可微分安全评价器实现精确在线安全校正，在高密度混合交通仿真中将碰撞率降至5.6%以下。

Comments 6 pages. Published in IEEE IARCE 2025

详情

DOI: 10.1109/IARCE68366.2025.11485680
Journal ref: 2025 IEEE 5th International Conference on Industrial Automation, Robotics and Control Engineering (IARCE), Chongqing, China, 2025, pp. 1-6

AI中文摘要

自动驾驶车辆和移动机器人在城市环境中的即将集成对未来的智能交通系统提出了严峻的安全挑战。本文解决了在无信号交叉口协调具有不同动力学的异构智能体的复杂问题。我们引入了一种新颖的框架，称为可微分模型预测安全（DMPS），它将模型预测控制的前瞻性嵌入到数据驱动的端到端强化学习架构中。DMPS智能体学习一个潜在动力学模型，以预测依赖于其动作的未来轨迹。然后，一个学习到的可微分安全评价器评估这些轨迹的风险。关键的是，通过利用通过整个展开预测模型的反向传播，智能体可以高效地计算未来安全性相对于当前动作的梯度，从而实现最小且精确的在线安全校正。集成到多智能体训练方案中，DMPS在高密度混合车辆-机器人交通仿真中几乎消除了碰撞，碰撞率低于5.6%，在不牺牲能量和交通效率的情况下展示了最先进的安全性。

英文摘要

The imminent integration of autonomous vehicles and mobile robots in urban settings presents a critical safety challenge for future intelligent transportation systems. This paper addresses the complex problem of coordinating heterogeneous agents with disparate dynamics at unregulated intersections. We introduce a novel framework, differentiable model predictive safety (DMPS), which embeds the foresight of model-predictive control into a data-driven, end-to-end reinforcement learning architecture. DMPS agents learn a latent dynamics model to predict future trajectories contingent on their actions. A learned, differentiable safety critic then evaluates the risk of these trajectories. Crucially, by leveraging backpropagation through the entire unrolled predictive model, agents can efficiently compute the gradient of future safety with respect to their current action, enabling a minimal and precise online safety correction. Integrated into a multi-agent training scheme, DMPS virtually eliminates collisions to less than 5.6% in high-density, mixed vehicle-robot traffic simulations, demonstrating state-of-the-art safety without compromising energy and traffic efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.27365 2026-05-28 cs.CV cs.AI cs.LG cs.RO 版本更新

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything: 基于并行框解码的快速高质量视觉定位

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Princeton University（普林斯顿大学）； Nanjing University（南京大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出并行框解码（PBD）方法，将边界框和点作为原子单元单步解码，结合大规模数据集LocateAnything-Data，实现高效统一的目标定位与检测，在保持高精度同时显著提升解码吞吐量。

Comments fix github link

详情

AI中文摘要

视觉语言模型（VLM）通常将视觉定位和检测表述为坐标令牌生成问题，将每个2D框序列化为多个1D令牌，这些令牌在很大程度上独立学习和解码。这种逐令牌解码与框几何的耦合结构不匹配，并且由于严格的顺序生成而造成了实际的推理瓶颈。我们引入了LocateAnything，一个基于并行框解码（PBD）的统一生成式定位和检测框架。通过将边界框和点等几何元素作为原子单元单步解码，LocateAnything保持了框内几何一致性并实现了显著的并行性。我们证明PBD提高了解码吞吐量和定位精度。我们进一步开发了一个可扩展的数据引擎，并策划了LocateAnything-Data，这是一个包含超过1.38亿个训练样本的大规模数据集，大大增加了高精度定位的数据多样性。大量评估表明，LocateAnything推进了速度-精度前沿，在多个基准测试中实现了显著更高的解码吞吐量，同时提高了高IoU定位质量。结果突显了并行框解码和大规模训练数据在实现高效精确的统一视觉定位和检测中的互补优势。

英文摘要

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

URL PDF HTML ☆

赞 0 踩 0

2605.19257 2026-05-28 cs.RO 版本更新

PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

PRISM-SLAM: 面向尺度感知度量SLAM的概率射线基础推理

Eunsoo Im, Gyeonggwan Lee, Junghun Suh

发表机构 * KakaoMobility, South Korea（韩国 KakaoMobility）

AI总结提出PRISM-SLAM框架，通过将视觉基础模型先验集成到贝叶斯因子图中，利用Plücker射线距离因子和动态场景不确定性门控机制，实现无尺度漂移的实时单目度量SLAM。

详情

AI中文摘要

单目SLAM历来在动态环境中存在尺度模糊和跟踪失败的问题。虽然最近的视觉基础模型（VFM）提供了显著的零样本深度先验，但简单地整合这些确定性预测忽略了预测不确定性和帧间尺度不一致性。我们提出了PRISM-SLAM，一个实时框架，将VFM先验严格集成到结构化的贝叶斯因子图中，以实现尺度感知、度量一致的定位与建图。具体来说，我们引入了Plücker射线距离因子，将单目观测锚定在全局一致的度量坐标系中的绝对空间，通过使度量尺度Fisher可识别，从数学上解决了尺度漂移。为了处理环境动态，我们从时间深度一致性中推导出认知不确定性代理，并设计了动态场景不确定性门控（DSUG）机制。这种软门控方法概率性地降低动态干扰物的权重，而不会产生与传统语义分割掩码相关的高计算开销。通过采用多进程架构异步处理VFM推理和几何跟踪，PRISM-SLAM仅使用RGB输入即可在30 FPS下提供验证的度量输出，弥合了基础模型与现实机器人应用之间的差距。在TUM RGB-D和7-Scenes基准上的评估表明，PRISM-SLAM的度量$SE(3)$绝对轨迹误差（ATE）几乎与其对齐的$Sim(3)$误差相同。这表明我们的系统能够生成可直接部署的度量轨迹，无需任何后处理尺度校正。项目页面：https://prismslam-cmd.github.io/prismslam_pr/

英文摘要

Monocular SLAM historically suffers from scale ambiguity and tracking failure in dynamic environments. While recent vision foundation models (VFMs) provide remarkable zero-shot depth priors, naively integrating these deterministic predictions ignores predictive uncertainty and frame-to-frame scale inconsistencies. We propose PRISM-SLAM, a real-time framework that rigorously integrates VFM priors into a structured Bayesian factor graph to achieve scale-aware, metric-consistent localization and mapping. Specifically, we introduce a Plücker Ray-Distance Factor to anchor monocular observations in absolute space within a globally consistent metric coordinate system, mathematically resolving scale drift by making the metric scale Fisher-identifiable. To handle environmental dynamics, we derive an epistemic uncertainty proxy from temporal depth consistency and formulate a Dynamic Scene Uncertainty Gating (DSUG) mechanism. This soft-gating approach probabilistically down-weights dynamic distractors without incurring the heavy computational overhead associated with traditional semantic segmentation masks. By employing a multi-process architecture that asynchronously processes VFM inference and geometric tracking, PRISM-SLAM provides verified metric output at 30 FPS using solely RGB input, bridging the gap between foundation models and real-world robotic applications. Evaluated on the TUM RGB-D and 7-Scenes benchmarks, PRISM-SLAM achieves a metric $SE(3)$ Absolute Trajectory Error (ATE) nearly identical to its oracle-aligned $Sim(3)$ error. This demonstrates that our system can produce deployment-ready metric trajectories by delivering robust metric SLAM solutions without any post-hoc scale correction. Project page: https://prismslam-cmd.github.io/prismslam_pr/

URL PDF HTML ☆

赞 0 踩 0

2605.17929 2026-05-28 cs.RO 版本更新

基于3D场景图的开放世界交互式物体搜索的关系语义推理

Imen Mahdi, Matteo Cassinelli, Fabien Despinoy, Tim Welschehold, Abhinav Valada

发表机构 * University of Freiburg（弗赖堡大学）； Toyota Motor Europe（丰田欧洲公司）

AI总结提出SCOUT方法，通过从LLM蒸馏的关系探索启发式直接搜索3D场景图，实现高效开放世界交互式物体搜索，性能匹配LLM且计算高效。

详情

AI中文摘要

家庭环境中的开放世界交互式物体搜索需要理解物体与其周围环境之间的语义关系，以有效引导探索。先前的方法要么依赖视觉-语言嵌入相似性，这不能可靠地捕获任务相关的关系语义，要么依赖大型语言模型（LLM），这对于实时部署来说太慢且成本高昂。我们提出SCOUT：基于场景图探索的开放世界交互式物体搜索学习效用，这是一种新颖的方法，通过使用关系探索启发式（如房间-物体包含和物体-物体共现）为房间、前沿和物体分配效用分数，直接搜索3D场景图。为了在不牺牲开放词汇泛化能力的情况下使其实用，我们提出了一种离线程序化蒸馏框架，将LLM中的结构化关系知识提取到轻量级模型中，用于机器人上的推理。此外，我们提出了SymSearch，一个用于评估交互式物体搜索任务中语义推理的可扩展符号基准。在符号和模拟环境中的广泛评估表明，SCOUT优于基于嵌入相似性的方法，并在保持计算效率的同时达到LLM级别的性能。最后，真实世界实验证明了向物理环境的有效迁移，在现实感知和导航约束下实现了开放世界交互式物体搜索。

英文摘要

Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models (LLMs), which are too slow and costly for real-time deployment. We introduce SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search, a novel method that searches directly over 3D scene graphs by assigning utility scores to rooms, frontiers, and objects using relational exploration heuristics such as room-object containment and object-object co-occurrence. To make this practical without sacrificing open-vocabulary generalization, we propose an offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference. Furthermore, we present SymSearch, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks. Extensive evaluations across symbolic and simulation environments show that SCOUT outperforms embedding similarity-based methods and matches LLM-level performance while remaining computationally efficient. Finally, real-world experiments demonstrate effective transfer to physical environments, enabling open-world interactive object search under realistic sensing and navigation constraints.

URL PDF HTML ☆

赞 0 踩 0

2603.01766 2026-05-28 cs.RO 版本更新

Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models

神经隐式动作场：从离散路点到连续函数的视觉-语言-动作模型

Haoyun Liu, Jianzhuang Zhao, Xinyuan Chang, Tianle Shi, Chuanzhang Meng, Jiayuan Tan, Feng Xiong, Tong Lin, Dongjie Huo, Mu Xu, SongLin Dong, Zhiheng Ma, Yihong Gong, Sheng Zhong

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Faculty of Computility Microelectronics, Shenzhen University of Advanced Technology（深圳大学计算微电子学院）； Guangdong Provincial Key Laboratory of Computility Microelectronics（广东省计算微电子重点实验室）； Amap, Alibaba Group（阿里集团Amap）； Shenzhen University（深圳大学）； Xi'an Jiaotong University（西安交通大学）； Beijing University of Chemical Technology（北京化工大学）

AI总结针对视觉-语言-动作模型预测离散动作路点与物理运动连续性不匹配的问题，提出神经隐式动作场（NIAF），通过将动作表示从离散路点重构为连续函数，实现任意时间分辨率的连续动作流形合成，支持解析求导和显式速度监督，提升控制平滑性和物理合理性。

Comments Accepted at ICML 2026

详情

AI中文摘要

尽管视觉-语言-动作（VLA）模型取得了快速进展，但将动作块预测为离散路点的普遍做法在结构上与物理运动的内在连续性不一致。这种离散化自然源于固定频率的机器人数据收集和大语言模型的逐词预测范式，但将动作绑定到固定的采样率，不能自然支持解析一致的高阶导数，并引入量化伪影，阻碍精确、柔顺的交互。我们提出神经隐式动作场（NIAF），将块级动作表示从离散路点重构为连续动作函数。通过使用视觉-语言模型作为可学习运动先验上的分层频谱调制器，NIAF 合成具有任意时间分辨率的连续时间动作流形。这种公式支持解析微分，允许显式监督速度和正则化高阶导数信号，以促进数学一致性、物理合理性和控制平滑性。我们的方法在 CALVIN 和 LIBERO 上跨多种骨干网络取得了强劲结果。真实世界实验进一步证实 NIAF 支持稳定的阻抗控制，桥接了策略侧动作生成和执行侧平滑控制。

英文摘要

Despite the rapid progress of vision-language-action (VLA) models, the prevailing practice of predicting action chunks as discrete waypoints remains structurally misaligned with the intrinsic continuity of physical motion. This discretization arises naturally from fixed-rate robot data collection and the token-by-token prediction paradigm of large language models, but ties actions to rigid sampling rates, does not naturally support analytically consistent higher-order derivatives, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), which reformulates chunk-level action representation from discrete waypoints to continuous action functions. Using a vision-language model as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes continuous-time action manifolds with arbitrary temporal resolution. This formulation enables analytical differentiation, allowing explicit supervision of velocity and regularization of higher-order derivative signals to promote mathematical consistency, physical plausibility, and control smoothness. Our approach achieves strong results on CALVIN and LIBERO across diverse backbones. Real-world experiments further confirm that NIAF supports stable impedance control, bridging policy-side action generation and execution-side smooth control.

URL PDF HTML ☆

赞 0 踩 0

2403.11852 2026-05-28 cs.RO cs.AI 版本更新

Delay-Aware Reinforcement Learning for Highway On-Ramp Merging under Stochastic Communication Latency

考虑随机通信延迟的高速公路匝道合流延迟感知强化学习

Amin Tabrizian, Zhitong Huang, Arsyi Aziz, Peng Wei

发表机构 * Department of Computer Science, George Washington University, Washington, D.C.（计算机科学系，乔治华盛顿大学，华盛顿特区）； Connected and Automated Vehicle Program Manager, Traffic Operations Division, Virginia Department of Transportation（连接与自动化车辆计划主任，交通运营处，弗吉尼亚州交通部）； Department of Mechanical & Aerospace Engineering, George Washington University, Washington, D.C.（机械与航空航天工程系，乔治华盛顿大学，华盛顿特区）

AI总结针对V2I通信随机延迟导致状态观测延迟的问题，提出DAROM框架，通过随机延迟MDP建模和延迟感知编码器恢复马尔可夫性，结合物理安全控制器实现鲁棒控制。

详情

AI中文摘要

延迟和部分可观测的状态信息给现实自动驾驶中基于强化学习（RL）的控制带来了重大挑战。在高速公路匝道合流中，路侧单元（RSU）可以感知附近交通，进行边缘感知，并通过车到基础设施（V2I）链路将状态估计传输给自车。随着智能交通基础设施和边缘计算的最新进展，这种RSU辅助感知越来越现实，并已部署在现代互联道路系统中。然而，边缘处理时间和无线传输可能引入随机的V2I通信延迟，违反马尔可夫假设并显著降低控制性能。在这项工作中，我们提出了DAROM，一种对随机延迟鲁棒的高速公路匝道合流延迟感知强化学习框架。我们将问题建模为随机延迟马尔可夫决策过程（RDMDP），并开发了一个统一的RL智能体用于联合纵向和横向控制。为了在延迟观测下恢复马尔可夫表示，我们引入了一个延迟感知编码器，该编码器以延迟观测、掩蔽动作历史和观测延迟幅度为条件来推断当前潜在状态。我们进一步集成基于物理的安全控制器以减少合流过程中的碰撞风险。在模拟城市交通（SUMO）模拟器中，使用下一代仿真（NGSIM）数据集的真实交通数据进行的实验表明，DAROM在各种交通密度下始终优于标准RL基线。特别是，基于门控循环单元（GRU）的编码器在高达2.0秒的随机V2I延迟的高密度交通中实现了超过99%的成功率。

英文摘要

Delayed and partially observable state information poses significant challenges for reinforcement learning (RL)-based control in real-world autonomous driving. In highway on-ramp merging, a roadside unit (RSU) can sense nearby traffic, perform edge perception, and transmit state estimates to the ego vehicle over vehicle-to-infrastructure (V2I) links. With recent advancements in intelligent transportation infrastructure and edge computing, such RSU-assisted perception is increasingly realistic and already deployed in modern connected roadway systems. However, edge processing time and wireless transmission can introduce stochastic V2I communication delays, violating the Markov assumption and substantially degrading control performance. In this work, we propose DAROM, a Delay-Aware Reinforcement Learning framework for On-ramp Merging that is robust to stochastic delays. We model the problem as a random delay Markov decision process (RDMDP) and develop a unified RL agent for joint longitudinal and lateral control. To recover a Markovian representation under delayed observations, we introduce a Delay-Aware Encoder that conditions on delayed observations, masked action histories, and observed delay magnitude to infer the current latent state. We further integrate a physics-based safety controller to reduce collision risk during merging. Experiments in the Simulation of Urban MObility (SUMO) simulator using real-world traffic data from the Next Generation Simulation (NGSIM) dataset demonstrate that DAROM consistently outperforms standard RL baselines across traffic densities. In particular, the gated recurrent unit (GRU)-based encoder achieves over 99% success in high-density traffic with random V2I delays of up to 2.0 seconds.

URL PDF HTML ☆

赞 0 踩 0

2602.03668 2026-05-28 cs.RO cs.CV 版本更新

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

MVP-LAM：通过跨视角重建学习以动作为中心的潜在动作

Jung Min Lee, Dohyeok Lee, Seokhun Ju, Taehyun Cho, Jin Woo Koo, Li Zhao, Sangwoo Hong, Jungwoo Lee

发表机构 * Seoul National University, Seoul, South Korea（首尔国立大学，首尔，韩国）； Konkuk University, Seoul, South Korea（韩国konkuk大学，首尔，韩国）； Microsoft Research Asia, Beijing, China（微软亚洲研究院，北京，中国）； HodooAI Labs, Seoul, South Korea（HodooAI实验室，首尔，韩国）

AI总结提出MVP-LAM模型，利用多视角视频通过跨视角重建目标学习与真实动作高度相关的潜在动作，提升动作预测和下游操作性能。

详情

AI中文摘要

从多样化人类视频中学习的潜在动作作为视觉-语言-动作（VLA）预训练的伪标签，但只有当它们对底层真实动作保持信息量时才能提供有效监督。为了有效监督，潜在动作应包含关于底层动作的信息，尽管这些信息不可直接获取。我们提出多视角潜在动作模型（MVP-LAM），该模型从多视角视频中学习与真实动作高度相关的潜在动作。MVP-LAM通过跨视角重建目标训练潜在动作，使得一个视角的潜在动作必须解释另一个视角的未来，从而减少对视角特定线索的依赖。在Bridge V2上，MVP-LAM生成更以动作为中心的潜在动作，与真实动作的互信息更高，动作预测性能提升，包括在分布外评估下。最后，使用MVP-LAM潜在动作预训练VLA模型提高了各种基准上的下游操作性能。代码和训练好的检查点可在https://jmsnu.github.io获取。

英文摘要

Latent actions learned from diverse human videos serve as pseudo-labels for vision-language-action (VLA) pretraining, but provide effective supervision only if they remain informative about the underlying ground-truth actions. For effective supervision, latent actions should contain information about the underlying actions even though they are inaccessible. We propose Multi-ViewPoint Latent Action Moel (MVP-LAM), which learns latent actions that are highly informative about ground-truth actions from multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective, so that a latent action from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on various benchmarks. The code and trained checkpoints are available at https://jmsnu.github.io.

URL PDF HTML ☆

赞 0 踩 0

2512.14340 2026-05-28 cs.RO 版本更新

Field evaluation and optimization of a lightweight autonomous lidar-based UAV system based on a rigorous experimental setup in boreal forest environments

基于严格实验设置的轻量级自主激光雷达无人机系统在北方森林环境中的现场评估与优化

Aleksi Karhunen, Teemu Hakala, Väinö Karjalainen, Eija Honkavaara

发表机构 * Finnish Geospatial Research Institute in National Land Survey of Finland（芬兰地理研究 institute 在芬兰国家土地测绘局）

AI总结提出标准化实验设置评估自主林下无人机系统，通过轻量级激光雷达四旋翼在北方森林中的93次真实飞行验证，优化后系统在中难度森林中1m/s和2m/s速度下成功率分别为12/15和15/15，在困难森林中为12/15和5/15。

Comments This work has been submitted to the IEEE for possible publication

详情

DOI: 10.1109/TFR.2026.3691711

AI中文摘要

近年来，利用自主无人机进行林下森林遥感引起了越来越多的兴趣，导致科学文献中发表了大量自主飞行算法。为了支持此类算法的选择和开发，基于已发表研究对现有方法进行可靠比较至关重要。然而，由于实验设置差异很大且报告实践不完整，目前可靠比较面临挑战。本研究提出了一种标准化的实验设置，用于评估自主林下无人机系统，以填补这一空白。所提出的设置强调森林复杂性的定量报告、测试环境的可视化表示、多次重复飞行的执行，以及飞行成功率与定性飞行结果的报告。此外，鼓励在多个目标速度下飞行，并报告实际飞行速度、任务完成时间和点对点飞行距离。该设置通过采用最先进开源算法的轻量级激光雷达四旋翼进行演示，并在两个天然北方森林环境中进行了大量实验评估。基于对原始系统的系统评估，引入了若干改进。随后对优化后的系统重复相同的实验协议，总共进行了93次真实世界飞行。优化后的系统在中难度森林中，目标飞行速度为1 m/s和2 m/s时分别实现了12/15和15/15的成功率，在困难森林中分别为12/15和5/15。采用所提出的实验设置将有助于基于文献的自主林下飞行系统比较，并支持未来基于无人机的森林机器人解决方案的系统性能改进。

英文摘要

Interest in utilizing autonomous uncrewed aerial vehicles (UAVs) for under-canopy forest remote sensing has increased in recent years, resulting in the publication of numerous autonomous flight algorithms in the scientific literature. To support the selection and development of such algorithms, a reliable comparison of existing approaches based on published studies is essential. However, reliable comparisons are currently challenging due to widely varying experimental setups and incomplete reporting practices. This study proposes a standardized experimental setup for evaluating autonomous under-canopy UAV systems to fill this gap. The proposed setup emphasizes quantitative reporting of forest complexity, visual representation of test environments, execution of multiple repeated flights, and reporting of flight success rates alongside qualitative flight results. In addition, flights at multiple target speeds are encouraged, with reporting of realized flight speed, mission completion time, and point-to-point flight distance. The proposed setup is demonstrated using a lightweight lidar-based quadrotor employing state-of-the-art open-source algorithms, evaluated through extensive experiments in two natural boreal forest environments. Based on a systematic evaluation of the original system, several improvements were introduced. The same experimental protocol was then repeated with the optimized system, resulting in a total of 93 real-world flights. The optimized system achieved success rates of 12/15 and 15/15 at target flight speeds of 1 m/s and 2 m/s, respectively, in a medium-difficulty forest, and 12/15 and 5/15 in a difficult forest. Adoption of the proposed experimental setup would facilitate the literature-based comparison of autonomous under-canopy flight systems and support systematic performance improvement of future UAV-based forest robotics solutions.

URL PDF HTML ☆

赞 0 踩 0

2307.06240 2026-05-28 cs.LG cs.AI cs.RO cs.SY eess.SY 版本更新

DSSE: a drone swarm search environment

DSSE：无人机群搜索环境

Manuel Castanares, Luis F. S. Carrete, Enrico F. Damiani, Leonardo D. M. de Abreu, José Fernando B. Brancalion, Fabrício J. Barth

发表机构 * Insper ； Embraer

AI总结基于PettingZoo的多智能体强化学习环境，无人机通过动态概率输入搜索目标。

Comments 7 pages

2512.12649 2026-05-28 cs.RO cs.SY eess.SY 版本更新

基于POPGym Arcade的无模型强化学习中的记忆研究

Zekang Wang, Zhe He, Borong Zhang, Edan Toledo, Steven Morad

发表机构 * Faculty of Science and Technology, University of Macau（澳门大学科技学院）； Centre for AI, University College London（伦敦大学学院人工智能中心）

AI总结本文通过引入分析工具和POPGym Arcade环境套件，研究深度强化学习中的记忆机制，发现价值函数会将信用分配到无关历史，并展示分布外场景如何污染记忆。

Comments Appear at ICML 2026 as a Spotlight paper

2508.21046 2026-05-28 cs.CV cs.RO 版本更新

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

CogVLA: 通过指令驱动路由与稀疏化实现认知对齐的视觉-语言-动作模型

Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳校区计算机科学与技术学院）

AI总结提出CogVLA框架，通过指令驱动路由和稀疏化机制，在LIBERO基准和真实机器人任务上以2.5倍训练成本降低和2.8倍推理延迟降低实现97.4%和70.0%的成功率。

Comments Accepted to NeurIPS 2025, Project Page: https://jiutian-vl.github.io/CogVLA-page

详情

AI中文摘要

最近基于预训练视觉-语言模型（VLM）构建的视觉-语言-动作（VLA）模型需要大量后训练，导致计算开销高，限制了可扩展性和部署。我们提出CogVLA，一个认知对齐的视觉-语言-动作框架，利用指令驱动路由和稀疏化来提高效率和性能。CogVLA受人类多模态协调启发，引入了一个3阶段渐进式架构。1）基于编码器-FiLM的聚合路由（EFA-Routing）将指令信息注入视觉编码器，以选择性聚合和压缩双流视觉标记，形成指令感知的潜在表示。2）基于这种紧凑的视觉编码，基于LLM-FiLM的剪枝路由（LFP-Routing）通过剪枝与指令无关的视觉接地标记将动作意图引入语言模型，从而实现标记级稀疏性。3）为确保压缩的感知输入仍能支持准确且连贯的动作生成，我们引入了V-L-A耦合注意力（CAtten），它将因果视觉-语言注意力与双向动作并行解码相结合。在LIBERO基准和真实机器人任务上的大量实验表明，CogVLA实现了最先进的性能，成功率分别为97.4%和70.0%，同时与OpenVLA相比，训练成本降低了2.5倍，推理延迟降低了2.8倍。CogVLA已开源，可在https://github.com/JiuTian-VL/CogVLA获取。

英文摘要

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

URL PDF HTML ☆

赞 0 踩 0

2509.14075 2026-05-28 cs.RO cs.SY eess.SY 版本更新

RCM Constraint-Consistent Dynamic Control in Surgical Robots

手术机器人中的RCM约束一致性动态控制

Yu Li, Hamid Sadeghian, Zewen Yang, Valentin Le Mesle, Sami Haddadin

发表机构 * Munich Institute of Robotics and Machine Intelligence, Technical University of Munich, Germany（慕尼黑机器人与机器智能研究所，慕尼黑技术大学，德国）； Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE（穆罕默德·本·扎耶德人工智能大学，阿布扎比，阿联酋）

AI总结将远程运动中心（RCM）建模为流变完整约束，并集成到基于投影的逆动力学控制器中，实现扭矩层面的约束一致控制，降低RCM残差并平滑扭矩曲线。

Comments Accepted at ICRA 2026

详情

AI中文摘要

机器人辅助微创手术（RAMIS）需要精确执行远程运动中心（RCM）约束，以确保通过套管针的安全工具运动。现有的虚拟RCM控制器通常在运动学层面或作为任务空间目标进行公式化，这使得在套管针运动和物理交互下难以一致地制定扭矩层面的执行。本文将RCM建模为流变完整约束，并将其纳入基于投影的逆动力学控制器中，具有显式的约束/自由运动扭矩分解。所得公式在扭矩层面统一了运动学RCM执行和任务空间跟踪，同时为残差调节和零空间顺应性保留了约束一致的结构。所提出的控制器在仿真和RAMIS训练平台上与代表性的基于投影和约束动力学基线进行了验证。在螺旋跟踪、变化插入深度、移动套管针条件和人类交互中，该方法实现了更低的RCM残差和更平滑的扭矩曲线，同时保持准确的工具尖端跟踪。这些结果支持使用约束一致扭矩控制来实现手术机器人中可靠的虚拟RCM执行。项目页面位于https://rcmpc-cube.github.io。

英文摘要

Robotic-assisted minimally invasive surgery (RAMIS) requires accurate enforcement of the remote center of motion (RCM) constraint to ensure safe tool motion through a trocar. Existing virtual RCM controllers are commonly formulated either at the kinematic level or as task-space objectives, which makes torque-level enforcement under trocar motion and physical interaction difficult to formulate consistently. This paper models the RCM as a rheonomic holonomic constraint and incorporates it into a projection-based inverse-dynamics controller with explicit constrained/free-motion torque decomposition. The resulting formulation unifies kinematic RCM enforcement and task-space tracking at the torque level, while preserving a constraint-consistent structure for residual regulation and null-space compliance. The proposed controller is validated in simulation and on a RAMIS training platform against representative projection-based and constrained-dynamics baselines. Across spiral tracking, varying insertion depth, moving trocar conditions, and human interaction, the method achieves lower RCM residuals and smoother torque profiles while maintaining accurate tool-tip tracking. These results support the use of constraint-consistent torque control for reliable virtual RCM enforcement in surgical robotics. The project page is available at https://rcmpc-cube.github.io

URL PDF HTML ☆

赞 0 踩 0

2509.13177 2026-05-28 cs.RO 版本更新

ROOM: A Physics-Based Continuum Robot Simulator for Photorealistic Medical Datasets Generation

ROOM: 基于物理的连续体机器人模拟器，用于生成逼真的医学数据集

Salvatore Esposito, Matías Mattamala, Daniel Rebain, Francis Xiatian Zhang, Kevin Dhaliwal, Mohsen Khadem, Subramanian Ramamoorthy

发表机构 * University of Edinburgh, UK（爱丁堡大学，英国）； University of British Columbia, Canada（不列颠哥伦比亚大学，加拿大）

AI总结提出ROOM模拟框架，利用患者CT扫描生成多模态支气管镜训练数据，验证其在姿态估计和深度估计任务中的有效性。

详情

Journal ref: International Conference on Robotics and Automation 2026

AI中文摘要

连续体机器人通过进入复杂的肺气道并进行靶向干预，正在推进支气管镜手术。然而，由于缺乏真实的训练和测试环境，其发展受到限制：由于伦理约束和患者安全问题，真实数据难以收集，而开发自主算法需要逼真的成像和物理反馈。我们提出了ROOM（医学中的逼真光学观察），一个用于生成逼真支气管镜训练数据的综合模拟框架。通过利用患者CT扫描，我们的流程渲染多模态传感器数据，包括具有真实噪声和光斑的RGB图像、度量深度图、表面法线、光流和点云，这些数据在医学相关尺度上生成。我们在两个医学机器人学的典型任务中验证了ROOM生成的数据：多视图姿态估计和单目深度估计，展示了最先进方法在迁移到这些医学环境时必须克服的多种挑战。此外，我们表明ROOM生成的数据可用于微调现有深度估计模型以克服这些挑战，并支持其他下游应用，如导航。我们期望ROOM能够在不同患者解剖结构和临床环境中难以捕获的手术场景中实现大规模数据生成。代码和数据：https://github.com/iamsalvatore/room。

英文摘要

Continuum robots are advancing bronchoscopy procedures by accessing complex lung airways and enabling targeted interventions. However, their development is limited by the lack of realistic training and test environments: Real data is difficult to collect due to ethical constraints and patient safety concerns, and developing autonomy algorithms requires realistic imaging and physical feedback. We present ROOM (Realistic Optical Observation in Medicine), a comprehensive simulation framework designed for generating photorealistic bronchoscopy training data. By leveraging patient CT scans, our pipeline renders multi-modal sensor data including RGB images with realistic noise and light specularities, metric depth maps, surface normals, optical flow and point clouds at medically relevant scales. We validate the data generated by ROOM in two canonical tasks for medical robotics: multi-view pose estimation and monocular depth estimation, demonstrating diverse challenges that state-of-the-art methods must overcome to transfer to these medical settings. Furthermore, we show that the data produced by ROOM can be used to fine-tune existing depth estimation models to overcome these challenges, also enabling other downstream applications such as navigation. We expect that ROOM will enable large-scale data generation across diverse patient anatomies and procedural scenarios that are challenging to capture in clinical settings. Code and data: https://github.com/iamsalvatore/room.

URL PDF HTML ☆

赞 0 踩 0

2311.02304 2026-05-28 cs.RO 版本更新

Imitating and Finetuning Model Predictive Control for Robust and Symmetric Quadrupedal Locomotion

模仿与微调模型预测控制实现鲁棒且对称的四足运动

Donghoon Youm, Hyunyoung Jung, Hyeongjun Kim, Jemin Hwangbo, Hae-Won Park, Sehoon Ha

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出模仿与微调模型预测控制（IFM）框架，结合模型预测控制与模仿学习及强化学习，提升四足机器人在复杂地形上的运动性能、对称性和能效。

详情

DOI: 10.1109/LRA.2023.3320827
Journal ref: IEEE Robotics and Automation Letters ( Volume: 8, Issue: 11, November 2023

AI中文摘要

腿式机器人的控制是一个具有挑战性的问题，已有多种方法进行研究，如基于模型的控制和学习算法。本文提出了一种新颖的模仿与微调模型预测控制（IFM）框架，以结合两种方法的优势。该框架首先使用微分动态规划和Raibert启发式方法开发一个传统的模型预测控制器（MPC），作为专家策略。然后，通过模仿学习训练MPC的克隆，使控制器可学习。最后，利用有限探索的深度强化学习在更具挑战性的地形上进一步微调策略。通过全面的仿真和硬件实验，我们证明了所提出的IFM框架能够显著提高给定MPC控制器在粗糙、湿滑和传送带等需要仔细协调步态的地形上的性能。我们还展示了与普通强化学习相比，IFM能够以最小的奖励塑造负担高效地产生更对称、周期性和节能的步态。

英文摘要

Control of legged robots is a challenging problem that has been investigated by different approaches, such as model-based control and learning algorithms. This work proposes a novel Imitating and Finetuning Model Predictive Control (IFM) framework to take the strengths of both approaches. Our framework first develops a conventional model predictive controller (MPC) using Differential Dynamic Programming and Raibert heuristic, which serves as an expert policy. Then we train a clone of the MPC using imitation learning to make the controller learnable. Finally, we leverage deep reinforcement learning with limited exploration for further finetuning the policy on more challenging terrains. By conducting comprehensive simulation and hardware experiments, we demonstrate that the proposed IFM framework can significantly improve the performance of the given MPC controller on rough, slippery, and conveyor terrains that require careful coordination of footsteps. We also showcase that IFM can efficiently produce more symmetric, periodic, and energy-efficient gaits compared to Vanilla RL with a minimal burden of reward shaping.

URL PDF HTML ☆

赞 0 踩 0

2409.13058 2026-05-28 cs.HC cs.RO 版本更新

Mixed Reality Tele-Ultrasound over 750 km: A Feasibility Study

混合现实远程超声检查跨越750公里：可行性研究

Ryan Yeung, David Black, Patrick B. Chen, Victoria Lessoway, Janice Reid, Sergio Rangel-Suarez, Silvia D. Chang, Septimiu E. Salcudean

发表机构 * School of Biomedical Engineering, The University of British Columbia（生物医学工程学院，不列颠哥伦比亚大学）； Department of Electrical and Computer Engineering, The University of British Columbia（电气与计算机工程系，不列颠哥伦比亚大学）； The University of British Columbia（不列颠哥伦比亚大学）； Department of Radiology, The University of British Columbia（放射学系，不列颠哥伦比亚大学）

AI总结本研究提出并评估了一种基于混合现实和触觉反馈的人机远程超声系统，通过新手操作员在专家远程控制下完成腹部超声检查，在754公里距离上实现了92%的图像质量达标率。

Comments 8 pages, 11 figures

详情

DOI: 10.1109/Telepresence66096.2025.11521508

AI中文摘要

为解决偏远社区缺乏超声检查的问题，先前工作引入了人机远程操作，一种基于混合现实和触觉的远程超声系统。该方法中，新手扮演认知机器人角色，由专家通过混合现实远程控制。本文总结了该系统的新进展，并描述了一项评估其用于长距离远程腹部超声检查的可行性研究。为提供简单有效的触觉反馈，我们使用了患者椭球模型，并通过系统的位置和力传感器校准其参数。我们在加拿大海达瓜依的斯基德盖特测试了该系统，专家位于754公里外的加拿大温哥华。我们共进行了11次扫描，涉及10名新手和2名超声技师。超声技师的任务是获取上腹部区域的5个目标图像。图像采集质量由2名放射科医生评估。我们收集了对准数据，新手完成了任务负荷和可用性问卷。新手和超声技师均提供了书面和口头反馈，以指导未来的设计迭代。92%的获取图像具有足够质量，可供两位放射科医生解读。新手报告的平均任务负荷低于文献中的参考值，可用性一致获得正面评价。未发现图像质量与跟随者相对于虚拟换能器的对准误差之间存在相关性。总体而言，我们表明人机远程操作使超声技师能够以高性能执行远程腹部超声成像，即使跨越远距离且使用新手跟随者。未来工作将把人机远程操作与传统、机器人及远程指导超声进行比较。

英文摘要

To address the lack of access to ultrasound in remote communities, previous work introduced human teleoperation, a mixed reality and haptics-based tele-ultrasound system. In this approach, a novice takes the role of a cognitive robot controlled remotely by an expert through mixed reality. In this manuscript we summarize new developments to this system and describe a feasibility study assessing its use for long-distance remote abdominal ultrasound examinations. To provide simple but effective haptic feedback, we used an ellipsoid model of the patient with its parameters calibrated using our system's position and force sensors. We tested the system in Skidegate, Haida Gwaii, Canada, with the experts positioned 754 km away in Vancouver, Canada. We performed 11 total scans with 10 novices and 2 sonographers. The sonographers were tasked with acquiring 5 target images in the epigastric region. The image acquisition quality was assessed by 2 radiologists. We collected alignment data and the novices completed task load and usability questionnaires. Both the novices and sonographers provided written and verbal feedback to inform future design iterations. 92% of the acquired images had sufficient quality for interpretation by both radiologists. The mean task load reported by the novices was below reference values reported in literature and the usability was unanimously positive. No correlation was found between image quality and the follower's alignment error with the virtual transducer. Overall, we show that human teleoperation enables sonographers to perform remote abdominal ultrasound imaging with high performance, even across large distances and with novice followers. Future work will compare human teleoperation to conventional, robotic and tele-mentored ultrasound.

URL PDF HTML ☆

赞 0 踩 0

2506.05012 2026-05-28 cs.RO physics.comp-ph physics.flu-dyn 版本更新

Realizing Robotic Swimming with Unified Fluid-Robot Multiphysics

实现统一流体-机器人多物理场的水下机器人游泳

Jeong Hun Lee, Junzhe Hu, Sofia Kwok, Carmel Majidi, Zachary Manchester

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出一个可微分的统一流体-机器人多物理场仿真框架，通过最小作用量原理联合推导耦合的机械臂和不可压缩Navier-Stokes方程，并利用离散变分力学和隐函数定理实现稳定、准确的联合仿真与梯度计算，成功优化仿生鳗鱼机器人的波动游泳和高动态C形逃逸动作，并验证了从仿真到实物的迁移。

Comments 9 pages, 10 figures, accepted to Robotics: Science and Systems 2026

详情

AI中文摘要

在水下机器人领域，实现与鱼类相当的游泳效率和敏捷性一直是一个难以达到的目标。这种运动能力依赖于机器人身体与周围流体之间复杂的涡旋相互作用。然而，模拟这些由耦合的常微分方程和偏微分方程控制的动力学，比经典刚性机器人系统的多体动力学要困难得多。我们提出了一个可微分的框架，将强耦合的流体-机器人多物理场作为一个统一的优化问题进行仿真。耦合的机械臂和不可压缩Navier-Stokes方程通过最小作用量原理从单个拉格朗日量中联合推导出来。我们采用离散变分力学，推导出一个稳定、条件良好且物理精确的方案，用于联合仿真铰接体及其周围的流体。我们利用隐函数定理计算完全耦合动力学的导数。利用这个仿真器及其梯度，我们实现了波动游泳步态，并优化了仿生鳗鱼机器人的高动态C形逃逸动作。我们在物理硬件上验证了这两种步态，展示了成功的仿真到实物迁移。仿真代码、硬件数据和鳗鱼机器人的示意图可在此处找到：https://unified-fluid-robot-multiphysics.github.io/

英文摘要

Matching the swimming efficiency and agility of fish has remained an elusive goal in underwater robotics. Such locomotion capabilities rely on complex vortex interactions between the robot's body and the surrounding fluid. However, simulating these dynamics, which are governed by coupled ordinary and partial differential equations, is significantly more difficult than the multi-body dynamics of classical rigid robotic systems. We present a differentiable framework for simulating strongly coupled fluid-robot multiphysics as a unified optimization problem. The coupled manipulator and incompressible Navier-Stokes equations are derived together from a single Lagrangian using the principle of least action. We employ discrete variational mechanics to derive a stable, well-conditioned, and physically accurate scheme for jointly simulating articulated bodies and the surrounding fluid. We leverage the implicit function theorem to compute derivatives of the fully coupled dynamics. Using this simulator and its gradients, we realize undulating swimming gaits and optimize a highly dynamic C-start escape maneuver for a bioinspired eel robot. We validate both gaits on physical hardware, demonstrating successful sim-to-real transfer. Simulation code, hardware data, and schematics for the eel robot can be found here: https://unified-fluid-robot-multiphysics.github.io/

URL PDF HTML ☆

赞 0 踩 0

2504.20736 2026-05-28 cs.RO cs.CV 版本更新

A Survey on Event-based Optical Marker Systems

基于事件的光学标记系统综述

Nafiseh Jabbari Tofighi, Maxime Robic, Fabio Morbidi, Pascal Vasseur

发表机构 * MIS laboratory, University of Picardie Jules Verne（皮卡第大学朱勒斯·弗尔大学MIS实验室）； DART Lab, Politecnico di Milano（米兰理工学院DART实验室）

AI总结本文综述了基于事件的光学标记系统（EBOMS），分析其异步操作原理和鲁棒性，并介绍了在目标检测、姿态估计和光通信等领域的应用。

Comments 11 pages, 6 figures, 2 table