URL PDF HTML ☆

赞 0 踩 0

VOLT: 面向超演示速度策略的视觉与语言轨迹分割

Robert Ramirez Sanchez, Daniel J. Evans, Dylan P. Losey, Siddarth Jain

发表机构 * Collab , Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24061（机械工程系，弗吉尼亚理工学院，布莱克斯堡，VA 24061）； Mitsubishi Electric Research Laboratories ( MERL ), Cambridge, MA 02139（三菱电机研究实验室（MERL），剑桥，MA 02139）

AI总结提出VOLT方法，通过视觉与语言线索对演示轨迹进行分割，选择性下采样安全加速部分，保留需要精细操作的慢速段，从而训练出比演示更快的机器人策略。

详情

AI中文摘要

人类演示任务所需的时间通常比机器人执行任务的时间长。许多工业和实际应用要求机器人尽可能快地执行任务，而不是学习以相同速度复制演示。本文研究了实现超演示速度策略的几种假设。实验表明，最有效的策略是对记录的演示进行下采样，并在加速后的数据上训练机器人策略。然而，均匀下采样整个轨迹可能存在问题：任务的某些部分可以安全加速（例如无约束运动），而其他部分则需要更慢、更精确的运动（例如物体交互或精细操作）。为解决这一挑战，我们提出了VOLT，一种视觉与语言轨迹分割方法，它推理视频演示，并利用上下文线索确定何时加速合适以及何时需要小心精确。VOLT识别需要缓慢、谨慎运动的分段，然后选择性地对剩余分段进行下采样。得到的重新格式化轨迹可用于标准模仿学习方法，如扩散策略。我们的结果强调分割质量至关重要——基线方法常常错误判断何时可以加速，导致策略过于谨慎或不可靠。与最先进的替代方法相比，VOLT使机器人能够更快地执行任务，同时保持强劲性能。

英文摘要

Humans often take longer to demonstrate a task than a robot would need to execute it. Rather than learning to replicate the demonstration at the same pace, many industrial and practical applications require robots to perform tasks as quickly as possible. In this paper, we investigate several hypotheses for learning policies that operate faster-than-demonstrations. Our experiments show that the most effective strategy is to downsample recorded demonstrations and train the robot's policy on this accelerated data. However, uniformly downsampling an entire trajectory can be problematic. Some parts of a task can be safely sped up (e.g., unconstrained motion), while others demand slower, more precise motion (e.g., object interactions or fine manipulation). To address this challenge, we introduce VOLT, a vision-and-language trajectory segmentation method that reasons over video demonstrations, and leverages contextual cues to determine when acceleration is appropriate and when careful precision is required. VOLT identifies segments where slow, deliberate motion is necessary, then selectively downsamples the remaining segments. The resulting reformatted trajectories can be used with standard imitation learning approaches, such as diffusion policies. Our results highlight that segmentation quality is critical -- baseline methods often misidentify when acceleration is possible, leading to overly cautious or unreliable policies. Compared to state-of-the-art alternatives, VOLT allows robots to execute tasks faster while maintaining strong performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09958 2026-06-10 cs.RO cs.AI 新提交

HANDOFF: 通过蒸馏互补教师实现人形机器人任务空间全身控制

Lizhi Yang, Junheng Li, Nehar Poddar, Yiling Hou, Gio Huh, Robert Griffin, Georgia Gkioxari, Aaron Ames

发表机构 * California Institute of Technology（加州理工学院）； The Institute for Human & Machine Cognition（人机认知研究院）

AI总结提出HANDOFF框架，通过多教师KL蒸馏和上下文门控机制，将全身运动跟踪、行走和跌倒恢复三个专家策略融合为混合专家学生策略，实现基于紧凑显式接口的全身控制，在Unitree G1上达到先进的速度跟踪性能并扩展了操作工作空间。

Comments 22 pages, 9 figures, Project page: https://lzyang2000.github.io/HANDOFF/

详情

AI中文摘要

对于要在现实世界中部署的人形机器人，命令空间（即任务规划与全身控制之间的接口）的选择至关重要。现有的全身控制器通常需要密集的运动学或空间参考，而规划器难以从任务语义中合成这些参考。我们提出了一种紧凑、显式的接口，该接口直观、通用、模块化且具有足够的表达能力，适用于多种操作技能。为此，我们引入了HANDOFF，这是一个单一的人形全身控制器，遵循该接口，并通过多教师KL蒸馏，在上下文条件门控方案下，从三个互补专家（具有安全过滤数据的全身运动跟踪、行走和跌倒恢复）中蒸馏出混合专家学生。在Unitree G1上，HANDOFF达到了最先进的速度跟踪性能，并提供了最大的鲁棒操作工作空间之一。我们进一步通过多个自然语言驱动的任务执行演示了硬件可行性，这些任务由VLM驱动的智能体规划器提供支持，无需特定任务数据或控制器微调。

英文摘要

For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse loco-manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.09595 2026-06-10 cs.NE cs.RO 版本更新

Neuromorphic Reinforcement Learning for Quadruped Locomotion Control on Uneven Terrain

用于不平地形四足运动控制的神经形态强化学习

Zhuangyu Han, Abhronil Sengupta

发表机构 * School of Electrical Engineering and Computer Science（电气工程与计算机科学学院）

AI总结提出基于平衡传播的PPO框架，结合CPG策略与残差调整策略，通过局部学习实现四足机器人在不平地形上的高效运动控制，性能与反向传播相当，GPU内存效率提升4.3倍。

详情

AI中文摘要

强化学习（RL）已实现复杂地形上的鲁棒四足运动，但大多数学习控制器通过反向传播在大量并行仿真中离线训练，并作为固定策略部署，限制了在地形变化、负载变化、执行器磨损以及其他实际条件下的适应能力，且受限于机载功耗。局部学习通过用局部神经状态驱动的更新替代全局反向传播图，为能量感知的机上自适应提供了潜在路径，使学习规则更兼容神经形态和内存计算基底。本文提出一种基于平衡传播（EP）的近端策略优化（PPO）框架，用于不平地形四足运动。控制器结合了仿生中枢模式发生器（CPG）策略和残余姿态调整策略，同时用支持EP的局部学习替代传统的反向传播训练的策略和价值网络。为了用EP训练随机连续控制策略，我们推导了与EP兼容的PPO输出扰动信号，并引入了一种双边比率裁剪机制，在松弛过程中稳定策略更新。在12自由度A1四足机器人上的实验表明，所提控制器在两阶段不平地形运动任务中实现了稳定的策略收敛。其运动性能在成功率、速度跟踪、执行器功率和身体稳定性方面与反向传播训练的PPO基线相当，同时与通过时间反向传播（BPTT）相比，GPU内存效率提高了4.3倍。这些结果表明，基于局部平衡的学习可以支持高维具身运动，并为低功耗机上自适应和微调提供算法基础。

英文摘要

Reinforcement learning (RL) has enabled robust quadruped locomotion over complex terrain, but most learned controllers are trained offline with backpropagation in massively parallel simulation and deployed as fixed policies, limiting adaptation to terrain variation, payload changes, actuator wear, and other real-world conditions under onboard power constraints. Local learning provides a potential path toward energy-aware on-robot adaptation by replacing global backpropagation graphs with updates driven by local neural states, making the learning rule more compatible with neuromorphic and in-memory computing substrates. This work proposes an equilibrium-propagation (EP)-based proximal policy optimization (PPO) framework for uneven-terrain quadruped locomotion. The controller combines a bio-inspired central pattern generator (CPG) policy with a residual postural adjustment policy, while replacing conventional backpropagation-trained policy and value networks with EP-enabled local learning. To train stochastic continuous-control policies with EP, we derive an EP-compatible PPO output-nudging signal and introduce a two-sided ratio clipping mechanism that stabilizes policy updates during relaxation. Experiments on a 12-DoF A1 quadruped show that the proposed controller achieves stable policy convergence in a two-stage uneven terrain locomotion task. Its locomotion performance is comparable to a backpropagation-trained PPO baseline in success rate, velocity tracking, actuator power, and body stability, while improving GPU memory efficiency by 4.3$\times$ compared with backpropagation through time (BPTT). These results suggest that local equilibrium-based learning can support high-dimensional embodied locomotion and provide an algorithmic foundation for low-power on-robot adaptation and fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.10039 2026-06-10 cs.RO 新提交

Robotic Nonprehensile Object Transportation with a Hanging Tray

使用悬挂托盘的机器人非抓取式物体运输

Adam Heins, Angela P. Schoellig

AI总结针对机器人服务员问题，提出使用绳索悬挂托盘实现三维摆运动，仅需3自由度移动基座即可减少滑动和泼洒，实验验证了有效性并集成到交互演示中。

Comments 8 pages, 11 figures. IEEE/ASME International Conference on Advanced Intelligent Mechatronics, 2026

详情

AI中文摘要

我们考虑称为服务员问题的非抓取式物体运输任务，其中机器人必须将平衡在托盘上的物体从一个位置移动到另一个位置。与先前关于机器人服务员问题的工作（使机器人倾斜由末端执行器刚性握持的托盘）不同，我们使用由绳索从末端执行器悬挂的托盘，使其行为类似于三维摆。一些先前的工作驱动机器人使末端执行器模拟摆的行为，因为摆运动减少了作用在运输物体上的剪切力，从而最小化刚性物体的滑动和液体容器中的泼洒。相比之下，我们使用真实的悬挂托盘，使得我们能够获得摆运动的益处，同时仅驱动3自由度移动基座，而不需要完整的6自由度机械臂。我们在仿真和真实硬件上的实验表明，与静态、刚性握持的托盘相比，悬挂托盘显著减少了滑动和泼洒。此外，我们将悬挂托盘集成到交互式机器人服务员演示中，该演示使用计算机视觉识别举手的人，并通过视觉伺服引导机器人朝向它们，使它们能够接触托盘。

英文摘要

We consider the nonprehensile object transportation task known as the waiter's problem, in which a robot must move an object balanced on a tray from one location to another. In contrast to prior works on the robotic waiter's problem, which make the robot tilt a tray rigidly held by its end effector (EE), we use a tray suspended from the EE by ropes, such that it behaves like a three-dimensional pendulum. Some prior works have actuated the robot so that the EE simulates the behavior of a pendulum, because pendular motion reduces the shear forces acting on the transported objects, minimizing the sliding of rigid objects and sloshing in containers of liquid. In contrast, our use of a real hanging tray allows us to obtain the benefits of pendular motion while only actuating a 3 degree-of-freedom (DOF) mobile base, rather than requiring a full 6-DOF manipulator arm. Our experiments in simulation and on real hardware show that the hanging tray substantially reduces both sliding and sloshing compared to a static, rigidly-grasped tray. Furthermore, we integrate the hanging tray into an interactive robot waiter demonstration, which uses computer vision to identify people with a raised hand and visual servoing to steer toward them and allow them to access the tray.

URL PDF HTML ☆

赞 0 踩 0

2606.10244 2026-06-10 cs.RO cs.AI 新提交

YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale

YUBI：面向大规模双手灵巧操作的通用双指接口

Takehiko Ohkawa, Jumpei Arima, Yuki Noguchi, Masatoshi Tateno, Makoto Sugiura, Takuya Okubo, Kengo Ikeuchi, Yuma Shin, Hiroki Nishizawa, Naoaki Kanazawa, Yuki Wakayama, Daiki Fukunaga, Koshi Makihara, Tomohiro Motoda, Floris Erich, Yukiyasu Domae, Tatsuya Matsushima, Yohishiro Okumatsu, Kei Ota

AI总结提出YUBI手指对齐夹爪，通过屈服式手指驱动映射实现直观、符合人体工学的双手灵巧操作数据采集，构建8434小时/120万集/119任务数据集，单策略跨多机器人迁移。

Comments Project page: https://yubi.airoa.io/

详情

AI中文摘要

我们引入了Yielding Universal Bidigital Interface (YUBI)，一种手指对齐的夹爪，旨在实现双手灵巧操作的直观、符合人体工学且可扩展的数据采集。虽然手持数据采集系统（如Universal Manipulation Interface (UMI)）实现了低成本数据采集，但其笨重的手枪式握把设计可能给精细灵巧操作任务带来人体工学和使用性挑战。为此，YUBI提出了一种独特的设计原则：屈服式手指驱动，将人类手指运动直接映射到夹爪钳口运动。使用YUBI设备，我们建立了一个集成基于VR的6自由度夹爪跟踪的数据采集系统，确保高保真轨迹数据获取。我们整理了一个前所未有的基于UMI的数据集：8434小时，涵盖120万集和119个任务。实验表明，YUBI在复杂双手任务的通用性、灵巧性和操作效率方面优于UMI夹爪。通过在多个平台上安装夹爪，在YUBI数据集上训练的单一策略可迁移到多个双手机器人（UR、Franka和ELEY），证实采集的数据可直接作为策略监督执行。我们发布了夹爪硬件、数据采集软件和数据集作为集成堆栈，为开放社区提供可复现的大规模数据采集路径，以推动机器人基础模型的发展。

英文摘要

We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8,434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.10614 2026-06-10 cs.RO cs.CV cs.LG 新提交

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

灵巧点策略：从人类演示中学习基于点的灵巧手策略

Beomjun Kim, Seong Hyeon Park, Seunghoon Sim, Seungjun Moon, Sanghyeok Lee, Jinwoo Shin

发表机构 * KAIST（韩国科学技术院）

AI总结提出Dexterous Point Policy框架，通过统一3D关键点表示从人类视频学习灵巧操作策略，无需机器人演示，在真实任务中达到75%成功率。

详情

AI中文摘要

基于人类演示视频预训练的机器人基础模型显示出潜力，但当策略部署到真实机器人时仍存在显著的具身差距。常见的补救措施是在机器人特定演示上微调这些模型。然而，机器人数据收集可能过于昂贵和耗时，这在灵巧操作中尤为突出，例如，即使是单个原子任务，遥操作多指手也可能需要数天。为了解决这个问题，我们引入了Dexterous Point Policy，一个直接从人类视频学习灵巧操作策略且无需机器人演示的框架。我们的核心见解是，统一的3D关键点表示在用于观察和动作时，可以桥接人类和机器人的具身。具体来说，我们从原始视频中提取任务相关物体和人类手的3D关键点，并训练一个自回归变换器来处理这些关键点。我们观察到，在关键点层面，特别是手腕和指尖，人类和机器人的行为紧密对齐，从而实现直接策略迁移。在一套包括拾取放置和工具使用的真实机器人任务中，Dexterous Point Policy达到了75.0%的成功率，而最先进的VLA基线仅达到1.0%。此外，我们的方法对未见过的场景具有很强的泛化能力，包括多物体环境和新型物体类别。

英文摘要

Robotic foundation models pre-trained on human demonstration videos have shown promise, but a significant embodiment gap remains when the resulting policies are deployed on real robots. A common remedy is to fine-tune these models on robot-specific demonstrations. However, robot data collection can be prohibitively expensive and time-consuming, which is particularly acute in dexterous manipulation, e.g., teleoperating a multi-fingered hand for even a single atomic task can take days. To address this, we introduce Dexterous Point Policy, a framework that learns dexterous manipulation policies directly from human videos and requires no robot demonstrations. Our core insight is that a unified 3D keypoint representation can bridge human and robot embodiments when used for both observations and actions. Specifically, we extract 3D keypoints of task-relevant objects and human hands from raw videos, and train an autoregressive transformer over these keypoints. We observe that at the keypoint level, specifically the wrist and fingertips, human and robot behaviors closely align, enabling direct policy transfer. On a suite of real-robot tasks spanning pick-and-place and tool use, Dexterous Point Policy attains 75.0% success, whereas a state-of-the-art VLA baseline reaches only 1.0%. Furthermore, our method generalizes strongly to unseen scenarios, including multi-object environments and novel object categories.

URL PDF HTML ☆

赞 0 踩 0

2606.10743 2026-06-10 cs.RO 新提交

TacForeSight：面向接触丰富操作的力引导触觉世界模型

Yujie Zang, Yuhang Zheng, Xian Nie, Yupeng Zheng, Shuai Tian, Songen Gu, Chen Gao, Zining Wang, Shuicheng Yan, Wenchao Ding

发表机构 * TARS Robotics ； National University of Singapore（新加坡国立大学）； Shanghai Jiao Tong University（上海交通大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Fudan University（复旦大学）

AI总结提出TacForeSight框架，通过力条件触觉世界模型预测触觉潜动态，结合预测性触觉条件策略实现高频操作下的主动接触推理，在动态接触干扰下优于现有方法。

详情

AI中文摘要

接触丰富操作要求机器人在动态接触过渡或复杂表面几何下持续感知和调节演变的物理交互。最近的模仿学习方法通过整合触觉或力反馈改善了接触感知控制，但很少对全局力和局部触觉感知的非对称时空角色进行建模。为此，我们提出TacForeSight，一种轻量级的力条件触觉预测框架，用于实时操作。核心组件是TacForceWM，一个触觉世界模型，它从双指触觉观测中预测短时域触觉潜动态，并以高频腕部力和力矩信号为条件。另一个关键组件，预测性触觉条件策略，利用预测的潜变量作为预期接触先验，通过交叉注意力建模当前到未来的触觉演化，并通过触觉引导门控模块自适应融合视觉-触觉特征。通过在紧凑潜空间内进行预测，TacForeSight实现了主动接触推理，并具有适用于高频操作控制的高效实时推理。在五个代表性任务和三种过程扰动设置上的真实机器人实验表明，TacForeSight在动态接触干扰下始终优于现有基线。所有模型和数据集将在项目网站上公开。

英文摘要

Contact-rich manipulation requires robots to continuously perceive and regulate evolving physical interactions under dynamic contact transitions or complex surface geometries. Recent imitation learning methods improve contact-aware control by incorporating tactile or force feedback, but they rarely model the asymmetric spatiotemporal roles of global force and local tactile sensing. To address this, we propose TacForeSight, a lightweight force-conditioned tactile foresight framework for real-time manipulation. The core component is TacForceWM, a tactile world model that predicts short-horizon tactile latent dynamics from dual-finger tactile observations conditioned on high-frequency wrist force and torque signals. Another key component, the Predictive Tactile-Conditioned Policy, leverages the predicted latents as anticipatory contact priors, models the current-to-future tactile evolution via cross-attention, and adaptively fuses visuo-tactile features through a tactile-guided gating module. By forecasting purely within a compact latent space, TacForeSight enables proactive contact reasoning with efficient real-time inference suitable for high-frequency manipulation control. Real-robot experiments on five representative tasks and three in-process perturbation settings show that TacForeSight consistently outperforms existing baselines, particularly under dynamic contact disturbances. All models and datasets will be made publicly available on the project website at https://tacforesight.github.io/ProjectPage.

URL PDF HTML ☆

赞 0 踩 0

2505.08213 2026-06-10 cs.RO 版本更新

HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands

HandCept: 用于灵巧手精确本体感知的视觉-惯性融合框架

Huang Junda, Honghao Guo, Hao Wu, Zhengyang Liu, Marcelo H Ang, Jianshu Zhou

发表机构 * The Chinese University of Hong Kong（香港中文大学）； National University of Singapore（新加坡国立大学）

AI总结提出HandCept，首个视觉-惯性本体感知框架，通过零样本学习和无延迟扩展卡尔曼滤波融合腕部RGB-D相机与9轴IMU，实现2°-4°关节角估计误差且无漂移，优于纯视觉或纯惯性方法。

Comments 8 pages, 7 figures, conference

详情

AI中文摘要

随着机器人向通用操作发展，灵巧手变得越来越关键。然而，由于体积和通用性的限制，灵巧手的本体感知仍然是一个瓶颈。在这项工作中，我们提出了HandCept，这是第一个旨在克服传统灵巧手关节角估计方法挑战的视觉-惯性本体感知框架。HandCept解决了在动态环境中实现准确且鲁棒的关节角估计的难题，在这种环境中，视觉和惯性测量都容易受到噪声和漂移的影响。它利用零样本学习方法，使用腕部RGB-D相机和9轴IMU，通过无延迟扩展卡尔曼滤波器（EKF）实时融合。我们的结果表明，HandCept实现了通常在$2^{\circ}$到$4^{\circ}$之间的关节角估计误差，且没有可观察到的漂移，优于纯视觉和纯惯性方法。此外，我们验证了IMU系统的稳定性和均匀性，表明IMU之间的公共基帧简化了系统标定。为了支持仿真到现实的迁移，我们还开源了我们的高保真渲染管线，这对于在没有真实世界真值的情况下进行训练至关重要。这项工作为灵巧手的本体感知提供了一种鲁棒、可泛化的解决方案，对机器人操作和人机交互具有重要意义。this https URL

英文摘要

As robotics progresses toward general manipulation, dexterous hands are becoming increasingly critical. However, proprioception in dexterous hands remains a bottleneck due to limitations in volume and generality. In this work, we present HandCept, the first visual-inertial proprioception framework designed to overcome the challenges of traditional joint angle estimation methods for dexterous hands. HandCept addresses the difficulty of achieving accurate and robust joint angle estimation in dynamic environments where both visual and inertial measurements are prone to noise and drift. It leverages a zero-shot learning approach using a wrist-mounted RGB-D camera and 9-axis IMUs, fused in real time via a latency-free Extended Kalman Filter (EKF). Our results show that HandCept achieves joint angle estimation errors generally between $2^{\circ}$ and $4^{\circ}$ without observable drift, outperforming visual-only and inertial-only methods. Furthermore, we validate the stability and uniformity of the IMU system, demonstrating that a common base frame across IMUs simplifies system calibration. To support sim-to-real transfer, we also open-source our high-fidelity rendering pipeline, which is essential for training without real-world ground truth. This work offers a robust, generalizable solution for proprioception in dexterous hands, with significant implications for robotic manipulation and human-robot interaction. https://github.com/huangjund/blenderYCB

URL PDF HTML ☆

赞 0 踩 0

2601.06997 2026-06-10 cs.RO cs.CV 版本更新

ObjSplat: Geometry-Aware Gaussian Surfels for Active Object Reconstruction

ObjSplat: 几何感知的高斯面元用于主动物体重建

Yuetao Li, Zhizhou Jia, Yu Zhang, Qun Hao, Shaohui Zhang

发表机构 * School of Optics and Photonics, Beijing Institute of Technology（光学与光子学学院，北京理工大学）； School of Optoelectronic Engineering, Changchun University of Science and Technology（光电工程学院，长春理工大学）

AI总结提出ObjSplat框架，利用高斯面元统一表示，通过几何感知视点评估和下一最佳路径规划器，实现高效高保真的主动物体重建。

Comments Accepted to IEEE T-ASE. Code: https://github.com/Li-Yuetao/ObjSplat , Project Page: https://li-yuetao.github.io/ObjSplat-page/

详情

DOI: 10.1109/TASE.2026.3700105

AI中文摘要

自主高保真物体重建是创建数字资产和弥合机器人模拟与现实差距的基础。我们提出ObjSplat，一个主动重建框架，利用高斯面元作为统一表示，逐步重建未知物体，同时具有逼真的外观和准确的几何。针对传统基于不透明度或深度线索的局限性，我们引入了几何感知视点评估管线，明确建模背面可见性和遮挡感知的多视图共视性，即使在几何复杂的物体上也能可靠地识别未重建区域。此外，为了克服贪婪规划策略的局限性，ObjSplat采用下一最佳路径（NBP）规划器，在动态构建的空间图上执行多步前瞻。通过联合优化信息增益和移动成本，该规划器生成全局高效的轨迹。在仿真和真实世界文化遗物上的大量实验表明，ObjSplat在几分钟内生成物理一致的模型，与最先进方法相比，实现了卓越的重建保真度和表面完整性，同时显著减少了扫描时间和路径长度。项目页面：此https URL。

英文摘要

Autonomous high-fidelity object reconstruction is fundamental for creating digital assets and bridging the simulation-to-reality gap in robotics. We present ObjSplat, an active reconstruction framework that leverages Gaussian surfels as a unified representation to progressively reconstruct unknown objects with both photorealistic appearance and accurate geometry. Addressing the limitations of conventional opacity or depth-based cues, we introduce a geometry-aware viewpoint evaluation pipeline that explicitly models back-face visibility and occlusion-aware multi-view covisibility, reliably identifying under-reconstructed regions even on geometrically complex objects. Furthermore, to overcome the limitations of greedy planning strategies, ObjSplat employs a next-best-path (NBP) planner that performs multi-step lookahead on a dynamically constructed spatial graph. By jointly optimizing information gain and movement cost, this planner generates globally efficient trajectories. Extensive experiments in simulation and on real-world cultural artifacts demonstrate that ObjSplat produces physically consistent models within minutes, achieving superior reconstruction fidelity and surface completeness while significantly reducing scan time and path length compared to state-of-the-art approaches. Project page: https://li-yuetao.github.io/ObjSplat-page/ .

URL PDF HTML ☆

赞 0 踩 0

2602.07413 2026-06-10 cs.RO 版本更新

Going with the Flow: Koopman Behavioral Models as Pseudo Planners for Visuo-Motor Dexterity

随流而行：Koopman行为模型作为视觉运动灵巧性的伪规划器

Yunhai Han, Jiaqi Fu, Linhao Bai, Ziyu Xiao, Zhaodong Yang, Yogita Choudhary, Krishna Jha, Chuizheng Kong, Shreyas Kousik, Harish Ravichandar

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出统一行为模型（UBM），将灵巧技能建模为耦合动力系统，确保时间一致性；基于Koopman算子实现线性潜空间，通过在线重规划实现反应性和适应性，在模拟和真实任务中达到或超越现有方法。

Comments Website: https://k-ubm.github.io/

详情

AI中文摘要

当代视觉运动灵巧性模型通常依赖于具有扩散和Transformer骨干的表达性策略类来实现强性能。然而，这些架构需要大量数据和计算资源，并且远未达到可靠，特别是对于多指灵巧性。重要的是，它们将技能建模为反应性映射，并依赖于固定视界的动作分块，在时间一致性和反应性之间造成了刚性权衡。为了解决这些问题，我们首先引入统一行为模型（UBMs），这是一个将灵巧技能表示为耦合动力系统的框架，捕捉环境视觉特征（视觉流）和机器人本体感受状态（动作流）如何共同演化。因此，UBMs通过构造而非启发式平均来确保时间一致性。与试图预测任意机器人动作对环境影响的 world models 不同，UBMs 针对行为动力学，编码演示的机器人行为如何与对环境期望的影响相关。UBM 可以视为一个伪规划器：给定初始条件，它计算整个技能视界上的期望机器人行为，同时“想象”视觉特征的流。为了实现UBMs，我们提出Koopman-UBM，作为UBMs的第一个实例化，即结构化的潜在线性系统。K-UBM计算高效，通过在线重规划策略实现反应性和适应性：模型充当自身的运行时监控器，当预测和观察到的视觉流偏离超过阈值时自动触发重规划。在七个模拟任务和四个真实世界任务中，我们的方法匹配或超过了最先进基线的性能，同时提供了更快的推理、平滑的执行、对遮挡的鲁棒性和灵活的重规划。

英文摘要

Contemporary visuo-motor dexterity models often rely on expressive policy classes with diffusion and transformer backbones to achieve strong performance. However, these architectures require significant data and computational resources, and remain far from reliable, particularly for multi-fingered dexterity. Importantly, they model skills as reactive mappings and rely on fixed-horizon action chunking, creating a rigid trade-off between temporal coherence and reactivity. To address these issues, we first introduce Unified Behavioral Models (UBMs), a framework to represent dexterous skills as coupled dynamical systems that capture how visual features of the environment (visual flow) and proprioceptive states of the robot (action flow) co-evolve. As such, UBMs ensure temporal coherence by construction rather than heuristic averaging. Unlike world models that attempt to predict the impact of arbitrary robot actions on the environment, UBMs target behavioral dynamics that encode how demonstrated robot behavior is related to desired impacts on the environment. A UBM can be viewed as a pseudo planner: given an initial condition, it computes the desired robot behavior over the entire skill horizon, while simultaneously ``imagining" the resulting flow of visual features. To operationalize UBMs, we propose Koopman-UBM, a first instantiation of UBMs as a structured latent linear system. K-UBM is computationally efficient, enabling reactivity and adaptation via an online replanning strategy: the model acts as its own runtime monitor, automatically triggering replanning when predicted and observed visual flow diverge beyond a threshold. Across seven simulated tasks and four real-world tasks, our approach matches or exceeds the performance of state-of-the-art baselines, while offering considerably faster inference, smooth execution, robustness to occlusions, and flexible replanning.

URL PDF HTML ☆

赞 0 踩 0

2603.20850 2026-06-10 cs.CV cs.RO 版本更新

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

Glove2Hand：从多模态传感手套合成自然的手-物体交互

Xinyu Zhang, Ziyi Kou, Chuan Qin, Mia Huang, Ergys Ristani, Ankit Kumar, Lele Chen, Kun He, Abdeslam Boularias, Li Guan

发表机构 * Meta Reality Labs（Meta现实实验室）； Rutgers University（罗格斯大学）

AI总结提出Glove2Hand框架，将多模态传感手套视频转化为逼真的裸手，并保留物理交互动态；引入3D高斯手模型和扩散手恢复器，创建HandSense数据集，提升下游任务性能。

Comments CVPR 2026 Highlight. This version includes the motion retarget process in the appendix

详情

AI中文摘要

理解手-物体交互（HOI）是计算机视觉、机器人和AR/VR的基础。然而，传统手部视频通常缺乏接触力和运动信号等关键物理信息，并且容易频繁遮挡。为了解决这些挑战，我们提出了Glove2Hand，一个将多模态传感手套HOI视频转化为逼真裸手的框架，同时忠实保留底层物理交互动态。我们引入了一种新颖的3D高斯手模型，确保时间渲染一致性。使用基于扩散的手部恢复器将渲染的手无缝集成到场景中，该恢复器有效处理复杂的手-物体交互和非刚性变形。利用Glove2Hand，我们创建了HandSense，这是第一个多模态HOI数据集，包含手套到手的视频以及同步的触觉和IMU信号。我们证明HandSense显著增强了下游裸手应用，包括基于视频的接触估计和严重遮挡下的手部跟踪。

英文摘要

Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.

URL PDF HTML ☆

赞 0 踩 0

2606.10348 2026-06-10 cs.RO 新提交

Rethinking Embodied Navigation via Relational Inductive Bias

通过关系归纳偏差重新思考具身导航

Weitao An, Chenghao Xu, Xu Yang, Cheng Deng

发表机构 * School of Electronic Engineering, Xidian University（西安电子科技大学电子工程学院）； School of Information Science and Engineering, Hohai University（河海大学信息科学与工程学院）

AI总结提出DB-Nav框架，利用激活偏置和抑制偏置双关系偏置重塑搜索空间，通过关系激活-抑制探索图调节前沿探索，显著提升目标导航成功率和路径效率。

详情

AI中文摘要

目标导航要求智能体通过视觉观察在未知环境中定位目标。现有方法通常依赖开放词汇检测器或视觉语言模型（VLM）来回答在哪里搜索，但往往忽略了什么不可信——哪些语义线索不可靠。开放词汇感知容易产生系统性误导证据：误报、过时的静态先验以及由于缺乏具身验证而导致的重复失败探索，这会污染地图构建和决策制定。此类错误根植于真实场景中的结构化对象关系。为解决此问题，我们提出DB-Nav，一个通过双关系偏置重塑搜索空间的框架。它将目标中心关系分解为激活偏置（传播上下文证据）和抑制偏置（通过感知混淆和动作级证伪抑制不可靠区域）。这些偏置统一到一个关系激活-抑制探索图中，该图利用在线观察和失败访问来调节前沿探索值。在ObjectNav基准上的实验表明，DB-Nav在成功率（SR）和路径长度加权成功率（SPL）上显著优于现有方法，提供了一个轻量级、可解释且鲁棒的导航框架，无需昂贵的在线VLM推理。

英文摘要

Object navigation requires an agent to locate a target in an unknown environment through visual observations. Existing methods typically rely on open-vocabulary detectors or vision-language models (VLMs) to answer where to search, but often overlook what not to trust - which semantic cues are unreliable. Open-vocabulary perception is prone to systematic misleading evidence: false positives, outdated static priors, and repeated failed exploration due to lack of embodied verification, which contaminates mapping and decision-making. Such errors are rooted in structured object relations in real-world scenes. To address this, we propose DB-Nav, a framework that reshapes the search space via dual relational biases. It factorizes target-centric relations into an Activation Bias (propagates contextual evidence) and an Inhibition Bias (suppresses unreliable regions via perceptual confusion and action-level falsification). These biases are unified into a Relational Activation-Inhibition Exploration Graph that modulates frontier exploration values using online observations and failed accesses. Experiments on ObjectNav benchmarks show that DB-Nav significantly outperforms existing methods in success rate (SR) and Success weighted by Path Length (SPL), offering a lightweight, interpretable, and robust navigation framework without costly online VLM reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.10442 2026-06-10 cs.RO 新提交

Information-Preserving Continuous Occupancy Mapping with Variance-Weighted Submap Joining

基于方差加权子图拼接的信息保持连续占据地图构建

Zhuhua Bai, Yingyu Wang, Liang Zhao, Shoudong Huang

发表机构 * University of Technology Sydney（悉尼科技大学）； University of Edinburgh（爱丁堡大学）

AI总结提出首个连续概率子图拼接框架，通过信息保持稀疏贝叶斯公式压缩观测数据为充分统计量，联合优化子图位姿与全局占据场，实现高精度位姿估计与全局一致性地图。

Comments 12 pages, 7 figures

详情

AI中文摘要

大规模SLAM由于累积轨迹漂移和维护全局一致性的计算成本增加而仍然具有挑战性。子图拼接通过构建局部一致子图并随后将其融合为全局地图来缓解这些问题。然而，现有的基于占据的子图拼接方法在离散网格上操作，导致优化过程中梯度不光滑，并忽略了占据估计的不确定性。我们提出了第一个连续概率子图拼接框架，该框架在潜在对数几率空间中联合优化子图位姿和全局占据场。该框架采用信息保持的稀疏贝叶斯公式，将原始占据观测压缩为充分统计量的对数几率元组，同时保留原始观测的后验信息。这为占据地图构建提供了闭式预测均值和方差估计，直接实现了具有解析雅可比矩阵的子图拼接公式，从而得到更精确的子图拼接，并在位姿收敛时产生闭式最优全局地图。在模拟和大规模真实世界数据集上的实验表明，所提方法比最先进的基于网格的子图拼接方法实现了更高的位姿精度和更好的全局一致性，同时比现有的连续占据地图构建方法产生了更紧凑的地图表示和更校准的不确定性估计。

英文摘要

Large-scale SLAM remains challenging due to accumulated trajectory drift and the increasing computational cost of maintaining global consistency. Submap joining alleviates these issues by constructing locally consistent submaps and subsequently fusing them into a global map. However, existing occupancy-based submap joining methods operate on discrete grids, resulting in non-smooth gradients during optimization and neglecting the uncertainty associated with occupancy estimates. We propose the first continuous probabilistic submap joining framework that jointly optimizes submap poses and a global occupancy field in the latent log-odds space. The framework employs an information-preserving sparse Bayesian formulation that compresses raw occupancy observations into sufficient-statistic log-odds tuples while retaining the posterior information of the original observations. This yields closed-form predictive mean and variance estimates for occupancy mapping, which directly enable a submap joining formulation with analytical Jacobians, leading to more accurate submap joining and yielding a closed-form optimal global map upon pose convergence. Experiments on both simulated and large-scale real-world datasets demonstrate that the proposed method achieves higher pose accuracy and improved global consistency than state-of-the-art grid-based submap joining approaches, while producing more compact map representations and better-calibrated uncertainty estimates than existing continuous occupancy mapping methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10577 2026-06-10 cs.RO 新提交

AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

AgenticNav：零样本视觉与语言导航作为工具调用框架

Yijian Li, Changze Li, Hantian Shi, Jiaying Luo, Jiyuan Cai, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Huawei Technologies Ltd（华为技术有限公司）

AI总结提出AgenticNav，通过将动作、深度和记忆作为可调用工具暴露给VLM，实现零样本连续环境导航，在R2R-CE基准上达到SOTA性能。

详情

AI中文摘要

连续环境中的零样本视觉与语言导航（VLN-CE）最近随着大型视觉语言模型（VLM）的出现而变得可行。然而，现有方法通常依赖学习到的航点预测器来提出可导航动作，这严重限制了模型的动作空间，并且未能有效利用深度输入。此外，记忆通常通过累积包含大量无关上下文的冗长文本或视觉历史，或通过检索跨回合经验来处理，这削弱了零样本设置。在本文中，我们将零样本VLN-CE重新思考为VLM与环境之间的代理接口，并提出了AgenticNav，这是一个轻量级导航框架，将动作、深度和记忆暴露为可调用的工具。动作工具允许VLM直接选择RGB观测中的目标像素，并将其转换为可执行运动，而不是从预测的航点中选择。深度通过按需像素深度工具暴露，使VLM能够在需要的地方请求精确的度量距离。对于记忆，AgenticNav提供了一个紧凑的地图图像，总结历史轨迹，并配有一个召回工具，允许VLM有选择地重新访问过去的视觉观测，而不会使提示上下文过载。在R2R-CE基准上，AgenticNav在相同VLM骨干下，在零样本方法中建立了新的最先进（SOTA）性能。真实世界验证进一步突显了其相比先前方法的零样本泛化能力。消融实验表明，我们的动作工具设计优于传统航点预测器，并且深度工具和代理记忆进一步促进了导航性能。

英文摘要

Zero-shot vision-and-language navigation in continuous environments (VLN-CE) has recently become feasible with large vision-language models (VLMs). However, existing methods typically rely on learned waypoint predictors to propose navigable actions. This severely limits the model's action space and fails to leverage depth inputs effectively. Moreover, memory is commonly handled by accumulating long textual or visual histories with substantial irrelevant context, or by retrieving cross-episode experiences, which weakens the zero-shot setting. In this paper, we rethink zero-shot VLN-CE as an agentic interface between the VLM and the environment, and present AgenticNav, a lightweight navigation harness that exposes action, depth, and memory as callable tools. Instead of choosing from predicted waypoints, the action tool allows the VLM to directly select a target pixel in RGB observations, converting it into executable motion. Depth is exposed through an on-demand pixel-depth tool, enabling the VLM to request precise metric distances only where they matter. For memory, AgenticNav provides a compact map image summarizing the historical trajectory, paired with a recall tool that allows the VLM to selectively revisit past visual observations without overwhelming the prompt context. On the R2R-CE benchmark, AgenticNav establishes new state-of-the-art (SOTA) performance among zero-shot methods given the same VLM backbone. Real-world validation further highlights its zero-shot generalization compared to prior methods. Ablations show that our action tool design outperforms traditional waypoint predictors, and that depth tool and agentic memory further contribute to navigation performance.

URL PDF HTML ☆

赞 0 踩 0

2606.10832 2026-06-10 cs.RO 新提交

GUIDE: Goal-Initialized Directional Understanding for End-to-End Visual Navigation

GUIDE: 目标初始化的定向理解用于端到端视觉导航

Liang Wang, Jin Jin, KanZhong Yao, YiBin Wu, Fangqiang Ding, Jin Wang, Jun Wu, Zhe Sun, Qiuguo Zhu

发表机构 * Institute of Cyber-Systems and Control, Zhejiang University（浙江大学控制科学与工程学院）； Institute of Artificial Intelligence (TeleAI), China Telecom（中国电信人工智能研究院（TeleAI））； Oxford Robotics Institute, University of Oxford（牛津大学牛津机器人研究所）； Center for Robotics, University of Bonn（波恩大学机器人中心）； Department of Mechanical Engineering, Massachusetts Institute of Technology（麻省理工学院机械工程系）

AI总结提出GUIDE框架，通过空间锚点预测器利用多频率本体感受历史提取自运动表示，结合深度流感知局部几何，实现无需后续目标更新的端到端四足机器人导航。

Comments https://guide-navigation.github.io/

详情

AI中文摘要

基于学习的足式机器人视觉导航通常依赖于层次状态估计的连续目标更新，以提供持久的定向参考。这种依赖引入了额外的感知和计算开销，偏离了完全端到端的移动自主性。此外，在部分可观测性下，策略容易学习短视行为，容易陷入死角和复杂结构布局。为了解决这些限制，我们研究了一种目标初始化的导航设置，其中目标仅在情节开始时提供一次，要求机器人基于内在空间记忆运行，无需来自外部模块的后续目标更新。在这项工作中，我们提出了GUIDE，一个完全端到端的强化学习框架，旨在培养内部定向意识。具体来说，GUIDE包含一个空间锚点预测器，利用多频率本体感受历史来提取自运动表示，从而为导航维持持久的长期空间上下文。同时，它利用原始深度流感知局部环境几何。我们在仿真和真实场景中对四足机器人进行了评估。实验表明，GUIDE学习了可靠的自运动和定向意识，使得完全端到端部署的策略能够在没有后续目标引导或先验地图的情况下，安全地穿越密集杂乱和结构化迷宫。

英文摘要

Learning-based visual navigation for legged robots typically relies on continuous goal updates from hierarchical state estimation to provide a persistent directional reference. This reliance incurs additional sensory and computational overhead and deviates from fully end-to-end mobile autonomy. Furthermore, under partial observability, policies are prone to learn myopic behaviors, easily becoming trapped in dead ends and complex structural layouts. To address these limitations, we investigate a goal-initialized navigation setting, where the target is provided only once at the beginning of an episode, requiring the robot to operate based on intrinsic spatial memory without subsequent goal updates from external modules. In this work, we propose GUIDE, a fully end-to-end reinforcement learning framework designed to cultivate internal directional awareness. Specifically, GUIDE incorporates a spatial anchor predictor that leverages multi-frequency proprioceptive history to extract egomotion representations, thereby maintaining a persistent long-horizon spatial context for navigation. Concurrently, it utilizes raw depth streams to perceive local environmental geometry. We evaluate the proposed framework across both simulation and real-world scenarios on a quadruped robot. Experiments show that GUIDE learns reliable egomotion and directional awareness, enabling a fully end-to-end deployed policy to safely navigate through dense clutter and structured mazes without subsequent goal guidance or prior maps.

URL PDF HTML ☆

赞 0 踩 0

2606.10903 2026-06-10 cs.RO 新提交

AgniNav: Configuration-Driven Cross-Embodiment Local Planning for Robot Navigation

AgniNav：配置驱动的跨具身局部规划机器人导航

Tianhao Zang, Siwei Cheng, Haidong Huang, Shanze Wang, Wei Zhang

发表机构 * Eastern Institute of Technology, Ningbo, China（东方理工（宁波））； University of Nottingham, Nottingham, UK（诺丁汉大学）； University of Science and Technology of China, Hefei, China（中国科学技术大学）

AI总结提出AgniNav框架，通过可配置的四参数安全包络实现单目视觉导航在轮式、四足和人形机器人间的零重训练迁移，联合调节感知与规划。

详情

AI中文摘要

单目局部导航对轻量级机器人具有吸引力，但现有的基于视觉的策略通常将感知耦合到特定机体、相机高度和足迹，使得从轮式底盘到腿式平台的迁移依赖于重新训练或主动深度硬件。本文介绍了AgniNav，一个配置驱动的局部导航框架，在碰撞包络层面标准化跨具身迁移。每个机器人由一个可测量的四参数安全包络指定：碰撞相关高度、前长、后长和半宽。高度参数条件化一个图像到扫描网络，从单目彩色图像预测一维、碰撞相关的伪激光扫描，而剩余的足迹参数配置一个维度感知的局部规划器用于碰撞检测。训练使用从配对的彩色-深度数据生成的高度条件化列最小扫描标签，允许同一图像监督不同的安全包络，无需收集机器人特定数据。据我们所知，AgniNav是第一个单目局部导航框架，它联合调节感知和规划于共享的碰撞包络配置，实现跨轮式、四足和人形平台的零重训练部署。在Turtlebot2、Unitree Go2和Accelerated Evolution K1上的真实机器人实验分别实现了39/40、18/20和18/20的成功率，碰撞次数分别为0/40、1/20和2/20，同时在Jetson Orin上以30 Hz运行。

英文摘要

Monocular local navigation is attractive for lightweight robots, but existing vision-based policies often couple perception to a specific body, camera height, and footprint, making transfer from wheeled bases to legged platforms dependent on retraining or active depth hardware. This paper introduces AgniNav, a configuration-driven local navigation framework that standardizes cross-embodiment transfer at the collision-envelope level. Each robot is specified by a measurable four-parameter safety envelope: collision-relevant height, front length, rear length, and half width. The height parameter conditions an image-to-scan network to predict a one-dimensional, collision-relevant pseudo-laserscan from a monocular color image, while the remaining footprint parameters configure a dimension-aware local planner for collision checking. Training uses height-conditioned column-minimum scan labels generated from paired color-depth data, allowing the same image to supervise different safety envelopes without collecting robot-specific data. To the best of our knowledge, AgniNav is the first monocular local-navigation framework that jointly conditions perception and planning on a shared collision-envelope configuration for zero-retraining deployment across wheeled, quadruped, and humanoid platforms. Real-robot experiments on a Turtlebot2, Unitree Go2, and Accelerated Evolution K1 achieve 39/40, 18/20, and 18/20 successes with 0/40, 1/20, and 2/20 collisions, respectively, while running at 30 Hz on Jetson Orin.

URL PDF HTML ☆

赞 0 踩 0

2606.10927 2026-06-10 cs.RO 新提交

AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning

AllDayNav: 通过真实世界强化学习实现终身导航

Hang Yin, Yinan Liang, Jiazhao Zhang, Jiahang Liu, Minghan Li, Zhizheng Zhang, He Wang

发表机构 * Tsinghua University（清华大学）； Galbot Robotics ； Peking University（北京大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结提出AllDayNav框架，利用自进化多模态记忆和强化学习隐式编码场景动态，在跨房间、跨回合和跨任务场景中实现接近100%的成功率，超越基于地图、VLM和RL的基线方法。

Comments Project Page: https://bagh2178.github.io/AllDayNav/

详情

AI中文摘要

在动态环境中进行终身具身导航需要机器人从碎片化观察中形成持久的场景理解，这对于依赖显式地图或场景图且难以泛化到结构化设置之外的现有方法仍然困难。我们提出AllDayNav，一个终身自学习导航框架，通过强化学习将场景动态隐式编码到大模型的十亿级参数中，并由一个自进化的多模态记忆驱动，该记忆维护和更新视觉关键帧、语义描述和时间上下文，同时自主生成开放词汇指令、图像目标和结构化奖励。在合成和真实环境中的跨房间、跨回合和跨任务场景实验表明，AllDayNav实现了接近100%的成功率，并在路径效率和鲁棒性上持续超越基于地图、VLM和RL的强基线，证明了隐式、记忆驱动的强化学习作为可靠终身导航的可扩展替代方案。

英文摘要

Lifelong embodied navigation in dynamic environments requires robots to form persistent scene understanding from fragmentary observations, which remains difficult for existing methods that rely on explicit maps or scene graphs and struggle to generalize beyond structured settings. We propose AllDayNav, a lifelong self-learning navigation framework that implicitly encodes scene dynamics into the billion-scale parameters of a large model via reinforcement learning, powered by a self-evolving multimodal memory that maintains and updates visual keyframes, semantic descriptions, and temporal context while autonomously generating open-vocabulary instructions, image goals, and structured rewards. Experiments in both synthetic and real-world environments across cross-room, cross-episode, and cross-task scenarios show that AllDayNav achieves success rates approaching $100\%$ and consistently surpasses strong map-based, VLM, and RL baselines in path efficiency and robustness, demonstrating implicit, memory-driven reinforcement learning as a scalable alternative to explicit mapping for reliable lifelong navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.10019 2026-06-10 cs.CV cs.AI cs.RO 交叉投稿

Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

广义CVO：基于二阶黎曼优化的快速无对应局部点云配准

Ray Zhang, Marcus Greiff, Thomas Lew, John Subosits

AI总结提出一种基于几何表面结构和再生核希尔伯特空间嵌入的无对应局部点云配准方法，采用二阶流形优化实现高达10倍加速，在LiDAR和RGB-D跟踪及物体配准中显著降低漂移并提升鲁棒性。

Comments 16 pages, 12 figures

详情

AI中文摘要

我们提出了一种快速且无需对应关系的局部点云配准方法，该方法利用了几何表面结构和再生核希尔伯特空间（RKHS）嵌入。该方法将点云表示为具有逐点各向异性核的连续函数，这些核编码了局部几何信息。这种公式化在沿表面法线方向改善对齐的同时，放松了沿切线方向的对齐。为了解决由此产生的配准问题，我们提出了一种具有近似黎曼海森矩阵的二阶流形优化方案，与先前基于无对应RKHS方法中使用的一阶求解器相比，实现了高达10倍的加速。我们展示了在多种室内外数据集上改进的帧到帧LiDAR和RGB-D跟踪精度。在驾驶领域的LiDAR跟踪配准任务中，我们在具有挑战性的特征稀疏环境下实现了平移和旋转漂移均减少超过55%。在物体配准基准测试中，我们展示了相比基于ICP的方法更强的鲁棒性，并且在优化全局初始化时（尤其是在中等错位情况下）获得了进一步的提升。

英文摘要

We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $>55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.

URL PDF HTML ☆

赞 0 踩 0

2605.25371 2026-06-10 cs.RO 版本更新

FOUND-IT: Foundation-model-first Task-driven 3D Scene Graphs with Granularity on Demand

FOUND-IT: 基于基础模型优先、按需粒度的任务驱动3D场景图

Dominic Maggio, Nicolas Gorlo, Kris Hauser, Luca Carlone

发表机构 * Laboratory for Information & Decision Systems, Massachusetts Institute of Technology（信息与决策系统实验室，麻省理工学院）； Samsung Research America（三星美国研究院）

AI总结提出首个基于未标定单目相机实时构建任意室内外环境分层任务驱动3D场景图的方法，通过几何基础模型和可调整粒度支持动态任务，并在ASHiTA SG3D基准上提升79%准确率。

详情

AI中文摘要

我们提出了首个使用未标定单目相机实时构建任意室内外环境分层任务驱动3D场景图的方法。我们利用几何基础模型来估计场景图的几何属性（例如，物体边界框），但也观察到可通行性信息（场景图的“地点”层）可以通过向现有几何基础模型（如VGGT）添加额外头部直接重建。我们的方法是任务驱动的，即根据任务调整地图中物体和区域的粒度；例如，在操作任务中，我们的方法能够分辨炉子上的小旋钮，而在导航任务中则可以关注大物体（如整个炉子）。然而，与相关工作的重要区别在于，我们考虑了任务列表并非预定义固定，而是随着机器人运行而演变的现实情况。这自然允许处理复杂的移动操作任务，机器人可以在任务展开时动态调整其表示。我们将由此产生的方法称为FOUND-IT。FOUND-IT还包括一种代理方法来查询场景图中的信息。除了在ASHiTA SG3D任务定位基准上实现79%的更高准确率外，我们展示了FOUND-IT在Jetson Thor上实时运行于地面机器人。此外，为了突出我们方法的鲁棒性，我们演示了在YouTube上随意拍摄的房地产公寓游览中构建3D场景图。代码将在发表后提供。

英文摘要

We present the first approach to build hierarchical task-driven 3D scene graphs of arbitrary indoor or outdoor environments using an uncalibrated monocular camera in real-time. We leverage geometric foundation models to estimate geometric attributes of the scene graph (e.g., object bounding boxes), but we also observe that traversability information (the "places" layer of a scene graph) can be directly reconstructed by adding an extra head to existing geometric foundation models, like VGGT. Our approach is task-driven in the sense that we adjust the granularity of the objects and regions in the map depending on the task; for instance, during a manipulation task, our approach is able to resolve small knobs on a stove, while during a navigation task it can focus on large objects (e.g., the entire stove). However, in a major departure from related work, we consider the realistic case where the list of tasks is not predefined and fixed, but evolves as the robot operates. This naturally allows dealing with complex loco-manipulation tasks, where the robot can dynamically adjust its representation as the task unfolds. We dub the resulting approach FOUND-IT. FOUND-IT also includes an agentic approach to query information in the scene graph. In addition to achieving 79% higher accuracy on the ASHiTA SG3D task grounding benchmark, we demonstrate FOUND-IT runs in real-time on a ground robot using a Jetson Thor. Furthermore, to highlight the robustness of our method, we demonstrate constructing 3D scene graphs on casually captured realtor apartment tours from YouTube. Code will be made available upon publication.

URL PDF HTML ☆

赞 0 踩 0

2606.10180 2026-06-10 cs.RO cs.AI cs.HC 新提交

Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

流控制：通过简单实时输入引导视觉-语言-动作模型

Jonathan C. Kao, Jason Chan, Andy Wang

AI总结提出流控制方法，利用键盘等通用实时输入引导VLA模型动作，无需重新训练，能提升任务成功率和完成速度。

Comments 10 pages, 5 figures

详情

AI中文摘要

我们引入了视觉-语言-动作（VLA）模型的流控制，这是一种简单有效的方法，通过通用输入（如键盘）实时引导VLA动作。该方法可直接使用，无需重新训练或微调VLA。它允许相对粗糙的用户输入引导VLA与用户意图对齐。VLA将这些输入转换为从训练期间学习的VLA专家动作分布中采样的动作样本，从而生成高质量（符合动作专家分布）和高保真度（反映用户意图）的动作。我们证明流控制具有许多理想特性：（1）流控制能准确、响应地通过用户输入引导机器人动作；（2）它对次优用户输入具有鲁棒性；（3）它使用户能够引导VLA实现显著更高的成功率和更快的任务完成；（4）在流控制轨迹上微调VLA可提高自主策略性能。这些结果共同为用户提供了一种简单直观的方式来帮助引导VLA动作，提升任务性能。

英文摘要

We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require retraining or fine-tuning VLAs. It enables relatively crude user inputs to steer a VLA to align with user intent. The VLA transforms these inputs into action samples drawn from the VLA expert action distribution learned during training, so that the generated actions are high quality (conformity to the action expert distribution) and high fidelity (reflecting the user's intent). We demonstrate that flow control has many desirable properties: (1) flow control accurately and responsively steers robot actions with user inputs, (2) it is robust to suboptimal user inputs, (3) it enables users to steer VLAs to achieve significantly higher success rates and faster task completion, and (4) fine-tuning a VLA on flow control trajectories improves the autonomous policy. Together, these results provide a simple and intuitive way for users to help steer VLA actions, increasing task performance.

URL PDF HTML ☆

赞 0 踩 0

2606.10276 2026-06-10 cs.RO cs.AI 新提交

Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

基于语言和自我中心人类信号的分层策略用于自然人机交互

Dongjun Lee, Juheon Choi, Dong Kyu Shin, Sinjae Kang, Kimin Lee

发表机构 * KAIST（韩国科学技术院）； Seoul National University（首尔国立大学）

AI总结提出EDITH框架，通过智能眼镜捕捉人类第一人称视角、注视和语言信号，设计分层策略将非语言信号与语言指令结合，实现更自然的人机交互，减少用户表达意图的负担。

Comments We provide video demos and code in: https://project-edith.github.io

详情

AI中文摘要

为了实现自然的人机交互，机器人必须理解人类不仅通过语言，还通过手势和注视等非语言信号表达的意图。然而，当前的机器人策略仅依赖语言指令作为传达意图的唯一接口，忽略了非语言信号，将全部沟通负担放在语言上。在这项工作中，我们提出了EDITH，一个机器人框架，通过智能眼镜的连续第一人称视角和注视流捕捉人类的非语言信号，并将其与语言指令一起作为机器人策略的输入。我们的硬件系统实时将人类的第一人称视角、注视和语音传输给机器人，并将语音转录为语言指令。为了处理这些丰富但嘈杂的信号，我们设计了一个分层策略，其中高层策略推断人类的意图并生成一系列子任务，每个子任务表示为一个细粒度指令，配有一个关键帧，将意图锚定在场景中（例如，人类指向目标物体的帧）。然后低层策略执行这些子任务。在我们的人机交互任务实验中，即使意图仅被短暂表达，EDITH也能使机器人根据人类的非语言信号行动，并且与仅使用语言指令相比，显著减少了用户传达意图的努力。请访问我们的项目页面获取源代码和真实机器人演示视频。

英文摘要

For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal signals such as gestures and gaze. However, current robot policies rely on language instructions as the sole interface for conveying intent, leaving nonverbal signals unused and placing the full burden of communication. In this work, we present EDITH, a robot framework that captures the human's nonverbal signals through continuous streams of first-person view and gaze from smart glasses, and uses them alongside language instructions as inputs to the robot policy. Our hardware system streams the human's first-person view, gaze, and speech to the robot in real time, transcribing the speech into language instructions. To handle these rich but noisy signals, we design a hierarchical policy in which a high-level policy infers the human's intent and produces a sequence of subtasks, where each subtask is represented as a fine-grained instruction paired with a keyframe that grounds the intent in the scene (e.g., the frame where the human points at the target object). A low-level policy then executes these subtasks. In our experiments on human-robot interactive tasks, EDITH enables the robot to act on the human's nonverbal signals even when intent is expressed only briefly, and significantly reduces user effort to convey intent compared to using language instructions alone. Visit our project page for source code and real-robot demo videos.

URL PDF HTML ☆

赞 0 踩 0

2606.11109 2026-06-10 cs.RO 新提交

EM-Fall: Embodied mmWave Sensing for Day-and-Night Fall Detection on Humanoid Robots

EM-Fall: 用于人形机器人昼夜跌倒检测的具身毫米波感知

Yanshuo Lu, Yuxuan Hu, Shenghai Yuan, Xinyu Zhou, Kuangji Zuo, Bofan Lyu, XiChen Yuan, Jianfei Yang

发表机构 * MARS Lab（MARS实验室）； NTU（南洋理工大学）； IOT Lab（物联网实验室）

AI总结提出EM-Fall框架，将毫米波感知与移动人形机器人结合，通过主动调整视角实现跨房间遮挡下的跌倒检测，并设计轻量时序模型处理宠物干扰和多径效应，在8个真实环境中验证了鲁棒性。

详情

AI中文摘要

跌倒是老年人受伤和住院的主要原因之一，因此可靠的跌倒感知成为住宅环境中安全监测的重要能力。然而，现有的跌倒检测系统通常依赖于可穿戴设备或固定传感装置，可能存在用户依从性低、空间覆盖有限或在遮挡和光照不良条件下性能下降的问题。在这项工作中，我们提出了\textbf{EM-Fall}，一种部署在移动人形机器人上的具身跌倒检测框架。该系统将毫米波（mmWave）感知与机器人移动性相结合，使机器人能够主动调整其传感视角，并在跨房间和遮挡情况下保持目标可观测性。为了解决复杂住宅环境中的干扰，包括宠物运动和多径伪影，我们设计了一个以人为中心的感知流水线，结合轻量级时序建模，以捕捉跌倒事件前、中、后的运动演变。我们在八个真实室内环境中对四位参与者进行了系统评估，并构建了一个家庭毫米波跌倒检测数据集。实验结果表明，具身移动感知范式提高了监测连续性，并在多种环境条件下保持了鲁棒的跌倒检测性能。所提出的框架为家庭环境中的机器人辅助安全监测提供了一种实用解决方案。

英文摘要

Falls are one of the leading causes of injury and hospitalization among elderly individuals, making reliable fall awareness an essential capability for safety monitoring in residential environments. However, existing fall detection systems often rely on wearable devices or fixed sensing installations, which may suffer from low user compliance, limited spatial coverage, or degraded performance under occlusion and poor lighting conditions. In this work, we propose \textbf{EM-Fall}, an embodied fall detection framework deployed on a mobile humanoid robot. The system integrates millimeter-wave (mmWave) sensing with robotic mobility, allowing the robot to actively adjust its sensing viewpoint and maintain target observability across rooms and under occlusion. To address interference in complex residential environments, including pet motion and multipath artifacts, we design a human-centered perception pipeline combined with lightweight temporal modeling to capture motion evolution before, during, and after fall events. We evaluate the proposed system across eight real indoor environments with four participants and construct an in-home mmWave fall detection dataset. Experimental results show that the embodied mobile sensing paradigm improves monitoring continuity and maintains robust fall detection performance under diverse environmental conditions. The proposed framework provides a practical solution for robot-assisted safety monitoring in home environments.

URL PDF HTML ☆

赞 0 踩 0

2606.09836 2026-06-10 cs.HC cs.RO 交叉投稿

Equanimity in HRI: Applying Calm Technology Principles to Human-Robot Interaction

人机交互中的平和心态：将平静技术原则应用于人机交互

Barbara Sienkiewicz, Bipin Indurkhya

发表机构 * Cognitive Science Department, Jagiellonian University（杰兹维日大学认知科学系）

AI总结本文探索将平静技术整合到人机交互中，为家庭辅助机器人设计提供指南，以促进平和、非侵入性的交互，并强调负责任机器人学与伦理考量。

Comments Conference pre-print. https://doi.org/10.1007/978-981-96-3525-2_41

详情

DOI: 10.1007/978-981-96-3525-2_41

AI中文摘要

本文探讨如何将{\ extit{平静技术}}整合到人机交互中，特别关注家庭环境。它提供了全面的指南，用于设计优先考虑并增强人类对{\ extit{平和心态}}需求的辅助机器人，确保交互是平静、非侵入性和和谐的。本文审视了技术在当代生活中的广泛影响及其对认知能力的影响，强调了未来技术发展中负责任机器人学和伦理考量的必要性。通过将{\ extit{平静技术}}原则应用于家用机器人，本文提供了在家庭辅助机器人中应使用的具体示例和特征。目标是促进人类与机器人之间平衡、不引人注目的交互，特别是在家庭环境中，因为它是每个人生活中最私密的空间，为该领域的应用和进一步研究铺平道路。

英文摘要

This paper explores how {\textit{Calm Technology}} can be integrated into Human-Robot Interaction (HRI), with a particular focus on the household environment. It offers comprehensive guidelines for designing assistive robots that prioritize and enhance the human need for {\textit{equanimity}}, ensuring interactions are calm, non-intrusive, and harmonious. The paper examines the widespread influence of technology in contemporary life and its impact on cognitive capabilities, underscoring the need for responsible robotics and ethical considerations in future technological developments. By adapting {\textit{Calm Technology}} principles to domestic robots, the article provides concrete examples and features that should be employed in household assistive robotics. The goal is to foster a balanced, unobtrusive interaction between humans and robots, especially in the home environment, as it is the most privat environment in everyone's life, paving the way for applications and further research in the field.

URL PDF HTML ☆

赞 0 踩 0

2605.06234 2026-06-10 cs.RO cs.HC 版本更新

SAFE-Pruner: 语义注意力引导的未来感知令牌剪枝用于高效视觉-语言-动作操控

Shilin Ma, Chubin Zhang, Changyuan Wang, Yuji Wang, Yue Wu, Zixuan Wang, Jingqi Tian, Zheng Zhu, Yansong Tang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）

AI总结针对视觉-语言-动作模型推理加速中现有剪枝方法忽略深层视觉信息的问题，提出SAFE-Pruner框架，通过引入未来层注意力线索和语义注意力一致性实现前瞻性令牌剪枝，在仿真和真实实验中取得最高1.89倍加速且成功率下降小于1.7%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型的实时推理对于机器人控制至关重要。虽然视觉令牌剪枝在加速推理方面显示出巨大潜力，但现有方法主要基于浅层线索进行剪枝决策，并存在丢弃深层所需视觉信息的风险。为解决此问题，我们提出SAFE-Pruner，一种即插即用的剪枝框架，将未来层的注意力线索融入剪枝决策。具体而言，我们识别出语义注意力一致性，即VLA模型在执行步骤中倾向于将其注意力概率质量集中在同一语义实体上。基于这一观察，我们设计了一种前瞻性策略来预测深层令牌的显著性，从而防止关键令牌过早移除并实现更稳定的加速。我们进一步引入自适应子任务划分策略来检测注意力突变，从而提高预测准确性和剪枝可靠性。在仿真和真实环境中的大量实验表明，我们的方法实现了高达1.89倍的加速，成功率下降最小（低于1.7%），同时比最先进的方法高出1.9%。

英文摘要

Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on shallow-layer cues and risk discarding visual information required by deep layers. To address this issue, we propose SAFE-Pruner, a plug-and-play pruning framework that incorporates attention cues of future layers into pruning decisions. Specifically, we identify semantic attention consistency, the tendency that VLA models concentrate their attention probability mass on the same semantic entity across execution steps. Based on this observation, we design a forward-looking strategy to forecast the token saliency in deep layers, which prevents the premature removal of critical tokens and leads to more stable acceleration. We further introduce an adaptive subtask division strategy to detect abrupt attention shifts, thereby improving forecasting accuracy and pruning reliability. Extensive experiments in simulation and real-world settings demonstrate that our method achieves up to 1.89x speedup with a minimal degradation in success rate of less than 1.7%, while outperforming state-of-the-art methods by up to 1.9%.

URL PDF HTML ☆

赞 0 踩 0

2508.13446 2026-06-10 cs.RO 版本更新

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

CAST: 反事实标签提升视觉-语言-动作模型中的指令跟随能力

Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, Sergey Levine

发表机构 * University of California Berkeley（加州大学伯克利分校）； Princeton University（普林斯顿大学）

AI总结针对VLA模型难以遵循细粒度指令的问题，提出利用视觉语言模型生成反事实标签增强数据集，提升语言基础多样性，实验表明该方法在导航和操作任务中显著提升指令跟随成功率。

详情

AI中文摘要

通用机器人应能理解并遵循用户指令。尽管当前视觉-语言-动作（VLA）模型为将开放词汇语言指令映射到机器人动作提供了强大架构，但它们难以遵循细粒度命令。原因之一是现有机器人数据集缺乏语义多样性和语言基础，特别是对于相似观测缺乏细粒度任务多样性。为解决此问题，我们提出一种新方法，利用视觉语言模型创建反事实标签来增强现有机器人数据集。通过用这些标签增强现有数据集，我们增加了机器人数据集语言基础的多样性和粒度，最终提升了VLA的语言跟随能力。我们通过在3个不同室内外环境中进行视觉语言导航实验，评估了所得模型遵循语言指令的能力，范围从简单的以物体为中心的指令到复杂的指代任务。实验表明，反事实重标记（无需额外数据收集）显著提升了VLA策略的指令跟随能力，超越了最先进方法，并且与在未增强数据上训练的VLA相比，成功率翻倍。我们还评估了该方法在操作VLA上的表现，发现在有干扰物的任务中性能有类似提升。

英文摘要

Generalist robots should be able to understand and follow user instructions. Despite providing a powerful architecture for mapping open-vocabulary language instructions to robot actions, current vision-language-action (VLA) models struggle to follow fine-grained commands. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision-language models to create counterfactual labels. By augmenting existing datasets with these labels, we increase the diversity and granularity of language grounding for robot datasets, ultimately improving the language-following capabilities of VLAs. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting vision-language navigation experiments in 3 different indoor and outdoor environments. Our experiments show that counterfactual relabeling (without additional data collection) significantly improves instruction-following in VLA policies, outperforming state-of-the-art methods and doubling the success rate compared to VLAs trained on unaugmented data. We also evaluate our method for manipulation VLAs and find a similar gain in performance on tasks with distractors.

URL PDF HTML ☆

赞 0 踩 0

2512.06628 2026-06-10 cs.RO cs.CV 版本更新

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

MIND-V：基于强化学习物理对齐的长期机器人操作分层世界模型

Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, Xiu Li

发表机构 * Tsinghua University（清华大学）； X Square Robot（X Square机器人）； Sun Yat-sen University（中山大学）； HKUST（香港科技大学）

AI总结提出MIND-V分层世界模型，通过语义推理、行为语义桥接和运动视频生成，结合强化学习物理对齐，实现长期机器人操作视频的物理合理合成。

详情

AI中文摘要

可扩展的具身智能受到多样化、长期机器人操作数据稀缺的限制。现有视频世界模型仅能合成简单动作的短视频，且常依赖手动定义轨迹。为此，我们提出MIND-V，一种认知分层世界模型，旨在合成物理合理且逻辑连贯的长期机器人操作视频。受认知科学启发，MIND-V通过三个核心组件桥接高层推理与像素级合成：语义推理中心（SRH）利用预训练视觉语言模型进行任务规划；行为语义桥（BSB）将抽象指令转换为域不变表示；运动视频生成器（MVG）用于条件视频渲染。MIND-V采用分阶段视觉未来展开（Staged Visual Future Rollouts）这一测试时优化策略以增强长期鲁棒性。为强制遵循物理定律，我们引入GRPO强化学习后训练阶段，由新颖的物理预见一致性（PFC）奖励引导。PFC利用V-JEPA2世界模型作为物理裁判，在潜在特征空间中惩罚不合理动态。实验证实MIND-V在长期模拟中的SOTA性能及其对策略学习的重要价值，为具身数据合成引入了可扩展且完全自主的框架。

英文摘要

Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.

URL PDF HTML ☆

赞 0 踩 0

2606.05645 2026-06-10 cs.RO 版本更新

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

Discrete-WAM：面向世界-策略学习的统一离散视觉-动作标记编辑

Ziyang Yao, Haochen Liu, Yuncheng Jiang, Zeyu Zhu, Zibin Guo, Jingru Wang, Tianle Liu, Jianwei Cui, Kuiyuan Yang, Hongwei Xie, Jingwei Zhao, Guang Chen, Hangjun Ye

发表机构 * Xiaomi EV（小米电动车）

AI总结提出Discrete-WAM，通过将未来视觉状态和自车动作对齐为离散标记，构建统一离散扩散框架，实现世界建模、世界-动作策略和分层决策策略的联合学习，支持可控生成和反事实推理，提升自动驾驶决策可靠性。

详情

AI中文摘要

自动驾驶需要对自车动作如何影响周围世界的演变进行推理。然而，大多数端到端方法依赖于直接的状态到动作映射，捕捉相关性而没有显式建模动作条件动力学。相反，连续潜在世界模型通常缺乏用于跨反事实未来进行因果推理的组合结构。我们提出Discrete-WAM，一种统一的潜在视觉-动作世界策略，将未来视觉状态和自车动作表示为对齐的离散标记，实现跨替代未来的组合因果推理。基于这种统一的离散对齐，Discrete-WAM建立了一个具有统一生成任务的共享离散扩散框架，共同制定世界建模、世界-动作策略和分层决策使能策略，支持跨多种驾驶场景的组合泛化。在大型自动驾驶基准上的实验表明，Discrete-WAM在实现竞争性能的同时，支持可控生成和反事实推理，为更可靠的决策提供了一条原则性路径。

英文摘要

Autonomous driving requires reasoning about how ego actions shape future world evolution, rather than merely mapping observations to actions. However, most end-to-end methods rely on direct state-to-action imitation, while existing world models often remain weakly aligned with downstream policy generation. We introduce Discrete-WAM, a unified discrete vision-action world-policy framework that represents visual observations, future states, high-level decisions, and ego actions within a shared token space. Built on this discrete alignment, Discrete-WAM jointly trains world modeling, world-policy modeling, and policy modeling through multi-task and multi-stage pretraining, allowing action-conditioned future prediction to directly support policy generation. For downstream planning, Discrete-WAM further decomposes policy generation into hierarchical decision prediction and parallel action-token editing, where the decision token provides a high-level planning skeleton and confidence-based scheduling refines dense future actions efficiently. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves strong planning performance while supporting controllable future generation, counterfactual evaluation, surprise-based world-model analysis, and efficient parallel policy decoding. These results suggest that discrete representation alignment, unified world-policy training, and hierarchical token editing provide a promising design paradigm for physical AI.

URL PDF HTML ☆

赞 0 踩 0

2510.14836 2026-06-10 cs.CV cs.RO 版本更新

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

QDepth-VLA：量化深度预测作为视觉-语言-动作模型的辅助监督

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Beijing Zhongke Huiling Robot Technology Co.（北京中科创联机器人科技有限公司）

AI总结提出QDepth-VLA框架，通过辅助深度预测任务增强VLA模型的空间感知与推理能力，在仿真和真实任务中提升操作性能。

2606.10986 2026-06-10 cs.RO cs.SY eess.SY 新提交

Multi-UAV Active Sensing with Information Gain-based Planning and Belief Fusion

基于信息增益规划与信念融合的多无人机主动感知

S. Habibi, L. Marques

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出多无人机主动感知框架，利用信息增益路径规划与概率信念融合实现二元地形映射，在合成和真实农业图像上验证，相比随机游走和扫描覆盖降低熵与误差。

详情

AI中文摘要

无人机越来越多地用于空间分布环境中的主动感知和信息收集。然而，其性能受到有限飞行时间、感知不确定性以及空间覆盖与观测精度之间权衡的制约。本文提出了一个多无人机主动感知框架的实际验证，用于概率二元地形映射，以精准农业作为应用案例。环境表示为概率信念图，其中空间依赖性通过因子图建模。无人机决策由基于信息增益的信息路径规划（IGbIPP）引导，并与随机游走和扫描覆盖路径规划基线在合成地形和真实无人机农业图像上进行比较。研究还评估了空间相关权重和几种用于多无人机信息共享的概率信念融合规则。结果表明，IGbIPP比基线更有效地降低了熵和映射误差，而更宽的视场提高了实际覆盖和地图精度。结果进一步表明，简单的相等或偏置空间权重比自适应权重更稳健，并且贝叶斯、对数几率与Dempster-Shafer融合实现了最佳协同映射性能。这些发现强调了不确定性驱动规划、感知几何、空间建模和概率融合对于实际无人机主动感知的重要性。

英文摘要

Unmanned aerial vehicles (UAVs) are increasingly used for active sensing and information gathering in spatially distributed environments. Their performance, however, is constrained by limited flight time, sensing uncertainty, and the trade-off between spatial coverage and observation accuracy. This paper presents a real-world validation of a multi-UAV active sensing framework for probabilistic binary terrain mapping, with precision agriculture used as the application case. The environment is represented as a probabilistic belief map, where spatial dependencies are modeled through a factor-graph formulation. UAV decision making is guided by Information Gain based Informative Path Planning (IGbIPP), and the approach is compared with Random Walk and Sweep coverage path planning baselines using both synthetic terrains and real UAV-derived agricultural imagery. The study also evaluates spatial correlation weights and several probabilistic belief-fusion rules for multi-UAV information sharing. Results show that IGbIPP reduces entropy and mapping error more effectively than the baselines, while a wider field of view improves real-world coverage and map accuracy. The results further show that simple equal or biased spatial weights can be more robust than adaptive weights, and that Bayesian, log-odds, and Dempster--Shafer fusion achieve the best cooperative mapping performance. These findings highlight the importance of uncertainty-driven planning, sensing geometry, spatial modeling, and probabilistic fusion for real-world UAV-based active sensing.

URL PDF HTML ☆

赞 0 踩 0

2606.11088 2026-06-10 cs.RO 新提交

A Distributed Multi-UGV Exploration Framework With Loop-Aware Planning and Descriptor-Aided Localization in Resource-Limited Environments

资源受限环境下的分布式多UGV探索框架：环回感知规划与描述符辅助定位

Zhiwei Li, Haiou Liu, Xijun Zhao, Ji Li, Yingze Wang, Boyang Wang

发表机构 * School of Mechanical Engineering, Beijing Institute of Technology（北京理工大学机械与车辆学院）； China North Artificial Intelligence & Innovation Research Institute, Collective Intelligence & Collaboration Laboratory (CIC)（中国北方人工智能与创新研究院集体智能与协作实验室）； Zhengzhou Intelligent Technology Research Institute, Beijing Institute of Technology（北京理工大学郑州智能科技研究院）

AI总结提出一种完全分布式的多无人地面车辆（UGV）探索框架，通过轻量级LiDAR全局描述符实现跨UGV环回检测，并结合环回感知分层规划，在资源受限环境中减少探索时间和行驶距离。

详情

DOI: 10.1109/TIE.2026.3684182
Journal ref: IEEE Transactions on Industrial Electronics, 2026

AI中文摘要

在未知、无GPS且带宽受限的环境中，多无人地面车辆（UGV）在没有先验地图的情况下进行鲁棒且高效的协同探索仍然具有挑战性，因为定位漂移会降低地图一致性并导致重复覆盖。本文提出了一种完全分布式的探索框架，该框架将描述符辅助的跨UGV环回闭合与环回感知分层规划相结合，同时实现自主定位和探索。我们开发了一种轻量级的LiDAR全局描述符，具有距离图像预对齐功能，可在较大的偏航和横向变化下实现鲁棒的跨UGV地点识别，并使用验证的环回闭合来维护全局一致的轨迹和稀疏拓扑表示。我们进一步引入了一种不确定性感知的跨UGV环回闭合选择模块，该模块在姿态不确定性下对候选环回闭合进行评分，并将高实用性的环回闭合保留为规划锚点，用于全局任务分配和局部路径优化。仿真和真实UGV实验表明，环回闭合模块实现了89.9%/95.5%的AR@1/AR@1%，分布式优化减少了绝对轨迹误差，系统显著降低了双向通信量，并且与mTSP基线相比，整体框架将探索时间和行驶距离分别减少了15%和14%。

英文摘要

Robust and efficient cooperative exploration with multiple unmanned ground vehicles (UGVs) in unknown, GPSdenied, and bandwidth-limited environments without prior maps remains challenging, as localization drift degrades map consistency and induces redundant coverage. This paper presents a fully distributed exploration framework that couples descriptoraided inter-UGV loop closure with loop-aware hierarchical planning while enabling autonomous localization and exploration. We develop a lightweight LiDAR global descriptor with range-image prealignment to enable robust cross-UGV place recognition under large yaw and lateral variations, and use verified loop closures to maintain globally consistent trajectories and a sparse topological representation. We further introduce an uncertainty-aware crossUGV loop-closure selection module that scores candidate loop closures under pose uncertainty and retains high-utility loop closures as planning anchors for global task allocation and local route refinement. Simulations and real-UGV experiments show that the loop-closure module achieves AR@1/AR@1% of 89.9%/95.5%, distributed optimization reduces absolute trajectory error, the system substantially reduces two-way communication volume, and the overall framework reduces exploration time and travel distance by 15% and 14%, respectively, compared with an mTSP baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.09919 2026-06-10 cs.LG cs.AI cs.MA cs.RO 交叉投稿

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

Co-GLANCE: 异构机器人团队的不确定性感知主动感知

Michal P. Podolinsky, Neel P. Bhatt, Pranay Samineni, Rohan Siva, Christian Ellis, Ufuk Topcu

AI总结提出Co-GLANCE系统，通过蒸馏视觉语言模型实现实时遮挡分割与机器人分配，结合共形预测与选择性弃权提供统计保证的不确定性量化，驱动主动感知，在真实场景中遮挡分割和分配准确率分别提升25%和36%，推理延迟降低350倍。

Comments Code, videos, and dataset available at https://co-glance.github.io/

详情

AI中文摘要

感知不确定性是异构机器人团队在非结构化户外环境中运行的核心挑战，因为单一视角无法提供可靠的场景理解。由遮挡等来源引起的感知不确定性，根据场景结构在不同机器人视角下表现不同。检测和解决感知不确定性的来源需要基于场景的上下文推理和具备能力感知的机器人分配。虽然视觉语言模型为两者提供了强大的语义先验，但它们对于机载推理在计算上过于昂贵，且缺乏校准的不确定性量化。我们介绍了Co-GLANCE，一个用于异构机器人团队不确定性解决的实时机载感知与决策系统。Co-GLANCE将视觉语言模型的语义推理能力蒸馏为用于遮挡分割和机器人分配的端到端模型，消除了对基于云推理的需求。为了量化感知不确定性，Co-GLANCE结合了共形预测与选择性弃权，为分割、机器人分配和检测输出提供统计有效的覆盖保证。这些校准的不确定性估计直接触发主动感知，派遣最合适的机器人获取信息丰富的视角并解决不确定性。在真实世界场景中，Co-GLANCE在遮挡分割和机器人分配准确率上分别比基于云的视觉语言模型基线高出25%和36%，同时将每帧推理延迟降低350倍。我们还发布了一个空地数据集以供未来研究。代码、视频和数据集可在以下网址获取：此 https URL。

英文摘要

Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification. We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, eliminating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty. Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame inference latency 350x. We also release an air-ground dataset for future research. Code, videos, and dataset available at https://co-glance.github.io/ .

URL PDF HTML ☆

赞 0 踩 0

2606.10688 2026-06-10 cs.RO 新提交

一种曝光时间对齐的主路径架构用于自动驾驶ECU

Toru Saito, Yuki Hagura, Tatsuya Konishi, Satoru Mizusawa, Takumi Yajima

发表机构 * National Institute of Advanced Industrial Science and Technology, Japan（日本国家先进工业科学与技术研究院）

AI总结针对生产车辆从模块化多NN流水线向端到端自动驾驶过渡的需求，提出主路径、曝光时间对齐和共路径共存三项设计原则，在双SoC平台上实现平均296ms的延迟。

详情

AI中文摘要

虽然端到端（E2E）自动驾驶已成为主导研究方向，但在一个非平凡的过渡期内，量产车辆仍然依赖模块化的多NN流水线。本文的主题是设计一种架构，在此阶段支持模块化流水线和E2E路径并行，并嵌入一条用于分阶段迁移的路径。移植到量产SoC上，平等主义的后期融合计算效率低下，且没有自然单元用于分阶段的E2E替代。作为替代方案，我们提出三项设计原则：（i）主路径，明确选择一条主要感知链，并优先将其封装在单个SoC对中，而非关键路径；（ii）曝光时间对齐，将主传感器的曝光时间τ_exp作为标签沿链传播，并在匹配的τ_exp上事件驱动融合节点，而非固定周期；（iii）共路径共存，基于（i）和（ii），让E2E输出路径与模块化流水线在同一τ_exp周期内并行运行。在双SoC量产AD-ECU上，实现从相机快门到规划器输出的平均延迟为296毫秒，在350毫秒的设计预算内。在（iii）下，模块化流水线在生产启动时为主路径，E2E路径作为影子在实车上运行，随着评估证据的积累，E2E范围逐步扩大。

英文摘要

While end-to-end (E2E) autonomous driving has become the dominant research direction, production vehicles continue to rely on modular multi-NN pipelines for a non-trivial transitional period. The subject of this paper is the design of an architecture that, during this phase, supports a modular pipeline and an E2E path side by side and embeds a path for staged migration. Transplanted to a production SoC, egalitarian late fusion is compute-inefficient and offers no natural unit for staged E2E substitution. As an alternative, we propose three design principles: (i) Primary-Path, which explicitly selects a primary perception chain and prioritizes its enclosure within a single SoC pair over the non-critical paths (ii) Exposure-Time-Aligned, which propagates the primary sensor's exposure time $τ_{\rm exp}$ as a tag along the chain and event-drives the fusion node on matched $τ_{\rm exp}$ rather than a fixed cycle and (iii) Co-Path Coexistence, which, building on (i) and (ii), lets an E2E output path co-run with the modular pipeline within the same $τ_{\rm exp}$ cycle. On a Dual-SoC production AD-ECU, the implementation closes camera-shutter to planner-output latency at a mean of 296 ms within the 350 ms design budget. Under (iii), the modular pipeline is primary at production launch and the E2E path runs as shadow on real vehicles, and the E2E scope is expanded as evaluation evidence accumulates.

URL PDF HTML ☆

赞 0 踩 0

2606.10857 2026-06-10 cs.RO cs.LG 新提交

Embodiment-conditioned Generalist Control for Multirotor Aerial Robots

基于具身条件的多旋翼空中机器人通用控制

Orestis Konstantaropoulos, Welf Rehberg, Mihir Kulkarni, Kostas Alexis

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology (NTNU), Trondheim, Norway（挪威科技大学工程控制论系）

AI总结提出一种通用位置控制策略，通过物理具身描述符（质量与惯性归一化控制分配矩阵）实现单一网络权重控制任意多旋翼构型，采用PPO训练，五分钟后零样本迁移至真实世界。

详情

面向视觉条件的无人机导航的自优化智能体强化学习

Roohan Ahmed Khan, Yasheerah Yaqoot, Amir Atef Habel, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出AgenticRL框架，利用多模态GPT智能体自动设计奖励函数、通过闭环自改进优化策略，在多种无人机导航任务中提升性能并实现高成功率。

详情

AI中文摘要

深度强化学习在使自主机器人学习复杂导航任务方面显示出巨大潜力。然而，其实际应用仍然严重依赖于人工设计的奖励函数和重复的手动微调，这既耗时又无法保证在目标任务中取得高成功率。本文提出了AgenticRL，一种智能体引导的强化学习框架，用于提高无人机导航任务中奖励设计、策略优化和实际部署的自主性。AgenticRL使用多模态生成预训练变换器（GPT）智能体来解释任务信息和视觉场景观察，生成特定于任务的奖励函数，使用近端策略优化（PPO）算法训练策略，然后通过诊断包评估训练后的策略作为批评者，生成反馈。基于该反馈，智能体识别失败模式并在闭环自改进过程中优化奖励函数。为了在推理期间进一步利用多模态GPT智能体，AgenticRL使用真实世界图像和自然语言任务信息自动识别活动场景并选择适当的训练策略执行。该框架在多种导航任务上进行了评估，包括穿越门、避障、穿越墙障并着陆、轨迹跟踪和运动行为学习。实验结果表明，与初始奖励相比，闭环优化过程将策略行为提升了71%。我们还展示了所提出框架的仿真到现实迁移，实现了91%的真实世界成功率和94%的仿真到现实准确率。

英文摘要

Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained transformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.

URL PDF HTML ☆

赞 0 踩 0

2407.05886 2026-06-10 cs.RO 版本更新

Rod models in continuum and soft robot control: a review

连续体和软体机器人控制中的杆模型：综述

Carlo Alessi, Camilla Agabiti, Daniele Caradonna, Cecilia Laschi, Federico Renda, Egidio Falotico

发表机构 * Istituto Italiano di Tecnologia（意大利技术研究院）； The BioRobotics Institute（生物机器人研究所）； Department of Excellence in Robotics and AI（机器人与人工智能卓越部门）

AI总结本文综述了杆模型在连续体和软体机器人建模与控制中的应用，涵盖数学基础、机器人建模及控制策略，并讨论了其优势、局限和未来方向。

详情

AI中文摘要

连续体和软体机器人可以变革在受限或非结构化环境中需要柔顺交互的自动化任务，包括医疗、农业、海洋和太空应用。然而，其复杂的力学特性给建模和控制带来了重大挑战。低维连续介质力学模型，如杆理论，能够有效捕捉细长体在接触丰富场景中的大变形，同时平衡精度和计算效率。本文对连续体和软体机器人的杆模型进行了纵向综述，涵盖其数学基础、机器人建模和控制应用。我们回顾了软体机器人中采用的主要杆理论，并引入了一种基于变形的杆模型分类方法。此外，我们调查了近期基于模型和基于学习的利用杆模型的控制策略，强调了它们在操作和物理交互任务中的作用。最后，我们讨论了基于杆的方法的优势、局限性、研究空白和新兴方向。本文旨在为开发连续体和软体机器人的模型和控制策略提供参考。

英文摘要

Continuum and soft robots can transform automation tasks requiring compliant interaction in constrained or unstructured environments, including healthcare, agriculture, marine, and space applications. However, their complex mechanics introduce significant challenges in modeling and control. Low-dimensional continuum mechanical models, such as rod theories, effectively capture the large deformations of slender bodies in contact-rich scenarios while balancing accuracy and computational efficiency. This paper presents a vertical survey of rod models for continuum and soft robots, spanning their mathematical foundations, robot modeling, and control applications. We review the main rod theories adopted in soft robotics and introduce a deformation-based classification of rod models for continuum and soft robots. Furthermore, we survey recent model-based and learning-based control strategies leveraging rod models, highlighting their role in manipulation and physical interaction tasks. Finally, we discuss advantages, limitations, research gaps, and emerging directions of rod-based approaches. This paper aims to serve as a reference for developing models and control strategies for continuum and soft robots.

URL PDF HTML ☆

赞 0 踩 0

2602.21331 2026-06-10 cs.RO 版本更新

CableRobotGraphSim: A Graph Neural Network for Modeling Partially Observable Cable-Driven Robot Dynamics

CableRobotGraphSim：一种用于建模部分可观测缆索驱动机器人动力学的图神经网络

Nelson Chen, William R. Johnson, Rebecca Kramer-Bottiglio, Kostas Bekris, Mridul Aanjaneya

发表机构 * Rutgers University（罗切斯特大学）； Yale University（耶鲁大学）

AI总结提出CableRobotGraphSim，一种图神经网络模型，通过将缆索驱动机器人表示为图（刚体为节点，缆索和接触为边），仅利用部分可观测输入即可快速准确匹配其他仿真和真实机器人，并采用仿真-真实联合训练提升鲁棒性，最后集成MPPI控制器实现闭环导航。

详情

AI中文摘要

通用仿真器加速了机器人的发展。然而，基于第一性原理的传统仿真器通常需要全状态可观测性或依赖参数搜索进行系统辨识。本文提出\texttt{CableRobotGraphSim}，一种用于缆索驱动机器人的新型图神经网络（GNN）模型，旨在解决先前仿真方案的不足。通过将缆索驱动机器人表示为图，其中刚体作为节点，缆索和接触作为边，该模型能够快速准确地匹配其他仿真模型和真实机器人的特性，同时仅接收部分可观测输入。伴随GNN模型的是一个仿真-真实联合训练过程，该过程促进了对噪声真实数据的泛化能力和鲁棒性。该模型进一步与模型预测路径积分（MPPI）控制器集成，用于闭环导航，展示了模型的速度和准确性。

英文摘要

General-purpose simulators have accelerated the development of robots. Traditional simulators based on first-principles, however, typically require full-state observability or depend on parameter search for system identification. This work presents \texttt{CableRobotGraphSim}, a novel Graph Neural Network (GNN) model for cable-driven robots that aims to address shortcomings of prior simulation solutions. By representing cable-driven robots as graphs, with the rigid-bodies as nodes and the cables and contacts as edges, this model can quickly and accurately match the properties of other simulation models and real robots, while ingesting only partially observable inputs. Accompanying the GNN model is a sim-and-real co-training procedure that promotes generalization and robustness to noisy real data. This model is further integrated with a Model Predictive Path Integral (MPPI) controller for closed-loop navigation, which showcases the model's speed and accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.12804 2026-06-10 cs.RO 版本更新

BiPneu: Design and Control of a Bipolar-Pressure Pneumatic System for Soft Robots

BiPneu：用于软体机器人的双极气压气动系统的设计与控制

Yu Mei, Xinyu Zhou, Vedant Naik, Alan Gao, Xiaobo Tan

发表机构 * Department of Electrical and Computer Engineering, Michigan State University（电气与计算机工程系，密歇根州立大学）

AI总结提出一种可扩展、高性价比的多通道双极气压气动系统BiPneu，并设计基于混合电-气动模型的双模式滑模控制器（DM-SMC），实现宽范围、精确、快速的压力调节，在软体机器人应用中显著优于MPC和PID控制器。

Comments Full Version of BiPenu, including the supplementary materials

详情

DOI: 10.1109/TMECH.2026.3693622
Journal ref: IEEE/ASME Transactions on Mechatronics, 2026

AI中文摘要

正负压力调节对于软体机器人执行器至关重要，可实现大运动范围和多种驱动模式。然而，由于不对称的充放气动力学、阀门非线性以及切换引起的流量扰动，在两种压力极性下实现高性能调节仍然具有挑战性。本文提出BiPneu，一种可扩展且经济高效的多通道双极气压气动系统，用于软体机器人，能够实现宽范围、精确和快速的压力调节，同时与高级软件生态系统无缝兼容。基于混合电-气动模型，提出了一种带有滞后监督模式选择的双模式滑模控制器（DM-SMC）。广泛的仿真和实验表明，与先进模型预测控制器和良好调谐的PID控制器相比，DM-SMC在跟踪阶跃和正弦压力参考方面具有优越性能。实验结果显示，多步测试中平均绝对误差为1.44 kPa，正弦跟踪中为4.23 kPa，相对于PID控制分别降低了11.9%和35.6%，同时改善了控制力度、阀门切换速率和瞬态响应。DM-SMC的鲁棒性在具有压力依赖体积的波纹管执行器上得到进一步验证。最后，通过两个软体机器人示例——使用软体并联执行器快速控球和基于实时有限元方法（FEM）的软体波纹管执行器遥操作——展示了BiPneu的能力。

英文摘要

Positive-negative pressure regulation is critical to soft robotic actuators, enabling large motion ranges and versatile actuation modes. However, achieving high-performance regulation across both pressure polarities remains challenging due to asymmetric inflation-deflation dynamics, valve nonlinearities, and switching-induced flow disturbances. This paper presents BiPneu, a scalable and cost-efficient multi-channel bipolar-pressure pneumatic system for soft robots that enables wide-range, accurate, and responsive pressure regulation while providing seamless compatibility with high-level software ecosystems. A dual-mode sliding-mode controller (DM-SMC) with hysteresis-supervised mode selection is proposed based on a hybrid electro-pneumatic model. Extensive simulation and experiments demonstrate the superior performance of DM-SMC in tracking step and sinusoidal pressure references compared with both advanced model predictive controllers and well-tuned PID controllers. Experimental results show average absolute errors of 1.44 kPa in multi-step tests and 4.23 kPa in sinusoidal tracking, corresponding to reductions of 11.9% and 35.6% relative to PID control, along with improved control effort, valve switching rate, and transient response. Robustness of DM-SMC is further verified on a bellow actuator with pressure-dependent volume. Finally, BiPneu's capability is demonstrated via two soft robotic examples, quick ball-maneuvering with a soft parallel manipulator and real-time finite element method (FEM)-based teleoperation of a soft bellows actuator.

URL PDF HTML ☆

赞 0 踩 0

2606.10229 2026-06-10 cs.RO cs.LG 新提交

What Demonstration Curation Metrics Do to Your Policy

演示筛选指标对策略的影响

Aarav Bedi

AI总结研究演示筛选指标在检测缺陷演示后，是否提升基于行为克隆的策略性能。发现指标检测缺陷的能力与策略性能严重脱钩，并揭示演示时长作为混淆变量的影响。

Comments 6 pages, 1 figure, 2 tables

详情

AI中文摘要

我们研究了检测缺陷训练演示的筛选指标是否也能改善基于筛选数据训练的行为克隆策略。在一个接触密集的LIBERO抓取放置基准任务中，通过引入受控结构缺陷（搬运阶段早期释放夹爪），我们发现这两个量是严重解耦的。具有最高缺陷检测AUROC（0.804）的指标产生了最差的筛选策略（任务成功率13.3%），而AUROC显著较低（0.638）的指标产生的策略几乎与在真实干净数据上训练的Oracle策略相匹配（90.0% vs. 93.3%）。我们进一步表明，我们评估的七个指标中有五个利用演示时长作为缺陷标签的琐碎代理，这种混淆因素将报告的AUROC膨胀到接近完美的值，并且在控制演示时长后消失。在所有条件下，受污染的基线仅在3.3%的测试中成功，而两种最佳的筛选方法将差距缩小到Oracle上限93.3%的3个百分点以内。我们的结果认为，筛选方法应根据其产生的策略来评估，而不是根据其标记的缺陷，并且任何筛选基准在报告检测准确性之前必须控制演示时长。我们发布了测试平台、所有指标实现和评估流程。

英文摘要

We study whether demonstration-curation metrics that detect defective training episodes also improve the downstream behavior-cloning policy that trains on the curated data. On a contact-rich LIBERO pick-and-place benchmark with a controlled structural defect (early gripper release during the carry phase), we find that the two quantities are sharply decoupled. The metric with the highest defect-detection AUROC (0.804) produces the worst curated policy (13.3% task success), while a metric with a substantially lower AUROC (0.638) produces a policy that nearly matches the oracle trained on ground-truth clean data (90.0% vs. 93.3%). We further show that five of the seven metrics we evaluate exploit episode length as a trivial proxy for the defect label, a confound that inflates reported AUROCs to near-perfect values and disappears once episode length is controlled. Across all conditions, the contaminated baseline succeeds on only 3.3% of rollouts, and the two best curation methods close this to within 3 percentage points of the 93.3% oracle ceiling. Our results argue that curation methods should be evaluated by the policy they produce, not the defects they flag, and that any curation benchmark must control for episode length before reporting detection accuracy. We release the testbed, all metric implementations, and the evaluation pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.10366 2026-06-10 cs.RO cs.AI 新提交

A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation

提升VLA评估中仿真与真实相关性的实用指南

Shuo Wang, Hanyuan Xu, Yingdong Hu, Fanqi Lin, Yang Gao

发表机构 * Tsinghua University（清华大学）； Shanghai Qi Zhi Institute（上海期智研究院）

AI总结本文系统研究仿真与真实环境在VLA策略评估中的相关性，提出统一框架来测量和提升仿真作为真实评估代理的有效性。

Comments 20 pages

详情

AI中文摘要

仿真已成为评估和改进视觉-语言-动作（VLA）策略的重要工具，为昂贵的真实机器人评估提供了可扩展、可重复且可控的替代方案。最近的仿真基准在真实感和多样性方面取得了实质性进展，但这些平台尚未被广泛用作可靠的真实策略评估代理。在这项工作中，我们通过仿真与真实相关性的视角研究这一问题。我们在多个仿真平台、VLA策略、任务和扰动因素上进行了系统研究，测量模拟评估在策略排名一致性、性能相关性和扰动方面失败模式上是否保留真实结论。这一分析使我们能够表征现有模拟器的局限性，并确定哪种模拟信号更符合真实部署。我们进一步研究了用户应如何利用仿真进行策略改进，包括何时基于模拟器的微调是有益的，以及后训练数据量如何影响仿真与真实的对齐。总体而言，我们的工作提供了一个统一的框架，用于测量、解释和提升仿真对VLA策略的有用性，为模拟器设计者和在策略开发流程中使用仿真的实践者提供指导。

英文摘要

Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. In this work, we investigate this issue through the lens of sim-and-real correlation. We conduct a systematic study across multiple simulation platforms, VLA policies, tasks, and perturbation factors, measuring whether simulated evaluation preserves real-world conclusions in terms of policy ranking consistency, performance correlation, and perturbation-wise failure patterns. This analysis allows us to characterize the limitations of existing simulators and identify what kinds of simulation signals are more aligned with real-world deployment. We further examine how users should exploit simulation for policy improvement, including when simulator-based finetuning is beneficial and how the amount of post-training data affects sim-and-real alignment. Overall, our work provides a unified framework for measuring, interpreting, and improving the usefulness of simulation for VLA policies, offering guidance both for simulator designers and for practitioners who use simulation as part of the policy development pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.10382 2026-06-10 cs.RO 新提交

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

UMI-Bench 1.0：基于UMI数据的桌面机器人操作开放可复现真实世界基准

Shi Jin, Yuntian Wang, Yuhui Duan, Di Wu, Gaoqi Dong, Xiaohang Liu, Xiaotong Li, Hongfei Jia, Zehao Zhang, Tianyu Wang, Zhongjie Jia, Yuanqi Yao, Chenjia Bai, Zhaxizhuoma, Siao Liu, Nieqing Cao, Jin Wang, Chao Yu, Yan Ding

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出UMI-Bench 1.0，首个专为UMI风格操作策略设计的真实机器人基准，通过统一协议实现数据收集、场景重置、策略执行、结果记录和任务因素分析，提供可复现的评估平台。

详情

AI中文摘要

测试时对抗接管：针对机器人扩散策略的实时劫持接口

Zi Yin, Peilin Chai, Siyuan Huang, Zhanhao Hu

发表机构 * Tsinghua University（清华大学）； Independent Researcher（独立研究员）； Johns Hopkins University（约翰霍普金斯大学）； UC Berkeley（加州大学伯克利分校）

AI总结提出测试时对抗接管（TAKO）方法，通过可微扩散推理学习可重复使用的通用补丁，在测试时切换补丁以劫持机器人策略，实现远程操控，在多种任务和模型上达到100%接管成功率。

详情

AI中文摘要

基于扩散的动作生成已成为具身AI的基础组件，但其对视觉条件的依赖使得部署的视觉运动策略容易受到对抗性操纵。大多数先前的攻击侧重于破坏：它们扰动观测流以降低任务成功率或引发异常行为。我们研究了一种更强的威胁，即测试时对抗接管（TAKO），其中攻击者获得对冻结机器人策略的实时转向接口，并将其转变为远程操控仪器。TAKO通过可微扩散推理学习一个小的可重用通用补丁词汇表；在测试时，攻击者在摄像头流中切换这些补丁以组合攻击者选择的轨迹。这种方法之所以有效，是因为扰动作用于视觉条件路径，其中诱导的偏差可以通过迭代生成推理持续存在。我们进一步表明，自然的目标基线——目标策略匹配——会失败，因为受害者策略无法可靠地在分布外目标偏移上监督自身。在四个任务（2D操作、模拟空中递送、模拟地面导航和物理世界地面导航）、两个视觉编码器（ResNet-18和EfficientNet-B0 + Transformer）以及三个生成推理族（DDPM、DDIM和流匹配）中，人类操作员在每个评估设置中均实现了100%的接管成功率，满足攻击者定义的目标。项目页面可在此https URL获取。

英文摘要

Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.10501 2026-06-10 cs.RO 新提交

Uncovering Vulnerability of Vision-Language-Action Models under Joint-Level Physical Faults

揭示视觉-语言-动作模型在关节级物理故障下的脆弱性

Minsoo Jo, Taeju Kwon, Junha Chun, Youngjoon Jeong, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University（首尔大学数据科学研究生院）

AI总结本研究揭示VLA模型在机器人关节级物理故障（如执行器退化、摩擦增加）下性能显著下降，并提出轻量级残差校准框架J-PARC，通过推断关节故障状态并自适应修正动作，提升鲁棒性。

详情

AI中文摘要

在真实机器人系统中部署视觉-语言-动作（VLA）模型不仅需要对语义和感知变化具有鲁棒性，还需要对改变动作物理实现方式的实体侧故障具有鲁棒性。真实机器人可能经历由执行器退化、硬件故障、安全限制、碰撞损坏或磨损引起的摩擦导致的关节级变化。这些故障至关重要，因为它们改变了策略的动作到运动接口，破坏了指令动作、实现运动与后续观测之间的学习闭环关系。在这项工作中，我们研究了真实的关节级物理故障，并表明当预测动作通过受扰动的机器人身体执行时，VLA模型是脆弱的。我们的分析揭示了关节依赖效应，受影响关节的任务成功率呈现异质性退化。我们还表明，性能下降不能仅归因于物理不可行性，因为可行的故障（如增加的关节摩擦）仍能显著降低成功率并引发闭环执行不匹配。受这些发现的启发，我们提出了关节级物理故障感知残差校准器（J-PARC），这是一个构建在冻结VLA策略之上的轻量级残差校准框架。J-PARC从最近的关节动力学中推断出潜在的关节故障状态，并在此状态下调节共享的残差校准器，从而实现对故障关节的自适应动作修正。实验表明，J-PARC在关节级故障下提高了鲁棒性，同时保持了无故障环境下的性能。

英文摘要

Deploying Vision-Language-Action (VLA) models in real robotic systems requires robustness not only to semantic and perceptual variations, but also to embodiment-side faults that change how actions are physically realized. Real robots can experience joint-level changes caused by actuator degradation, hardware faults, safety limits, collision damage, or wear-induced friction. These faults are critical because they alter the action-to-motion interface of a policy, disrupting the learned closed-loop relationship between commanded actions, realized motion, and subsequent observations. In this work, we study realistic joint-level physical faults and show that VLA models are vulnerable when predicted actions are executed through a perturbed robot body. Our analysis reveals joint-dependent effects, with heterogeneous degradation in task success across affected joints. We also show that performance drops cannot be attributed solely to physical infeasibility, since feasible faults such as increased joint friction can still substantially reduce success rates and induce closed-loop execution mismatch. Motivated by these findings, we propose Joint-level Physical-fault Aware Residual Calibrator (J-PARC), a lightweight residual calibration framework built on top of a frozen VLA policy. J-PARC infers a latent joint-fault regime from recent joint dynamics and conditions a shared residual calibrator on this regime, enabling adaptive action correction across faulty joints. Experiments show that J-PARC improves robustness under joint-level faults while preserving fault-free environment performance.

URL PDF HTML ☆

赞 0 踩 0

2606.10228 2026-06-10 cs.LG cs.AI cs.RO 交叉投稿

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

SHAPO: 面向安全探索的锐度感知策略优化

Kaustubh Mani, Yann Pequignot, Vincent Mai, Liam Paull

AI总结提出SHAPO算法，通过锐度感知策略更新隐式重加权梯度，放大罕见不安全动作的影响，抑制安全动作的贡献，从而在欠探索区域实现保守行为，提升安全性与任务性能。

Comments ICLR 2026

详情

AI中文摘要

安全探索是在安全关键领域部署强化学习（RL）智能体的先决条件。在本文中，我们通过认知不确定性的视角来探讨安全探索，其中智能体对参数扰动的敏感性作为高不确定性区域的实际代理。我们提出了锐度感知策略优化（SHAPO），一种锐度感知的策略更新规则，该规则在扰动参数处评估梯度，使得策略更新相对于智能体的认知不确定性变得悲观。分析表明，这种调整隐式地重新加权了策略梯度，放大了罕见不安全动作的影响，同时抑制了已安全动作的贡献，从而在欠探索区域将学习偏向于保守行为。在多个连续控制任务中，我们的方法在安全性和任务性能上均持续优于现有基线，显著扩展了它们的帕累托前沿。

英文摘要

Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor's epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.

URL PDF HTML ☆

赞 0 踩 0

2601.18765 2026-06-10 cs.RO 版本更新

Goal-oriented Communication for Fast and Robust Robotic Fault Detection and Recovery

面向快速鲁棒机器人故障检测与恢复的目标导向通信

Shutong Chen, Adnan Aijaz, Yansha Deng

发表机构 * Department of Engineering, King’s College London（伦敦国王学院工程系）； Bristol Research and Innovation Laboratory, Toshiba Europe Ltd.（托bsd欧洲有限公司布里斯托尔研究与创新实验室）

AI总结提出目标导向通信框架，通过联合设计通信-计算-控制回路，利用3D场景图检测故障，并微调小语言模型结合知识蒸馏生成恢复动作，实现故障检测与恢复时间降低82.6%，任务成功率提升76%。

Comments Submit to IEEE for potential publication

详情

AI中文摘要

自主机器人系统广泛部署于智能工厂，并在动态、不确定及有人参与的环境中运行，需要低延迟且鲁棒的故障检测与恢复（FDR）。然而，现有FDR框架存在各种局限性，例如通信和计算的显著延迟，以及机器人运动/轨迹生成的不可靠性，这主要是因为通信-计算-控制（3C）回路的设计未考虑下游FDR目标。为了解决这个问题，我们提出了一种新颖的目标导向通信（GoC）框架，该框架联合设计3C回路，专门用于快速鲁棒的机器人FDR，目标是最小化FDR时间同时最大化机器人任务（例如工件分拣）成功率。对于故障检测，我们的GoC框架创新性地通过我们设计的表示提取器定义并提取3D场景图（3D-SG）作为语义表示，并通过监测3D-SG中的空间关系变化来检测故障。对于故障恢复，我们通过低秩适配（LoRA）微调一个小语言模型（SLM），并通过知识蒸馏增强其推理和泛化能力，以生成机器人的恢复动作。我们还设计了一个轻量级的目标导向数字孪生重建模块，在需要精细机器人控制时，仅使用任务相关的物体轮廓进行数字孪生重建，以优化SLM生成的恢复动作。大量仿真表明，与依赖视觉语言模型进行故障检测和大型语言模型进行故障恢复的最先进框架相比，我们的GoC框架将FDR时间降低了高达82.6%，并将任务成功率提高了高达76%。

英文摘要

Autonomous robotic systems are widely deployed in smart factories and operate in dynamic, uncertain, and human-involved environments that require low-latency and robust fault detection and recovery (FDR). However, existing FDR frameworks exhibit various limitations, such as significant delays in communication and computation, and unreliability in robot motion/trajectory generation, mainly because the communication-computation-control (3C) loop is designed without considering the downstream FDR goal. To address this, we propose a novel Goal-oriented Communication (GoC) framework that jointly designs the 3C loop tailored for fast and robust robotic FDR, with the goal of minimising the FDR time while maximising the robotic task (e.g., workpiece sorting) success rate. For fault detection, our GoC framework innovatively defines and extracts the 3D scene graph (3D-SG) as the semantic representation via our designed representation extractor, and detects faults by monitoring spatial relationship changes in the 3D-SG. For fault recovery, we fine-tune a small language model (SLM) via Low-Rank Adaptation (LoRA) and enhance its reasoning and generalization capabilities via knowledge distillation to generate recovery motions for robots. We also design a lightweight goal-oriented digital twin reconstruction module to refine the recovery motions generated by the SLM when fine-grained robotic control is required, using only task-relevant object contours for digital twin reconstruction. Extensive simulations demonstrate that our GoC framework reduces the FDR time by up to 82.6% and improves the task success rate by up to 76%, compared to the state-of-the-art frameworks that rely on vision language models for fault detection and large language models for fault recovery.

URL PDF HTML ☆

赞 0 踩 0

2407.20242 2026-06-10 cs.CY cs.AI cs.RO 版本更新

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

BadRobot: 在物理世界中越狱具身LLM智能体

Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo, Leo Yu Zhang

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Beihang University（北航）； Griffith University（格里菲斯大学）

AI总结提出BadRobot攻击范式，利用LLM在机器人系统中的操纵、语言输出与物理动作的错位以及世界知识缺陷三个漏洞，通过语音交互使具身LLM执行有害行为，并在基准测试中验证了有效性。

Comments Accepted to ICLR 2025. Please cite the conference version. Project page: https://Embodied-LLMs-Safety.github.io

详情

Journal ref: International Conference on Learning Representations (ICLR) 2025

AI中文摘要

具身AI代表将AI集成到物理实体中的系统。大型语言模型（LLM）展现出强大的语言理解能力，通过促进复杂的任务规划，已被广泛用于具身AI。然而，一个关键的安全问题仍被忽视：这些具身LLM是否会实施有害行为？为此，我们引入了BadRobot，一种新颖的攻击范式，旨在通过典型的基于语音的用户-系统交互，使具身LLM违反安全和伦理约束。具体来说，我们利用了三个漏洞来实现这种攻击：(i) 机器人系统中LLM的操纵，(ii) 语言输出与物理动作之间的错位，以及(iii) 世界知识缺陷导致的意外危险行为。此外，我们构建了一个包含各种恶意物理动作查询的基准，以评估BadRobot的攻击性能。基于该基准，针对现有突出的具身LLM框架（例如Voxposer、Code as Policies和ProgPrompt）的大量实验证明了我们BadRobot的有效性。我们的代码可在以下网址获取：this https URL。

英文摘要

Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot. Our code is available at https://github.com/Rookie143/BadRobot.

URL PDF HTML ☆

赞 0 踩 0

2606.10208 2026-06-10 cs.RO cs.AI 新提交

Exploration of Foundation Model-Based Robots in Patient and Elderly Care

基于基础模型的机器人在患者和老年人护理中的探索

Zhiwen Qiu, Wei Liu, Yuexing Hao

AI总结本文综述了基于基础模型的护理机器人在设计特征、用户体验和护理效果方面的现状，指出当前系统多用于语音交互，多模态和物理自主性有限，并呼吁向护理特定评估标准和负责任自主性发展。

详情

通过Lingua Franca实现ROS 2应用的确定性执行

Harun Teper, Shaokai Lin, Shulu Li, Edward A. Lee, Jian-Jia Chen

发表机构 * TU Dortmund University（多特蒙德工业大学）； University of California, Berkeley（加州大学伯克利分校）； RWTH Aachen University（亚琛工业大学）

AI总结提出框架将未修改的ROS 2应用转换为Lingua Franca程序，利用逻辑时间实现确定性执行，解决ROS 2中回调执行顺序和消息交织的非确定性问题。

详情

AI中文摘要

机器人操作系统2（ROS 2）是一种广泛用于机器人系统的中间件，其特点是发布-订阅（pub-sub）通信机制，计算结构为由ROS 2执行器调度的回调。尽管很流行，但ROS 2中的pub-sub模式本质上是不确定的：即使在单个执行器内，这些回调的运行顺序也是不确定的，分布式部署由于节点间消息的交织和网络延迟进一步增加了不确定性。这种不确定性常常导致并发问题，使得几乎不可能分析安全性并提供保证。我们提出了一个框架，能够将未修改的ROS 2应用程序转换并在Lingua Franca（LF）下运行，LF是一种使用逻辑时间进行确定性执行的协调语言，使得相同的输入总是产生相同的确定性执行顺序。我们首先描述了哪些ROS 2特性可以在逻辑时间下确定性执行。这些特性使得建立自动转换框架成为可能，该框架从ROS 2应用程序中提取信息并直接将其转换为LF程序。然后可以应用LF的丰富特性，如逻辑时间延迟、跨进程的联邦执行和故障处理，使ROS 2应用程序以确定性和时序可预测的方式执行，而无需更改ROS 2代码。我们在一个合成示例和Autoware参考系统上评估了该框架。我们表明，在默认ROS 2中，回调的执行顺序不同，同时端到端延迟在不同执行中也有所变化。相比之下，我们由LF控制的ROS 2系统产生了确定的执行顺序和一致的端到端延迟。

英文摘要

The Robot Operating System 2 (ROS 2) is a widely used middleware for robotic systems, characterized by a publish-subscribe (pub-sub) communication mechanism in which computation is structured as callbacks dispatched by ROS 2 executors. Despite its popularity, the pub-sub pattern in ROS 2 is inherently nondeterministic: the order in which these callbacks run is nondeterministic even within a single executor, and distributed deployments add further nondeterminism from the interleaving of messages across nodes and from network latency. Such nondeterminism often leads to concurrency issues and makes it virtually impossible to analyze for safeness and provide guarantees. We present a framework that is able to convert an unmodified ROS 2 application and run it under Lingua Franca (LF), a coordination language for deterministic execution using logical time, so that the same input always produces the same deterministic execution order. We first describe which ROS 2 features can be executed deterministically under logical time. Such features enable the possibility to establish an automatic conversion framework to extract information from a ROS 2 application and directly convert it into an LF program. The rich features of LF, such as logical-time delays, federated execution across processes, and fault handling, can then be applied to make the ROS 2 application be executed in a deterministic and timing-predictable manner without changing the ROS 2 code. We evaluate the framework on a synthetic example and on the Autoware reference system. We show that the order in which callbacks are executed differs in default ROS 2, while also having end-to-end latencies that vary across executions. In contrast, our LF-controlled ROS 2 system produces a deterministic execution order and consistent end-to-end latencies.

URL PDF HTML ☆

赞 0 踩 0

2606.00097 2026-06-10 cs.RO cs.MA 版本更新

RocketSmith: An Agentic System for High-Powered Rocket Design and Manufacturing

RocketSmith: 一种用于高功率火箭设计与制造的智能系统

Peter Pak, Jesse Barkley, Rumi Loghmani, Derek Baich, Ananya Pamal, Amir Barati Farimani

发表机构 * Graduate Research Assistant, Mechanical Engineering（机械工程研究生助理）； AI Fellow, Mechanical Engineering（人工智能研究员，机械工程）； Undergraduate Student, Mechanical Engineering（机械工程本科生）； Senior Member, Pittsburgh Prefecture One（高级会员，匹兹堡郡一区）； Russell V. Trader Associate Professor, Mechanical Engineering（Russell V. Trader副教授，机械工程）

AI总结本文提出RocketSmith，一种基于智能体系统的自动化设计、制造与优化框架，通过子智能体与技能实现零样本和人在回路的飞行参数优化，并利用增材制造成功开发并测试了四枚高功率火箭。

详情

AI中文摘要

本文介绍了RocketSmith，一种能够完成高功率火箭开发中设计、制造和优化过程的智能系统。该系统实现了软件工具的智能自动化，不仅能够验证飞行稳定性等因素，还能生成火箭组件的参数化设计。通过一组子智能体和技能，该系统能够在零样本和人在回路的工作流程中通过迭代优化飞行参数。利用该系统，结合增材制造的独特设计能力，开发了四种不同电机和组件配置的高功率火箭。这些组件使用各种FDM打印机打印，手动评估飞行准备状态，并在发射活动中进行了飞行测试。测试中，所有火箭均实现了稳定发射，其中两枚火箭成功回收并具备再次飞行条件。在收集的飞行数据中，实测远地点与飞行模拟计算值的准确率达到84%。

英文摘要

This work presents RocketSmith, an agentic system capable of the design, manufacturing, and optimization processes in high powered rocket development. The system enables the intelligent automation of software tools as to not only validate factors such as flight stability but also generate the parametric design components for the rocket assembly. A collection of subagents and skills enable optimization workflows of flight parameters via iteration in both zero-shot and human-in-the-loop workflows. With this system, four distinct high power rockets with various motor and assembly configurations were developed utilizing the unique design capabilities of additive manufacturing. These assembly components were fabricated using various FDM printers, manually evaluated for flight readiness, and flight tested at a launch event. From these tests, all rockets achieved a stable launched and two of the four rockets were successfully recovered in reflyable condition. Within the collected flight data, an 84% accuracy was achieved when comparing measured apogee to that calculated in flight simulations.

URL PDF HTML ☆

赞 0 踩 0

2602.16898 2026-06-10 cs.RO cs.AI cs.CV cs.LG 版本更新

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI：一种多智能体框架用于集成通用机器人操作

Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

发表机构 * Department of Electrical Engineering, Sharif University of Technology（电气工程系，谢里夫大学）

AI总结 MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作，提升泛化能力和零样本任务成功率。

Comments Some fundemental change in text and codebase

2603.04056 2026-06-10 cs.CV cs.RO 版本更新

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

长期动态底栖环境中的视觉定位：一个数据集、基于足迹的地面真实信息以及视觉地点识别基准

Martin Kvisvik Larsen, Oscar Pizarro

发表机构 * Department of Marine Technology（海洋技术系）； Norwegian University of Science and Technology（挪威科学技术大学）； Trondheim, Norway（特罗姆瑟，挪威）

AI总结本文提出一个用于长期底栖环境视觉定位的 curated 数据集和基于足迹的地面真实方法，评估了八种最先进的视觉地点识别方法，发现其在该数据集上的 Recall@K 显著低于传统基准。

详情

DOI: 10.3389/frobt.2026.1821019
Journal ref: Frontiers in Robotics and AI Volume 13 (2026) 1821019

AI中文摘要

长期视觉定位有潜力降低光学底栖监测中自主水下机器人（AUV）的成本并提高制图质量。尽管有这种潜力，底栖环境中长期视觉定位仍被低估，主要由于缺乏用于基准测试的curated数据集。此外，有限的地理参考精度和图像足迹需要精确的几何信息以实现准确的地面真实。在本文中，我们通过提出一个用于长期视觉定位的底栖环境curated数据集和一种新的方法来为近垂直水下影像的视觉定位结果进行地面真实，解决了这些差距。我们的数据集包括来自五个底栖参考站点的地理参考AUV影像，这些站点在长达六年的期间内被重新访问，包括原始和颜色校正的立体影像、相机校准和亚分米注册的相机姿态。据我们所知，这是首个涵盖多个站点和光层栖息地的长期视觉定位水下数据集。我们的地面真实方法估计3D海底图像足迹，并将具有重叠足迹的相机视图联系起来，确保地面真实链接反映共享的视觉内容。基于此数据集和地面真实，我们基准测试了八种最先进的视觉地点识别（VPR）方法，并发现Recall@K在我们的数据集上显著低于传统陆地和水下基准。最后，我们比较了基于足迹的地面真实与传统位置基于的地面真实，并表明距离阈值地面真实在地形崎岖和海拔变化的站点上会高估VPR Recall@K。共同，curated数据集、地面真实方法和VPR基准为在动态底栖环境中推进长期视觉定位提供了基础。

英文摘要

Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.

URL PDF HTML ☆

赞 0 踩 0

2508.00491 2026-06-10 cs.RO cs.AI 版本更新

HannesImitation: Grasping with the Hannes Prosthetic Hand via Imitation Learning

HannesImitation：通过模仿学习控制Hannes假手进行抓取

Carlo Alessi, Federico Vasile, Federico Ceola, Giulia Pasquale, Nicolò Boccardo, Lorenzo Natale

发表机构 * Humanoid Sensing and Perception（人形感知与感知实验室）； Istituto Italiano di Tecnologia（意大利技术研究院）； Rehab Technologies Lab（康复技术实验室）

AI总结本文提出HannesImitationPolicy，通过模仿学习控制Hannes假手在无结构环境中抓取物体，并引入HannesImitationDataset进行训练，实验表明其在无结构场景中优于基于分割的视觉伺服控制器。

Comments Paper accepted at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

详情

Journal ref: IEEE/RSJ International Conference on Intelligent Robots and Systems, Hangzhou, China, 2025

AI中文摘要

最近，假手控制的进步集中在通过摄像头和其他传感器输入提高自主性。这些系统旨在通过自动控制某些自由度来减少用户认知负担。在机器人学中，模仿学习已成为学习抓取和复杂操作任务并简化数据收集的有前途的方法。然而，其在假手控制中的应用仍 largely 未被探索。填补这一差距可以提高灵活性恢复，并使假手设备能够在更多无约束场景中运行，其中任务是通过演示学习而非依赖手动标注序列。为此，我们提出了HannesImitationPolicy，一种基于模仿学习的方法来控制Hannes假手，使其在无结构环境中进行物体抓取。此外，我们引入了HannesImitationDataset，包含在桌子、架子和人到假手交接场景中的抓取演示。我们利用此类数据训练了一个单扩散策略，并将其部署在假手上以预测手腕方向和手部闭合以进行抓取。实验评估显示在多样化的物体和条件下成功抓取。最后，我们展示该策略在无结构场景中优于基于分割的视觉伺服控制器。附加材料可在我们的项目页面上提供：https://hsp-iit.github.io/HannesImitation

英文摘要

Recent advancements in control of prosthetic hands have focused on increasing autonomy through the use of cameras and other sensory inputs. These systems aim to reduce the cognitive load on the user by automatically controlling certain degrees of freedom. In robotics, imitation learning has emerged as a promising approach for learning grasping and complex manipulation tasks while simplifying data collection. Its application to the control of prosthetic hands remains, however, largely unexplored. Bridging this gap could enhance dexterity restoration and enable prosthetic devices to operate in more unconstrained scenarios, where tasks are learned from demonstrations rather than relying on manually annotated sequences. To this end, we present HannesImitationPolicy, an imitation learning-based method to control the Hannes prosthetic hand, enabling object grasping in unstructured environments. Moreover, we introduce the HannesImitationDataset comprising grasping demonstrations in table, shelf, and human-to-prosthesis handover scenarios. We leverage such data to train a single diffusion policy and deploy it on the prosthetic hand to predict the wrist orientation and hand closure for grasping. Experimental evaluation demonstrates successful grasps across diverse objects and conditions. Finally, we show that the policy outperforms a segmentation-based visual servo controller in unstructured scenarios. Additional material is provided on our project page: https://hsp-iit.github.io/HannesImitation

URL PDF HTML ☆

赞 0 踩 0

1. 机器人学习与模仿强化学习 9 篇

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation

Task Robustness via Re-Labelling Vision-Action Robot Data

Baseline-Free Policy Optimization for Neural Combinatorial Optimization

On-sky demonstration of reinforcement learning for adaptive optics control

Online Self-Training for Co-Adaptation in Hierarchical Diffusion Policies

VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies

2. 运动规划、控制与动力学 12 篇

Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment

Locomotion analysis of a quadruped interacting with the lunar granular surface

MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds

LieIPM: Lie Group Interior Point Method for Direct Trajectory Optimization of Rigid Bodies

Gradient based Bilevel for Inverse Optimal Control, a Riemannian approach

A Spiking Neural Architecture for Coordinating Arm and Locomotor Control

Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators

Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making

Scalable and General Whole-Body Control for Cross-Humanoid Locomotion

Adaptive Artificial Time-Delay Control with Barrier Lyapunov Constraints for Euler-Lagrange Robots

HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers

Neuromorphic Reinforcement Learning for Quadruped Locomotion Control on Uneven Terrain

3. 操作、抓取与灵巧手 13 篇

Robotic Nonprehensile Object Transportation with a Hanging Tray

YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale

Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization

Bridging Semantics and Physical Execution: A Neuro-Symbolic Framework for Multi-Pair Robotic Assembly

IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation

JOIN: Anchor-Grasp-Conditioned Joining via Opposition, Inference, and Navigation for Bimanual Assistive Manipulation

TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation

HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands

ObjSplat: Geometry-Aware Gaussian Surfels for Active Object Reconstruction

Going with the Flow: Koopman Behavioral Models as Pseudo Planners for Visuo-Motor Dexterity

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

4. 导航、定位与SLAM 8 篇

Rethinking Embodied Navigation via Relational Inductive Bias

Information-Preserving Continuous Occupancy Mapping with Variance-Weighted Submap Joining

AgenticNav: Zero-Shot Vision-and-Language Navigation as a Tool-Calling Harness

GUIDE: Goal-Initialized Directional Understanding for End-to-End Visual Navigation

AgniNav: Configuration-Driven Cross-Embodiment Local Planning for Robot Navigation

AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning

Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization

FOUND-IT: Foundation-model-first Task-driven 3D Scene Graphs with Granularity on Demand

5. 人机交互与协作机器人 5 篇

Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction

EM-Fall: Embodied mmWave Sensing for Day-and-Night Fall Detection on Humanoid Robots

Equanimity in HRI: Applying Calm Technology Principles to Human-Robot Interaction

RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

6. 具身智能与视觉语言动作模型 7 篇

OMG: Omni-Modal Motion Generation for Generalist Humanoid Control

VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

7. 多机器人与群体系统 3 篇

Multi-UAV Active Sensing with Information Gain-based Planning and Belief Fusion

A Distributed Multi-UGV Exploration Framework With Loop-Aware Planning and Descriptor-Aided Localization in Resource-Limited Environments

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

8. 无人车、无人机与移动机器人 10 篇

Self-Supervised Relevance Modelling in Autonomous Driving via Counterfactual Analysis

Vehicle Prediction Model for Enhanced MPC Path Tracking in Formula Student Driverless

Pushing the Performance Limits in Autonomous Racing: Continuous Stability-Aware Adaptive Velocity Planning in Formula Student Driverless

An Exposure-Time-Aligned Primary-Path Architecture for Autonomous-Driving ECUs

Embodiment-conditioned Generalist Control for Multirotor Aerial Robots

Resilient Navigation for Autonomous Farm Robots by Leveraging Jerk-Augmented Models with IMU-Only Disturbance Rejection

Language-Driven Cost Optimization for Autonomous Driving

Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving

RAPTOR: Rapid Aerial Pickup and Transport of Objects by Robots

AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

9. 软体机器人与硬件设计 3 篇

Rod models in continuum and soft robot control: a review

CableRobotGraphSim: A Graph Neural Network for Modeling Partially Observable Cable-Driven Robot Dynamics

BiPneu: Design and Control of a Bipolar-Pressure Pneumatic System for Soft Robots

10. 仿真、数据集与评测 6 篇