arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 机器人学习与模仿强化学习 12 篇

2606.13817 2026-06-15 cs.RO cs.LG 新提交

FlowMo-WM: A World Model with Object Momentum and Hidden Ambient Drift

FlowMo-WM:具有物体动量和隐藏环境漂移的世界模型

Yitao Jiang, Luyang Zhao, Muhao Chen, Devin Balkcom

发表机构 * Dartmouth College(达特茅斯学院) Clemson University(克莱姆森大学) University of Houston(休斯顿大学)

AI总结 提出FlowMo-WM,一种端到端可训练的视觉世界模型,通过分解图像-动作历史为短历史潜在状态和长历史上下文,分别建模物体运动和环境漂移,提升水下机器人等场景的长程预测精度。

详情
AI中文摘要

机器人学习中的世界模型根据视觉观察和动作预测未来状态,使智能体能够推理其控制后果。然而,许多动作条件模型在运动由即时控制主导的场景中评估,而水面航行器和其他真实世界物体在惯性下持续运动,并被水流或风等隐藏环境漂移所位移。我们提出FlowMo-WM,一种端到端可训练的视觉世界模型,无需流场直接监督,从图像-动作历史中推断以物体为中心的运动状态和与隐藏漂移相关的预测性长历史上下文。FlowMo-WM将图像-动作历史分解为短历史潜在状态(训练以总结以物体为中心的运动)和长历史上下文(训练以总结缓慢变化的外生影响)。在潜在展开期间,零上下文残差转移将动作条件基础动力学与上下文相关的漂移效应分离。在具有多样隐藏流、干扰和随机化车辆动力学的模拟水面航行器环境中,FlowMo-WM相比代表性动作条件潜在世界模型提高了长程展开精度。预测时上下文消融实验(在展开过程中将推断的上下文置零或打乱)表明,环境上下文对于隐藏漂移下的稳定预测至关重要,而冻结线性探针则表征了学习因子中编码的信息。

英文摘要

World models in robot learning predict future states from visual observations and actions, enabling agents to reason about the consequences of their controls. However, many action-conditioned models are evaluated in settings where motion is dominated by immediate control, whereas aquatic surface vehicles and other real-world objects continue moving under inertia and are displaced by hidden ambient drift, such as water currents or wind. We propose FlowMo-WM, an end-to-end trainable visual world model that infers object-centric motion state and a predictive long-history context associated with hidden drift from image-action histories without direct supervision of flow fields. FlowMo-WM factorizes image-action history into a short-history latent state, trained to summarize object-centric motion, and a longer-history context, trained to summarize slowly varying exogenous influences. A zero-context residual transition separates action-conditioned base dynamics from context-dependent drift effects during latent rollout. In simulated aquatic surface-vehicle environments with diverse hidden flows, disturbances, and randomized vehicle dynamics, FlowMo-WM improves long-horizon rollout accuracy over representative action-conditioned latent world models. Prediction-time context ablations, in which the inferred context is zeroed or shuffled during rollout, show that the ambient context is important for stable prediction under hidden drift, while frozen linear probes characterize information encoded in the learned factors.

2606.13856 2026-06-15 cs.RO 新提交

Output-Level Regularization Eliminates the Seed Lottery in Single-GPU VLA Fine-Tuning

输出级正则化消除单GPU VLA微调中的种子彩票

Jeffrin Sam, Dzmitry Tsetserukou

发表机构 * Skolkovo Institute of Science and Technology (Skoltech)(斯科尔科沃科学技术研究所)

AI总结 发现单GPU微调VLA-JEPA时存在种子彩票现象(随机种子导致性能骤降29%),归因于输出坍塌,提出输出级正则化(VICReg、Dropout、减半学习率)可完全消除该问题。

Comments 10 pages, 8 figures, submitted to CoRL 2026

详情
AI中文摘要

在单GPU上微调视觉-语言-动作模型(VLA-JEPA)本应简单:加载预训练检查点、运行训练、部署。但存在一个隐藏风险。使用相同数据和架构、不同随机种子运行同一微调代码十三次,其中十二次产生的机器人成功率为91-94%,而一次运行无声地降至65.2%:29个百分点的差距,无错误消息、无警告,且无法预测哪个种子会失败。我们称此为种子彩票。我们将原因追溯到输出坍塌:动作预测器学会产生几乎相同的输出,无论机器人看到什么。现有的权重级方法(L2、EWC)在结构上对此坍塌视而不见——它们惩罚权重变化,但坍塌发生在权重可自由移动而不影响输出的方向上,我们通过Jacobian零空间形式化了这一差距。在7种方法×最多13个种子×3个LIBERO基准测试中,三种输出级正则化器——VICReg(n=12个种子)、Dropout(n=4)和减半学习率(n=5)——各自消除了所有灾难性种子(0/21次联合坍塌 vs. 基线1/13次;F(12,11)=28.7,p<0.001),而权重级方法(L2、EWC)保留了彩票。最简单的修复是在优化器配置中更改一个数字。

英文摘要

Fine-tuning a vision-language-action model (VLA-JEPA) on a single GPU should be simple: load a pretrained checkpoint, run training, deploy. There is a hidden danger. Run the same fine-tuning code thirteen times -- same data, same architecture, different random seed -- and twelve runs produce a robot succeeding 91--94% of the time, while one run silently degrades to 65.2%: a 29 pp gap with no error message, no warning, and no way to predict which seed will fail. We call this the seed lottery. We trace the cause to output collapse: the action predictor quietly learns to produce nearly identical outputs regardless of what the robot sees. Existing weight-level methods (L2, EWC) are structurally blind to this collapse -- they penalize weight changes, but collapse occurs in directions weights can move freely without affecting outputs, a gap we formalize via the Jacobian null-space. Across 7 methods x up to 13 seeds x 3 LIBERO benchmarks, three output-level regularizers -- VICReg (n=12 seeds), Dropout (n=4), and a halved learning rate (n=5) -- each eliminate every catastrophic seed (0/21 combined collapses vs. 1/13 Baseline; F(12,11)=28.7, p<0.001), while weight-level methods (L2, EWC) preserve the lottery. The simplest fix is changing one number in your optimizer config.

2606.13970 2026-06-15 cs.RO cs.LG 新提交

An Attention-based Model for Robust Forecasting with Missing Modality

基于注意力的缺失模态鲁棒预测模型

Zhitian Zhang, Wenjie Zi, Yunduz Rakhmangulova, Saghar Irandoust, Hossein Hajimirsadeghi, Thibaut Durand

发表机构 * Simon Fraser University(西蒙菲莎大学) RBC Borealis

AI总结 提出一种基于条件变分自编码器和Transformer的多模态模型,通过注意力机制学习统一固定维度的表示,在训练和推理中处理缺失模态,在人类轨迹预测和机器人操作预测任务上优于现有方法。

Comments Work originally done in 2023

详情
AI中文摘要

在缺失模态下的学习是多模态机器人学习中的一个基本挑战,因为现实世界的机器人系统通常运行在传感器数据不完整的环境中。基于注意力的模型在处理多模态数据时具有吸引力,因为它们可以用单一骨干网络处理多种模态。然而,大多数多模态模型假设在训练和推理过程中所有模态都可用,限制了它们在机器人感知和决策中的适用性。在本文中,我们介绍了一种多模态模型,旨在在训练和推理过程中处理缺失模态。该模型被表述为条件变分自编码器(CVAE),并采用基于Transformer的架构,利用注意力机制学习统一的固定维度表示,即使某些模态缺失。我们表明,所提出的模型可以在缺失模态的情况下进行训练,同时逼近所有模态的鲁棒表示。我们在五个多模态数据集上评估了我们的方法,涉及两个机器人学习任务:人类轨迹预测和机器人操作预测。实验结果表明,我们的模型有效地从不完整数据中学习,并且优于先前的多模态融合方法。

英文摘要

Learning with missing modalities is a fundamental challenge in multimodal robot learning, as real-world robotic systems often operate in environments with incomplete sensor data. Attention-based models are appealing for processing multimodal data because they can handle multiple modalities with a single backbone network. However, most multimodal models assume that all modalities are available during both training and inference, limiting their applicability in robotic perception and decision-making. In this paper, we introduce a multimodal model designed to handle missing modalities during both training and inference. The model is formulated as a conditional variational autoencoder (CVAE) and incorporates a transformer-based architecture that leverages attention mechanisms to learn a unified, fixed-dimensional representation, even when some modalities are missing. We show that our proposed model can be trained with missing modalities while approximating a robust representation of all modalities. We evaluate our approach on five multimodal datasets across two robot learning tasks: human trajectory prediction and robot manipulation forecasting. Experimental results demonstrate that our model effectively learns from incomplete data and is superior to prior multimodal fusion approaches.

2606.14255 2026-06-15 cs.RO 新提交

ReactVLA: Fast and Lightweight Reactive Robot Manipulation via Improved Mean Flow Action Generation

ReactVLA: 通过改进的平均流动作生成实现快速轻量级反应式机器人操作

Yanzhao Guo, Wenkai Chen, Jianwei Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Technical Aspects of Multimodal Systems (TAMS), Department of Informatics, Universität Hamburg(汉堡大学信息学系多模态系统技术方面(TAMS))

AI总结 提出ReactVLA框架,结合改进的平均流动作生成器和注意力残差机制,实现轻量低延迟的实时机器人操作,在模拟和真实任务中性能提升达1.65倍,推理速度提升4倍以上。

详情
AI中文摘要

基于扩散的视觉-语言-动作(VLA)策略在建模表达性和多模态动作分布方面表现出强大的能力。然而,它们对迭代采样的依赖引入了显著的推理延迟,限制了其在反应式闭环机器人操作中的应用。为了解决这一限制,我们提出了\texttt{ReactVLA},一个用于实时机器人操作的轻量级低延迟VLA框架。\texttt{ReactVLA}结合了两种互补设计:(1)改进的平均流(iMF)动作生成器,将昂贵的多步扩散采样减少到一步到几步的动作生成;(2)注意力残差(AttnRes),一种动态的深度特征路由机制,取代均匀残差累积,以更好地保留任务相关的多模态表示。我们在大规模模拟基准(包括LIBERO和RoboIMI)以及真实世界机器人操作任务上评估了\texttt{ReactVLA}。实验结果表明,\texttt{ReactVLA}始终优于同等规模的VLA基线,包括SmolVLA和$\pi_0$。在具有挑战性的精密操作任务中,与领先的VLA模型相比,\texttt{ReactVLA}在任务性能上实现了高达1.65倍的提升,同时推理速度提高了4倍以上。最后,它将真实世界策略延迟降低到38.6毫秒以下,从而在物理机器人平台上实现快速反应控制。请访问我们的项目网站:this https URL。

英文摘要

Diffusion-based Vision-Language-Action (VLA) policies have demonstrated strong capability in modeling expressive and multimodal action distributions. However, their reliance on iterative sampling introduces substantial inference latency, which limits their applicability to reactive closed-loop robot manipulation. To address this limitation, we propose \texttt{ReactVLA}, a lightweight and low-latency VLA framework for real-time robotic manipulation. \texttt{ReactVLA} combines two complementary designs: (1) an improved Mean Flow (iMF) action generator that reduces expensive multi-step diffusion sampling to one-to-few-step action generation, and (2) Attention Residuals (AttnRes), a dynamic depth-wise feature routing mechanism that replaces uniform residual accumulation to better preserve task-relevant multimodal representations. We evaluate \texttt{ReactVLA} on large-scale simulation benchmarks, including LIBERO and RoboIMI, as well as real-world robotic manipulation tasks. Experimental results show that \texttt{ReactVLA} consistently outperforms similarly sized VLA baselines, including SmolVLA and $π_0$. On challenging precision manipulation tasks, \texttt{ReactVLA} achieves up to a 1.65$\times$ improvement in task performance while providing more than a 4$\times$ increase in inference speed compared with leading VLA models. Finally, it reduces real-world policy latency to below 38.6 ms, enabling fast reactive control on physical robot platforms. Please check out our project website at: https://game-loader.github.io/ReactVLA/.

2606.14375 2026-06-15 cs.RO cs.AI 新提交

Elastic Queries Reinforcement Learning: Self-Aware Policy Execution for VLA Models

弹性查询强化学习:VLA模型的自我感知策略执行

Ge Wang, Xinyu Tan, Xiang Li, Man Luo, Chengsi Yao, Shenhao Yan, Jiahao Yang, Fan Feng, Honghao Cai, Xiangyuan Wang, Zhixin Mai, Yiming Zhao, Yatong Han, Zhen Li

发表机构 * Ising AI CUHK-Shenzhen(香港中文大学(深圳)) PKU(北京大学)

AI总结 提出弹性查询强化学习(EQRL),通过轻量级潜在调度适配器动态调整VLA模型的推理步骤和动作块长度,利用评论家集成分歧估计状态难度,在降低推理成本的同时保持或提升任务成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型是机器人操作中强大的动作生成器,但通常以固定的推理和重新规划调度执行。这种刚性忽略了机器人控制的不均匀难度:接触密集或不确定状态可能需要更多计算和更新鲜的反馈,而较容易的状态通常可以用更少的推理步骤和更长的开环执行来处理。我们提出弹性查询强化学习(EQRL),一个使每个VLA策略查询具有弹性的框架。一个轻量级的潜在调度适配器联合选择潜在输入、去噪预算和动作块长度,无需微调底层VLA模型。为了使调度具有难度感知,EQRL在联合潜在调度动作上训练一个评论家,并从评论家集成分歧中推导出状态难度信号。该信号引导计算资源向困难状态倾斜,而学习到的残差允许任务驱动的修正。我们将可变块执行形式化为查询级宏动作强化学习,具有块依赖的折扣和摊销的函数评估次数(NFE)预算。在仿真和真实机器人操作中,EQRL在保持或提高任务成功率的同时,降低了摊销推理成本。

英文摘要

Vision-language-action (VLA) models are powerful action generators for robot manipulation, but they are typically executed with fixed inference and replanning schedules. This rigidity ignores the uneven difficulty of robot control: contact-rich or uncertain states may need more computation and fresher feedback, while easier states can often be handled with fewer inference steps and longer open-loop execution. We propose Elastic Queries Reinforcement Learning (EQRL), a framework that makes each VLA policy query elastic. A lightweight latent-schedule adaptor jointly selects the latent input, denoising budget, and action chunk length, without fine-tuning the underlying VLA model. To make scheduling difficulty-aware, EQRL trains a critic over the joint latent-schedule action and derives a state difficulty signal from critic ensemble disagreement. This signal guides compute toward difficult states, while a learned residual allows task-driven correction. We formulate variable chunk execution as query-level macro-action RL with chunk-dependent discounting and an amortized number-of-function-evaluations (NFE) budget. Across simulation and real-robot manipulation, EQRL reduces amortized inference cost while preserving or improving task success.

2606.14665 2026-06-15 cs.RO 新提交

EgoGuide: Egocentric Guidance for Efficient Robot-Free Demonstration Collection and Learning

EgoGuide: 以自我为中心引导的高效无机器人演示收集与学习

Yue Xu, Mingtao Nie, Tianle Li, Hong Li, Yibo Luo, Siyuan Huang, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Beijing Institute for General Artificial Intelligence (BIGAI)(北京通用人工智能研究院)

AI总结 提出EgoGuide数据收集接口,通过同步腕部和头部/自我中心观察并在线视觉-几何质量引导,结合门控自我中心残差策略,减少所需数据量并提高数据效率。

详情
AI中文摘要

目前,从真实世界演示中进行的机器人学习受到数据扩展的限制。通用操作接口(UMI)提供了一种高效的无机器人数据收集接口,然而当前的UMI风格流程通常收集冗余的演示,并且缺乏全局场景上下文。为了提高数据效率,我们提出了EgoGuide,一种收集接口,它记录同步的腕部和头部/自我中心观察,并将其与在线视觉-几何数据质量引导相结合。我们还引入了一种门控自我中心残差策略,用于从视角变化的自我中心相机中进行鲁棒学习,允许头部/自我中心上下文纠正模糊的局部观察,同时保持稳定的腕部视角控制。真实世界实验表明,EgoGuide减少了所需的数据集数并提高了数据效率。残差策略进一步提高了视觉遮挡下的鲁棒性。项目页面:此 https URL

英文摘要

Robot learning from real-world demonstrations is currently constrained by data scaling. Universal Manipulation Interface (UMI) provides an efficient robot-free data collection interface, yet current UMI-style pipelines often collect redundant demonstrations and lack global scene context. To improve data efficiency, we present EgoGuide, a collection interface that records synchronized wrist and head/egocentric observations and couples them with online visual-geometric data quality guidance. We also introduce a Gated Egocentric Residual Policy for robust learning from a viewpoint-varying egocentric camera, allowing head/egocentric context to correct ambiguous local observations while preserving stable wrist-view control. Real-world experiments show that EgoGuide reduces the required number of data episodes and improves data efficiency. The residual policy further improves robustness under visual occlusion. Project Page: https://silicx.github.io/EgoGuide

2606.14418 2026-06-15 cs.AI cs.LG cs.RO 交叉投稿

Causal Object-Centric Models for Planning with Monte Carlo Tree Search

用于蒙特卡洛树搜索规划的因果对象中心模型

Rodion Vakhitov, Leonid Ugadiarov, Alexey Skrynnik, Aleksandr Panov

发表机构 * MIRAI CogAILab

AI总结 提出COMET算法,结合无监督对象中心编码器和Transformer世界模型,通过动作-槽融合机制和对象因果注意力实现高效规划,在多个基准上优于基线方法。

详情
AI中文摘要

我们提出了COMET(用于高效树搜索的因果对象中心模型),一种基于模型的强化学习算法,在槽结构化的潜在空间中执行蒙特卡洛树搜索。COMET将冻结的无监督对象中心编码器与基于Transformer的世界模型配对,其中通过一种新颖的动作-槽融合机制将动作绑定到对象上,该机制用于槽转移预测。策略和价值头使用对象因果注意力,通过学习到的每槽相关性分数调节令牌交互,使决策集中在任务相关实体上。COMET为MuZero风格的潜在规划增加了显式的对象级归纳偏差。在来自Object-Centric Visual RL基准、ManiSkill、Robosuite和VizDoom的八个视觉和动态多样化的任务中,COMET在训练早期相比对象中心和单一基线实现了更高的平均归一化分数。

英文摘要

We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.

2603.03733 2026-06-15 cs.RO 版本更新

X-Loco: Towards Generalist Humanoid Locomotion Control via Synergetic Policy Distillation

X-Loco:通过协同策略蒸馏实现通用人形机器人运动控制

Dewei Wang, Xinmiao Wang, Chenyun Zhang, Jiyuan Shi, Yingnan Zhao, Chenjia Bai, Xuelong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出X-Loco框架,通过协同策略蒸馏和案例自适应专家选择,训练视觉通用人形运动策略,整合直立行走、全身协调和跌倒恢复,仅基于速度指令,无需参考运动。

Comments Accepted by RSS 2026. Project page: https://x-loco-humanoid.github.io/

详情
AI中文摘要

尽管近期进展在单个类人技能(如直立行走、跌倒恢复和全身协调)上表现出色,但由于多样化的动力学和冲突的控制目标,学习一个掌握所有这些技能的单一策略仍然具有挑战性。为此,我们引入X-Loco,一个用于训练基于视觉的通用人形运动策略的框架。X-Loco训练多个专家策略,并采用协同策略蒸馏与案例自适应专家选择机制,动态利用多个专家策略来指导基于视觉的学生策略。这种设计使学生能够获得广泛的运动技能,从跌倒恢复到地形穿越和全身协调技能。据我们所知,X-Loco是第一个展示基于视觉的人形运动的框架,该框架联合集成了直立行走、全身协调和跌倒恢复,且仅基于速度命令运行,无需依赖参考运动。实验结果表明,X-Loco实现了卓越的性能,通过跌倒恢复和地形穿越等任务得到证明。消融研究进一步强调,我们的框架有效利用了专家知识并提高了学习效率。

英文摘要

While recent advances have demonstrated strong performance in individual humanoid skills such as upright locomotion, fall recovery and whole-body coordination, learning a single policy that masters all these skills remains challenging due to the diverse dynamics and conflicting control objectives involved. To address this, we introduce X-Loco, a framework for training a vision-based generalist humanoid locomotion policy. X-Loco trains multiple oracle specialist policies and adopts a synergetic policy distillation with a case-adaptive specialist selection mechanism, which dynamically leverages multiple specialist policies to guide a vision-based student policy. This design enables the student to acquire a broad spectrum of locomotion skills, ranging from fall recovery to terrain traversal and whole-body coordination skills. To the best of our knowledge, X-Loco is the first framework to demonstrate vision-based humanoid locomotion that jointly integrates upright locomotion, whole-body coordination and fall recovery, while operating solely under velocity commands without relying on reference motions. Experimental results show that X-Loco achieves superior performance, demonstrated by tasks such as fall recovery and terrain traversal. Ablation studies further highlight that our framework effectively leverages specialist expertise and enhances learning efficiency.

2606.04718 2026-06-15 cs.RO cs.AI 版本更新

CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation

CoRe-MoE: 面向多地形人形机器人步态适应的对比重加权专家混合

Kailun Huang, Zikang Xie, Yanzhe Xie, Panpan Liao, Fanghai Zhang, Yanheng Mai, Wenhao Xu, Yunheng Wang, Renjing Xu, Haohui Huang, Chenguang Yang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) South China Agricultural University(华南农业大学) Guangdong University of Technology(广东工业大学)

AI总结 提出CoRe-MoE两阶段强化学习框架,通过解耦步态生成与地形适应,利用对比学习促进专家专业化,实现人形机器人在多地形下的稳定行走和跑步。

Comments Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu, Haohui Huang and Chenguang Yang

详情
AI中文摘要

人类主要依靠行走和跑步穿越复杂地形,而无需采用不必要复杂的运动模式。类似地,人形机器人应在行走和跑步之间实现平滑过渡,同时保持自然稳定的运动。然而,由于梯度干扰以及地形相关的视觉和动态变化引起的分布偏移,在单一策略中统一步态转换和多地形适应仍然具有挑战性。尽管专家混合(MoE)架构可以缓解多技能干扰,但简单的联合训练往往无法产生清晰的专家专业化,限制了其有效性。为解决这些问题,我们提出了CoRe-MoE,一个两阶段强化学习框架,将步态生成与地形适应解耦。在第一阶段,学习一个稳定的运动策略,以产生具有平滑过渡的自然行走和跑步行为。在第二阶段,引入一个地形感知的MoE分支,并通过对比目标进行训练以塑造门控网络,使其能够捕捉结构化地形表示并促进专家专业化。最终动作通过基础步态策略和地形感知分支的加权融合获得,使策略在适应复杂地形的同时保持稳定的运动模式。大量仿真结果表明,所提方法在成功率、运动稳定性和多地形适应性方面优于基线方法。此外,在Unitree G1人形机器人上的零样本部署验证了我们框架的有效性,实现了在楼梯、斜坡、台阶、障碍物和非结构化户外地形上的稳健行走和跑步,同时在外界干扰下保持精确的落脚点和动态稳定性。

英文摘要

Humans primarily rely on walking and running to traverse complex terrains. Similarly, humanoid robots should be able to smoothly transition between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference between tasks and the distribution shift caused by terrain variations. Although Mixture-of-Experts (MoE) architectures can mitigate multi-skill interference, direct joint training often fails to achieve clear expert specialization. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced, and the gating network is trained with a contrastive objective to learn structured terrain representations and promote expert specialization. The final action is obtained through weighted fusion of the base gait policy and the terrain-aware branch, enabling the policy to preserve stable locomotion while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains while maintaining accurate foothold control and dynamic stability.

2606.13675 2026-06-15 cs.RO 版本更新

Improving Robotic Generalist Policies via Flow Reversal Steering

通过流反转引导改进机器人通用策略

Andy Tang, William Chen, Andrew Wagenmaker, Chelsea Finn, Sergey Levine

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出流反转引导(FRS)方法,通过逆向流策略找到次优动作的潜在噪声并映射到通用策略的动作模式,提升零样本控制、行为克隆和强化学习效果。

详情
AI中文摘要

通用策略可以从多样化的机器人数据集中学习广泛的技能。为了解决或改进具有挑战性的新任务,我们需要一种方法从策略丰富的行为先验中推断并调用适当的动作,特别是当直接命令策略失败时。我们专注于流匹配通用策略,并提出流反转引导(FRS):一种方法,它采用次优但“合理”的动作,通过逆向流策略传递它们以找到其潜在噪声,并将它们映射到附近的通用策略动作模式。我们在多个模拟和真实世界的操作设置中评估了FRS。首先,FRS可以将来自人类或视觉语言模型的粗略语义引导转化为相应的良好机器人动作,从而改进零样本控制。这些收益可以通过行为克隆进行蒸馏,通过训练一个辅助策略输出噪声,通用策略将其映射到良好动作——在不到一分钟的训练中显示出高达95%的绝对任务成功率提升。最后,FRS通过用语义知识引导强化学习实现策略改进,在标准强化学习无法改进的多个任务上取得了改进。

英文摘要

Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging new tasks, we need a way to infer and invoke the appropriate actions from the policy's rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and propose Flow Reversal Steering (FRS): a method that takes suboptimal but ``reasonable'' actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions -- showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.

2601.19810 2026-06-15 cs.LG cs.AI cs.RO 版本更新

Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals

高效探索的无监督学习:通过自我设定目标预训练自适应策略

Octavio Pappalardo

发表机构 * University College London (UCL)(伦敦大学学院(UCL))

AI总结 提出ULEE方法,结合上下文学习器与对抗性目标生成策略,在无监督元学习框架中优化多回合探索与适应,提升零样本和少样本性能。

Comments ICLR 2026; v2 adds link to code: https://github.com/Octavio-Pappalardo/ulee-jax

详情
Journal ref
The Fourteenth International Conference on Learning Representations, 2026
AI中文摘要

无监督预训练可以为强化学习智能体提供先验知识,加速下游任务的学习。一个基于人类发展的有前景方向是研究智能体通过设定和追求自身目标来学习。核心挑战在于如何有效地生成、选择并从这些目标中学习。我们的关注点是下游任务的广泛分布,其中零样本解决每个任务是不可行的。当目标任务位于预训练分布之外或智能体未知其身份时,这种设置自然出现。在这项工作中,我们(i)在元学习框架内优化高效的多回合探索和适应,以及(ii)用智能体适应后性能的演化估计来指导训练课程。我们提出了ULEE,一种无监督元学习方法,它将上下文学习器与对抗性目标生成策略相结合,该策略将训练维持在智能体能力的前沿。在XLand-MiniGrid基准测试中,ULEE预训练产生了改进的探索和适应能力,这些能力泛化到新的目标、环境动态和地图结构。得到的策略获得了改进的零样本和少样本性能,并为更长的微调过程提供了强初始化。它优于从头学习、DIAYN预训练和替代课程。代码可在以下网址获取:https://github.com/facebookresearch/ulee

英文摘要

Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent's post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent's capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula. Code is available at: https://github.com/Octavio-Pappalardo/ulee-jax

2605.03065 2026-06-15 cs.LG cs.RO 版本更新

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

OGPO:生成控制策略的样本高效全微调

Sarvesh Patil, Mitsuhiko Nakamoto, Manan Agarwal, Shashwat Saxena, Jesse Zhang, Giri Anantharaman, Cleah Winston, Chaoyi Pan, Douglas Chen, Nai-Chieh Huang, Zeynep Temel, Oliver Kroemer, Sergey Levine, Abhishek Gupta, Hongkai Dai, Paarth Shah, Max Simchowitz

发表机构 * University of California, Berkeley(加州大学伯克利分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出OGPO算法,通过离策略评论网络和修改的PPO目标,实现生成控制策略的样本高效微调,在多种操作任务上达到最优性能,并能在无专家数据下微调不良初始化的行为克隆策略。

详情
AI中文摘要

生成控制策略(GCPs),如基于扩散和基于流的控制策略,已成为机器人学习的有效参数化方法。本文介绍了离策略生成策略优化(OGPO),一种用于微调GCPs的样本高效算法,该算法维护离策略评论网络以最大化数据重用,并通过修改的PPO目标将策略梯度传播到策略的完整生成过程,使用评论网络作为终端奖励。OGPO在涵盖多任务设置、高精度插入和灵巧控制的操作任务上达到了最先进的性能。据我们所知,它也是唯一一种能够在在线回放缓冲区中无专家数据的情况下,将初始化不良的行为克隆策略微调到接近完全任务成功的方法,并且只需很少的任务特定超参数调整。通过广泛的实证研究,我们证明了OGPO在策略引导和残差学习方面显著优于替代方法,并确定了其性能背后的关键机制。我们进一步引入了实用的稳定技巧,包括成功缓冲区正则化、双边保守优势和Q方差减少,以减轻基于状态和基于像素的设置中的评论网络过度利用。除了提出OGPO,我们还对GCP微调进行了系统的实证研究,确定了控制成功离策略全策略改进的稳定机制和失败模式。

英文摘要

Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.

2. 运动规划、控制与动力学 7 篇

2606.13842 2026-06-15 cs.RO 新提交

Efficient Domain-Adaptive Policy Learning via Kernel Representation with Application to Quadrotor Control under Non-Stationary Disturbances

基于核表示的高效域自适应策略学习及其在非平稳扰动下四旋翼控制中的应用

Hongyu Zhou, Mingtian Tan, Vasileios Tzoumas

发表机构 * University of Michigan, Ann Arbor(密歇根大学安娜堡分校)

AI总结 提出一种基于核表示的高效域自适应策略学习算法,通过随机傅里叶特征建模未知扰动,离线训练仅需50秒,在线通过最小二乘估计实时更新参数,在四旋翼轨迹跟踪任务中有效应对非平稳扰动。

详情
AI中文摘要

我们提出了一种基于核表示的高效域自适应策略学习算法。学习域自适应策略具有挑战性,因为它需要一种环境表示,既能足够表达以在离线训练期间建模复杂的模拟到现实差距,又能在部署期间支持快速在线适应。例如,四旋翼可能遇到时变的非平稳扰动,如突然阵风、载荷变化或在不同飞行状态(有无地面效应)之间的转换。为了解决这些挑战,我们使用基于随机傅里叶特征的可微核近似来建模未知扰动。在离线训练阶段,我们随机采样核系数和带宽参数以生成丰富多样的扰动分布。然后通过可微仿真和解析梯度优化控制策略,该过程在RTX 4090 GPU上仅需50秒训练时间。在硬件部署期间,策略通过在线最小二乘估计更新核系数和带宽,实时适应非平稳环境。我们在高保真数值仿真和Crazyflie硬件实验中评估了该方法,在包括复杂气动效应、风、地面效应和载荷波动等各种扰动下进行四旋翼轨迹跟踪任务。

英文摘要

We present an algorithm for efficient domain-adaptive policy learning via kernel representations. Learning domain-adaptive policies is challenging since it requires an environment representation that is both sufficiently expressive to model complex sim-to-real gaps during offline training, and computationally efficient enough to support rapid online adaptation during deployment. For instance, a quadrotor may encounter time-varying, non-stationary disturbances, such as sudden gusts of wind, payload shifts, or transitions between distinct flight regimes with and without ground effects. To address these challenges, we model unknown disturbances using a differentiable kernel approximation based on random Fourier features. During the offline training phase, we randomly sample kernel coefficients and bandwidth parameters to generate a rich diversity of disturbance profiles. We then optimize the control policy via differentiable simulation with analytical gradients, a process that takes only 50 seconds of training time on an RTX 4090 GPU. During hardware deployment, the policy adapts to non-stationary environments in real time by updating both the kernel coefficients and bandwidth through online least-squares estimation. We evaluate our method on quadrotor trajectory tracking tasks across high-fidelity numerical simulations and hardware experiments using Crazyflie, subjected to various disturbances, including complex aerodynamic effects, wind, ground effects, and payload fluctuations.

2606.13915 2026-06-15 cs.RO cs.SY eess.SY 新提交

Learning Dynamic Swing-Up of an Inverted Pendulum using Remote Magnetic Actuation

利用远程磁驱动学习倒立摆的动态摆动控制

Viacheslav Sydora, Jasan Zughaibi, Denis von Arx, Quentin Boehler, Michael Muehlebach

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院) University of Strasbourg(斯特拉斯堡大学)

AI总结 针对电磁导航系统在远离平衡态轨迹跟踪中的空白,提出结合轨迹优化、时变LQR和迭代学习控制的方法,首次实现倒立摆的磁驱动摆动控制,六次迭代成功,并验证了ILC校正与高保真磁场模型预测的扭矩偏差高度吻合。

详情
AI中文摘要

电磁导航系统(eMNS)在微创手术和靶向药物递送中受到广泛关注。尽管大多数文献依赖于这些系统的准静态控制,但近期工作已展示了动态方法的优势。然而,远离平衡态的轨迹跟踪仍未得到充分解决。我们通过使用临床就绪的Navion eMNS首次演示了磁驱动倒立摆的摆动控制,填补了这一空白。尽管倒立摆本身不具有临床相关性,但所提出的方法将扭矩和力作为控制目标,使其适用于其他磁驱动设备,如导管和导丝。我们的方法结合了考虑eMNS内部动力学的轨迹优化、时变线性二次型调节器(LQR)状态反馈和迭代学习控制(ILC),后者利用先前的试验数据和系统动态模型逐步优化前馈指令。尽管单独使用LQR因磁驱动的复杂现象而失败,但ILC在六次迭代内实现了成功摆动。此外,实验后分析表明,学习到的ILC校正与高保真磁场模型校准预测的扭矩偏差高度吻合,表明学习和自适应是处理电磁驱动中不确定性的有前景工具,这些不确定性可能源于患者特定的生理运动模式和磁场模型校准误差。

英文摘要

Electromagnetic Navigation Systems (eMNS) have gained considerable attention for minimally invasive surgery and targeted drug delivery. While most of the literature relies on quasi-static control of these systems, recent work has demonstrated the benefits of dynamic approaches. However, trajectory tracking far from equilibrium states remains largely unaddressed. We close this gap by demonstrating the first swing-up of a magnetically actuated inverted pendulum using the clinically-ready Navion eMNS. Although the inverted pendulum is not clinically relevant in itself, the proposed method utilizes torques and forces as control objectives, making it applicable to other magnetically actuated devices such as catheters and guidewires. Our approach combines trajectory optimization that accounts for internal eMNS dynamics with time-varying Linear Quadratic Regulator (LQR) state feedback and Iterative Learning Control (ILC), which leverages previous trial data and the system's dynamic model to progressively refine the feedforward command. While LQR alone fails due to the complex phenomena of magnetic actuation, ILC enables successful swing-up within six iterations. Furthermore, post-experimental analysis reveals that the learned ILC correction closely matches the torque discrepancy predicted by high-fidelity magnetic field model calibration, suggesting learning and adaptation as a promising tool to deal with uncertainties in electromagnetic actuation arising, e.g., from patient-specific physiological motion patterns and field model calibration inaccuracies.

2606.14063 2026-06-15 cs.RO cs.SY eess.SY 新提交

Semidefinite Relaxations for Collision-Free Motion Planning

无碰撞运动规划的半定松弛

Bernhard Paus Graesdal, Alexandre Amice, Pablo A. Parrilo, Russ Tedrake

发表机构 * Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology(麻省理工学院电气工程与计算机科学系)

AI总结 研究点机器人通过球形障碍物的无碰撞运动规划,提出半定松弛方法,理论分析其紧性并利用对称性降低计算复杂度,比直接非线性规划快10-100倍。

详情
AI中文摘要

我们研究了无碰撞运动规划的半定松弛。我们关注一个点机器人在 $\mathbb{R}^n$ 中从起点运动到终点,穿过球形障碍物,并受到路径连续性约束和平方导数成本;这一设定概念简单但抓住了无碰撞运动规划的难度。我们将该问题精确地表述为多项式曲线上的非凸问题,并提出了一个自然的半定松弛。我们贡献了两个关键的理论见解;据我们所知,这是对无碰撞运动规划半定松弛的首次理论分析。首先,我们表明求解凸松弛等价于在潜在更高维空间中全局最优地求解一个相关的运动规划问题。这种几何解释给出了紧性的必要和充分条件,以及松弛何时松弛的清晰直觉。其次,我们表明该松弛允许对称性约简,使其比预期的要小得多,正半定锥的大小随多项式次数线性增长,且与环境维度无关。由此产生的松弛比使用 SNOPT 和 IPOPT 求解的直接非线性规划转录快10到100倍,求解时间的方差显著降低,并能可靠地找到原始问题的局部最优路径。我们展示了其作为 RRT 规划器中凸导向函数的有效性,用于具有 $C^4$ 连续轨迹的最小加加速度四旋翼规划。

英文摘要

We study semidefinite relaxations for collision-free motion planning. We focus on a point robot moving from start to goal through spherical obstacles in $\mathbb{R}^n$, subject to path continuity constraints and squared derivative costs; a setting that is conceptually simple yet captures the hardness of collision-free motion planning. We formulate this problem exactly as a nonconvex problem over polynomial curves, and present a natural semidefinite relaxation. We contribute two key theoretical insights; to our knowledge this is the first theoretical analysis of semidefinite relaxations for collision-free motion planning. First, we show that solving the convex relaxation is equivalent to solving, to global optimality, a related motion planning problem in a potentially higher-dimensional space. This geometric interpretation yields necessary and sufficient conditions for tightness, and a clear intuition for when the relaxation is loose. Second, we show that the relaxation admits a symmetry reduction that makes it significantly smaller than one might expect, with positive semidefinite cone sizes that scale linearly with the polynomial degree and are independent of the ambient dimension. The resulting relaxation is 10 to 100 times faster than direct nonlinear programming transcriptions solved with SNOPT and IPOPT, exhibits significantly lower variance in solve times, and reliably finds a locally optimal path for the original problem. We demonstrate its effectiveness as a convex steering function in an RRT planner for minimum-snap quadrotor planning with $C^4$ continuous trajectories.

2606.14270 2026-06-15 cs.RO cs.AI 新提交

Robust Fall Recovery for Armless Bipedal-Wheeled Robots Via Force-Guided Learning

无臂双轮足机器人的鲁棒摔倒恢复:基于力引导的学习方法

Haidong Hou, Zhangguo Yu, Tao Han, Hengbo Qi, Khaleel Ghazal, Yu Zhang, Yidong Du, Xuechao Chen, Fei Meng

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 针对无臂双轮足机器人无法借助外部支撑恢复站立的问题,提出力引导教师-学生框架FTSR,通过约束强化学习逐步减少外力依赖,实现从摔倒到稳定行走的鲁棒恢复。

Comments 8 pages, 6 figures, accepted by IEEE Robotics and Automation Letters (RA-L)

详情
Journal ref
IEEE Robotics and Automation Letters, 2026
AI中文摘要

摔倒恢复对于自主腿式运动至关重要。现有方法已证明,某些腿式机器人(如人形机器人和四足机器人)能够通过利用手臂或协调多腿产生支撑力,从各种姿态恢复。没有手臂或其他腿提供支撑辅助,双轮足机器人必须完全依赖其腿部的驱动,这使得恢复特别困难。为解决这一问题,我们引入了FTSR(力引导的教师-学生框架与阶段奖励)。力引导方法在模拟训练期间构建一个与机器人实时高度直接相关的外部辅助力,明确地将该力公式化为可优化约束。通过约束强化学习,策略被引导逐步减少力依赖并增加身体高度,尽管没有手臂支撑,仍能发展内部恢复策略。高度渐进式阶段奖励在恢复过程中逐步构建姿态稳定,并过渡到持续运动,与教师-学生架构集成,蒸馏出力效应和恢复动态的特权知识。经过模拟训练,该策略被部署在物理无臂双轮足机器人上并进行了广泛评估。实验证实了在多种挑战性条件下鲁棒可靠的摔倒恢复,展示了强大的环境适应性和运动鲁棒性,同时保持恢复后的完整运动能力。该框架也有效泛化到高自由度人形机器人,证实了其实用泛化性。项目页面见该URL。

英文摘要

Fall recovery is critical for autonomous legged locomotion. Existing methods have demonstrated that some legged robots, such as humanoids and quadrupeds, are capable of fall recovery from diverse postures by utilizing arms or coordinating multi-legs to generate support forces. Without arms or other legs to provide supportive assistance, a bipedal-wheeled robot must rely solely on the actuation of its legs, making recovery particularly difficult. To address this, we introduce FTSR (Force-guided Teacher-student framework with Stage-wise Rewards). The force-guided method constructs an external auxiliary force during simulation training that correlates directly with the robot's real-time height, explicitly formulating this force as an optimizable constraint. Through constrained reinforcement learning, the policy is guided toward reducing force dependency gradually and increasing the body height, developing internal recovery strategies despite having no arms for support. Height-progressive stage-Wise rewards progressively structure posture stabilization during recovery and transition to sustained locomotion, integrated with teacher-student architecture distilling privileged knowledge of force effects and recovery dynamics. After simulation training, the policy is deployed on a physical armless bipedal-wheeled robot and extensively evaluated. Experiments confirm robust and reliable fall recovery under diverse challenging conditions, demonstrating strong environmental adaptability and motion robustness, while maintaining full post-recovery motion capability. The framework also generalizes effectively to a high-DOF humanoid, confirming its practical generalizability. The project page is available at https://2350575870.github.io/force-guided.github.io/

2512.22484 2026-06-15 cs.RO math.DG 版本更新

Asymmetric Friction in Geometric Locomotion

几何运动中的非对称摩擦

Ross L. Hatton, Yousef Salaman, Shai Revzen

发表机构 * Robotics program at Oregon State University(俄勒冈州立大学机器人项目) Department of Electrical Engineering and Computer Science at the University of Michigan(密歇根大学电气工程与计算机科学系)

AI总结 本文提出将非对称摩擦引入几何运动模型,用Finsler度量替代Riemannian度量,并扩展子Riemannian方法为子Finsler方法,以表征系统运动能力。

Comments 23 pages, 15 figures

详情
AI中文摘要

运动学的几何力学模型揭示了机器人和动物如何利用环境相互作用将内部形状变化转化为在世界中的位移,并将这种关系编码为“运动图”。这类运动图的一个关键类别源于作用在系统各个身体部位上的(可能是各向异性的)线性阻力,通过系统各个身体部位运动的Riemannian度量形式化描述。然后,可以通过对系统整体运动施加子Riemannian约束来生成运动图,在该约束下,给定形状速度所引起的位置速度是使摩擦耗散功率最小的那个。这类系统的运动是“几何的”,因为系统最终达到的位置仅取决于系统经过的形状序列,而不取决于形状变化的速率。在本文中,我们考虑一类更一般的系统,其中阻力不仅可以是各向异性的(前后和左右运动具有不同的系数),而且可以是非对称的(前后运动具有不同的系数)。形式上,在摩擦中包含非对称性将身体部位的Riemannian度量替换为Finsler度量。我们证明了构建系统运动图的子Riemannian方法自然地扩展到子Finsler方法,并确定了与子Riemannian系统的约束曲率类似的系统属性,从而能够表征系统的运动能力。

英文摘要

Geometric mechanics models of locomotion have provided insight into how robots and animals use environmental interactions to convert internal shape changes into displacement through the world, encoding this relationship in a ``motility map''. A key class of such motility maps arises from (possibly anisotropic) linear drag acting on the system's individual body parts, formally described via Riemannian metrics on the motions of the system's individual body parts. The motility map can then be generated by invoking a sub-Riemannian constraint on the aggregate system motion under which the position velocity induced by a given shape velocity is that which minimizes the power dissipated via friction. The locomotion of such systems is ``geometric'' in the sense that the final position reached by the system depends only on the sequence of shapes that the system passes through, but not on the rate with which the shape changes are made. In this paper, we consider a far more general class of systems in which the drag may be not only anisotropic (with different coefficients for forward/backward and left/right motions), but also asymmetric (with different coefficients for forward and backward motions). Formally, including asymmetry in the friction replaces the Riemannian metrics on the body parts with Finsler metrics. We demonstrate that the sub-Riemannian approach to constructing the system motility map extends naturally to a sub-Finslerian approach and identify system properties analogous to the constraint curvature of sub-Riemannian systems that allow for the characterization of the system motion capabilities.

2602.01948 2026-06-15 cs.RO 版本更新

A Unified Control Architecture for Macro-Micro Manipulation using a Active Remote Center of Compliance for Manufacturing Applications

面向制造应用的宏微操作统一控制架构:基于主动远程柔顺中心

Patrick Frank, Christian Friedrich

发表机构 * Institute for Robotics and Intelligent Production Systems University of Applied Sciences Karlsruhe (HKA)(机器人与智能生产系统研究所 卡尔施塔特应用科学大学(HKA))

AI总结 提出一种将宏操作器纳入主动交互控制的新架构,相比现有领先-跟随方法将控制带宽提升2.1倍,相比传统力控制提升12.5倍,并引入替代模型简化控制器设计。

Comments 17 pages, 14 figures, submitted to Robotics and Computer-Integrated Manufacturing (RCIM)

详情
AI中文摘要

宏微操作器将具有大工作空间的宏操作器(如工业机器人)与轻量、高带宽的微操作器相结合。这使得在保持机器人广阔工作空间的同时,能够实现高动态的交互控制。传统上,位置控制分配给宏操作器,而微操作器负责与环境交互,这限制了可实现的交互控制带宽。为解决此问题,我们提出了一种新颖的控制架构,将宏操作器纳入主动交互控制中。与基于领先-跟随方法的最先进架构相比,这导致控制带宽提升了2.1倍,与传统基于机器人的力控制相比提升了12.5倍。此外,我们提出了替代模型,以实现更高效的控制器设计并易于适应硬件变化。我们通过在不同实验(如与物体碰撞、跟随力轨迹和工业装配任务)中与其他控制方案进行比较,验证了我们的方法。

英文摘要

Macro-micro manipulators combine a macro manipulator with a large workspace, such as an industrial robot, with a lightweight, high-bandwidth micro manipulator. This enables highly dynamic interaction control while preserving the wide workspace of the robot. Traditionally, position control is assigned to the macro manipulator, while the micro manipulator handles the interaction with the environment, limiting the achievable interaction control bandwidth. To solve this, we propose a novel control architecture that incorporates the macro manipulator into the active interaction control. This leads to a increase in control bandwidth by a factor of 2.1 compared to the state of the art architecture, based on the leader-follower approach and factor 12.5 compared to traditional robot-based force control. Further we propose surrogate models for a more efficient controller design and easy adaptation to hardware changes. We validate our approach by comparing it against the other control schemes in different experiments, like collision with an object, following a force trajectory and industrial assembly tasks.

2605.25782 2026-06-15 cs.RO 版本更新

ParkourFormer: Integrating Predictive Supervision and Sequence Modeling into Parkour Locomotion

ParkourFormer:将预测监督与序列建模融入跑酷运动

Yanheng Mai, Wenhao Xu, Zirui Huang, Yifei Fu, Shengwei Dong, Xinjue Wang, Kailun Huang, Yanzhe Xie, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) CLAI-LAB, CL-TECH(CLAI实验室,CL-TECH) South China Agricultural University(华南农业大学) Guangdong University of Technology(广东工业大学)

AI总结 提出基于Transformer的序列建模框架ParkourFormer,通过预测未来本体感受状态并融合时序特征生成动作,实现人形机器人在多地形跑酷中的高成功率运动控制。

Comments Project Homepage: https://mronaldo-gif.github.io/parkourformer.github.io/

详情
AI中文摘要

人形机器人跑酷需要运动策略协调全身动力学,以应对楼梯、间隙、斜坡和障碍物等快速变化的地形。现有的强化学习策略大多是反应式的,直接将观测映射到动作,而不显式建模未来身体状态。在敏捷运动任务中,这种建模变得至关重要,因为成功的运动执行强烈依赖于对即将到来的接触过渡和身体动力学的预测。我们提出了ParkourFormer,一个基于Transformer的序列建模框架,将人形机器人运动重新表述为未来条件化的决策问题。当前机器人状态通过交叉注意力查询历史传感器运动轨迹,同时一个轻量级预测头预测短时域的未来本体感受状态。经过监督信号训练的预测未来状态与时间特征融合以生成动作,使策略能够联合推理运动历史和预期的未来动力学。我们在一个包含楼梯、间隙、斜坡、粗糙地形和障碍物穿越的多样化多地形人形机器人跑酷基准上评估了ParkourFormer。在仿真和真实人形机器人上的实验表明,ParkourFormer在极具挑战性的地形上实现了93.85%的平均穿越成功率,相比强MLP、基于MoE的MLP和普通Transformer基线,提升高达42.73%,同时在所有地形类型上保持单一统一策略。这些结果表明,显式未来状态建模显著提高了敏捷全身运动的鲁棒性和泛化能力。

英文摘要

Humanoid parkour requires locomotion policies to coordinate whole-body dynamics across rapidly changing terrains such as stairs, gaps, slopes, and obstacles. Existing reinforcement learning policies are largely reactive, mapping observations directly to actions without explicitly modeling future body states. Such modeling becomes critical in agile locomotion tasks where successful motion execution depends strongly on anticipating upcoming contact transitions and body dynamics. We present ParkourFormer, a Transformer-based sequence modeling framework that reformulates humanoid locomotion as a future-conditioned decision-making problem. The current robot state queries historical sensorimotor trajectories through cross-attention, while a lightweight prediction head forecasts short-horizon future proprioceptive states. The predicted future states, trained with supervised signals, are fused with temporal features to generate actions, enabling the policy to jointly reason over motion history and anticipated future dynamics. We evaluate ParkourFormer on a diverse multi-terrain humanoid parkour benchmark including stairs, gaps, slopes, rough terrain, and obstacle traversal. Experiments in simulation and on a real humanoid robot show that ParkourFormer achieves a 93.85% average traversal success rate on highly challenging terrains, with improvements of up to 47.12% over strong MLP, MoE-based MLP, and vanilla Transformer baselines, while maintaining a single unified policy across all terrain types. These results demonstrate that explicit future-state modeling significantly improves robustness and generalization for agile whole-body locomotion.

3. 操作、抓取与灵巧手 11 篇

2606.14089 2026-06-15 cs.RO 新提交

A Modular Dual-Arm Apple Harvesting Robot with Enhanced Field Performance

一种具有增强田间性能的模块化双臂苹果采摘机器人

Keyi Zhu, Kyle Lammers, Chaaran Arunachalam, Kaixiang Zhang, Renfu Lu, Zhaojian Li

发表机构 * Michigan State University(密歇根州立大学) United States Department of Agriculture Agricultural Research Service(美国农业部农业研究局)

AI总结 提出一种模块化双臂苹果采摘机器人,采用垂直堆叠臂实现单树上下区域同时作业,结合基础模型感知、7阶加加速度轨迹生成、线性扫描采摘策略等5项改进,在商业果园中达到80.0%采摘成功率和7.53秒平均单臂周期,91.2%果实达到特级标准。

详情
AI中文摘要

机器人苹果采摘为解决商业果园劳动力短缺提供了有前景的方案,但低吞吐量和在果园环境中的较差性能阻碍了其商业应用。本文提出一种模块化双臂苹果采摘机器人,采用垂直堆叠臂实现单棵树上、下区域同时作业,将平台定位从多树横向重新定位简化为单树停止。与我们之前的水平双臂系统相比,该平台集成了5项进步:(1)基于基础模型的感知管线,结合Grounding-DINO和EfficientViT-SAM,在非结构化户外环境中实现鲁棒的水果定位;(2)7阶加加速度有界轨迹生成与控制屏障函数安全滤波器相结合,实现快速且安全的臂运动;(3)线性扫描采摘策略,带有10厘米接近缓冲区和旋转分离,提高了采摘可靠性;(4)基于时序逻辑的双臂协调策略与视觉-臂异步调度,最大化共享真空源的使用;(5)在2025年收获季节,涵盖不同苹果品种和树形结构的两个商业果园中进行现场验证。在这些田间试验收集的1738个臂循环中,系统实现了80.0%的单次尝试成功率和平均每臂周期7.53秒。水果损伤评估确认,91.2%的机器人采摘水果保持了美国农业部最高等级(特级),碰伤率在2.4%至4.9%之间。随着采摘周期时间的进一步改进和对茂密树叶遮挡的处理,这种新型模块化机器人设计有望用于苹果的商业化采摘。

英文摘要

Robotic apple harvesting offers a promising solution to labor shortages in commercial orchards, but low throughput and poor performance in orchard environments hinder its commercial adoption. This paper presents a modular dual-arm apple harvesting robot that uses a vertically stacked arms to enable simultaneous operation in the upper and lower zones of a single tree, simplifying platform positioning from multi-tree lateral repositioning to single-tree stops. Compared to our prior horizontal dual-arm system, the platform integrates 5 advances: (1)a foundation-model-based perception pipeline combining Grounding-DINO and EfficientViT-SAM for robust fruit localization in unstructured outdoor environments; (2)7th-order jerk-bounded trajectory generation paired with a Control Barrier Function safety filter to achieve fast yet safe arm motions; (3)a linear sweep harvesting strategy with a 10cm approach buffer and rotational detachment that improves picking reliability; (4)a temporal-logic-based dual-arm coordination policy with vision-arm async scheduling that maximizes usage of a shared vacuum source; and (5)field validation in 2 commercial orchards covering different apple varieties and tree architectures during the 2025 harvest season. Across the 1738 arm cycles collected in these field trials, the system achieved an 80.0% per-attempt success rate and a mean per-arm cycle time of 7.53s. Fruit damage assessments confirmed that 91.2% of robotically harvested fruit retained the highest USDA grade (Extra Fancy), with bruise rates between 2.4% and 4.9%. With further improvements in the picking cycle time and handling of heavy foliage occlusions, this new modular robot design holds promise for commercial harvesting of apples.

2606.14188 2026-06-15 cs.RO cs.AI cs.LG cs.SY eess.SY math.OC 新提交

Robustness without Wrinkles: Parallel Simulation and Robust MPC for Certified Deformable Manipulation

无皱鲁棒性:并行仿真与鲁棒MPC实现可认证的变形体操作

Wei-Chen Li, Jeffrey Fang, Sasanka Polisetti, Yuexi Song, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出CORD-SLS实时控制方法,通过GPU并行可微仿真与接触平滑实现高效梯度规划,结合鲁棒模型预测控制与共形预测校准,在绳索和布料操作中达到毫秒级规划与高安全性。

详情
AI中文摘要

我们提出了CORD-SLS,一种用于安全变形物体操作的实时控制方法,重点关注绳索和布料。其核心是一个带有接触平滑的GPU并行可微仿真器,能够通过间歇性接触实现高效的基于梯度的规划。为了在模型和感知不确定性下鲁棒地满足约束,我们开发了一种实时、GPU并行的输出反馈鲁棒模型预测控制(MPC)算法,该算法利用该仿真器进行规划。我们进一步证明,该仿真器加速了基于模型的强化学习,用于训练神经操作策略。为了提高现实世界的鲁棒性,我们使用共形预测来校准视觉反馈和感知误差界限,用于MPC,从而产生可达管,实现高概率的安全控制。我们在仿真和硬件上对高维、接触丰富的绳索和布料操作任务(包括避障、布线、折叠和平整)评估了CORD-SLS。在各种设置中,CORD-SLS实现了毫秒级规划速度,在安全性、速度和任务成功率方面均优于基线方法。

英文摘要

We present CORD-SLS, a real-time control method for safe deformable object manipulation, with a focus on ropes and cloth. At its core is a GPU-parallel differentiable simulator with contact smoothing which enables efficient gradient-based planning through intermittent contact. To robustly satisfy constraints under model and sensing uncertainty, we develop a real-time, GPU-parallel output-feedback robust model predictive control (MPC) algorithm that plans with this simulator. We further show that the simulator accelerates model-based RL for training neural manipulation policies. To improve real-world robustness, we use conformal prediction to calibrate visual-feedback and perception-error bounds for MPC, producing reachable tubes that enable high-probability safe control. We evaluate CORD-SLS on high-dimensional, contact-rich rope and cloth manipulation tasks in simulation and hardware, including obstacle avoidance, routing, folding, and smoothing. Across settings, CORD-SLS achieves millisecond-speed planning, exceeding baselines in safety, speed, and task success.

2606.14250 2026-06-15 cs.RO 新提交

SyLink Hand: A Synergy-Inspired Linkage-Driven Anthropomorphic Hand for Human-Like Dexterity

SyLink Hand:一种受协同作用启发的连杆驱动拟人手,实现类人灵巧性

Hao Wu, Yanzhe Wang, Yu Feng, Yitong Li, Jingxiang Guo, Jian Liu, Jianshu Zhou

发表机构 * National University of Singapore(新加坡国立大学) Zhejiang University(浙江大学)

AI总结 受人类手部协同作用启发,提出SyLink Hand拟人灵巧手,通过生物力学协同原理与连杆驱动机构结合,在紧凑低成本架构中实现外观、运动学和功能的高度拟人化,验证了协同启发连杆设计有效平衡拟人度、机械简单性和功能多样性。

详情
AI中文摘要

设计在功能灵巧性与机械简单性之间取得平衡的拟人机器人手仍然是一个重大挑战。受人类手部协同作用的启发,本文提出了SyLink Hand,一种拟人灵巧手,它将生物力学协同原理与连杆驱动传动机制相结合,在紧凑且成本效益高的架构中实现了外观、运动学和功能的高度拟人化。使用动作捕捉手套对自然手部运动进行生物力学分析,揭示了手部关节之间的强运动学相关性,为简化但功能性的自由度配置提供了基础。在这些协同特性的指导下,采用优化的连杆机构来协调多个关节运动并再现自然手指轨迹。进一步提出了一种新颖的球形四杆连杆机构,以在紧凑的外形下实现掌指关节的屈曲/伸展和外展/内收的解耦。最终原型集成了19个关节,由11个执行器驱动,总质量为520克,制造成本约为400美元。实验评估证明了其类人运动学性能、高承载能力以及多样的抓取和操作技能。这些结果验证了协同启发、基于连杆的设计有效平衡了拟人度、机械简单性和功能多样性,突显了其在需要灵巧性的机器人应用中实际部署的潜力。

英文摘要

Designing anthropomorphic robotic hands that balance functional dexterity with mechanical simplicity remains a significant challenge. Inspired by human hand synergies, this paper presents the SyLink Hand, an anthropomorphic dexterous hand that integrates biomechanical synergy principles with linkage-driven transmission mechanisms to achieve a high degree of anthropomorphism in appearance, kinematics, and functionality within a compact and cost-effective architecture. Biomechanical analysis of natural hand motions using motion capture gloves reveals strong kinematic correlations among hand joints, providing the basis for a simplified yet functional degree-of-freedom (DOF) configuration. Guided by these synergistic characteristics, optimized linkage mechanisms are employed to coordinate multiple joint motions and reproduce natural finger trajectories. A novel spherical four-bar linkage is further proposed to achieve decoupled flexion/extension (Flex/Ext) and abduction/adduction (Abd/Add) at the metacarpophalangeal joint within a compact form factor. The resulting prototype integrates 19 joints driven by 11 actuators, with a total mass of 520g and a manufacturing cost of approximately USD 400. Experimental evaluations demonstrate its human-like kinematic performance, high load-bearing capability, and versatile grasping and manipulation skills. These results validate that the synergy-inspired, linkage-based design effectively balances anthropomorphism, mechanical simplicity, and functional versatility, highlighting its potential for practical deployment in dexterity-demanding robotic applications.

2606.14531 2026-06-15 cs.RO 新提交

AERMANI-PLACE: Language Guided Object Placement with Aerial Manipulators

AERMANI-PLACE: 基于语言引导的空中机械臂物体放置

Sarthak Mishra, Ritama Sanyal, Rishabh Dev Yadav, Wei Pan, Spandan Roy

发表机构 * Robotics Research Center, IIIT Hyderabad(海得拉巴国际信息技术学院机器人研究中心) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Newcastle University(纽卡斯尔大学)

AI总结 提出AERMANI-PLACE框架,通过自然语言指令和图像编辑模型生成视觉标记,引导空中机械臂完成物体放置,在测试集和真实平台上分别达到87%和72%的平均成功率。

详情
AI中文摘要

物体放置是空中操纵任务的基本组成部分,但现有系统通常需要以度量坐标明确指定期望的放置位置。这种界面不直观,要求用户推理坐标框架和场景几何,使其在实际部署中难以使用。相比之下,人类通常通过语言和指向手势的组合来传达空间目标。受此观察启发,我们提出了AERMANI-PLACE,一个用于空中机械臂语言引导物体放置的框架。给定场景图像和自然语言指令,图像编辑模型生成场景的修改版本,其中包含指示物体应放置位置的视觉标记。然后,使用深度观测将该标记锚定到物理环境中,以恢复度量放置点,之后由空中机械臂生成并执行放置轨迹。我们在包含100个语言引导放置任务的测试集上评估了所提出的方法,并在真实的空中操纵平台上展示了成功执行。实验结果表明,所提出的方法能够可靠地从语言指令中推断放置位置,在测试集上的平均成功率为87%,并有效迁移到真实世界空中操纵,平均成功率为72%。视频:此 https URL

英文摘要

Object placement is a fundamental component of aerial manipulation tasks, yet existing systems typically require the desired placement position to be specified explicitly in metric coordinates. Such interfaces are not intuitive and require users to reason about coordinate frames and scene geometry, making them difficult to use in practical deployments. In contrast, humans often communicate spatial goals through a combination of language and pointing gestures. Inspired by this observation, we present AERMANI-PLACE, a framework for language-guided object placement with aerial manipulators. Given a scene image and a natural language instruction, an image editing model generates a modified version of the scene containing a visual marker that indicates where the object should be placed. This marker is then grounded into the physical environment using depth observations to recover a metric place point, after which a placement trajectory is generated and executed by the aerial manipulator. We evaluate the proposed approach on a test set of 100 language-guided placement tasks and demonstrate successful execution on a real aerial manipulation platform. Experimental results show that the proposed method reliably infers placement locations from language instructions with an average success rate of 87\% on the test-set and transfers effectively to real-world aerial manipulation with an average success rate of 72\%. Video: https://youtu.be/SgwwgLBsv0g

2606.14535 2026-06-15 cs.RO 新提交

Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera

空间条件扩散策略:使用单个RGB相机学习精确且鲁棒的操作

Seoyoon Kim, Kanghyun Kim, Dongwoo Ko, Yeong Jin Heo, Min Jun Kim

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) Neuromeka

AI总结 提出空间条件扩散策略(SCDP),利用末端执行器轨迹作为视觉注意力锚点,通过多尺度特征编码和空间条件模块,在单相机设置下实现精确鲁棒的操作。

Comments 15 pages

详情
AI中文摘要

最近的视觉模仿学习系统广泛采用多相机设置,其中腕部相机已成为事实标准。然而,从单一全局视角进行操作仍然具有挑战性,因为策略需要捕捉细粒度的交互细节并识别任务相关区域,而无需局部腕部视图。为了应对这一挑战,我们提出了空间条件扩散策略(SCDP),一种基于扩散的视觉运动策略,可在单相机设置下实现精确且鲁棒的操作。我们的关键思想是,末端执行器轨迹可以作为反映任务相关区域的视觉注意力锚点。基于这一思想,SCDP由两个关键组件组成:(i)一个视觉编码器,生成多尺度特征图以捕捉更广泛的上下文和细粒度视觉特征,以及(ii)一个空间条件模块,在扩散循环中沿中间末端执行器轨迹采样点状特征。大量的仿真实验表明,SCDP始终优于强大的单视图基线,并实现了与多相机基线相当的性能。真实世界实验进一步证明了其精确操作和对视觉干扰物的鲁棒性,突显了单相机模仿学习的潜力。

英文摘要

Recent visual imitation learning systems have widely adopted multi-camera setups with wrist-mounted cameras as the de facto standard. However, manipulation from a single global view remains challenging, as the policy should capture fine-grained interaction details and identify task-relevant regions without local wrist views. To address this challenge, we present Spatially Conditioned Diffusion Policy (SCDP), a diffusion-based visuomotor policy that achieves precise and robust manipulation in a single-camera setting. Our key idea is that end-effector trajectories can serve as visual attention anchors that reflect task-relevant regions. Building on this idea, SCDP consists of two key components: (i) a visual encoder that produces multi-scale feature maps to capture both broader context and fine-grained visual features, and (ii) a spatial conditioning module that samples point-wise features along intermediate end-effector trajectories in the diffusion loop. Extensive simulation experiments show that SCDP consistently outperforms strong single-view baselines and achieves performance comparable to multi-camera baselines. Real-world experiments further demonstrate precise manipulation and robustness to visual distractors, highlighting the potential of single-camera imitation learning.

2606.14561 2026-06-15 cs.RO cs.LG 新提交

ORCA: A Platform for Open-Source Dexterity Research

ORCA: 开源灵巧性研究平台

Francesco Capuano, Maximilian Eberlein, Fabrice Bourquin, Clemens Claudio Christoph

发表机构 * University of Oxford(牛津大学) ETH Zurich(苏黎世联邦理工学院) Orca Dexterity

AI总结 提出ORCA学习栈,统一灵巧手控制、仿真、遥操作和重定向,集成机器人学习框架,实现端到端灵巧操作研究。

Comments 15 pages

详情
AI中文摘要

机器人操作研究越来越关注两指平行夹爪,因其有效性、经济性和易于遥操作。然而,夹爪受限于其外形因素,即使对于简单的重新定向任务,也常常需要双臂设置。拟人手是灵巧机器人学习的更自然平台——更接近人手,能够从人类视频中学习——但它们在学习研究中仍然难以使用:即使存在开放且可访问的手部硬件,用于控制、仿真、遥操作和重定向的软件也分散在零散的代码库中,并且与机器人学习生态系统基本脱节。在这项工作中,我们介绍了\orca~学习栈,这是一个将灵巧性作为第一类机器人学习领域的开源研究栈。我们的\orca~栈将低级控制、仿真、来自一系列消费平台的遥操作以及手部重定向统一在单个接口后面,并原生集成流行的机器人学习框架(如\lerobot),使灵巧手研究人员能够利用与非灵巧机器人学习相同的数据、训练和评估流程。我们展示了一个完整的端到端工作流程,通过使用消费级VR头显进行遥操作收集手内重新定向任务的专家演示,使用\lerobot训练自主策略,并在完全可重现和可观察的设置中评估学习到的策略。我们将整个栈开源,作为灵巧操作研究的共享、可重现基础。

英文摘要

Robotics manipulation research increasingly focuses on two-finger parallel grippers for their effectiveness, affordability, and ease of teleoperation. Grippers are nonetheless limited by their form factor, often requiring bimanual setups even for simple reorientation tasks. Anthropomorphic hands are a more natural platform for dexterous robot learning -- closer to the human hand, and capable of learning from human video -- yet they remain hard to use in learning research: even where open and accessible hand hardware exists, the software for control, simulation, teleoperation, and retargeting is scattered in one-off code bases, and largely disconnected from the robot-learning ecosystem. In this work, we introduce the \orca~learning stack, an open-source research stack for dexterity as a first-class robot learning domain. Our \orca~stack unifies low-level control, simulation, teleoperation from a range of consumer platforms, and hand retargeting, behind a single interface, and integrates natively with popular robot-learning frameworks such as \lerobot, so dexterous hand researchers can leverage the same data, training, and evaluation pipelines used for non-dexterous robot learning. We demonstrate a complete end-to-end workflow, collecting expert demonstrations of an in-hand reorientation task by teleoperation with a consumer-grade VR headset, training an autonomous policy with \lerobot, and evaluating the learned policy in a fully reproducible and observable setup. We open-source the entire stack as a shared, reproducible foundation for dexterous-manipulation research.

2606.14606 2026-06-15 cs.RO cs.SY eess.SY 新提交

Impedance MPC with Disturbance Estimation for Dexterous Hand Control

用于灵巧手控制的阻抗MPC与扰动估计

Yongyan Cao

AI总结 提出一种执行器无关的阻抗模型预测控制框架,通过代数前馈将肌腱传动简化为常系数双积分器,结合编码器增强卡尔曼扰动估计,实现高精度轨迹跟踪与安全接触力控制。

详情
AI中文摘要

灵巧手必须同时跟踪精确的手指轨迹并保持安全、柔顺的接触——这对于任何固定增益控制器来说都是相互矛盾的目标。我们提出了一种执行器无关的灵巧手指阻抗模型预测控制(Impedance MPC)框架,实例化了为物理人机交互(pHRI)建立的恒定$A_d$无偏移架构;通过保留架构假设,其稳定性、递归可行性和输入-状态稳定性保证得以继承。代数前馈将肌腱传动——液压、缆绳、气动、扭绳或串联弹性——简化为常系数双积分器,因此QP代价逆矩阵可离线预计算,一个10步滚动时域二次规划以500 Hz运行,同时强制执行接触力(ISO/TS 15066)、驱动限制和加加速度的硬约束。仅使用编码器的增广卡尔曼扰动状态使任何恒定接触负载下的稳态误差为零。在液压驱动手指上——作为工作示例平台,增加了压力和空化约束——500 Hz卡尔曼MPC在1.5 Nm接触下实现了0.5 mrad RMS、0.1 mrad稳态和6.6 mrad峰值偏差:比经典阻抗分别好183倍、1500倍和23倍。实现的首次运动刚度(随更新率从18变化到323 Nm/rad)得到独立验证。该架构可扩展到16自由度LEAP Hand MuJoCo仿真,在0.7秒内从2.5 N抓取负载扰动中恢复。

英文摘要

Dexterous hands must simultaneously track precise finger trajectories and maintain safe, compliant contact -- objectives in tension for any fixed-gain controller. We present an actuator-agnostic Impedance Model Predictive Control (Impedance MPC) framework for dexterous fingers, instantiating the constant-$A_d$ offset-free architecture established for physical human-robot interaction (pHRI); its stability, recursive-feasibility, and input-to-state-stability guarantees are inherited by preserving the architectural assumptions. An algebraic feedforward reduces the tendon transmission -- hydraulic, cable, pneumatic, twisted-string, or series-elastic -- to a constant-coefficient double integrator, so the QP cost inverse is precomputed offline and a 10-step receding-horizon quadratic program runs at 500\,Hz while enforcing hard constraints on contact force (ISO/TS 15066), actuation limits, and jerk. An encoder-only augmented-Kalman disturbance state drives steady-state error to zero under any constant contact load. On a hydraulically actuated finger -- the worked example platform, adding pressure and cavitation constraints -- the 500\,Hz Kalman MPC attains 0.5\,mrad RMS, 0.1\,mrad steady-state, and 6.6\,mrad peak deflection under 1.5\,Nm contact: 183$\times$, 1500$\times$, and 23$\times$ better than classical impedance. The realized first-move stiffness (18$\to$323\,Nm/rad with update rate) is independently verified. The architecture scales to a 16-DOF LEAP Hand MuJoCo simulation, recovering from 2.5\,N grasp-load disturbances within 0.7\,s.

2606.08555 2026-06-15 cs.RO 版本更新

FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation

FAWAM: 面向闭环密集接触操作的力感知世界动作模型

Haotian He, Zeyu Yan, Qipeng Liu, Ning Guo, Wenzhao Lian

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院)

AI总结 提出FAWAM,在感知、预测和闭环执行三个层次融入力信息,通过联合预测动作与末端扳手及残差校正模块,提升密集接触操作的成功率。

详情
AI中文摘要

力信号为接触丰富的机器人操作提供了关键的交互线索。然而,现有方法大多将力作为额外的观测模态,未能充分利用其在建模未来交互动态或指导执行时反馈校正中的作用。本文提出FAWAM,一种力感知世界动作模型,在三个层次融入力信息:感知、预测和闭环执行。FAWAM首先编码历史六轴力/力矩信号以调节动作生成,然后联合预测未来动作和末端扳手以显式建模接触演化。它进一步引入残差校正模块,使用预测的扳手轨迹作为执行时参考,基于实时力反馈在线优化动作。跨多个接触丰富任务的实际实验表明,FAWAM相比纯视觉基线平均成功率提升36.25%,相比现有力感知基线提升21.25%,证明了我们的力感知框架在鲁棒密集接触操作中的有效性。

英文摘要

Force signals provide critical interaction cues for contact-rich robotic manipulation. However, existing methods mostly use force as an additional observation modality, without fully exploiting its role in modeling future interaction dynamics or guiding execution-time feedback correction. In this paper, we propose FAWAM, a force-aware world action model that incorporates force information at three levels: perception, prediction, and closed-loop execution. FAWAM first encodes historical 6-axis force/torque signals to modulate action generation, then jointly predicts future actions and end-effector wrenches to explicitly model contact evolution. It further introduces a residual correction module that uses the predicted wrench trajectory as an execution-time reference to refine actions online based on real-time force feedback. Real-world experiments across multiple contact-rich tasks show that FAWAM improves the average success rate by 36.25% over vision-only baselines and 21.25% over existing force-aware baselines, demonstrating the effectiveness of our force-aware framework for robust contact-rich manipulation.

2606.12728 2026-06-15 cs.RO cs.CV cs.LG 版本更新

EquiDexFlow: Contact-Grounded SE(3)-Equivariant Dexterous Grasp Generative Flows

EquiDexFlow: 基于接触的SE(3)-等变灵巧抓取生成流

Clinton Enwerem, John S. Baras, Calin Belta

发表机构 * Institute for Systems Research, University of Maryland, College Park(马里兰大学帕克分校系统研究所)

AI总结 提出EquiDexFlow,一种SE(3)-等变流匹配模型,联合预测腕部姿态、关节角度、指尖接触、表面法线和接触力,通过将接触投影到物体表面并将力约束在库仑摩擦锥内,确保物理稳定抓取,在16自由度Allegro手上实现零摩擦违规和最佳综合分数。

Comments 22 pages, 11 figures, 11 tables. Project page with videos, code, and checkpoints: https://equidexflow.github.io

详情
AI中文摘要

大多数学习型灵巧抓取生成器将接触力降级为下游验证步骤,因此运动学上可行的姿态仍可能违反稳定物理抓取的条件。我们通过EquiDexFlow解决这一问题,这是一种SE(3)-等变流匹配模型,从物体点云联合预测腕部姿态、关节角度、指尖接触、表面法线和接触力。我们的架构通过构造将接触投影到物体表面并将力约束在库仑摩擦锥内,因此无需损失惩罚即可满足放置和摩擦合规性。我们证明了端到端SE(3)等变性,并在200次旋转上经验验证,腕部残差低于$0.04^\circ$且关节偏差严格为零。该模型在81个物体的8,100个力闭合抓取上训练,适用于16自由度Allegro手,在所有消融变体中实现了零摩擦违规、最佳综合分数和最低扳手残差。我们通过每指逆运动学将解码的指尖接触重新定位到16自由度LEAP手,我们的硬件可行优化将每个关节至少置于其执行器包络的5%以内,同时保持扳手平衡。在物理机器人上,重新定位的EquiDexFlow解码抓取在所有六个测试物体上完成了开环拾取和保持试验,每个非对称物体在标准姿态和$120^\circ$共旋转下均成功。视频、代码和检查点可在https://this URL获取。

英文摘要

Most learned dexterous grasp generators relegate contact forces to a downstream verification step, so a kinematically-plausible pose can still violate the conditions for a stable physical grasp. We address this with EquiDexFlow, an SE(3)-equivariant flow-matching model that jointly predicts wrist pose, joint angles, fingertip contacts, surface normals, and contact forces from an object point cloud. Our architecture projects contacts onto the object surface and forces into the Coulomb friction cone by construction, so placement and friction compliance hold without loss penalties. We prove end-to-end SE(3) equivariance and verify it empirically over 200 rotations, with wrist residuals below $0.04^\circ$ and exactly zero joint deviation. Trained on 8,100 force-closure grasps across 81 objects for the 16-DoF Allegro Hand, our model achieves zero friction violations, the best composite score, and the lowest wrench residual among all ablation variants. We retarget decoded fingertip contacts to a 16-DoF LEAP Hand via per-finger inverse kinematics, and our hardware-feasible refinement places every joint at least 5% inside its actuator envelope while preserving wrench balance. On the physical robot, retargeted EquiDexFlow-decoded grasps complete open-loop pick-and-hold trials on all six test objects, with every asymmetric object succeeding at both the canonical pose and a $120^\circ$ co-rotation. Videos, code, and checkpoints are available at https://equidexflow.github.io.

2606.12910 2026-06-15 cs.RO cs.AI cs.CV cs.SY eess.SY 版本更新

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

边界框作为目标:通过神经符号规划实现语言条件抓取

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GRASP框架,利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测实现零样本桌面操作,无需任务特定训练。

Comments Project website: https://allisonandreyev.github.io/grasp.github.io/

详情
AI中文摘要

为了将机器人有效集成到家庭或工业环境中,机器必须实时适应自然语言提示。尽管视觉-语言模型(VLM)已在机器人任务与运动规划(TAMP)中实现零样本泛化,但当前最先进的方法通常计算量“沉重”或需要在数千个演示上进行大量训练。我们提出GRASP(基础推理与符号规划)框架,作为向开放词汇桌面操作迈进的一步。我们的方法利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测管道在物理世界中接地。与依赖固定颜色列表或硬编码坐标的方法不同,GRASP使机器人能够解释诸如“顶层架子”之类的抽象空间概念,并在无需额外微调的情况下执行任务。我们在三个难度级别的90次真实机器人试验中实现了73.3%的总体成功率,无需任务特定训练。

英文摘要

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

2603.05230 2026-06-15 cs.CV cs.RO 版本更新

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

数字孪生驱动的自动化分拣系统中的纺织品分类与异物识别

Serkan Ergun, Tobias Mitterer, Hubert Zangl

发表机构 * Institute of Smart Systems Technologies(智能系统技术研究所) University of Klagenfurt(克雷格弗特大学) AAU SAL USE Laboratory(AAU SAL USE实验室) Silicon Austria Labs(硅 Austria 实验室)

AI总结 提出一种数字孪生驱动的机器人分拣系统,结合抓取预测、多模态感知和视觉语言模型,实现纺织品分类与异物检测,Qwen模型准确率达87.9%。

Comments 10 pages,single column, 5 figures, preprint for Photomet Edumet 2026 (Klagenfurt, Austria)

详情
AI中文摘要

对可持续纺织品回收日益增长的需求要求强大的自动化解决方案,能够处理可变形服装并在杂乱环境中检测异物。本文提出了一种数字孪生驱动的机器人分拣系统,集成了抓取预测、多模态感知和语义推理,用于现实世界中的纺织品分类。一个配备RGBD传感、电容式触觉反馈和碰撞感知运动规划的双臂机器人单元,自主地将服装从未分类的篮子中分离,将其转移到检查区域,并使用最先进的视觉语言模型(VLM)进行分类。我们在一个包含223个检查场景的数据集上对来自五个模型家族的九个VLM进行了基准测试,这些场景包括衬衫、袜子、裤子、内衣、异物(包括上述类别之外的服装)和空场景。评估评估了每类准确率、幻觉行为以及在实际硬件约束下的计算性能。结果表明,Qwen模型家族实现了最高的总体准确率(高达87.9%),具有强大的异物检测性能,而较轻的模型如Gemma3为边缘部署提供了有竞争力的速度-准确率权衡。数字孪生结合MoveIt实现了碰撞感知路径规划,并将分割后的检查服装3D点云集成到虚拟环境中,以提高操作可靠性。所提出的系统证明了将语义VLM推理与常规抓取检测和数字孪生技术相结合,在现实工业环境中实现可扩展的自主纺织品分拣的可行性。

英文摘要

The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

4. 导航、定位与SLAM 9 篇

2606.13727 2026-06-15 cs.RO 新提交

Occupancy-Grounded Room Segmentation for Hierarchical 3D Scene Graphs

基于占用空间的房间分割用于分层3D场景图

Carlos Cueto Zumaya, Iacopo Catalano, Jorge Peña-Queralta, Wallace Moreira Bessa

发表机构 * University of Turku(图尔库大学) Centre for Artificial Intelligence, Zürich University of Applied Sciences(苏黎世应用科学大学人工智能中心)

AI总结 提出一种基于占用分解的房间节点锚定方法,构建分层3D场景图,在Matterport3D数据集上相比基线方法恢复了更多房间实例。

详情
AI中文摘要

室内机器人的分层3D场景图(3DSGs)在空间尺度上组织几何和语义信息,其中房间层连接对象级感知和房间级推理。现有系统从不同的空间基板(例如,地点聚类、墙壁平面或分割输出)构建该层,因此房间节点没有在共同的几何标准上进行评估。我们提出了一种基于占用空间的3DSG管道,其中房间节点锚定到从占用分解中跟踪的自由空间区域,为每个房间提供明确的多边形足迹。我们在12个Matterport3D场景上评估该管道,通过将预测的房间多边形与标注的房间实例进行匹配,并与代表性最先进的地点连接基线Hydra进行比较。结果表明,基于占用空间的锚定比地点连接构建恢复了更多的房间实例,但代价是精度较低,并且两种方法在墙壁精确的房间边界方面仍然是一个开放问题。代码可在该https URL获取。

英文摘要

Hierarchical 3D scene graphs (3DSGs) for indoor robots organize geometric and semantic information across spatial scales, with a room layer that connects object-level perception to room-scale reasoning. Existing systems construct this layer from different spatial substrates (\eg{} place clusters, wall planes, or segmentation outputs), and as a result, room nodes are not evaluated on a common geometric criterion. We present an occupancy-grounded 3DSG pipeline in which room nodes are anchored to tracked free-space regions derived from occupancy decomposition, giving each room an explicit polygonal footprint. We evaluate the pipeline on 12 Matterport3D scenes by matching predicted room polygons to annotated room instances and compare against Hydra, a representative state-of-the-art place-connectivity baseline. The results show that occupancy-grounded anchoring recovers substantially more room instances than place-connectivity construction, at the cost of lower precision, and that wall-accurate room boundaries remain an open problem for both methods. Code is available at https://github.com/crcz25/OccuSG.

2606.13878 2026-06-15 cs.RO 新提交

AnyGoal: Vision-Language Guided Multi-Agent Exploration for Training-Free Lifelong Navigation

AnyGoal: 视觉-语言引导的多智能体探索实现免训练终身导航

MoniJesu James, Marcelino Julio Fernando, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出AnyGoal,一种免训练多机器人架构,利用视觉-语言模型(VLM)驱动前沿探索,通过共享2D高斯贝叶斯价值图(BVM)协调智能体,实现终身证据积累,在GOAT-Bench上达到52.4%子任务成功率,优于模块化方法27.7个百分点。

Comments 17 pages, 3 figures

详情
AI中文摘要

在大规模仿真语料库上训练的端到端导航策略在迁移到分布外场景、类别或目标模态时性能急剧下降。模块化流水线如Modular GOAT受限于封闭集目标检测召回率,而3D快照记忆系统(如3D-Mem)积累密集、视角相关的表示,维护成本高。我们提出AnyGoal,一种免训练的多机器人架构,将视觉-语言模型(VLM)置于基于前沿的探索核心,并通过共享的2D高斯贝叶斯价值图(BVM)协调智能体。BVM维护每个像素关于目标相关性的后验(mu, sigma^2),通过深度锥掩模对VLM分数进行精度加权融合更新,且子任务间从不重置,实现终身证据积累。前沿通过VLM评判softmax与BVM上的贝叶斯UCB项的凸混合进行排序。具有空间分离惩罚和承诺滞后的贪婪分配器在无中央控制器的情况下将前沿分配给各智能体。在完整的GOAT-Bench验证集未见分割(360个片段,2669个子任务)上,我们的双智能体系统在严格物理机制下(离散0.25米步长,无瞬移,42度水平视场角)达到52.4%的子任务成功率和12.7%的SPL,创下新纪录,比Modular GOAT(24.9%)提高27.7个百分点。单智能体AnyGoal达到41.9%的子任务成功率,表明增益来自决策架构。四路感知消融实验显示,开放词汇检测器将主要失败模式从探索转向目标验证。

英文摘要

End-to-end navigation policies trained on large simulation corpora degrade sharply when transferred to out-of-distribution scenes, categories, or goal modalities. Modular pipelines such as Modular GOAT are bottlenecked by closed-set object detection recall, while 3D snapshot-memory systems (e.g. 3D-Mem) accumulate dense, view-dependent representations that are heavy to maintain. We present AnyGoal, a training-free multi-robot architecture that places a Vision-Language Model (VLM) at the core of frontier-based exploration and coordinates agents through a shared 2D Gaussian Bayesian Value Map (BVM). The BVM maintains a per-pixel (mu, sigma^2) posterior over goal relevance, updated via precision-weighted fusion of VLM scores through a depth-cone mask, and is never reset between subtasks, yielding lifelong evidence accumulation. Frontiers are ranked by a convex blend of a VLM-as-judge softmax and a Bayesian UCB term on the BVM. A greedy allocator with spatial-separation penalty and commitment hysteresis distributes frontiers across agents without a centralized controller. On the full GOAT-Bench val unseen split (360 episodes, 2,669 subtasks), our dual-agent system achieves 52.4% Subtask SR at 12.7% SPL--state of the art under the strict physical regime (discrete 0.25 m steps, no teleportation, 42 deg HFOV) and a +27.5 pp improvement over Modular GOAT (24.9%). Single-agent AnyGoal achieves 41.9% Subtask SR, showing gains arise from the decision architecture. A four-way perception ablation shows that open-vocabulary detectors shift the dominant failure mode from exploration to goal verification.

2606.13990 2026-06-15 cs.RO 新提交

SplatlessDF: Continuous Distance Field Mapping with Non-Splatting Gaussians

SplatlessDF: 基于非溅射高斯分布的连续距离场映射

Monisha Mushtary Uttsha, Lan Wu, Teresa Vidal-Calleja

发表机构 * UTS Robotics Institute, Faculty of Engineering and IT, University of Technology Sydney(悉尼科技大学工程与信息技术学院UTS机器人研究所) School of Engineering, University of Western Australia(西澳大学工程学院)

AI总结 提出SplatlessDF框架,利用各向异性高斯元素从空间角度构建连续距离场,支持距离和梯度查询,并可与2D高斯溅射结合实现统一建模,适用于机器人导航。

详情
AI中文摘要

最近的高斯溅射(GS)方法表明,场景可以通过可优化的高斯分布高效表示,以实现高质量的重建和渲染。本文基于这一原理,引入SplatlessDF,一个从空间而非光度角度使用各向异性高斯元素的连续距离场(DF)映射框架。SplatlessDF直接参数化高斯分布并优化以恢复可微DF,使得能够在空间域中查询距离和梯度,用于下游机器人任务如导航。此外,SplatlessDF可与2D高斯溅射(2DGS)耦合,提供一个完全基于高斯原语的统一框架,该框架可以学习连续DF和表面模型,并支持光度渲染。我们考虑两种设置:独立的仅DF公式和与2DGS耦合的联合DF-渲染公式。实验表明,独立公式提供高效准确的距离和梯度查询,而联合公式改善渲染几何并同时建模连续DF。这些结果凸显了GS风格表示不仅在表面建模和渲染方面,而且在适用于机器人导航的映射表示方面的潜力。

英文摘要

Recent Gaussian splatting (GS) methods have shown that scenes can be represented efficiently with optimisable Gaussians for high-quality reconstruction and rendering. In this paper, building on this principle, we introduce SplatlessDF, a continuous distance field (DF) mapping framework that uses anisotropic Gaussian elements from a spatial rather than photometric perspective. SplatlessDF directly parameterises the Gaussians and optimises to recover a differentiable DF, enabling distances and gradients to be queried in the spatial domain for downstream robotic tasks such as navigation. Furthermore, SplatlessDF can be coupled with 2D Gaussian splatting (2DGS), providing a unified framework based solely on Gaussian primitives that can learn continuous DF and surface models and supports photometric rendering. We consider two settings: a standalone DF-only formulation and a joint DF-rendering formulation coupled with 2DGS. Experiments show that the standalone formulation provides efficient and accurate distance and gradient queries, while the joint formulation improves rendering geometry and simultaneously models a continuous DF. These results highlight the potential of GS-style representations not only for surface modelling and rendering but also for mapping representations suited to robotic navigation.

2606.14160 2026-06-15 cs.RO 新提交

GAIT: Legged Robot Proprioceptive State Estimation with Attention over Inertial-Leg Tokens

GAIT: 基于惯性-腿部令牌注意力的足式机器人本体状态估计

Young-Rang Seo, Hajun Kim, Sangmin Kim, Dongyun Kang, Hae-Won Park

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出一种将惯性-腿部(IL)令牌化与注意力机制结合的方法,用于足式机器人本体状态估计,通过自适应加权不同传感器测量值提升估计性能,在未见步态和复杂地形上优于现有方法。

详情
AI中文摘要

本文提出了一种方法,将惯性-腿部(IL)令牌化应用于基于注意力的网络,用于足式机器人的本体状态估计。与现有的将所有传感器测量值拼接成单个扁平向量的学习型状态估计器不同,所提出的架构将惯性测量和每条腿的测量表示为单独的令牌,并使用注意力机制学习每个测量的相对重要性。这种设计允许网络根据当前的接触条件重新加权每个测量,反映了前向运动学测量的可靠性取决于相应脚是否接触的事实。然而,与传统的接触辅助估计器不同,所提出的方法无需依赖显式的接触估计器或基于静止接触假设的显式测量更新即可学习这种行为。为了验证所提出的方法,我们在Unitree Go1机器人上进行了实验,包括模拟中未建模的碎石地形和训练中未见过的步态模式。实验结果表明,所提出的方法在未见步态模式下比现有的学习型状态估计器实现了更好的估计性能,并且也优于基于接触辅助的模型方法。

英文摘要

In this paper, we propose a method that applies Inertial-Leg (IL) tokenization to an attention-based network for proprioceptive state estimation in legged robots. Unlike existing learning-based state estimators that concatenate all sensor measurements into a single flat vector, the proposed architecture represents inertial measurements and leg-wise measurements as individual tokens and uses an attention mechanism to learn the relative importance of each measurement.This design allows the network to reweight each measurement according to the current contact condition, reflecting the fact that the reliability of forward kinematic measurements depends on whether the corresponding foot is in contact. Unlike conventional contact-aided estimators, however, the proposed method learns this behavior without relying on an explicit contact estimator or on explicit measurement updates based on a stationary contact assumption. To validate the proposed method, we conducted experiments on a Unitree Go1 robot, including debris terrain not modeled in simulation and gait patterns not seen during training. Experimental results show that the proposed method achieves better estimation performance than existing learning-based state estimators under unseen gait patterns and also improves performance over contact-aided model-based methods.

2606.14237 2026-06-15 cs.RO 新提交

BIM-Loc: BIM-Integrated Discrepancy-Aware LiDAR-based Indoor Localization

BIM-Loc:集成BIM的差异感知激光雷达室内定位

Yinqiang Zhang, Liang Lu, Yipeng Pan, Maolin Lei, Yuhan Xie, Zhanteng Xie, Xiaowei Luo, Jia Pan

发表机构 * Department of AI & Data Science, University of Hong Kong (HKU)(香港大学人工智能与数据科学系) Department of Architecture and Civil Engineering, City University of Hong Kong (CityU)(香港城市大学建筑与土木工程系) Humanoids and Human Centered Mechatronics Research Line, Italian Institute of Technology (IIT)(意大利技术研究院人形机器人与以人为本机电一体化研究组)

AI总结 提出BIM-Loc,一种直接集成建筑信息模型(BIM)的差异感知激光雷达定位方法,通过多命中射线投射、BIM集成因子位姿图优化和层次贝叶斯推断,实现与BIM坐标系对齐的轨迹估计和在线差异检测,显著提升定位精度与鲁棒性。

Comments 24 pages, 21 figures, accepted by International Journal of Robotics Research (IJRR), to be published

详情
AI中文摘要

准确且鲁棒的定位是服务机器人和巡检机器人的基本要求,尤其是在特征稀疏的室内环境中,传统系统因缺乏明显地标而难以工作。虽然先验地图可以增强鲁棒性,但对于新的或频繁变化的环境,精确且紧凑的、捕捉真实世界细节的地图往往不可用。本文提出BIM-Loc,一种新颖的差异感知激光雷达定位方法,直接集成设计阶段的建筑信息模型(BIM)。BIM-Loc同时估计与BIM坐标系对齐的轨迹,并以在线方式识别真实世界观测与设计BIM之间的差异。我们的核心贡献包括:(1) 一种新颖的多命中射线投射策略,用于高效的BIM-点云数据关联和将3D观测投影到2D纹理空间;(2) 一个集成BIM因子的位姿图优化框架,强制里程计、连续扫描和BIM结构之间的一致性;(3) 一个层次贝叶斯推断模块,增量更新连续的2D表面表示以进行差异检测,并将更新从像素传播到结构级别。在仿真和实际应用中的广泛评估表明,BIM-Loc在定位精度和鲁棒性方面显著优于最先进的基于地图的方法。

英文摘要

Accurate and robust localization is a fundamental requirement for service and inspection robots, particularly in feature-sparse indoor environments where traditional systems struggle due to a lack of distinct landmarks. While prior maps can enhance robustness, precise and compact maps capturing real-world details are often unavailable for new or frequently changing environments. This paper presents BIM-Loc, a novel discrepancy-aware LiDAR-based localization method that directly integrates Building Information Models (BIM) from the design phase. BIM-Loc simultaneously estimates trajectories aligned with the BIM coordinate system and identifies discrepancies between real-world observations and the as-designed BIM in an online fashion. Our core contributions include: (1) a novel multi-hit ray casting strategy for efficient BIM-point data association and projection of 3D observations into 2D texture space; (2) a pose graph optimization framework with BIM-integrated factors that enforces consistency among odometry, sequential scans, and BIM structures; and (3) a hierarchical Bayesian inference module that incrementally updates a continuous 2D surface representation for discrepancy detection, propagating updates from the pixel to the structure level. Extensive evaluations in both simulation and real-world applications demonstrate that BIM-Loc significantly outperforms state-of-the-art map-based methods in localization accuracy and robustness.

2606.14267 2026-06-15 cs.RO 新提交

FloVerse: Floor Plan-Guided Multi-Modal Navigation

FloVerse:基于楼层平面图的多模态导航

Weiqi Huang, Shuangyi Dong, Jiaxin Li, Yifei Guo, Zan Wang, Wei Liang

发表机构 * School of Computer Science & Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院)

AI总结 提出FloVerse任务统一PointNav、ObjectNav和ImageNav,构建FloVerse-1.6K数据集,并设计ThreeDiff两阶段模仿学习策略,利用楼层平面图先验提升导航性能。

Comments Accepted at CVPR 2026

详情
AI中文摘要

楼层平面图包含了紧凑的空间先验信息,使智能体能够更高效地导航未见过的场景。虽然先前的工作已经探索了基于楼层平面图的导航,但主要集中在PointNav和有限的环境上。为了弥补这一差距,我们引入了FloVerse,一个基于楼层平面图的具身导航新任务,统一了PointNav、ObjectNav和ImageNav。为了支持FloVerse,我们构建了FloVerse-1.6K,一个大规模数据集,包含来自HM3D和Gibson 4+的1600个场景及其对应的楼层平面图,共计24万条专家轨迹和1200万帧RGBD图像。我们进一步提出了ThreeDiff,一种两阶段模仿学习策略,包括一个规划器、一个基于扩散的多模态目标推理模块(通过掩码模态建模训练)和一个精炼器(基于深度的轨迹精炼模块,用于安全执行)。大量实验表明:(1) 楼层平面图先验提高了所有目标模态的导航性能;(2) ThreeDiff隐式地从楼层平面图中捕获空间信息。这些结果强调了空间先验的有效性,并验证了我们提出的基于楼层平面图的具身导航统一方法的有效性。

英文摘要

Floor plans encapsulate compact spatial priors, enabling agents to navigate unseen scenes more efficiently. While prior work has explored floor plan-guided navigation, it has focused mainly on PointNav and a limited set of environments. To bridge this gap, we introduce FloVerse, a new task for floor plan-guided embodied navigation that unifies PointNav, ObjectNav, and ImageNav. To support FloVerse, we assemble FloVerse-1.6K, a large-scale dataset of 1.6K scenes from HM3D and Gibson 4+, paired with corresponding floor plans, comprising 240K expert trajectories and 12M RGBD frames. We further propose ThreeDiff, a two-stage imitation learning policy comprising a planner, a diffusion-based multimodal goal-reasoning module trained via masked-modality modeling, and a refiner, a depth-based trajectory-refinement module for safe execution. Extensive experiments demonstrate that (1) floor-plan priors improve navigation performance across all goal modalities, and (2) ThreeDiff implicitly captures spatial information from floor plans. These results underscore the effectiveness of spatial priors and validate our proposed unified approach for floor plan-guided embodied navigation.

2606.14421 2026-06-15 cs.RO cs.HC eess.SP 新提交

ForestBack: Breadcrumb-Based Pedestrian Dead Reckoning for Infrastructure-Free Return Navigation

ForestBack:基于面包屑的步行者航位推算实现无基础设施返回导航

Aueaphum Aueawatthanaphisut, Chanakan Chaipan

发表机构 * University of Tokyo(东京大学)

AI总结 提出ForestBack框架,通过面包屑式步行者航位推算(PDR)在无GPS/基础设施环境中记录路径并生成反向引导,实验显示轨迹RMSE降低15.76%。

Comments 9 pages, 6 figures, 1 table, and 19 equations

详情
AI中文摘要

在GPS受限且外部定位基础设施可能不可用或不可靠的环境中,可靠的返回导航仍然是一个重要挑战。本文提出ForestBack,一种基于面包屑式步行者航位推算(PDR)的无基础设施行人返回导航框架。该系统将用户的行走路线记录为一系列可逆的面包屑节点,并在无需GPS、Wi-Fi、蓝牙信标或预装基础设施的情况下生成反向路径引导。ForestBack集成了基于加速度的步态检测、自适应步长估计、磁力计辅助航向估计、气压高度校正以及双向面包屑路径重建。该系统使用一条包含五个检查点的室内避障路线进行评估,用户围绕一个中心障碍物导航。评估使用了包含36次行走试验和42,474个时间序列样本的数据集,包括IMU信号、磁力计读数、气压变量、转弯事件标签、地面真实轨迹、基线PDR输出、提出的ForestBack输出以及功率相关测量。实验结果表明,与传统PDR相比,ForestBack将平均RMSE从1.129米降低到0.965米,提高了15.76%。平均最终位置误差从1.781米降低到1.388米,而转弯事件检测一致性达到约99.90%。这些结果表明,ForestBack在避障场景中改善了轨迹重建和路径保持的返回引导。发布的数据集和分析笔记本支持可重复性以及未来对基于PDR的无基础设施返回导航系统的基准测试。

英文摘要

Reliable return navigation remains an important challenge in GPS-denied environments where external positioning infrastructure may be unavailable or unreliable. This paper presents ForestBack, an infrastructure-free pedestrian return navigation framework based on breadcrumb-based pedestrian dead reckoning (PDR). The system records a user's walking route as a sequence of reversible breadcrumb nodes and generates reverse-path guidance without requiring GPS, Wi-Fi, Bluetooth beacons, or pre-installed infrastructure. ForestBack integrates acceleration-based step detection, adaptive step-length estimation, magnetometer-assisted heading estimation, barometric-altitude correction, and bidirectional breadcrumb path reconstruction. The system was evaluated using an indoor obstacle-avoidance route with five checkpoints, where the user navigated around a central obstacle. A dataset of 36 walking trials and 42,474 time-series samples was used for evaluation, including IMU signals, magnetometer readings, barometric variables, turn-event labels, ground-truth trajectories, baseline PDR outputs, proposed ForestBack outputs, and power-related measurements. Experimental results show that ForestBack reduced the mean RMSE from 1.129 m to 0.965 m compared with traditional PDR, corresponding to a 15.76% improvement. The mean final-position error was reduced from 1.781 m to 1.388 m, while turn-event detection consistency reached approximately 99.90%. These results indicate that ForestBack improves trajectory reconstruction and route-preserving return guidance in obstacle-avoidance scenarios. The released dataset and analysis notebook support reproducibility and future benchmarking of infrastructure-free PDR-based return navigation systems.

2405.14154 2026-06-15 cs.RO 版本更新

Cross-Stage Sensorimotor Perception Scheduling and Sparse Map Encoding for Efficient Edge Embodied Navigation

跨阶段感知运动调度与稀疏地图编码用于高效边缘具身导航

Yaotian Liu, Sri Sai Rakesh Nakkilla, Xiangyu Zhou, Yu Cao, Jeff Zhang

发表机构 * Arizona State University(亚利桑那州立大学) University of Minnesota(明尼苏达大学)

AI总结 针对边缘设备上具身导航的延迟和内存瓶颈,提出SKIP调度器(基于安全跳过准则)和SCOUT稀疏编码器,实现1.7倍加速、50.5%内存降低和7.1% SPL提升。

Comments 9 pages, 6 figures

详情
AI中文摘要

具身智能体必须在严格的延迟、内存和能量预算下,在嵌入式硬件上完成从感知到动作的闭环,这使得部署成为一个系统级协同设计问题,而非模型精度问题。我们针对模块化目标导航(ObjectNav)研究了这一挑战,其中我们的性能分析显示语义地图构建主导了每步延迟,而目标预测主导了峰值内存。我们将边缘具身导航部署形式化为一个预算约束的设计空间问题,并引入了两个正交优化旋钮:SKIP,一种自适应感知运动调度器,将安全跳过形式化为有界地图影响准则,并学习一个轻量级预测器,在每个FORWARD步骤中从廉价传感器线索估计该准则,暴露了一个原则性的质量-效率旋钮(基于深度的更新始终保留);以及SCOUT,一种稀疏上下文编码器,将活动地图区域上的子流形稀疏卷积与轻量级密集上下文流相结合。在HM3D上,在服务器和嵌入式平台上,SKIP+SCOUT在选定操作点相比密集基线实现了高达1.7倍的端到端加速、50.5%的峰值内存降低和7.1%的SPL提升,优于朴素的小型感知骨干网络。SKIP可迁移到第二个模块化流水线(PONI),性能几乎无损,并且在深度传感器噪声下保持鲁棒。SKIP+SCOUT共同为边缘物理AI系统揭示了一系列设备感知的帕累托操作点。

英文摘要

Embodied agents must close a perception-to-action loop on embedded hardware under tight latency, memory, and energy budgets, making deployment a system-level co-design problem rather than a model-accuracy problem. We study this challenge for modular Object Goal Navigation (ObjectNav), where our profiling shows semantic mapping dominates per-step latency while goal prediction dominates peak memory. We formulate edge embodied navigation deployment as a budget-constrained design-space problem and introduce two orthogonal optimization knobs: SKIP, an adaptive sensorimotor scheduler that formalizes safe skipping as a bounded map-impact criterion and learns a lightweight predictor to estimate it from cheap sensor cues at each \texttt{FORWARD} step, exposing a principled quality-efficiency knob (depth-based updates are always retained); and SCOUT, a sparse-context encoder that couples submanifold sparse convolutions on active map regions with a lightweight dense context stream. On HM3D across server and embedded platforms, SKIP+SCOUT delivers up to 1.7x end-to-end speedup, 50.5% lower peak memory, and 7.1% higher SPL than the dense baseline at the selected operating point, outperforming naively smaller perception backbones. SKIP transfers to a second modular pipeline (PONI) with near-lossless performance and remains robust under depth-sensor noise. Together, SKIP+SCOUT expose a family of device-aware Pareto operating points for edge physical AI systems.

2512.21201 2026-06-15 cs.RO cs.AI cs.CV 版本更新

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

薛定谔的导航者:为零样本目标导航设想未来轨迹集合

Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Shanghai University of International Business and Economics(上海对外经贸大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出一种信念感知框架,在推理时通过轨迹条件化的3D世界模型设想多个未来场景,结合自适应遮挡物感知采样和未来感知价值图,提升零样本目标导航在遮挡严重环境中的隐蔽目标发现和风险感知路径选择。

详情
AI中文摘要

零样本目标导航(ZSON)要求机器人在未见环境中找到目标物体,无需任务特定的微调或预建地图,这是通用服务机器人的关键能力。然而,在模拟中表现良好的方法在杂乱的真实世界场景中往往会退化,这些场景存在严重遮挡和潜在危险,大面积的未观察区域使得单场景推理脆弱且不安全。我们提出薛定谔的导航者,一个信念感知框架,在推理时对多个轨迹条件化的设想3D未来进行推理。给定候选路径,轨迹条件化的3D世界模型预测假设的观察结果,并保持多个合理场景实现的叠加,而不是承诺于单一地图。自适应遮挡物感知采样器将想象引导至不确定性关键区域,而未来感知价值图(FAVM)聚合设想的未来,以实现鲁棒、主动的动作选择。在模拟和物理Go2四足机器人上的实验表明,薛定谔的导航者优于强ZSON基线,在遮挡严重的导航场景中提高了隐蔽目标发现和风险感知路径点选择。这些结果突显了设想3D未来作为在不确定真实世界环境中进行零样本导航的可扩展和通用策略。

英文摘要

Zero-shot object navigation (ZSON) requires robots to find target objects in unseen environments without task-specific fine-tuning or pre-built maps, a key capability for general-purpose service robots. Yet methods that perform well in simulation often degrade in cluttered real-world scenes with severe occlusion and latent hazards, where large unseen regions make single-scene inference brittle and unsafe. We propose Schrödinger's Navigator, a belief-aware framework that reasons at inference time over multiple trajectory-conditioned imagined 3D futures. Given candidate paths, a trajectory-conditioned 3D world model predicts hypothetical observations and maintains a superposition of plausible scene realizations rather than committing to one map. An adaptive occluder-aware sampler directs imagination to uncertainty-critical regions, while a Future-Aware Value Map (FAVM) aggregates imagined futures for robust, proactive action selection. Experiments in simulation and on a physical Go2 quadruped show that Schrödinger's Navigator outperforms strong ZSON baselines, improving hidden-target discovery and risk-aware waypoint selection in occlusion-heavy navigation scenarios. These results highlight imagined 3D futures as a scalable and generalizable strategy for zero-shot navigation in uncertain real-world environments.

5. 人机交互与协作机器人 5 篇

2606.14083 2026-06-15 cs.RO 新提交

The N2D Haptic Glove: A Multi-Finger Glove for 2D Directional Force Feedback for Contact Rich Manipulation

N2D 触觉手套:用于接触丰富操作的多指二维方向力反馈手套

Yao-Ting Huang, Jake Honma, Omar Hernandez, Logan Li, Kaitlin Calimbahin, Bryce Hackel, Michael C. Yip

发表机构 * University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出 N2D 触觉手套,通过绞盘驱动在指尖提供二维弯曲-伸展力反馈,显著降低遥操作中的接触力误差并提高一致性。

详情
AI中文摘要

人类在操作过程中依赖方向性指尖力来探测和调节接触,但大多数可穿戴触觉手套仅提供振动或单轴力,导致力方向模糊。缺乏方向性提示时,用户必须仅凭视觉推断接触力,常导致过度按压、控制不一致以及机器人遥操作精度下降。我们提出 N2D 触觉手套,一种多指可穿戴设备,利用绞盘驱动传输在指尖提供平面弯曲-伸展力,实现高透明度力反馈。通过台架验证和涉及机器人手臂与手触觉遥操作的用户研究,我们证明与仅视觉和单轴触觉基线相比,平面指尖反馈在精确操作中显著降低接触力误差,提高试验间一致性,并增强轴向探测任务中的整体用户体验。这些发现确立了 N2D 触觉手套和基于方向手指的触觉设备作为接触丰富遥操作、沉浸式虚拟现实模拟以及机器人从演示中学习的有前景模式。N2D 触觉手套的硬件和软件系统将完全开源,网址为 \href{this https URL}{this https URL}。

英文摘要

Humans rely on directional fingertip forces to probe and regulate contact during manipulation, yet most wearable haptic gloves render only vibration or single-axis force, leaving force direction ambiguous. Without directional cues, users must infer contact force from vision alone, often leading to over-pressing, inconsistent control, and reduced precision in robotic teleoperation. We present the N2D Haptic Glove, a multi-finger wearable device that renders planar flexion-extension fingertip forces using capstan-drive transmissions for high-transparency force feedback. Through benchtop validations and a user study involving haptic teleoperation of a robotic arm and hand, we demonstrate that compared to visual-only and single-axis haptic baselines, planar fingertip feedback significantly reduces contact force error during precise manipulation, improves trial-to-trial consistency, and enhances overall user experience in axial probing tasks. These findings establish the N2D Haptic Glove and directional finger-based haptics devices as a promising modality for contact-rich teleoperation, immersive virtual reality simulations, and robot learning from demonstrations. N2D Haptic Glove's hardware and software system will be fully open-sourced at \href{https://ucsdarclab.github.io/n2d-glove/}{this https URL}.

2606.14218 2026-06-15 cs.RO cs.AI cs.LG 新提交

Universal Manipulation Exoskeleton: Learning Compliant Whole-body Policies with Real-time Torque Feedback

通用操控外骨骼:利用实时扭矩反馈学习全身柔顺策略

Litian Liang, Jingxi Xu, Xinda Qi, Yujun Cai, Houzhu Ding, Luqi Wang, Zhixin Sun, Jyh-Herng Chow, Ming Yang, Mark Cutkosky

发表机构 * Ant Group(蚂蚁集团) Stanford University(斯坦福大学)

AI总结 提出通用操控外骨骼(UME),通过实时触觉扭矩反馈和全身数据采集,使机器人学习主动柔顺策略,在受限空间中完成移动操作、力控翻转等任务。

详情
AI中文摘要

为了使机器人在家庭环境中安全工作,它们需要具备柔顺性,并在接触过程中对扭矩和力反馈做出反应。然而,现有的大多数数据采集管道仍然缺乏捕捉力和扭矩数据以学习主动柔顺策略的能力。在本文中,我们提出了通用操控外骨骼(UME),一种上肢外骨骼,它提供实时触觉扭矩反馈,同时记录整个手臂的配置和关节扭矩信号用于遥操作。凭借透明的扭矩反馈,人类操作员甚至可以在蒙眼的情况下拔出运动学约束的物体。UME成本低、重量轻且便携。配备嵌入式IMU,它支持移动操作的遥操作。通过我们提出的通用重定向算法,UME可以遥操作多种机器人,包括7自由度OpenArm、7自由度Franka和6自由度X-ARM。我们证明,这些能力的组合使得学习双臂、全身和主动柔顺策略成为可能,这些策略在高度受限的空间中有效运行。学习到的鲁棒自主策略在各种任务中实现了高成功率,包括长时程移动操作、力介导的箱子翻转、视觉遮挡的箱子推挤以及空间受限的桌面操作。视频、代码和更多信息可在此https URL找到。

英文摘要

For robots to work safely in household environments, they need to be compliant and react to torque and force feedback during contact. However, the majority of existing data collection pipelines still lack the ability to capture force and torque data for learning active compliant policies. In this paper, we present Universal Manipulation Exoskeleton (UME), an upper-limb exoskeleton that provides real-time haptic torque feedback while recording whole-arm configurations and joint torque signals for teleoperation. With transparent torque feedback, human operators can even unsheathe kinematically constrained objects while blindfolded. UME is low-cost, lightweight, and portable. Equipped with an embedded IMU, it enables teleoperation for mobile manipulation. With our proposed universal retargeting algorithm, UME can teleoperate a range of robots, including the 7DoF OpenArm, 7DoF Franka, and 6DoF X-ARM. We demonstrate that this combination of capabilities enables learning bimanual, whole-body, and active compliant policies that operate effectively in highly constrained spaces. The learned robust autonomous policies achieve high success rates across a variety of tasks, including long-horizon mobile manipulation, force-mediated box flipping, visually occluded box pushing, and space-constrained tabletop manipulation. Videos, code, and additional information can be found at https://ume-exo.github.io.

2606.14602 2026-06-15 cs.RO 新提交

What Robots Do Matters More Than What They Look Like: Task Context Shapes Trust in Educational HRI

机器人做什么比它们长什么样更重要:任务背景塑造教育人机交互中的信任

Anna-Maria Velentza, Konstantina Nikou, Anne-Gwenn Bosser, Nikolaos Fachantidis

发表机构 * LIRES Robotics Lab, University of Macedonia(马其顿大学LIRES机器人实验室)

AI总结 通过视频实验(N=81)发现,任务类型(教学、指导、索要个人信息)对信任有显著主效应,而机器人外观无显著影响,表明任务背景比物理外观更关键。

Comments Accepted in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026), Kitakyushu, Fukuoka, Japan

详情
AI中文摘要

社交辅助机器人(SARs)越来越多地部署在教育和信息共享环境中,这得益于大型语言模型的进步,使得流畅的实时交互成为可能。尽管机器人外观的多样性不断增加,但尚不清楚单一机器人外观是否适用于不同的交互任务,或者信任是否主要取决于情境因素。在本研究中,我们考察了机器人外观和任务类型如何共同影响对机器人的信任。通过一项受试者内视频实验(N=81),参与者评估了三种外观不同的机器人在执行三种教育相关任务(教学、程序性指导和个人信息讨论)时的表现。重复测量分析结果显示,任务对信任有强烈的主效应:参与者在指导任务中报告了最高的信任度,在教学活动中信任度中等,而当机器人索要个人信息时信任度显著降低。相比之下,机器人外观没有显著的主效应,外观与任务之间的交互作用也不明显。这些发现表明,人机交互中的信任更多地由任务背景而非物理外观所塑造。通过关注未来的教育工作者作为最终用户,本研究为教育环境中任务感知的机器人部署提供了实证证据,并强调了将机器人角色和行为与交互目标对齐的重要性,而非仅仅依赖拟人化设计。

英文摘要

Socially assistive robots (SARs) are increasingly deployed in educational and information-sharing contexts, supported by advances in large language models that enable fluent real-time interaction. Despite the growing diversity of robot embodiments, it remains unclear whether a single robot appearance is appropriate across different interaction tasks or whether trust depends primarily on contextual factors. In this study, we examine how robot appearance and task type jointly influence trust in robots. Using a within-subjects video-based experiment (N = 81), participants evaluated three robots with distinct appearances while performing three educationally relevant tasks: teaching, procedural instruction, and personal-information discussion. Results from repeated-measures analyses show a strong main effect of task on trust, with participants reporting the highest trust during instructional guidance, moderate trust during teaching activities, and significantly lower trust when robots requested personal information. In contrast, robot appearance showed no significant main effect, and the interaction between appearance and task was marginal. These findings suggest that trust in human-robot interaction is shaped more strongly by task context than by physical embodiment alone. By focusing on future educators as end users, this work contributes empirical evidence toward task-aware robot deployment in educational environments and highlights the importance of aligning robot roles and behaviors with interaction goals rather than relying solely on anthropomorphic design.

2606.14617 2026-06-15 cs.RO cs.SY eess.SY 新提交

Whole-Body Impedance Model Predictive Control for Safe Physical Human--Robot Interaction on Floating-Base Platforms

全身阻抗模型预测控制:浮基平台上的安全人机物理交互

Yongyan Cao

发表机构 * Voryx Robotics

AI总结 提出三层架构的全身阻抗MPC,通过质心MPC规划接触力、优先级WBC层平衡关节力矩、再ceding-horizon QP预测并抑制人机交互扰动,实现浮基机器人零稳态误差安全交互。

详情
AI中文摘要

浮基机器人必须在刚性接触约束下保持平衡,同时与人类安全交互。现有的全身控制(WBC)框架将全部关节空间分配给运动,或依赖固定增益阻抗反馈,在持续的人机物理交互(pHRI)力作用下积累稳态误差。本文将作者先前针对固定基座的两层阻抗MPC扩展到浮基平台,采用三层架构:质心MPC在500毫秒时域内规划接触力;优先级驱动的WBC层通过接触一致性零空间投影将平衡分解为关节力矩;剩余零空间由再ceding-horizon二次规划(QP)控制,该QP使用卡尔曼增强状态预测并抑制pHRI扰动。接触一致性反馈线性化将手臂末端执行器系统简化为在每个接触模式下具有恒定状态矩阵的双积分器,从而允许离线预计算QP代价并实现≥1 kHz运行。一种协方差膨胀协议在接触模式切换时保持扰动估计,保证在有界恒定pHRI负载下零稳态误差;阻抗等价定理表明无限时域极限恢复经典任务空间阻抗定律,其有效质量、阻尼和刚度随姿态和接触配置自适应。在17自由度双足机器人和Unitree G1人形机器人上的仿真验证了该设计。

英文摘要

Floating-base robots must balance under rigid contact constraints while interacting safely with humans. Existing whole-body control~(WBC) frameworks allocate the full joint space to locomotion or rely on fixed-gain impedance feedback that accumulates steady-state error under sustained physical human--robot interaction~(pHRI) forces. This paper extends the authors' fixed-base two-layer Impedance MPC to floating-base platforms through a three-level architecture: a centroidal MPC plans contact forces over a 500\,ms horizon; a priority-driven WBC layer resolves balance into joint torques through contact-consistent null-space projection; and the residual null space is governed by a receding-horizon quadratic program~(QP) that predicts and rejects pHRI disturbances using a Kalman-augmented state. A contact-consistent feedback linearization reduces the arm end-effector plant to a double integrator with a \emph{constant} state matrix within each contact mode, enabling offline precomputation of the QP cost and ${\geq}1$\,kHz operation. A covariance-inflation protocol preserves the disturbance estimate across contact-mode switches, guaranteeing zero steady-state error under bounded constant pHRI loads, and an Impedance Equivalence Theorem shows the infinite-horizon limit recovers a classical task-space impedance law whose effective mass, damping, and stiffness adapt to posture and contact configuration. Simulations on a 17-DOF biped and the Unitree G1 humanoid validate the design.

2604.01463 2026-06-15 cs.RO cs.AI cs.HC 版本更新

Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis

基于低负担LLM的偏好学习:通过自然语言反馈为瘫痪用户个性化辅助机器人

Keshav Shankar, Dan Ding, Wei Gao

发表机构 * Electrical and Computer Engineering(电气与计算机工程) Rehabilitation Science and Technology(康复科学与技术)

AI总结 针对严重运动障碍用户,提出一种低负担离线框架,利用大语言模型将非结构化自然语言反馈转化为确定性机器人控制策略,并通过职业治疗框架解码用户需求,显著降低用户负担。

Comments Accepted to IEEE RO-MAN 2026

详情
AI中文摘要

物理辅助机器人需要个性化行为以确保用户安全和舒适。然而,传统的偏好学习方法(如详尽的成对比较)会给严重运动障碍用户带来巨大的身体和认知疲劳。为解决这一问题,我们提出了一种低负担的离线框架,将非结构化自然语言反馈直接转化为确定性的机器人控制策略。为了安全地弥合模糊的人类语言与机器人代码之间的差距,我们的流程使用基于职业治疗实践框架的大语言模型(LLMs)。这种临床推理将主观用户反应解码为明确的生理和心理需求,然后映射到透明的决策树中。在部署前,自动化的“LLM-as-a-Judge”验证代码的结构安全性。我们在一个模拟的餐食准备研究中,对10名瘫痪成年人进行了系统验证。结果表明,与传统的基线方法相比,我们的自然语言方法显著降低了用户的工作负担。此外,职业治疗师确认生成的策略是安全的,并且准确反映了用户偏好。

英文摘要

Physically Assistive Robots require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause substantial physical and cognitive fatigue for users with severe motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework. This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, occupational therapists confirmed the generated policies are safe and accurately reflect user preferences.

6. 具身智能与视觉语言动作模型 6 篇

2606.13886 2026-06-15 cs.RO cs.CV cs.LG 新提交

PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation

PhysVLA:面向物理基础的VLA用于具身机器人操作

Namai Chandra, Shriram Damodaran, Lin Wang

发表机构 * IIT Madras(印度理工学院马德拉斯分校) Nanyang Technological University(南洋理工大学)

AI总结 提出PhysVLA,一种即插即用的推理时框架,通过相位有限状态机和选择性欧拉-拉格朗日门,在不重新训练的情况下为任何冻结的VLA骨干注入物理约束,提升成功率、稳定性和轨迹效率。

Comments 9 pages, 5 figures, supplementary material included

详情
AI中文摘要

视觉-语言-动作(VLA)模型擅长将视觉输入和自然语言指令直接映射到机器人控制策略。然而,由于它们主要针对行为演示数据进行训练,并未明确强制执行刚体动力学或接触约束等基本物理原理。这暴露了一个关键的物理差距:在单步或分块VLA上应用的标准时间平滑以轨迹质量为代价,增加了短期记忆无法解决的失败。为弥补这一差距,我们提出PhysVLA(Physics-VLA),一种即插即用、推理时的框架,旨在包装任何冻结的VLA骨干,无需重新训练、微调或权重访问,每个控制步骤的开销小于1毫秒。PhysVLA拦截预测的控制动作,仅捕获模拟器或系统状态,并应用双层校正:(i)一个相位感知的有限状态机,用于结构化离散任务段(接近、抓取、运输和放置),以及(ii)一个选择性欧拉-拉格朗日门,仅在动力学预言器检测到运动学不一致时激活。在LIBERO-Spatial上使用7自由度Franka Panda对OpenVLA、OpenVLA-OFT、Force-VLA和Generalist-VLA进行评估,该框架实现了高达17%的绝对成功率提升和高达19%的稳定性提升,且无每任务回归,在所有四个骨干上轨迹效率提升高达15%,并在Robosuite Lift跨模拟器扫描中显示出高达10倍的轨迹急动度鲁棒性提升。我们还在真实的Agilex Piper机械臂上通过拾取和放置任务进一步验证了该框架,确认PhysVLA无需重新训练即可迁移到物理硬件,成功率提升高达50%,将物理意识确立为一种可组合、骨干无关的运行时模块。

英文摘要

Vision-Language-Action (VLA) models excel at mapping visual inputs and natural language instructions directly to robotic control policies. However, because they are trained primarily to fit behavioural demonstration data, they do not explicitly enforce fundamental physical principles such as rigid-body dynamics or contact constraints. This exposes a critical physics gap: standard temporal smoothing applied on top of single-step or chunked VLAs trades trajectory quality for added failures that short-term memory cannot resolve. To bridge this gap, we introduce PhysVLA (Physics-VLA), a plug-and-play, inference-time framework designed to wrap any frozen VLA backbone without retraining, fine-tuning, or weight access, with less than 1 ms of overhead per control step. PhysVLA intercepts the predicted control action, captures only the simulator or system state, and applies a dual-layered correction: (i) a phase-aware finite-state machine that structures discrete task segments (approach, grasp, transport, and place), and (ii) a selective Euler-Lagrange gate that activates only when a dynamics oracle detects kinodynamic inconsistency. Evaluated across OpenVLA, OpenVLA-OFT, Force-VLA, and Generalist-VLA on LIBERO-Spatial with a 7-DoF Franka Panda, the framework delivers absolute success rate increases of up to 17% and stability increases of up to 19% with no per-task regressions, improves trajectory efficiency by up to 15% across all four backbones, and shows up to a 10x improvement in trajectory jerk robustness on a Robosuite Lift cross-simulator sweep. We further validate the framework on a real Agilex Piper arm with a pick-and-place task, confirming that PhysVLA transfers to physical hardware without retraining, with success-rate improvements of up to 50%, establishing physical awareness as a composable, backbone-agnostic runtime module.

2606.14084 2026-06-15 cs.RO 新提交

Self-Improving VLA Policies: Selected Diffusion Noise for Spurious-Robust Action Smoothing

自我改进的VLA策略:用于抗伪影动作平滑的选择性扩散噪声

Duc Minh Nguyen, Bao-Ngoc Dao, Tung M. Luu, Binh Gia Nguyen, Vinh Tong, Anji Liu, Vu N. Duong, Dung D. Le, Daniel Sonntag, Trung Le, Ngan Le, Jan Peter, An Thai Le, Minh Nhat Vu, Mathias Niepert, Khoa D. Doan, Duy M. H. Nguyen, Vien Anh Ngo

发表机构 * Center for AI Research, VinUniversity(VinUniversity人工智能研究中心) VinRobotics KAIST(韩国科学技术院) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院) National University of Singapore(新加坡国立大学) DFKI(德国人工智能研究中心) University of Oldenburg(奥尔登堡大学) Monash University(莫纳什大学) University of Arkansas(阿肯色大学) TU Darmstadt(达姆施塔特工业大学)

AI总结 提出一种无需训练的选择性扩散噪声方法,通过动态采样噪声向量增强视觉-语言-动作策略的鲁棒性和动作平滑性,在仿真和真实场景中成功率分别提升8%和10%。

详情
AI中文摘要

基于扩散的视觉-语言-动作(VLA)策略在机器人操作中实现了强大的泛化能力,但对伪影视觉相关性和噪声动作生成仍然敏感,导致在扰动下行为脆弱。我们引入了选择性扩散噪声(SDN),这是一种简单的、无需训练的测试时方法,通过利用扩散噪声空间作为可控自由度来提高鲁棒性和成功率。SDN动态采样与参考集最大分离的噪声向量,以减轻对伪影线索的依赖,同时选择产生更一致动作轨迹的候选。这种双重目标即使在物体遮挡的观测下也能鼓励稳定行为,并在不修改模型参数的情况下减少动作抖动。我们在两个模拟基准(Google Robot、Widow-X)和两个真实世界机器人数据集上,对多种VLA策略(包括pi_0、Groot-N1.5和Groot-N1.6)评估了SDN。SDN在模拟环境中一致地将成功率提高了8%,在真实环境中提高了10%,同时产生更平滑、更稳定的动作。我们的结果强调,扩散噪声选择可以作为在测试时增强VLA策略的有效且通用机制。

英文摘要

Diffusion-based Vision-Language-Action (VLA) policies enable strong generalization in robotic manipulation, but remain sensitive to spurious visual correlations and noisy action generation, leading to brittle behavior under perturbations. We introduce Selected Diffusion Noise (SDN), a simple, training-free test-time method that improves both robustness and success rate by leveraging the diffusion noise space as a controllable degree of freedom. SDN dynamically samples noise vectors that are maximally separated from a reference set to mitigate reliance on spurious cues, while selecting candidates that yield more coherent action trajectories. This dual objective encourages stable behavior even under object-masked observations and reduces action jitter without modifying model parameters. We evaluate SDN on two simulation benchmarks (Google Robot, Widow-X) and two real-world robotic datasets across multiple VLA policies, including pi_0, Groot-N1.5, and Groot-N1.6. SDN consistently improves success rates by +8% in simulation and +10% in real-world settings, while producing smoother and more stable actions. Our results highlight that diffusion noise selection can serve as an effective and general mechanism for enhancing VLA policies at test time.

2606.14409 2026-06-15 cs.RO cs.AI 新提交

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

Hy-Embodied-0.5-VLA:从视觉-语言-动作模型到真实世界机器人学习栈

He Zhang, Lingzhu Xiang, Haitao Lin, Zeyu Huang, Minghui Wang, Dingyan Zhong, Yubo Dong, Yihao Wu, Yongming Rao, Dongsheng Zhang, Wanjia He, Ling Chen, Kai Huang, Jiahao Chen, Sichang Su, Xumin Yu, Ziyi Wang, Chengwei Zhu, Xiao Teng, Yuchun Guo, Yufeng Zhang, Yuandong Liu, Rui Wang, Zisheng Lu, Han Hu, Zhengyou Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出端到端机器人学习栈HyVLA-0.5,涵盖数据收集、模型设计、预训练与微调、RL后训练及真实部署,各组件协同工作。

详情
AI中文摘要

在本报告中,我们提出Hy-Embodied-0.5-VLA,简称HyVLA-0.5,一个覆盖完整机器人学习栈的端到端系统:数据收集、模型设计、持续预训练和监督微调、RL后训练以及真实世界部署。每个组件在该栈中扮演着独特的角色。

英文摘要

In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.

2606.14010 2026-06-15 cs.CV cs.LG cs.RO 交叉投稿

RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

RT-VLA:通过知识蒸馏实现实时视觉-语言-动作模型

Xiangyu Huang, Zhenlin Hua, Han Zhou, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出RT-VLA,通过多级监督蒸馏将SimLingo模型的能力压缩至轻量学生模型,在保持竞争性能的同时将推理时间降低44.8倍(纯视觉模式)和7.9倍(视觉+语言模式),实现实时可解释的VLA自动驾驶。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过联合建模视觉感知、语言推理、可解释性和动作预测,在端到端自动驾驶中展现出强大潜力。然而,其庞大的视觉-语言骨干网络和推理模块引入了显著的推理延迟,从而阻碍了它们在道路网络严苛现实中的部署。我们提出RT-VLA,一种轻量级、蒸馏的VLA模型,通过多级监督蒸馏将最先进的SimLingo模型的驾驶和推理能力迁移到紧凑的学生模型中。RT-VLA保留了基于语言的推理,并通过离线语言分析安全关键驾驶时刻来支持事后解释,而不增加实时控制的延迟。与SimLingo教师模型相比,RT-VLA在保持竞争性的闭环驾驶和语言推理性能的同时,在纯视觉模式下将推理时间减少了44.8倍,在视觉+语言模式下减少了7.9倍。这些结果表明,监督蒸馏是构建实时、可解释的VLA风格自动驾驶模型的实用方法。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.

2606.14048 2026-06-15 cs.CV cs.RO 交叉投稿

WAM4D: Fast 4D World Action Model via Spatial Register Tokens

WAM4D:通过空间注册令牌实现快速4D世界动作模型

Ying Li, Xiaobao Wei, Jiajun Cao, Hao Wang, Xiaowei Chi, Chengyu Bai, Qianpu Sun, Jiajun Li, Xiaojie Zhang, Jian Tang, Sirui Han, Shanghang Zhang

发表机构 * Peking University(北京大学) The Hong Kong University of Science and Technology(香港科技大学) Beijing Innovation Center of Humanoid Robotics(北京人形机器人创新中心)

AI总结 提出WAM4D,利用轻量级空间注册令牌将预训练几何先验迁移至因果视频-动作变换器,实现高效4D世界动作建模,在RoboTwin 2.0和真实操作任务中提升空间一致性并保持快速推理。

Comments 15 pages, 7figures, 9tables

详情
AI中文摘要

世界动作模型(WAMs)最近在联合建模未来观测和可执行机器人动作方面显示出前景。然而,大多数现有的WAMs仍在2D视频或潜在空间中运行,其中视觉上合理的展开缺乏精确操作所需的3D空间约束和遮挡接触几何。虽然几何基础模型为从视觉观测恢复密集3D结构和运动提供了强大的先验,但迫使WAMs预测密集4D表示会引入昂贵的几何解码并减慢因果动作生成。为了解决这一权衡,我们提出了WAM4D,一种快速的4D世界动作模型,它使用轻量级空间注册令牌作为训练时的未来深度读出,将预训练的几何先验迁移到因果视频-动作变换器中,然后移除注册分支以实现轻量级动作推理。为了防止非因果捷径,我们进一步为混合变换器(MoT)WAM骨干设计了因果混合注意力,定义了视频、动作和几何令牌之间的模态特定可见性。在RoboTwin 2.0和具有挑战性的真实世界操作任务上的全面实验表明,WAM4D提高了空间一致性,并在保持高效推理的同时实现了具有竞争力的动作预测。

英文摘要

World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.

2606.14153 2026-06-15 cs.CV cs.RO 交叉投稿

Encoder Winners Do Not Reliably Transfer Across VLA Backbone Scale: A Frozen-Backbone Grafting Diagnostic

编码器胜者无法可靠跨VLA骨干网络规模迁移:一种冻结骨干嫁接诊断方法

Qingping Zeng, Fei She

发表机构 * Tsinghua University(清华大学)

AI总结 提出冻结骨干嫁接诊断方法,发现小规模VLA上最优的视觉编码器在大规模骨干上并非最优,编码器选择依赖于骨干网络规模。

Comments 23 pages, 5 figures, 8 tables

详情
AI中文摘要

视觉-语言-动作(VLA)策略通常从其上游VLM发布中继承视觉编码器,但目前尚不清楚在小规模VLA上验证的编码器选择是否能迁移到更大的骨干网络上。我们引入了一种冻结骨干嫁接诊断方法:将已发布VLA的视觉塔替换为候选编码器,采用固定协议(自适应平均池化、LayerNorm和单个可训练的线性投影器),同时冻结语言模型和动作专家。在四个编码器、两个LIBERO套件、两个骨干网络(SmolVLA-450M和$\pi_{0.5}$-3.3B)以及每个单元两到三个随机种子(共40次主要嫁接运行,加上原生、LoRA、池化以及零/打乱图像对照,全部通过离线动作MSE评分)的条件下,小骨干网络的胜者无法可靠地选出大骨干网络的顶级编码器:SigLIP在SmolVLA上两个套件中均表现最佳,而在$\pi_{0.5}$上,DINOv2-small在空间套件中领先,物体套件则是对种子敏感的接近平局带;四个骨干-套件比较中的三个(以及12个种子级单元中的11个)支持依赖于骨干网络的排名。嫁接包装本身并非中性,在两个骨干网络上符号相反(在SmolVLA原生视觉塔上MSE增加45-56%,在$\pi_{0.5}$上降低50-52%),因此所有结论都依赖于固定的嫁接协议。我们将冻结嫁接定位为一种廉价的靶向骨干诊断方法,在承诺大规模使用编码器之前运行,而非闭环部署声明。

英文摘要

Vision-language-action (VLA) policies typically inherit their vision encoder from upstream VLM releases, but it is unclear whether an encoder choice validated on a small VLA transfers to a larger backbone. We introduce a frozen-backbone grafting diagnostic: the vision tower of a released VLA is replaced by a candidate encoder under a fixed protocol (adaptive average pooling, LayerNorm, and a single trainable linear projector), with the language model and action expert frozen. Across four encoders, two LIBERO suites, two backbones (SmolVLA-450M and $π_{0.5}$-3.3B), and two-to-three seeds per cell (40 main grafting runs plus native, LoRA, pooling, and zero-/shuffled-image controls, all scored by offline action MSE), the small-backbone winner does not reliably select the large-backbone top tier: SigLIP is best on SmolVLA across both suites, while on $π_{0.5}$ DINOv2-small leads the spatial suite and the object suite is a seed-sensitive near-tie band; three of the four backbone-suite comparisons (and 11 of 12 seed-level cells) support backbone-dependent rankings. The grafting wrapper is itself non-neutral with opposite sign across backbones (+45-56% MSE on the SmolVLA native tower, -50-52% on $π_{0.5}$), so all conclusions are conditional on the fixed grafting protocol. We position frozen grafting as a cheap target-backbone diagnostic to run before committing to an encoder at scale, not as a closed-loop deployment claim.

7. 多机器人与群体系统 2 篇

2606.14252 2026-06-15 cs.RO 新提交

Optimality-Preserving Decomposition for Scalable QAOA in Natural-Language-Guided Multi-Drone Assignment

面向自然语言引导的多无人机分配中可扩展QAOA的最优性保持分解

Junyeop Bang, Byongho Lee, Dohyun An, Hwangnam Kim

发表机构 * Korea University(高丽大学)

AI总结 提出端到端框架,集成微调大语言模型与量子-经典后端,通过约束保持图分割和动态规划合并,实现自然语言引导下多无人机任务分配的可扩展量子优化。

Comments 10 pages, 2 figures, 3 tables, preprint

详情
AI中文摘要

随着多无人机机群的扩展,区域分配迅速演变为一个难以处理的NP-hard组合问题,使经典穷举搜索不堪重负。虽然量子优化有望打破这些经典瓶颈,但将人类意图中的复杂空间任务映射到受限的量子硬件上仍然是一个严峻挑战。为弥合这一差距,我们提出了一个端到端框架,集成了微调的大语言模型前端和高度可扩展的领域特定量子-经典后端。前端利用监督微调和直接偏好优化,将自由形式的自然语言指令转换为结构稳健的二次无约束二元优化约束,且无假阴性。为克服近期量子设备的严格量子比特限制,我们的框架采用了一种新颖的约束保持图分割器和基于压缩分隔符的动态规划合并。通过W态初始化和XY混频器在条件风险价值量子近似优化中结构性地编码约束,流水线保持高度紧凑。实验结果表明,该架构规避了经典扩展墙,在100%的理想化预言机案例和96.3%的实际QAOA采样下恢复了全局最优解,使得自然语言引导的任务分配在以前难以处理的规模上成为可能。

英文摘要

As multi-drone fleets scale, zone assignment rapidly evolves into an intractable NP-hard combinatorial problem that overwhelms classical exhaustive search. While quantum optimization promises to shatter these classical bottlenecks, mapping complex spatial tasks from human intent to restricted quantum hardware remains a severe challenge. To bridge this gap, we present an end-to-end framework integrating a fine-tuned Large Language Model (LLM) front-end with a highly scalable, domain-specific quantum-classical backend. The front-end utilizes Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to translate free-form natural language instructions into structurally robust Quadratic Unconstrained Binary Optimization (QUBO) constraints without false negatives. To overcome the strict qubit limits of near-term quantum devices, our framework features a novel constraint-preserving graph partitioner and a compressed separator-based dynamic programming (DP) merge. By structurally encoding constraints via W-state initialization and XY-mixers in Conditional Value-at-Risk Quantum Approximate Optimization (CVaR-QAOA), the pipeline stays highly compact. Empirical results demonstrate that this architecture circumvents classical scaling walls, recovering the global optimum on 100% of idealized oracle cases and 96.3% under real QAOA sampling, enabling natural-language-guided task allocation at previously intractable scales.

2605.25025 2026-06-15 cs.RO cs.SY eess.SY 版本更新

Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning

动态流场中微群集运动优化的多目标多智能体强化学习方法

Josef Berman, Oren Gal

发表机构 * Hatter Department of Marine Technologies, Leon H. Charney School of Marine Sciences, University of Haifa(哈特尔海洋技术系,列昂·H·夏恩海洋科学学院,海法大学)

AI总结 提出混合CFD与多目标多智能体强化学习框架,通过PCGrad解决梯度冲突,在振荡流中优化微机器人集群的上游推进、能量效率和运动平滑性。

详情
AI中文摘要

在生理真实、时间依赖的流体环境中协调微型机器人集群,仍然是生物医学和环境应用中的未解决挑战。我们提出了一种混合计算流体动力学-多目标多智能体强化学习框架,该框架将高保真不可压缩纳维-斯托克斯求解器与去中心化近端策略优化直接耦合,以在振荡流中学习物理一致的集群控制策略。十六个磁驱动微型机器人在脉动动脉波形中导航,同时优化上游推进、能量守恒和运动平滑性,并通过PCGrad手术进行协调。没有PCGrad时,能量效率和平滑度奖励在10000训练步内降至接近零,而进度表现出持续的大幅振荡,证实梯度冲突解决是该领域的一个结构性要求而非可选改进。收敛策略实现了6.5-7.0的进度奖励、0.63-0.65的持续能量效率以及接近最大的平滑度(0.97-0.99),在主目标上比暴力基线有所改进,而两个基线在整个过程中能量效率均为负值。训练揭示了三个涌现行为阶段:在正向流动期间抑制峰值通道速度的集体双层水动力节流编队、利用流动反转进行上游重新定位的周期同步棘轮机制,以及智能体接近成功边界时的个体化最终接近。这些结果表明,时间依赖的流体-智能体相互作用可以直接在多目标强化学习循环中捕获,为生物医学导航、环境监测和工业微流体中的微群集控制提供了基于物理的范式。

英文摘要

Coordinating micro-robotic swarms in realistic, time-dependent fluid environments remains a major challenge for biomedical and environmental applications. We present a hybrid CFD-MO-MARL (Computational Fluid Dynamics-Multi Objective-Multi Agent Reinforcement Learning) framework that couples a high-fidelity incompressible Navier--Stokes solver with decentralized proximal policy optimization to learn swarm control policies in oscillatory flow. Sixteen magnetically actuated micro-robots were simulated to navigate a pulsatile arterial waveform within a 2 mm channel while jointly optimizing upstream progression, energy efficiency, and motion smoothness. Conflicting objectives are resolved using Projected Conflicting Gradient (PCGrad) surgery. Without PCGrad, energy and smoothness rewards collapse during training, demonstrating that gradient conflict resolution is essential for stable multi-objective learning. The converged policy achieves progress rewards of 6.5-7.0, energy efficiency of 0.63-0.65, and smoothness of 0.97-0.99, outperforming brute-force baselines by more than 8 reward units on the primary objective. Training reveals three emergent behaviors not encoded in the reward function: hydrodynamic throttling formations that reduce peak flow velocities, a cycle-synchronized ratchet mechanism that exploits flow reversals for upstream movement, and individualized final-approach strategies near the target boundary. These results demonstrate that physically realistic fluid--agent interactions can be integrated directly into multi-objective reinforcement learning, providing a scalable framework for micro-swarm control in biomedical navigation, environmental monitoring, and microfluidic systems.

8. 无人车、无人机与移动机器人 8 篇

2606.13840 2026-06-15 cs.RO cs.CV 新提交

Multi-Agent Embodied Autonomous Driving: From V2X Information Exchange to Shared World Models

多智能体具身自动驾驶:从V2X信息交换到共享世界模型

Senkang Hu, Zhengru Fang, Yihang Tao, Zihan Fang, Sam Tak Wu Kwong, Yuguang Fang

发表机构 * Lingnan University, Hong Kong(岭南大学(香港))

AI总结 本文综述了从单车智能向多智能体具身系统转变的自动驾驶技术,通过共享世界模型实现感知共享、意图推断和协同规划,并指出了在仿真评估、实时安全保证等方面的研究空白。

详情
AI中文摘要

自动驾驶正从孤立的车辆智能转向多智能体具身系统,这些系统共享感知、推断意图并在不确定性下协调行动。本综述通过共享世界模型(SWMs)的视角审视这一转变:SWMs是跨车辆、基础设施和其他交通参与者维护的预测性跨智能体表征。我们回顾了超过380篇文献,涵盖车联万物(V2X)通信、协同感知、智能体间认知、协同规划、端到端协同驾驶以及用于闭环验证的仿真和数据引擎。核心问题是交换的观测如何成为对齐的状态、意图感知的交互和协调的下游行动。在所调查的文献中,评估仍然集中在仿真、精心设计的基准测试和离线协议上。基于基础模型的协调也缺乏在开放交通中经过验证的实时安全保证。这些空白为多智能体具身自动驾驶(MAEAD)提出了关键研究重点:可验证的共享状态维护、鲁棒的意图和计划对齐,以及在通信、延迟和部署约束下的安全协调行动。

英文摘要

Autonomous driving is shifting from isolated vehicle intelligence toward multi-agent embodied systems that share perception, infer intent, and coordinate action under uncertainty. This survey examines this transition through the lens of Shared World Models (SWMs): predictive cross-agent representations maintained across vehicles, infrastructure, and other traffic participants. We review more than 380 publications spanning vehicle-to-everything (V2X) communication, collaborative perception, inter-agent cognition, cooperative planning, end-to-end cooperative driving, and simulation and data engines for closed-loop validation. The organizing question is how exchanged observations become aligned state, intent-aware interaction, and coordinated downstream action. Across the surveyed literature, evaluation remains concentrated in simulation, curated benchmarks, and offline protocols. Foundation-model-based coordination also lacks verified real-time safety guarantees in open traffic. These gaps motivate key research priorities for multi-agent embodied autonomous driving (MAEAD): verifiable shared-state maintenance, robust intent and plan alignment, and safe coordinated action under communication, latency, and deployment constraints.

2606.13883 2026-06-15 cs.RO 新提交

Guided Diffusion with Distilled Vision-Language Reliability for Aerial Navigation

基于蒸馏视觉语言可靠性的引导扩散用于空中导航

Ivan Valuev, Iana Zhura, Valerii Serpiva, Didar Seyidov, Dzmitry Tsetserukou

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出一种可靠性感知的扩散规划器,通过蒸馏视觉语言模型生成场景级可靠性热图,引导去噪过程处理不可靠区域,显著降低无人机导航中的障碍物违反率并提高区域可靠性。

详情
AI中文摘要

自主无人机导航通常由将感知、映射和规划分离为不同阶段的流水线解决,这会传播误差、累积延迟,并需要针对特定环境重新调整。端到端生成模型通过将原始观测直接映射到轨迹来消除这些接口,但继承了一个微妙的失败模式:在干净数据上训练后,它们无法识别观测何时不可靠,并将玻璃、镜子和过曝光表面等退化区域视为有效证据进行规划。我们提出了一种用于3D无人机导航的可靠性感知扩散规划器。它将轨迹生成条件设置为观测以及场景级可靠性热图,该热图标记了感知不可信的区域,由轻量级网络生成,该网络在实时规划预算内蒸馏了视觉语言模型的开放词汇推理能力。为了无需重新训练即可泛化到未见环境,我们使用可微的两阶段ESDF成本引导去噪过程,该成本将来自深度的物理障碍和来自高度不可靠区域的虚拟障碍同等对待。在仿真和真实四旋翼飞行器上,我们的规划器比最先进的扩散基线产生了明显更安全的轨迹,将障碍物违反率从40.3%降低到9.6%,并将穿越区域的平均可靠性从0.588提高到0.925。仅消融可靠性项会使平均可靠性从0.898降至0.783,确认了其决定性作用,而蒸馏使框架运行速度比完整视觉语言模型快2倍。

英文摘要

Autonomous UAV navigation is conventionally solved by pipelines that separate perception, mapping, and planning into distinct stages, which propagates errors, accumulates latency, and requires environment-specific retuning. End-to-end generative models remove these interfaces by mapping raw observations directly to trajectories, but inherit a subtle failure mode: trained on clean data, they cannot recognise when an observation is unreliable, and treat degraded regions such as glass, mirrors, and overexposed surfaces as valid evidence for planning. We present a reliability-aware diffusion planner for 3D UAV navigation. It conditions trajectory generation on the observation together with a scene-level reliability heatmap that marks where perception cannot be trusted, produced by a lightweight network that distils the open-vocabulary reasoning of a vision-language model within the real-time planning budget. To generalise to unseen environments without retraining, we steer the denoising process with a differentiable two-stage ESDF cost that treats physical obstacles from depth and virtual obstacles from highly unreliable regions on equal footing. In simulation and on a real quadrotor, our planner produces markedly safer trajectories than a state-of-the-art diffusion baseline, reducing the obstacle-violation rate from 40.3% to 9.6% and raising the mean reliability of traversed regions from 0.588 to 0.925. Ablating the reliability term alone drops mean reliability from 0.898 to 0.783, confirming it as the decisive component, while distillation runs the framework up to 2 times faster than the full vision-language model.

2606.14032 2026-06-15 cs.RO 新提交

From Attacks to Curricula: Learnability-Guided Adversarial Training for Safe Autonomous Driving

从攻击到课程:面向安全自动驾驶的可学习性引导对抗训练

Yuewen Mei, Tong Nie, Jie Sun, Haotian Shi, Wei Ma, Jian Sun

发表机构 * College of Transportation & Key Laboratory of Road and Traffic Engineering of Ministry of Education, Tongji University(同济大学交通运输工程学院 & 道路与交通工程教育部重点实验室) Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University(香港理工大学土木与环境工程学系)

AI总结 提出AlignADV框架,通过偏好对齐生成可解决场景,并利用行为指纹预测策略能力,动态采样课程以提升自动驾驶对抗训练的收敛效率与安全性。

详情
AI中文摘要

闭环对抗训练通过将策略暴露于罕见的安全关键场景来提高自动驾驶安全性。标准流程首先生成对抗场景,然后采样用于策略优化。然而,大多数现有框架仍以攻击为导向:碰撞驱动的生成器常合成无法解决的极端情况,这可能导致学习退化;而启发式采样器忽略驾驶策略的演化能力,导致样本效率低下和收敛延迟。我们提出AlignADV,一个可学习性引导的闭环对抗训练框架,将对抗场景转化为可解决且与能力对齐的课程。首先,我们将对抗场景生成重新表述为偏好对齐问题,并采用直接偏好优化引导生成器朝向关键但可解决的场景。其次,我们引入行为指纹来捕捉演化策略的内在特征,并构建多模态能力预测模型,无需昂贵的闭环模拟即可估计策略性能。通过结合可解决性对齐场景与能力预测,AlignADV开发了动态课程采样机制,优先针对当前策略弱点的场景。在Waymo开放运动数据集上的实验表明,AlignADV提高了收敛效率和最终性能,与基线方法相比,训练步骤减少高达40.6%,同时在正常和对抗交通条件下降低了碰撞率并提高了路线完成率。这些结果强调了从攻击导向的场景生成向可学习性引导的策略改进的转变,为更安全、更高效的自动驾驶训练提供了原则性方向。项目页面:此 https URL。

英文摘要

Closed-loop adversarial training improves autonomous driving safety by exposing policies to rare safety-critical scenarios. Standard pipelines first generate adversarial scenarios and then sample them for policy optimization. However, most existing frameworks remain attack-oriented: collision-driven generators often synthesize unsolvable extreme situations, which can degrade learning, while heuristic samplers ignore the evolving capability of the driving policy, causing sample inefficiency and delayed convergence. We propose AlignADV, a learnability-guided closed-loop adversarial training framework that converts adversarial scenarios into resolvable and capability-aligned curricula. First, we reformulate adversarial scenario generation as a preference alignment problem and employ direct preference optimization to guide the generator toward critical yet resolvable scenarios. Second, we introduce behavioral fingerprints to capture the intrinsic characteristics of the evolving policy and construct a multi-modal capability prediction model that estimates policy performance without expensive closed-loop simulations. By combining resolvability-aligned scenarios with capability predictions, AlignADV develops a dynamic curriculum sampling mechanism that prioritizes scenarios targeting the current policy's vulnerabilities. Experiments on the Waymo Open Motion Dataset demonstrate that AlignADV improves convergence efficiency and final performance, reducing training steps by up to 40.6 percent compared with baseline methods while lowering collision rate and improving route completion under both normal and adversarial traffic conditions. These results highlight a shift from attack-oriented scenario generation to learnability-guided policy improvement, offering a principled direction for safer and more efficient autonomous driving training. Project page: https://meiyuewen.github.io/AlignADV/.

2606.14216 2026-06-15 cs.RO cs.SY eess.SY 新提交

Short-Horizon Position Accuracy of Single-Track Models: Implications for Motion Planning of Autonomous Vehicles

单轨模型的短时位置精度:对自动驾驶车辆运动规划的启示

Aron J. Aertssen, Lars A. T. H. van Alen, Igo J. M. Besselink, Rudolf G. M. Huisman, René M. J. G. van de Molengraft

发表机构 * Department of Mechanical Engineering, Eindhoven University of Technology(埃因霍温理工大学机械工程系) Safety & Driver Controls Group, Vehicle Development, DAF Trucks N.V.(DAF卡车公司车辆开发部安全与驾驶员控制组)

AI总结 本文通过实车实验对比三种单轨车辆模型的短时位置精度,分析模型复杂度、参数化质量与位置精度的权衡,为模型预测控制中的模型选择提供依据。

Comments Submitted to The International Journal of Automotive Engineering, Official Journal of the Society of Automotive Engineers of Japan, Inc. (JSAE)

详情
AI中文摘要

准确且计算高效的车辆模型对于自动驾驶车辆的运动规划至关重要,其中位置精度直接影响轨迹可行性和安全性。然而,位置精度尚未针对实际测量进行系统评估。因此,本文通过多种驾驶操作下的车辆测量,比较了三种单轨车辆模型的短时位置精度。模型参数通过使用仪器化测试车辆的专用实验进行识别。本文旨在提供对模型复杂度、参数化质量和位置精度之间权衡的洞察,以便在模型预测控制应用中做出明智的模型选择,而非确定单一最佳模型。

英文摘要

Accurate and computationally efficient vehicle models are essential for motion planning of autonomous vehicles, where positional accuracy directly affects trajectory feasibility and safety. However, the positional accuracy has not been systematically evaluated against real measurements. Therefore, this paper compares the short-horizon positional accuracy of three single-track vehicle models against vehicle measurements across various driving maneuvers. Model parameters are identified through dedicated experiments with the instrumented test vehicle. Rather than identifying a single best model, this work aims to provide insight into the trade-offs between model complexity, parameterization quality, and positional accuracy for informed model selection in Model Predictive Control applications.

2606.14219 2026-06-15 cs.RO cs.AI 新提交

Selective Agentic Recovery for UAV Autonomy with a Persistent Mission Runtime

面向无人机自主性的选择性代理恢复与持久任务运行时

Taewoo Park, Kyeonghyun Yoo, Seunghyun Yoo, Hwangnam Kim

发表机构 * Department of Electrical and Electronic Engineering, Korea University(高丽大学电气与电子工程系)

AI总结 提出持久任务运行时(PMR)框架,通过选择性调用外部代理推理器实现无人机恢复,引入学习型调用认知价值(learned-CVI)门控机制,在Gazebo/PX4基准测试中将硬/模糊场景成功率从5.0%提升至95.0%,同时减少16.7%的远程调用和29.2%的令牌消耗。

Comments 17 pages, 2 figures. Preprint

详情
AI中文摘要

代理AI可以通过在基于航点或设定点的局部执行遇到阻塞路径、重复无进展行为或任务级模糊时提供高层恢复推理来支持无人机自主性。然而,在物理无人机上,远程推理只有在选择性调用时最有用,因为每次调用都会引入延迟、资源成本、后端不确定性以及验证返回决策的需求。本文提出持久任务运行时(PMR),一种无人机恢复框架,它保持任务循环和安全关键执行在本地,同时仅将外部代理推理器用作按需恢复模块。推理器从预定义的恢复技能中选择,每个返回的决策在影响飞行之前经过解析、验证、安全过滤并映射到本地执行器动作。PMR引入了学习型调用认知价值(learned-CVI),一种紧凑的准入门控,用于估计远程代理推理何时可能改善近期任务进展以证明其操作成本合理。在包含八个场景的固定400次运行Gazebo/PX4基准测试中,learned-CVI将硬/模糊场景成功率从仅本地的5.0%提升至95.0%,优于一次性推理和周期性推理基线分别20.0和32.5个百分点,并且相对于手动调整的基于规则的调用基线,减少了16.7%的远程代理调用和29.2%的日志令牌。

英文摘要

Agentic AI can support unmanned aerial vehicle (UAV) autonomy by providing high-level recovery reasoning when local waypoint- or setpoint-based execution encounters blocked passages, repeated no-progress behavior, or mission-level ambiguity. On physical UAVs, however, remote reasoning is most useful when it is invoked selectively, since each call introduces latency, resource cost, backend uncertainty, and a need to validate the returned decision. This paper presents Persistent Mission Runtime (PMR), a UAV recovery framework that keeps the mission loop and safety-critical execution local while using an external agentic reasoner only as an on-demand recovery module. The reasoner selects from predefined recovery skills, and each returned decision is parsed, verified, safety-filtered, and mapped to local executor actions before it can affect flight. PMR introduces learned Cognitive Value of Invocation (learned-CVI), a compact admission gate that estimates when remote agentic reasoning is likely to improve near-term mission progress enough to justify its operational cost. Across a fixed 400-run Gazebo/PX4 benchmark with eight scenarios, learned-CVI raises hard/ambiguous-regime success from 5.0% under local-only autonomy to 95.0%, outperforms one-shot and periodic reasoning baselines by 20.0 and 32.5 percentage points, and reduces remote-agent calls by 16.7% and logged tokens by 29.2% relative to a manually tuned rule-based invocation baseline.

2606.14609 2026-06-15 cs.RO 新提交

Safe Reinforcement Learning of Autonomous Highway Driving: A Unified Framework for Safety and Efficiency

自动驾驶高速公路的安全强化学习:安全与效率的统一框架

Chufei Yan, Zhihao Cui, Yiyan Lv, Taojie Chen, Ning Bian, Yulei Wang

发表机构 * School of Physics, Northeast Normal University(东北师范大学物理学院) Clean Energy Automotive Engineering Center, School of Automotive Studies, Tongji University(同济大学汽车学院清洁能源汽车工程中心) Mengshi Automobile Technology Company, Dongfeng Motor Corporation(东风汽车公司猛士汽车技术公司)

AI总结 提出MoE-RM-SRL框架,通过安全距离、奖励机器和混合专家机制,在训练和部署中同时保证安全与效率,在CARLA和VR平台实验中优于现有方法。

Comments 20 pages, 5 figures, 7 tables. Preprint version

详情
AI中文摘要

深度强化学习(DRL)为高级自动驾驶车辆(AV)的决策提供了一条引人注目的途径,但其试错特性使得在训练过程中难以保证安全性,并在部署时难以同时实现安全与效率。我们提出了一个统一的安全强化学习(SRL)框架,该框架集成了安全距离(SD)、奖励机器(RM)和混合专家(MoE),称为MoE-RM-SRL。在部署中,SD和RM共同塑造了一个规则感知的奖励,编码了高速公路交通规则和阶段目标,从而在不牺牲效率的情况下实现安全可靠的行为。在训练中,我们引入了一个稀疏门控的MoE层,包含多达11个深度Q网络(DQN);基于SD的门控规则激活一组最小的专家用于车道保持和车道变换,减轻了在不同控制器(如MPC/基于规则的模块和学习策略)之间切换时常见的不稳定性、不连续性和脉冲瞬态。我们在CARLA中实现了所提出的架构,并将其与一个6自由度驾驶员在环虚拟现实(DiL-VR)平台集成。在随机双车道交通中的实验表明,MoE-RM-SRL在安全性和效率上显著优于最先进的基线,并且该框架自然地扩展到多车道驾驶以及匝道合流和驶出场景。

英文摘要

Deep reinforcement learning (DRL) offers a compelling route to decision-making for advanced autonomous vehicles (AVs), yet its trial-and-error nature makes it difficult to guarantee safety during training and to achieve both safety and efficiency at deployment. We propose a unified safe reinforcement learning (SRL) framework that integrates safe distance (SD), reward machines (RM), and mixture-of-experts (MoE), termed MoE-RM-SRL. For deployment, SD and RM jointly shape a rule-aware reward that encodes highway traffic regulations and stage-wise objectives, enabling safe and reliable behavior without sacrificing efficiency. For training, we introduce a sparsely gated MoE layer comprising up to 11 deep Q-networks (DQNs); an SD-based gating rule activates a minimal set of experts for lane-keeping and lane-changing, mitigating the instability, discontinuities, and impulsive transients commonly induced by switching between heterogeneous controllers (e.g., MPC/rule-based modules and learned policies). We implement the proposed architecture in CARLA and integrate it with a 6-DoF driver-in-the-loop virtual-reality (DiL-VR) platform. Experiments in stochastic two-lane traffic show that MoE-RM-SRL substantially improves safety and efficiency over state-of-the-art baselines, and the framework naturally extends to multi-lane driving as well as on-ramp merging and exiting scenarios.

2606.13794 2026-06-15 eess.SY cs.AI cs.RO cs.SY 交叉投稿

An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts

过驱动飞行器的可解释控制效能学习与非线性控制分配集成方法

Umut Demir, Aamir Ahmad, Walter Fichter

发表机构 * University of Stuttgart, Faculty of Aerospace Engineering and Geodesy, Institute of Flight Mechanics and Control (iFR)(斯图加特大学航空航天工程与大地测量学院飞行力学与控制研究所)

AI总结 提出一种基于稀疏非线性动力学辨识的学习控制效能映射方法,结合在线自适应机制,实现过驱动飞行器的高效非线性控制分配,兼具可解释性和低计算成本。

详情
AI中文摘要

非线性动力学以及多个执行器之间产生的强耦合削弱了传统线性控制分配技术背后的假设。当飞行进入非线性效应主导的模态时,线性分配器因模型失配增加而精度下降,进而降低飞行控制系统的性能和鲁棒性。高保真机载模型和黑箱数据驱动方法可以在整个飞行包线内恢复精度,但分别带来实时分配难以承受的计算负担,并牺牲了验证和故障诊断所需的可解释性。本文通过使用稀疏非线性动力学辨识从代表性飞行数据中学习显式的、受物理约束的控制效能映射解析模型,解决了这些限制。所得映射紧凑、可解释,并允许解析导数,从而能够在非线性求解器中高效计算,同时额外包含执行器动力学,无需机载模型。在线自适应机制监控预测残差,并在检测到显著对象变化时刷新模型,从而在执行器故障和变化工况下提供平滑重构。该方法在一款高保真非线性基准飞行器上经过一系列激进机动评估,达到了与完整非线性机载模型相当的精度,同时相对于现有基线显著降低了计算成本。

英文摘要

Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.

2503.14331 2026-06-15 cs.RO cs.CV cs.SY eess.SY 版本更新

ADAPT: An Autonomous Forklift for Construction Site Operation

ADAPT:一种用于建筑工地作业的自主叉车

Johannes Huemer, Markus Murschitz, Matthias Schörghuber, Lukas Reisinger, Thomas Kadiofsky, Christoph Weidinger, Mario Niedermeyer, Benedikt Widy, Marcel Zeilinger, Csaba Beleznai, Tobias Glück, Andreas Kugi, Patrik Zips

发表机构 * Center for Vision, Automation and Control(视觉、自动化与控制中心) AIT Austrian Institute of Technology GmbH(奥地利技术研究所) Automation and Control Institute(自动化与控制研究所) Technische Universität Wien(维也纳技术大学)

AI总结 提出ADAPT自主叉车,结合AI感知与经典方法,在非结构化建筑工地实现近人类水平的物流操作,提升安全与效率。

详情
AI中文摘要

高效的物料物流在控制建筑行业的成本和进度中起着关键作用。然而,人工物料搬运仍然容易出现效率低下、延误和安全风险。自主叉车提供了一种有前景的解决方案,以简化现场物流,减少对人类操作员的依赖并缓解劳动力短缺。本文介绍了ADAPT(自主动态全地形托盘运输车)的开发与评估,这是一种专为建筑环境设计的全自主越野叉车。与结构化的仓库环境不同,建筑工地面临重大挑战,包括动态障碍物、非结构化地形和多变的天气条件。为应对这些挑战,我们的系统将AI驱动的感知技术与传统的决策、规划和控制方法相结合,实现了在复杂环境中的可靠操作。我们通过广泛的真实世界测试验证了该系统,并在各种天气条件下将其连续性能与经验丰富的人类操作员进行了比较。我们的研究结果表明,自主户外叉车可以达到接近人类水平的性能,为更安全、更高效的建筑物流提供了一条可行路径。

英文摘要

Efficient material logistics play a critical role in controlling costs and schedules in the construction industry. However, manual material handling remains prone to inefficiencies, delays, and safety risks. Autonomous forklifts offer a promising solution to streamline on-site logistics, reducing reliance on human operators and mitigating labor shortages. This paper presents the development and evaluation of ADAPT (Autonomous Dynamic All-terrain Pallet Transporter), a fully autonomous off-road forklift designed for construction environments. Unlike structured warehouse settings, construction sites pose significant challenges, including dynamic obstacles, unstructured terrain, and varying weather conditions. To address these challenges, our system integrates AI-driven perception techniques with traditional approaches for decision making, planning, and control, enabling reliable operation in complex environments. We validate the system through extensive real-world testing, comparing its continuous performance against an experienced human operator across various weather conditions. Our findings demonstrate that autonomous outdoor forklifts can operate near human-level performance, offering a viable path toward safer and more efficient construction logistics.

9. 软体机器人与硬件设计 2 篇

2606.13746 2026-06-15 cs.RO 新提交

Scalable Dynamic Tactile Sensing Enabled by Passive and Flexible Acoustic Waveguides

可扩展动态触觉传感:基于被动柔性声波导

Guimin Long, Changhong Linghu, Chuanping Liu, Ke Xu, Xingjian Jing

发表机构 * Department of Mechanical Engineering, City University of Hong Kong(香港城市大学机械工程系)

AI总结 提出一种基于深亚波长声波导的被动分布式触觉传感范式,通过弹性膜帽亥姆霍兹谐振器和弹簧增强微管网络实现弯曲不变性,结合稀疏麦克风阵列与轻量神经网络,在4个麦克风64节点阵列中实现4mm空间分辨率和>99%定位精度,支持低频信号波形重建,并展示指尖阵列、触觉手套和大面积皮肤等原型。

Comments 40 pages, 6 figures

详情
AI中文摘要

人工动态触觉传感需要灵敏度、鲁棒性和柔顺性,但现有技术在大面积阵列扩展时面临权衡,加上布线复杂性和成本。本文报告了一种使用深亚波长声波导的被动分布式范式,将性能与结构柔性解耦。弹性膜帽封装的亥姆霍兹谐振器由弹簧增强微管互连,形成封闭网络,在宏观弯曲下保持声学传输不变。通过稀疏嵌入麦克风,系统实现了低频信号(<100 Hz)的实时定位(4 mm最高空间分辨率;4个麦克风64节点传感阵列中准确率>99%)和波形重建。快速连续小波变换和轻量神经网络可在5.5 ms内完成推理。我们展示了适形原型——指尖阵列、触觉手套和大面积皮肤——可检测从单根头发接触到5 mg颗粒撞击、动脉脉搏波、羽毛触摸和手指接触的刺激。这为下一代人机界面建立了一种可扩展、灵活、低成本的范式。

英文摘要

Artificial dynamic tactile sensing requires sensitivity, robustness, and compliance, yet existing technologies face trade-offs when scaling to large-area arrays, compounded by wiring complexity and cost. Here, we report a passive distributed paradigm using deep sub-wavelength acoustic waveguides that decouples performance from structural flexibility. Elastic-membrane-capped Helmholtz resonators interconnected by spring-reinforced microtubes form an enclosed network with invariant acoustic transmission under macroscopic bending. By sparsely embedding microphones, the system achieves real-time localization (4 mm highest spatial resolution; >99% accuracy in a 4 microphones 64-node sensing array) and waveform reconstruction of low-frequency signals (<100 Hz). Fast Continuous Wavelet Transform and a lightweight neural network enable inference within 5.5 ms. We demonstrate conformable prototypes-fingertip arrays, a tactile glove, and large-area skins-detecting stimuli from single-hair contact to 5-mg particle impacts, arterial pulse waves, feather touches, and finger contact. This establishes a scalable, flexible, low-cost paradigm for next-generation human-machine interfaces.

2606.14070 2026-06-15 cs.RO 新提交

Development of a 3 in Sewer Pipe Inspection Robot with an Articulated Differential Mechanism using X-shaped Linkages

使用X形连杆的铰接差动机构的三通下水道管道检测机器人开发

Shoya Umemura, Ryota Taniguchi, Atsushi Kakogawa

发表机构 * Ritsumeikan University(立命馆大学)

AI总结 提出一种改进的三通下水道管道检测机器人,通过铰接差动机构提升牵引力和越障能力,并设计基于驱动轮电流检测的线缆松弛控制方法,实验验证了其越障性能。

Comments The 23rd International Conference on Ubiquitous Robots (UR 2026), 15-18 July, Osaka Ibaraki Campus, Ritsumeikan University, Ibaraki, Osaka, Japan

详情
AI中文摘要

本文提出了一种改进版的三通下水道管道检测机器人,配备紧急疏散机构。第一版中存在的低牵引力和差劲的越障能力,通过简单连接推进单元得到了改善。耦合的推进单元具有差动机构,能够通过单根线缆实现姿态变化,从而适应管道直径变化。为了穿越管道接头等障碍物,设计了一种控制方法,通过驱动轮电机上的电流负载检测障碍物接触并松弛线缆。该方法通过模拟管道实验进行了验证。使用施加在驱动轮上的电流波形进行了负载比较。我们提出的控制方法显著提高了新型铰接式机器人的越障能力。

英文摘要

This paper proposes, an improved version of the 3 in sewer pipe inspection robot equipped with an emergency evacuation mechanism. The low traction force and poor stepover capability, which were challenges of the first version, have been improved by simply connecting the propulsion units. The coupled propulsion units feature a differential mechanism capable of posture changes via a single wire, enabling adaptation to pipe diameter variations. To traverse obstacles like pipe joints, a control method was devised that detects obstacle contact through current load on the drive wheel motors and slackens the wire. This method was verified through simulated pipe experiments. Load comparisons were made using current waveforms applied to the drive wheels. Our proposed control method significantly improved the step-over capability of the new articulated robots.

10. 仿真、数据集与评测 7 篇

2606.13877 2026-06-15 cs.RO 新提交

ContactWorld: What Matters in Vision-Tactile World Models for Contact-Rich Manipulation

ContactWorld: 视觉-触觉世界模型中什么对接触丰富操作至关重要

Zhiyuan Zhang, Pokuang Zhou, Kaidi Zhang, Adeesh Desai, Temitope Amosa, Davood Soleymanzadeh, Jiuzhou Lei, Minghui Zheng, Yu She

发表机构 * School of Industrial Engineering, Purdue University(普渡大学工业工程学院) Department of Mechanical Engineering, Texas A&M University(德克萨斯农工大学机械工程系)

AI总结 通过12项接触丰富操作任务,发现空间结构化和时间连续的表征(如点云)能显著提升规划成功率,且触觉传感的有效性依赖于跨模态表征兼容性。

Comments 32 pages, 12 figures, supplementary material included

详情
AI中文摘要

接触丰富操作需要世界模型从多模态感官观测中推理复杂的接触动力学。然而,哪些表征属性从根本上支持接触丰富环境下的稳定长时域规划仍不清楚。在本文中,我们提出了ContactWorld,一个涵盖12项接触丰富操作任务(包括插入、拆卸、拧紧和探索性交互)的基准和系统性实证研究。通过大量实验,我们发现同时具有空间结构化和时间连续性的表征始终能实现最强的规划性能。特别地,点云观测将平均规划成功率从腕部视角观测的20.7%和前方视角观测的22.0%提升至32.1%。我们进一步发现,触觉传感的有效性关键取决于跨模态表征兼容性,而非仅模态规模。将点云观测与保留更丰富空间结构和交互动力学的触觉力场表征相结合,进一步将性能提升至36.1%,在所有评估任务中实现了最强的整体规划性能。此外,在长时域规划目标下,触觉传感变得越来越重要,因为复合预测误差和接触不确定性随时间累积。总之,这些发现强调了表征结构、多模态兼容性和长时域鲁棒性在面向接触丰富机器人操作的视觉-触觉世界模型中的重要性。

英文摘要

Contact-rich manipulation requires world models to reason over complex contact dynamics from multimodal sensory observations. However, it remains unclear which representation properties fundamentally support stable long-horizon planning in contact-rich settings. In this paper, we present ContactWorld, a benchmark and systematic empirical study of vision-tactile world models spanning 12 contact-rich manipulation tasks, including insertion, disassembly, screwing, and exploratory interaction. Across extensive experiments, we find that representations that are both spatially structured and temporally continuous consistently achieve the strongest planning performance. In particular, point-cloud observations improve average planning success rates from 20.7% with wrist-view observations and 22.0% with front-view observations to 32.1%. We further find that the effectiveness of tactile sensing depends critically on cross-modal representation compatibility rather than modality scaling alone. Combining point-cloud observations with tactile force-field representations, which preserve richer spatial structure and interaction dynamics, further improves performance to 36.1%, yielding the strongest overall planning performance across all evaluated tasks. Moreover, tactile sensing becomes increasingly important under long-horizon planning objectives, where compounding prediction errors and contact uncertainty accumulate over time. Together, these findings highlight the importance of representation structure, multimodal compatibility, and long-horizon robustness in vision-tactile world models for contact-rich robotic manipulation.

2606.14058 2026-06-15 cs.RO 新提交

ReactSim-Bench: Benchmarking Reactive Behavior World Model Simulation in Autonomous Driving

ReactSim-Bench:自动驾驶中反应性行为世界模型模拟的基准测试

Zhiyuan Zhang, Yanlun Peng, Jianing Zhang, Xianda Guo, Zehan Huang, Haoran Liu, Qifeng Li, Shaofeng Zhang, Xiaosong Jia, Junchi Yan

发表机构 * School of Computer Science & School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学计算机科学与技术学院、人工智能学院) Great Wall Motor(长城汽车) Institute of Trustworthy Embodied AI (TEAI), Fudan University(复旦大学可信具身人工智能研究所) School of Computer Science, Wuhan University(武汉大学计算机学院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出ReactSim-Bench,通过解耦自车与周围智能体控制,使用偏离日志的自车行为作为输入,评估行为世界模型模拟的反应性能力,并基于碰撞、地图和运动学指标系统评测多种模型。

详情
AI中文摘要

反应能力是自动驾驶仿真系统中数据驱动行为世界模型模拟器的一个关键特性。具备这种能力,模拟世界中的智能体能够对不同于日志的自车行为做出可行的响应。然而,现有的行为仿真基准测试并未直接衡量反应能力。它们通常让模拟器联合控制自车和周围智能体,并通过日志相似性或开环预测指标来评估真实性。在这项工作中,我们引入了ReactSim-Bench,用于评估自动驾驶中行为世界模型模拟的反应能力。我们将智能体和自车的控制解耦,使用偏离日志的自车行为作为独立的自车输入,要求智能体做出响应。为了获得这些自车行为,我们构建了一个流程,使用自车规划器模型生成候选行为,并通过规则和人工验证筛选数据。采用碰撞指标、基于地图的指标和运动学可行性指标来评估反应性响应的安全性和规则合规性。我们构建了包含三个类别的2,636个测试场景,并对多种架构的最先进模型进行了系统评估,包括基于Transformer、扩散和下一令牌预测的模型。我们进一步分析了重新规划频率对性能的影响,并为未来研究提供了见解。

英文摘要

Reactive capability is a key property of data-driven behavior world model simulators for autonomous driving simulation systems. With this capability, simulated world agents can respond feasibly to autonomous vehicle (AV) behaviors that differ from the log. However, existing behavior simulation benchmarks do not directly measure reactive capability. They often let the simulator jointly control the AV and surrounding agents and evaluate realism through log similarity or open-loop prediction metrics. In this work, we introduce ReactSim-Bench for evaluating the reactive capability of behavior world model simulation in autonomous driving. We decouple the control of agents and the AV, using AV behaviors that differ from the log and require agents to respond as independent AV inputs. To obtain these AV behaviors, we construct a pipeline that uses an AV planner model to generate candidate behaviors and filters the data using rules and manual verification. Collision metrics, map-based metrics, and kinematic feasibility metrics are used to evaluate the safety and rule compliance of reactive responses. We construct 2,636 test scenarios with three categories and conduct a systematic evaluation of state-of-the-art models across multiple architectures, including Transformer-based, diffusion-based, and next-token-prediction-based models. We further analyze how replan frequency affects performance and provide insights for future studies.

2606.14433 2026-06-15 cs.RO 新提交

Kine2Go: Kinematic dataset for the Unitree Go2 robot with diverse gaits and motions

Kine2Go: 面向Unitree Go2机器人的多步态运动学数据集

Władysław Pałucki, Paweł Siwak, Krzysztof Ciebiera, Marek Cygan

发表机构 * University of Warsaw(华沙大学)

AI总结 为降低四足机器人研究门槛,提出Kine2Go数据集,包含800条来自40种策略的Unitree Go2机器人步态运动学轨迹数据,通过强化学习训练策略并收集鲁棒的运动学与电机动作数据。

Comments 9 pages, 6 figures

详情
AI中文摘要

近年来,机器人技术的普及以及机器人硬件成本的稳步下降,降低了机器人研究的入门门槛,推动了该领域的快速发展。一个典型例子是Unitree Go2四足机器人,它常被研究人员用于运动、导航、控制等领域。许多研究人员将Go2机器人与模仿学习、强化学习和行为克隆等技术结合,使机器学习系统能够完全控制机器人。同时,这些技术中的许多需要包含机器人运动学信息和施加于电机的动作的演示数据。获取此类数据困难、需要构建复杂流程且耗时。为帮助此类工作,我们提出了Kine2Go——一个包含800条多样化步态运动学轨迹运动数据的数据集,针对Unitree Go2机器人,源自40种不同的策略。我们的流程接受来自各种四足形态的数据,并将其转换为Go2兼容格式。然后我们使用强化学习训练遵循给定运动的策略,最后从这些策略中收集数据,从而获得鲁棒的、带有扰动的运动学数据及相应的电机级动作。

英文摘要

The recent popularity of robotics, combined with the steadily decreasing cost of robotic hardware, has lowered the entry barrier to robotics research and enabled rapid advancements in the field. One of the primary examples is the Unitree Go2 quadruped robot, which is often used by researchers in the areas of locomotion, navigation, control, and others. Many researchers use the Go2 robot in combination with techniques like imitation learning, reinforcement learning, and behavioral cloning to allow machine learning systems to take full control of the robot. At the same time, many of those techniques require demonstration data consisting of the robot's kinematics information and actions applied to the motors. Obtaining such data is difficult, requires building complex pipelines, and can take significant time. To aid in those kinds of efforts, we present Kine2Go - a dataset with 800 diverse gait kinematics trajectory motion data for the Unitree Go2 robot, derived from 40 distinct policies. Our pipeline accepts data from various quadruped morphologies and translates them to a Go2-compatible format. Then we use Reinforcement Learning to train policies following a given motion, and finally we gather data from those policies, which grants robust, perturbed kinematic data with corresponding motor-level actions.

2606.14699 2026-06-15 cs.CV cs.GR cs.RO 交叉投稿

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

Instruct-Particulate: 基于运动学控制的可扩展前馈式3D物体关节化

Ruining Li, Yuxin Yao, Matt Zhou, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学) Nanyang Technological University(南洋理工大学)

AI总结 提出Instruct-Particulate模型,通过运动学规范(部件描述、连接性、关节类型等)指导3D网格的关节分割和运动参数预测,利用异构数据集(15万+物体)训练,实现跨类别和AI生成网格的泛化。

Comments Project page: https://instruct-particulate.github.io/

详情
AI中文摘要

重建关节式3D物体对于动画、游戏和机器人模拟至关重要。最近的神经网络可以估计3D物体的关节结构,但其泛化能力仍然受到该任务标注数据稀缺的限制。为了解决这一差距,我们引入了Instruct-Particulate,一个模型,它接受一个3D网格以及一个目标运动学规范,包括部件描述、连接性、关节类型和可选的点提示,并预测相应的运动学部件分割和关节运动参数。运动学规范消除了任务的歧义,并允许模型针对不同粒度的标注,从而使得使用更丰富的异构训练数据成为可能。在测试时,运动学规范可以从大规模视觉-语言模型中自动获得,因此该模型可以应用于任何输入网格。为了大规模训练我们的模型,我们构建了一个包含超过15万个关节式3D物体的异构数据集,通过使用视觉-语言模型对部分其他3D模型(整体或已分解为部件)进行运动学标注,扩展了现有的公开数据集。实验表明,我们的模型在跨类别和AI生成网格上泛化更好,通过图像到3D模型实现了从真实世界图像重建关节式资产。

英文摘要

Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.

2602.03177 2026-06-15 cs.RO 版本更新

Estimation of Ground Reaction Forces from Kinematic Data during Locomotion

基于运动学数据估计行走过程中的地面反作用力

Gautami Golani, Dong Anh Khoa To, Ananda Sidarta, Arun-Kumar Kaliya-Perumal, Oliver Roberts, Lek Syn Lim, Jim Patton, Domenico Campolo

发表机构 * Nanyang Technological University(南洋理工大学) Agency for Science, Technology and Research(科技研究局) National Healthcare Group(国家健康集团)

AI总结 提出一种仅使用标记点运动捕捉数据估计地面反作用力的无测力台方法,通过16个身体段运动学计算质心并分解力分量,实验验证了可行性。

详情
AI中文摘要

地面反作用力(GRFs)提供了对人体步态力学的基本洞察,并广泛用于评估关节负荷、肢体对称性、平衡控制和运动功能。尽管具有临床相关性,但由于测力台系统的实际限制,GRF在临床工作流程中的应用仍不充分。在这项工作中,我们提出了一种无测力台的方法,仅使用基于标记的运动捕捉数据来估计GRF。这种仅基于运动学的方法来估计和分解GRF,使其非常适合广泛的临床部署。通过使用16个身体节段的运动学,我们估计质心(CoM)并计算GRF,随后通过基于最小化的方法将其分解为各个分量。通过这一框架,我们可以识别步态支撑期,并在没有专用测力台系统的情况下提供临床上有意义的动力学测量。实验结果表明,仅基于运动学数据估计CoM和GRF是可行的,支持无测力台的步态分析。

英文摘要

Ground reaction forces (GRFs) provide fundamental insight into human gait mechanics and are widely used to assess joint loading, limb symmetry, balance control, and motor function. Despite their clinical relevance, the use of GRF remains underutilised in clinical workflows due to the practical limitations of force plate systems. In this work, we present a force-plate-free approach for estimating GRFs using only marker-based motion capture data. This kinematics only method to estimate and decompose GRF makes it well suited for widespread clinical depolyment. By using kinematics from sixteen body segments, we estimate the centre of mass (CoM) and compute GRFs, which are subsequently decomposed into individual components through a minimization-based approach. Through this framework, we can identify gait stance phases and provide access to clinically meaningful kinetic measures without a dedicated force plate system. Experimental results demonstrate the viability of CoM and GRF estimation based solely on kinematic data, supporting force-plate-free gait analysis.

2606.08881 2026-06-15 cs.RO cs.AI 版本更新

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

在SO-101上对视觉-语言-动作模型进行基准测试:失败与恢复分析

Yi Yu, Xinchuan Qiu

发表机构 * Graduate School of Advanced Science and Engineering, Hiroshima University(广岛大学先进科学与工程研究生院)

AI总结 提出SO-101低成本机器人平台基准,通过失败分类和恢复评估指标,系统比较VLA和模仿学习策略,发现执行不稳定是主要失败源。

Comments 13 pages, 9 figures,

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出强大的泛化能力,但现有评估主要在仿真或昂贵机器人平台上进行,其在低成本真实机器人上的鲁棒性尚未充分探索。我们提出了一个标准化的真实世界基准,用于在低成本SO-101机器人平台上评估代表性VLA和模仿学习策略。该基准包含四个代表性操作任务和统一评估协议,能够在具身不确定性下进行系统比较。使用真实遥操作演示,我们直接在物理平台上微调和评估$π_{0.5}$、SmolVLA、Wall-X和ACT。除了传统的任务成功率,该基准还包含结构化的失败分类、语义级和执行级失败分解,以及恢复感知评估指标,以表征策略鲁棒性。实验结果表明,更强的预训练VLA策略通常优于模仿学习基线,尽管在低成本机器人部署条件下性能高度依赖于任务。执行不稳定是主要的失败源,而恢复能力在不同架构间差异显著。这些结果强调了超越二元任务成功进行失败和恢复分析的重要性,并将SO-101确立为在现实低成本机器人部署条件下评估具身AI系统的实用基准。

英文摘要

Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $π_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.

2606.12349 2026-06-15 cs.RO cs.SY eess.SY 版本更新

Traceable Virtual Sea Trials in the Marine Robotics Unity Simulator for Manoeuvring Assessment of Unmanned Surface Vehicles

面向无人水面艇操纵性评估的海洋机器人Unity仿真器中可追溯虚拟海试

Paria Rezayan

发表机构 * School of Engineering and Built Environment, Sheffield Hallam University(谢菲尔德哈勒姆大学工程与建筑环境学院)

AI总结 针对USV水动力导数辨识数据获取难的问题,在MARUS仿真器中建立标准化虚拟海试框架,通过TC/ZZ机动自动化执行、数据采集与后处理管道,生成符合IMO/ITTC指标的可重复数据集,案例验证了框架的有效性。

详情
AI中文摘要

精确识别水动力导数对于无人水面艇(USV)的控制与导航至关重要,但物理海试的高保真操纵数据受成本和安全性限制。回转试验(TC)和Z形试验(ZZ)仍是IMO和ITTC评估程序的基础。本文扩展了海洋机器人Unity仿真器(MARUS),引入标准化虚拟海试框架,用于TC/ZZ机动的自动化执行和数据生成,包括可追溯的命令-执行日志记录、面向系统辨识(SI)的数据调理以及自动提取符合IMO/ITTC的操纵性指标。一个关键贡献是专用的TC/ZZ数据采集和后处理管道,提高了基于仿真的机动的可重复性和可审计性,同时生成适用于水动力导数辨识和数字孪生工作流的SI就绪数据集。另一个特点是差动推力转向的显式命令-执行分离,其中输入记录为有序的等效舵命令,而实际执行则记录为基于施加推力的执行级代理。案例研究结果表明了可重复且合规的机动行为。对于TC试验,左舷和右舷之间的归一化进距差异约为3.9%,战术直径差异约为4.6%至4.7%。对于ZZ试验,±10度和±20度机动下的第一和第二超越角超调量均保持在1度以下,满足IMO标准,而峰值偏航速率约为4.1至5.8度/秒。总体而言,该框架提供了一种可重复且可审计的虚拟海试工作流,用于生成符合IMO/ITTC的数据集,并支持系统辨识、水动力导数估计和数字孪生校准。

英文摘要

Accurate identification of hydrodynamic derivatives is essential for precise control and autonomous navigation of Unmanned Surface Vehicles (USVs). However, acquiring high-fidelity manoeuvring data from physical sea trials is often constrained by cost, safety, and environmental disturbances. Standard manoeuvring trials, particularly Turning Circle (TC) and Zig-Zag (ZZ), remain fundamental to IMO and ITTC assessment procedures because they provide comparable performance metrics reflective of underlying hydrodynamic behaviour. This paper extends the open-source Marine Robotics Unity Simulator (MARUS) by introducing a standardised Virtual Sea Trial framework for automated execution and data generation of TC/ZZ manoeuvres. The framework provides traceable command-actuation logging, system-identification (SI)-focused data conditioning, and automated extraction of IMO/ITTC-aligned manoeuvring metrics. A key contribution is a dedicated TC/ZZ data acquisition and post-processing pipeline, improving the repeatability and auditability of simulator-based manoeuvres while producing SI-ready datasets for hydrodynamic-derivative identification and digital-twin workflows. The framework also provides explicit command-execution separation for differential-thrust steering, where manoeuvre inputs are recorded as ordered rudder-equivalent commands and realised actuation is logged as an execution-level proxy derived from applied thrust. Case study results demonstrate repeatable and IMO-compliant manoeuvre behaviour. For TC tests, the normalised advance differs by approximately 3.9% between port and starboard turns, while the tactical diameter differs by 4.6-4.7%. For ZZ tests, first and second overshoot excesses remain below 1 degree for both +/-10-degree and +/-20-degree manoeuvres, satisfying IMO criteria, while peak yaw rates range from approximately 4.1 to 5.8 degrees/second.

11. 安全、鲁棒性与可信机器人 2 篇

2606.14585 2026-06-15 cs.RO cs.AI 新提交

Sensitivity Shaping for Latent Modeling

潜变量建模中的灵敏度塑造

Hongzhan Yu, Chenghao Li, Ruipeng Zhang, Henrik Christensen, Sicun Gao

发表机构 * University of California San Diego(加利福尼亚大学圣迭戈分校)

AI总结 针对生成动力学模型在策略诱导的分布外(OOD)转换检测中灵敏度不足的问题,提出支持条件控制灵敏度正则化,提升对控制输入变化的局部响应,实验验证了改进的OOD检测和更安全的闭环规划。

详情
AI中文摘要

生成动力学模型能够在具有挑战性的机器人系统中进行规划,但安全部署需要可靠地检测策略诱导的分布外(OOD)转换。现有方法通常将学习到的动力学视为固定的,并附加事后支持代理。我们表明,当动力学对关键动作选择局部不敏感时,这些代理可能失效:不受支持的控制动作可能产生类似于演示转换的潜变量预测,尽管存在较大的真实预测误差,但仍会抑制OOD信号。为了解决这个问题,我们引入了支持条件控制灵敏度正则化,该正则化在学习动力学的高支持训练区域中促进对控制输入变化的局部敏感响应。这保留了控制引起的变异,同时限制了因弱经验支持导致的不稳定外推。在基于视觉的避障、操作和真实机器人导航中的实验表明,OOD检测和更安全的闭环规划得到了改进。

英文摘要

Generative dynamics models enable planning in challenging robotic systems, but safe deployment requires reliably detecting policy-induced out-of-distribution (OOD) transitions. Existing methods typically treat the learned dynamics as fixed and attach post hoc support surrogates. We show that these surrogates can fail when the dynamics are locally insensitive to critical action choices: unsupported control actions may produce latent predictions that resemble demonstrated transitions, suppressing OOD signals despite large true predictive errors. To address this, we introduce support-conditioned control-sensitivity regularization, which promotes sensitive local response to control input changes in learned dynamics in high-support training regions. This preserves control-induced variation while limiting unstable extrapolation due to weak empirical support. Experiments in vision-based obstacle avoidance, manipulation, and real-robot navigation show improved OOD detection and safer closed-loop planning.

2606.14536 2026-06-15 cs.LG cs.RO cs.SY eess.SY 交叉投稿

Provably Safe, Yet Scalable Reinforcement Learning

可证明安全且可扩展的强化学习

Kai S. Yun, Zeyang Li, Navid Azizan

发表机构 * MIT(麻省理工学院)

AI总结 提出PS2-RL框架,通过两阶段架构(学习备份策略隐式构造控制不变集,再通过可微投影层训练RL策略)实现可证明安全且可扩展的强化学习,在高达10维状态空间中保持性能与安全性。

详情
AI中文摘要

安全强化学习旨在学习在满足约束的同时优化奖励的策略。主流方法依赖于软约束策略优化,虽取得经验成功,但无法为学习策略提供正式安全保证。相反,具有严格保证的方法通常依赖显式证书函数,其构造需要直接综合和验证控制不变集,这一过程随状态维度扩展性差,且往往导致过于保守的行为。本文提出可证明安全且可扩展的强化学习(PS2-RL)框架,一种新颖的两阶段架构,以可扩展方式学习可证明安全的策略,旨在克服先前方法的关键瓶颈。PS2-RL不显式计算不变集,而是利用学习的备份策略前向积分系统动力学,在线生成隐式控制不变集。第一阶段,通过提出的安全到达值函数训练备份策略,该值函数刻画了用于不变集构造的最优备份策略。第二阶段,通过可微投影层端到端训练RL策略,该投影层严格强制由学习备份策略诱导的安全保证。通过在第一阶段最大化隐式控制不变集的体积,第二阶段得到的PS2策略既高效又可扩展,同时保持可证明安全性。关键的是,PS2-RL对底层RL算法无限制,可插入任何现有训练流程。我们为所提框架建立了理论保证,并在状态维度高达10的机器人控制任务上进行了评估,而在此范围内,先前可证明安全的RL方法难以应对或变得不实用。

英文摘要

Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned policy. In contrast, methods with strict guarantees typically rely on explicit certificate functions, whose construction requires the direct synthesis and verification of control-invariant sets, a process that scales poorly with state dimension and often yields overly conservative behavior. In this paper, we present the Provably Safe, yet Scalable RL (PS2-RL) framework, a novel two-phase architecture for learning provably safe policies in a scalable manner, designed to overcome the key bottlenecks of prior methods. Rather than explicitly computing invariant sets, PS2-RL leverages a learned backup policy to forward-integrate the system dynamics, generating an implicit control-invariant set online. In the first phase, the backup policy is trained with our proposed safe-arrival value function, which characterizes the optimal backup policy for invariant-set construction. In the second phase, an RL policy is trained end-to-end through a differentiable projection layer that strictly enforces the safety guarantees induced by the learned backup policy. By maximizing the volume of the implicit control-invariant set in the first phase, the resulting PS2 policy from the second phase is performant and scalable, while maintaining provable safety. Crucially, PS2-RL imposes no restrictions on the underlying RL algorithm and can be plugged into any existing training pipeline. We establish theoretical guarantees for the proposed framework and evaluate it on robotic control tasks with state dimensions up to 10, a regime in which prior provably safe RL methods struggle or become impractical.

12. 其他/综合机器人 4 篇

2605.24795 2026-06-15 math.OC cs.LG cs.RO cs.SY eess.SY 版本更新

Lifted Schrödinger Bridges for Gaussian Mixture Endpoints: Projection Gaps and Path-Space Obstructions

提升的Schrödinger桥用于高斯混合端点:投影间隙与路径空间障碍

Siddhartha Ganguly, George Rapakoulias, Panagiotis Tsiotras

发表机构 * Daniel Guggenheim School of Aerospace, Georgia Institute of Technology(丹尼尔·加金吉姆航空航天学院,佐治亚理工学院)

AI总结 针对高斯混合端点分布下的随机密度控制问题,提出一种提升路径空间构造,将问题分解为高斯分量间的显式Schrödinger桥与有限维熵耦合,并分析投影后的标签信息间隙及路径空间障碍。

Comments 35 pages. Submitted to a journal; comments are welcome

详情
AI中文摘要

我们研究了布朗先验动力学下高斯混合端点分布之间的随机密度控制。由于高斯混合之间的直接Schrödinger桥通常没有闭式解,我们引入了一种提升路径空间构造,其中每条轨迹都增加了一个源-目标分量标签。因此,问题分解为具有显式边际、漂移和成本公式的高斯分量间Schrödinger桥,而混合级分配简化为具有Sinkhorn缩放形式的有限维熵耦合问题。然后,我们分析了通过丢弃或遗忘标签得到的投影。通过构造,投影律满足原始高斯混合端点约束,但其相对熵通常与提升相对熵相差一个非负的条件标签信息间隙。这个间隙揭示了一个路径空间障碍:提升优化器在投影后通常不能等同于直接的无标签Schrödinger桥。我们还推导了与投影边际流相关的后验平均马尔可夫漂移,证明了动能上界,并识别了一个公共路径势条件,在该条件下投影间隙消失。为了自包含的阐述,记录了几个显示密度和形状控制的数值示例。

英文摘要

We study stochastic density control between Gaussian-mixture endpoint distributions under Brownian prior dynamics. Since the direct Schrödinger bridge between Gaussian mixtures is generally not available in closed form, we introduce a lifted path-space construction in which each trajectory is augmented with a source--target component label. Consequently, the problem decomposes into Gaussian component-to-component Schrödinger bridges with explicit marginal, drift, and cost formulas, while the mixture-level assignment reduces to a finite-dimensional entropic coupling problem with a Sinkhorn scaling form. We then analyze the projection obtained by discarding or forgetting the label. By construction, the projected law satisfies the original Gaussian-mixture endpoint constraints, but its relative entropy generally differs from the lifted relative entropy by a nonnegative conditional label-information gap. This gap reveals a path-space obstruction: the lifted optimizer cannot, in general, be identified with the direct unlabeled Schrödinger bridge after projection. We also derive the posterior-averaged Markov drift associated with the projected marginal flow, prove a kinetic-energy upper bound, and identify a common path-potential condition under which the projection gap vanishes. Several numerical illustrations showing density and shape control are recorded for a self-contained exposition.

2507.06174 2026-06-15 cs.RO cs.AI cs.SY eess.SY 版本更新

Design and Experimental Validation of Sensorless 4-Channel Bilateral Teleoperation for Low-Cost Manipulators

无传感器四通道双侧远程操控的设计与实验验证用于低成本机械臂

Koki Yamane, Yunhan Li, Masashi Konosu, Koki Inami, Junji Oaki, Toshiaki Tsuji, Sho Sakaino

发表机构 * Degree Programs in Intelligent and Mechanical Interaction Systems, University of Tsukuba(智能与机械交互系统专业,东京大学) Faculty of Engineering, Information and Systems, University of Tsukuba(工程、信息与系统学部,东京大学) Department of Electrical Engineering, Electronics, and Applied Physics, Saitama University(电子工程、电子学与应用物理系,埼玉大学)

AI总结 本文提出了一种无传感器四通道双侧远程操控框架,结合非线性动力学补偿与基于观测器的扰动估计方案,实验证明在低成本硬件限制下可实现稳定的高速接触密集场景远程操控,并提升模仿学习任务的成功率。

Comments 22 pages, 12 figures, Submitted to IEEE Access

详情
AI中文摘要

远程操控低成本机械臂正逐渐成为收集模仿学习演示数据的实用手段。然而,现有大多数低成本系统依赖单侧位置控制无力反馈,而实现力反馈双侧远程操控困难,因为低成本机械臂通常具有低分辨率编码器和无关节扭矩传感器。本文提出了一种无传感器四通道双侧远程操控框架,整合了识别的非线性动力学补偿与基于扰动观测器的速度和外部力估计方案。通过在频域中解释观测器结构,我们澄清了速度和外部力估计带宽之间的耦合,并基于阻尼比和单个截止频率推导了实用的调谐指南。实车实验,包括力传感器比较和远程操控任务,证明所提出的框架提供了实用的力估计,并在低成本硬件限制下实现了高速和接触密集场景下的稳定远程操控。作为应用,模仿学习实验表明,将估计的力信息纳入演示中可提高测试接触密集操作任务的任务成功率。

英文摘要

Teleoperation of low-cost manipulators is attracting increasing attention as a practical means of collecting demonstration data for imitation learning. However, most existing low-cost systems rely on unilateral position control without force feedback, while implementing force-feedback bilateral teleoperation is difficult because low-cost manipulators typically have low-resolution encoders and no joint torque sensors. This paper presents a sensorless 4-channel bilateral teleoperation framework that integrates identified nonlinear dynamics compensation with a disturbance-observer-based velocity and external-force estimation scheme. By interpreting the observer structure in the frequency domain, we clarify the coupling between the velocity- and external-force-estimation bandwidths and derive practical tuning guidelines based on the damping ratio and a single cutoff frequency. Real-robot experiments, including force-sensor comparison and teleoperation tasks, demonstrate that the proposed framework provides practically useful force estimates and enables stable teleoperation in high-speed and contact-rich scenarios under low-cost hardware constraints. As an application, imitation-learning experiments demonstrate that incorporating estimated force information into demonstrations improves task success rates in the tested contact-rich manipulation tasks.

2503.15496 2026-06-15 cs.HC cs.RO 版本更新

Fast Multi-Party Open-Ended Conversation with a Social Robot

快速多方开放性对话与社交机器人

Giulio Antonio Abbo, Maria Jose Pinto-Bernal, Martijn Catrycke, Tony Belpaeme

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文提出一种结合多模态感知与大语言模型的多方对话系统,评估结果显示其在平行对话和小组讨论中表现出高参与度和准确率,但存在语音识别误差和响应延迟等技术限制。

Comments 15 pages, 5 figures, 4 tables; 2 appendices

详情
Journal ref
Front. Robot. AI 13:1766383 (2026)
AI中文摘要

多方开放性对话在人机交互中仍是一个重大挑战,特别是当机器人需要识别说话者、分配发言权并在对话重叠或快速变化时保持连贯回应。本文提出一种多方对话系统,结合多模态感知(语音方向到达、说话人分离、面部识别)与大语言模型进行回应生成。在Furhat机器人上实现后,该系统在两个场景中对30名参与者进行了评估:(i)平行独立对话和(ii)共享小组讨论。结果表明,该系统能维持连贯且吸引人的对话,在平行设置中实现高收件人准确率(92.6%)和强面部识别可靠性(80-94%)。参与者报告了清晰的社会存在感和积极的参与度,尽管语音基于说话人识别错误和响应延迟等技术障碍影响了小组互动的流畅性。结果突显了基于LLM的多方交互的潜力和局限性,并概述了未来社交机器人改进多模态提示整合和响应能力的具体方向。

英文摘要

Multi-party open-ended conversation remains a major challenge in human-robot interaction, particularly when robots must recognise speakers, allocate turns, and respond coherently under overlapping or rapidly shifting dialogue. This paper presents a multi-party conversational system that combines multimodal perception (voice direction of arrival, speaker diarisation, face recognition) with a large language model for response generation. Implemented on the Furhat robot, the system was evaluated with 30 participants across two scenarios: (i) parallel, separate conversations and (ii) shared group discussion. Results show that the system maintains coherent and engaging conversations, achieving high addressee accuracy in parallel settings (92.6%) and strong face recognition reliability (80-94%). Participants reported clear social presence and positive engagement, although technical barriers such as audio-based speaker recognition errors and response latency affected the fluidity of group interactions. The results highlight both the promise and limitations of LLM-based multi-party interaction and outline concrete directions for improving multimodal cue integration and responsiveness in future social robots.

2508.18967 2026-06-15 cs.RO cs.CV 版本更新

Enhanced UAV Path Planning Using the Tangent Intersection Guidance (TIG) Algorithm

利用切线交点引导算法(TIG)增强的无人机路径规划

Hichem Cheriet, Khellat Kihel Badra, Chouraqui Samira

AI总结 本文提出TIG算法,通过椭圆切线交点方法生成可行路径,结合启发式规则和二次贝塞尔曲线平滑技术,在静态和动态环境中实现高效安全的无人机路径规划。

Comments Accepted for publication in JAMRIS Journal

详情
Journal ref
Journal of Automation, Mobile Robotics and Intelligent Systems, 20(2), 30-52 (2026)
AI中文摘要

高效的无人机导航对于各种应用至关重要,包括战斗支援、包裹递送和搜索救援。本文介绍了切线交点引导(TIG)算法,一种用于静态和动态环境中的无人机路径规划的先进方法。该算法使用椭圆切线交点方法生成可行路径。它为每个威胁生成两条子路径,根据启发式规则选择最佳路线,并迭代优化路径,直到达到目标。考虑到无人机的运动学和动力学约束,采用基于二次贝塞尔曲线的改进平滑技术生成平滑且高效的路径。实验结果表明,TIG算法在静态环境中能够在0.01秒内生成最短路径,比A*、PRM、RRT*、切线图和静态APPATT算法具有更少的转向角度。此外,在完全未知和部分已知环境中,TIG展示了高效的实时路径规划能力,用于避障,优于APF和动态APPATT算法。

英文摘要

Efficient and safe navigation of Unmanned Aerial Vehicles (UAVs) is critical for various applications, including combat support, package delivery and Search and Rescue Operations. This paper introduces the Tangent Intersection Guidance (TIG) algorithm, an advanced approach for UAV path planning in both static and dynamic environments. The algorithm uses the elliptic tangent intersection method to generate feasible paths. It generates two sub-paths for each threat, selects the optimal route based on a heuristic rule, and iteratively refines the path until the target is reached. Considering the UAV kinematic and dynamic constraints, a modified smoothing technique based on quadratic Bézier curves is adopted to generate a smooth and efficient route. Experimental results show that the TIG algorithm can generate the shortest path in less time, starting from 0.01 seconds, with fewer turning angles compared to A*, PRM, RRT*, Tangent Graph, and Static APPATT algorithms in static environments. Furthermore, in completely unknown and partially known environments, TIG demonstrates efficient real-time path planning capabilities for collision avoidance, outperforming APF and Dynamic APPATT algorithms.