arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28812 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

超越二元:基于物理接触表示的仿真到现实灵巧操作

Jiahe Pan, Stelian Coros, Jitendra Malik, Toru Lin

发表机构 * ETH Zürich(苏黎世联邦理工学院) UC Berkeley(伯克利加州大学)

AI总结 提出基于物理原理的中心压力(CoP)触觉表示,结合可微动力学传感器标定,实现多指手的零样本仿真到现实迁移,在插销入孔和球平衡任务中优于二元接触和原始触觉基线。

Comments Project site: https://mpan31415.github.io/tactile_rep/

详情
AI中文摘要

接触丰富操作的主要瓶颈是收集真实世界数据的困难。仿真到现实强化学习提供了一种可扩展的替代方案,但仿真-现实差距阻碍了像触觉这样信息密集的模式被有效使用。现有的仿真到现实方法通常通过将触觉数据简化为粗略的低维特征来缩小这一差距——牺牲了复杂操作所需的丰富性。在这项工作中,我们引入了中心压力(CoP),一种基于物理原理的有效触觉表示,它保留了密集的接触信息,同时保持了仿真到现实迁移的鲁棒性。为了支持这种表示,我们提出了一种基于可微动力学的传感器标定方案,使得能够在不需真实力测量的情况下估计触觉单元的朝向。我们在两个盲态、具有挑战性的接触丰富操作任务上评估了CoP:插销入孔和球平衡。在这两个任务中,基于CoP的策略在多指手上实现了零样本仿真到现实迁移,并且优于粗略的二元接触和原始触觉基线。对学习策略状态的分析进一步表明,基于CoP的策略编码了任务相关的物理属性,如物体质量,作为控制的涌现副产品。

英文摘要

A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offers a scalable alternative, but the simulation-reality gap prevents information-dense modalities like touch from being effectively used. Existing sim-to-real methods often mitigate this gap by simplifying tactile data into coarse low-dimensional features -- sacrificing the richness required for complex manipulation. In this work, we introduce Center-of-Pressure (CoP), an effective tactile representation grounded in physical principles that preserves dense contact information while maintaining robustness for sim-to-real transfer. To support this representation, we propose a sensor calibration scheme based on differentiable dynamics, enabling the estimation of taxel orientations without requiring ground-truth force measurements. We evaluate CoP on two blind, challenging contact-rich manipulation tasks: peg-in-hole insertion and ball balancing. Across both tasks, policies conditioned on CoP achieve zero-shot sim-to-real transfer on a multi-fingered hand, and outperform both coarse binary-contact and raw-taxel baselines. Analysis of learned policy states further suggests that CoP-conditioned policies encode task-relevant physical properties, such as object mass, as an emergent byproduct of control.

2605.28736 2026-05-28 cs.RO 版本更新

Imitation Learning for Robot Assistance in Open Surgery: A Multi-Policy Evaluation on Suture Following

开放手术中机器人辅助的模仿学习:针对缝合跟随的多策略评估

Xucheng Wang, Zhizhou Yang, Xiaoman Zhang, Sung Eun Kim, Romain Hardy, Pranav Rajpurkar

发表机构 * Harvard Medical School(哈佛医学院) Massachusetts General Hospital(麻省总医院)

AI总结 本研究首次评估通用模仿学习在开放手术中用于外科医生-机器人协作辅助的可行性,以缝合跟随(每次缝合时助手执行的抓取-拉动-释放动作)为任务,通过比较四种策略(ACT、Diffusion Policy、SmolVLA、π₀)在28个训练模型上的表现,发现π₀在数据效率、背景鲁棒性和轨迹平滑性上最优,并在机器人缝合试验中达到92%的缝合完成率。

详情
AI中文摘要

本研究首次评估了通用模仿学习在外科医生-机器人协作辅助开放手术中的应用,针对缝合跟随:即助手在每次缝合时执行的抓取-拉动-释放动作。我们在一个开源机器人臂上收集了160次遥操作演示(32,374帧),并基准测试了四种架构不同的模仿学习策略(ACT、Diffusion Policy、SmolVLA、π₀),涉及28个训练模型,在32种配置下沿三个临床相关维度(数据集大小、相机视角和背景变化)进行评估。结果表明,在理想条件下,四种策略实现了50%-75%的任务成功率,深度误差是所有架构的主要失败模式。在所有策略中,π₀凭借预训练的视觉-语言骨干网络取得了最强结果,展现出优越的数据效率、对背景变化的更强鲁棒性以及与手术工作流兼容的更平滑轨迹。在外科医生-机器人缝合试验中,π₀实现了92%的缝合完成率。这些发现确立了开放手术中的协作机器人辅助作为模仿学习的可行目标,并强调深度感知和末端执行器设计是临床转化的关键优先事项。

英文摘要

This study presents the first evaluation of general-purpose imitation learning for surgeon-robot collaborative assistance in open surgery, targeting suture following: the grab-pull-release motion an assistant performs at every stitch. We collect 160 teleoperated demonstrations (32,374 frames) on an open-source robot arm, benchmark four architecturally diverse imitation learning policies (ACT, Diffusion Policy, SmolVLA, $π_0$) across 28 trained models evaluated in 32 configurations along three clinically motivated dimensions: dataset size, camera viewpoint, and background variation. Our results demonstrate that under ideal conditions, the four policies achieve $50$-$75\%$ task success, with depth error as the dominant failure mode across all architectures. Among all policies, $π_0$ achieves the strongest results with a pretrained vision-language backbone, demonstrating superior data efficiency, greater robustness to background variation, and smoother trajectories compatible with surgical workflow. When deployed in a surgeon-robot suturing trial, $π_0$ yields a $92\%$ stitch completion rate. These findings establish collaborative robotic assistance in open surgery as a feasible target for imitation learning and highlight depth perception and end-effector design as key priorities for clinical translation.

2605.28726 2026-05-28 cs.RO cs.LG 版本更新

How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

VLA如何以不同方式失败:黑盒动作监控揭示架构特定的失败特征

Krishnam Gupta

发表机构 * Independent Research(独立研究)

AI总结 本文通过黑盒动作监控发现,视觉-语言-动作(VLA)架构在电机指令层面以根本不同且可预测的方式失败,并证明架构匹配的监控器选择至关重要。

Comments Accepted at IEEE ICRA 2026 Workshop "From Data to Decisions: VLA Pipelines for Real Robots", Vienna, June 2026. Non-archival workshop. 5 pages, 2 figures, 22 references

详情
AI中文摘要

我们发现VLA架构在电机指令层面以根本不同且可预测的方式失败。在相同的评估协议(PushT和ALOHA 14自由度双手操作共450个回合)上运行VQ-BeT、Diffusion Policy和ACT,我们发现:(1)方向反转率是所有三种架构的通用失败预测器(AUROC=0.93, 0.79, 0.91; p<0.001);(2)加加速度监控仅对离散令牌架构具有预测性,遵循离散到连续的梯度(0.88, 0.69, 0.41);(3)速度违规本身在所有地方均无预测性(AUROC 0.41-0.69),然而速度检查是VLA部署代码中最常见的安全机制;(4)对于连续族VLA,速度监控提供的预测信号几乎为零(ACT上AUROC=0.52,Diffusion上0.41),证明架构匹配的监控器选择至关重要。这些结果量化了众所周知的离散/连续VLA区分的监控后果:两个家族产生定性不同的失败特征,需要不同的监控器。没有单一的监控器能普遍适用;需要架构匹配的选择。这一发现得益于SafeContract,一个无需训练、黑盒动作监控工具包,具有共形校准。代码:https://github.com/krishnam94/vla-edge

英文摘要

We discover that VLA architectures fail in fundamentally different, predictable ways at the motor-command level. Running VQ-BeT, Diffusion Policy, and ACT on identical evaluation protocols (n=450 episodes across PushT and ALOHA 14-DOF bimanual manipulation), we find: (1) direction reversal rate is a universal failure predictor across all three architectures (AUROC=0.93, 0.79, 0.91; p<0.001); (2) jerk monitoring is predictive only for discrete-token architectures, following a discrete-to-continuous gradient (0.88, 0.69, 0.41); (3) velocity violations alone are non-predictive everywhere (AUROC 0.41-0.69), yet velocity checking is the most common safety mechanism in VLA deployment code; and (4) for continuous-family VLAs, velocity monitoring provides effectively zero predictive signal (AUROC=0.52 on ACT, 0.41 on Diffusion), proving that architecture-matched monitor selection is essential. These results quantify a monitoring consequence of the well-known discrete/continuous VLA distinction: the two families produce qualitatively different failure signatures that require different monitors. No single monitor works universally; architecture-matched selection is required. This finding was enabled by SafeContract, a training-free, black-box action monitoring toolkit with conformal calibration. Code: https://github.com/krishnam94/vla-edge

2605.28634 2026-05-28 cs.RO 版本更新

PrimitiveVLA: Learning Reusable Motion Primitives for Efficient and Generalizable Robotic Manipulation

PrimitiveVLA:学习可复用的运动基元以实现高效且可泛化的机器人操作

Yutai Li, Shaohui Peng, Jiaming Guo, Di Huang, Zihao Zhang, Yuxuan Guo, Yunkai Gao, Siming Lan, Ling Li, Xing Hu, Yunji Chen

发表机构 * State Key Lab of Processors, Institute of Computing Technology, CAS(处理器国家重点实验室,计算技术研究所,中国科学院) Jiangsu Key Laboratory of AI for Industries, Institute of AI for Industries, CAS(江苏人工智能工业重点实验室,人工智能工业研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Cambricon Technologies(寒武科技) Intelligent Software Research Center, Institute of Software, CAS(软件研究所智能软件研究中心,中国科学院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出PrimitiveVLA框架,通过将视觉-语言-动作模型从直接指令到控制映射转向以基元为中心的拆解与组装范式,利用多模态规范表示和自动化流水线,提升数据效率并实现零样本泛化。

详情
AI中文摘要

视觉-语言-动作(VLA)模型为通用机器人策略提供了有前景的范式,但其适应受到数据效率低下和泛化能力差的阻碍。我们认为这些瓶颈源于主流的直接指令到控制映射,该映射迫使模型记忆整体轨迹而非可复用的运动模式,即基元。我们提出PrimitiveVLA,一个将该范式转向以基元为中心的拆解与组装范式的框架。在共享的多模态规范表示(MCR)支持下,PrimitiveVLA统一了两个阶段:(1)微调阶段拆解,使用自动化流水线将演示拆解为可复用的基元;(2)推理阶段组装,采用基于VLM的规划器和LLM生成的切换模块实现鲁棒的闭环执行。通过将任务拆解为可复用的基元,PrimitiveVLA使VLA模型能够学习不变的运动模式而非特定任务的轨迹。大量实验表明,我们的框架提高了数据效率,并在未见过的任务和长时域任务上实现了卓越的零样本泛化。

英文摘要

Vision-Language-Action (VLA) models offer a promising paradigm for generalist robotic policies, yet their adaptation is hindered by data inefficiency and poor generalization. We argue that these bottlenecks stem from the prevailing Direct Instruction-to-Control Mapping, which forces models to memorize monolithic trajectories rather than reusable motion patterns, i.e., primitives. We propose PrimitiveVLA, a framework that shifts this paradigm toward a Primitive-Centric Disassemble & Assemble paradigm. Supported by a shared Multimodal Canonical Representation (MCR), PrimitiveVLA unifies two phases: (1) Fine-tuning-phase Disassembly, which uses an automated pipeline to disassemble demonstrations into reusable primitives; and (2) Inference-phase Assembly, which employs a VLM-based planner and an LLM-generated switch module for robust closed-loop execution. By disassembling tasks into reusable primitives, PrimitiveVLA enables VLA models to learn invariant motion patterns instead of task-specific trajectories. Extensive experiments show that our framework improves data efficiency and achieves superior zero-shot generalization across unseen and long-horizon tasks.

2605.28583 2026-05-28 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving

SARAD:基于LLM的安全感知混合强化学习与碰撞预测在自动驾驶中的应用

Kangyu Wu, Peng Cui, Guoxi Chen, Ya Zhang

发表机构 * National Natural Science Foundation (NNSF) of China(中国国家自然科学基金委员会) National Science and Major Project(国家科学技术重大专项)

AI总结 提出SARAD框架,结合大语言模型和深度强化学习,通过检索增强生成和碰撞预测模块提升自动驾驶的安全性和效率。

Comments 7 pages, 4 figures, accepted by IJCNN 2026

详情
AI中文摘要

确保自动驾驶系统决策的安全性和效率仍然是一个基本挑战。传统的深度强化学习(DRL)存在不安全的随机探索和收敛缓慢的问题,而大语言模型(LLM)在实时推理操作中表现出固有的延迟。为了解决这些限制,本文提出了SARAD,一种新颖的安全感知混合框架,协同LLM和DRL用于自动驾驶。SARAD用来自动态专家知识库的、经检索增强生成(RAG)增强的LLM引导决策替代了DRL的随机探索。提出了一个注意力判别器,将LLM的先验知识整合到DRL策略优化中。进一步设计了一个碰撞预测模块,使用历史碰撞数据进行微调,以提高车辆安全性。大量实验表明,SARAD在Highway-Env模拟器中实现了显著的性能提升,验证了所提模型在自动驾驶中的有效性。

英文摘要

Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Reinforcement Learning (DRL) suffers from unsafe random exploration and slow convergence, while Large Language Models (LLMs) demonstrate inherent latency in real-time inference operations. To address these limitations, this paper proposes SARAD, a novel safety-aware hybrid framework that synergizes LLMs and DRL for autonomous driving. SARAD substitutes the random exploration of DRL with Retrieval-Augmented Generation (RAG)-enhanced, LLM-guided decisions sourced from a dynamic expert knowledge repository. An attention discriminator is proposed to integrate the prior knowledge of LLMs into DRL policy optimization. A collision predictor module, fine-tuned with historical collision data, is further designed to improve vehicle safety. Extensive experiments show that SARAD achieves significant performance improvements in the Highway-Env simulator, validating the effectiveness of the proposed model in autonomous driving.

2605.28549 2026-05-28 cs.RO cs.LG 版本更新

SPRINT: Efficient Spectral Priors for Humanoid Athletic Sprints

SPRINT: 用于人形运动短跑的高效频谱先验

Yantong Wei, Kaihong Huang, Hainan Pan, Jiawei Luo, Jiawei Zhou, Ziyan Mai, Zhiwen Zeng, Yaonan Wang, Huimin Lu

发表机构 * College of Intelligence Science and Technology, National University of Defense Technology(智能科学与技术学院,国防科技大学) School of Artificial Intelligence and Robotics, Hunan University(人工智能与机器人学院,湖南大学)

AI总结 提出SPRINT框架,利用频率自适应频谱先验生成运动学可行的关节轨迹,实现零样本仿真到现实迁移,在Unitree G1平台上达到6 m/s峰值速度。

详情
AI中文摘要

人形运动短跑的追求受到缺乏人形可行的运动学参考数据以及现有框架在短跑过程中无法保持稳定性的阻碍。为了克服这些限制,我们引入了SPRINT,一种由高效、频率自适应频谱先验驱动的新框架。通过使用五个离散运动序列的参考库在频域中表征人类运动的基本周期性,这些先验在广泛的速度范围内生成运动学可行的关节轨迹,成功外推至超过参考分布的速度。在这些预训练先验的指导下,SPRINT策略在Unitree G1平台上的现场实验中实现了零样本仿真到现实迁移,达到了6 m/s的峰值短跑速度,并在保持仿生自然性的同时展示了无缝步态转换。最终,这项工作确立了频率自适应频谱先验作为人形运动短跑的高数据效率基础。项目页面见 https://anonymous.4open.science/w/SPRINT-138A/。

英文摘要

The pursuit of humanoid athletic sprints is hindered by a scarcity of humanoid-viable kinematic reference data and the inability of existing frameworks to maintain stability during sprints. To overcome these limitations, we introduce SPRINT, a novel framework driven by efficient, frequency-adaptive spectral priors. By characterizing the fundamental periodicity of human locomotion in the frequency domain using a reference library of five discrete motion sequences, these priors generate kinematically feasible joint trajectories across a broad velocity spectrum, successfully extrapolating to speeds that exceed the reference distribution. Guided by these pretrained priors, the SPRINT policy achieves zero-shot sim-to-real transfer in field experiments on the Unitree G1 platform, reaching a peak sprinting velocity of 6 m/s and demonstrating seamless gait transitions while preserving biomimetic naturalness. Ultimately, this work establishes frequency-adaptive spectral priors as a highly data-efficient foundation for humanoid athletic sprints. The project page is available at https://anonymous.4open.science/w/SPRINT-138A/.

2605.28527 2026-05-28 cs.RO 版本更新

What Frozen VLAs Already Know About Success: A Probing Study of Value-Like Structure in Foundation Robot Policies

冻结的VLA已经知道关于成功的信息:对基础机器人策略中价值类结构的探测研究

Jiachen Zhang, Junnan Nie, Junyi Lao, Wei Cheng, Chenghao Liu, Jiaxin Jiang, Songfang Huang

发表机构 * Peking University(北京大学) China Agricultural University(中国农业大学)

AI总结 通过线性探测从冻结的VLA特征中预测蒙特卡洛结果目标,发现其编码了成功信息,并可用于测试时动作选择提升成功率。

Comments 14 pages, 1 figure, 11 tables. Equal contribution: Jiachen Zhang, Junnan Nie, and Junyi Lao. Corresponding author: Songfang Huang. Preprint

详情
AI中文摘要

视觉-语言-动作(VLA)策略被训练来模仿动作;它们的损失函数从未要求它们估计奖励、进展或未来成功。然而,它们冻结的表示仍然携带这些信息,并且可以在不重新训练策略的情况下被读取并用于指导动作选择。从LIBERO-Goal上的混合成功和失败操作轨迹中,我们使用冻结特征上的轻量级线性探测恢复了蒙特卡洛结果目标。这些目标可以从OpenVLA、Pi0.5、DINOv2和CLIP特征中一致地预测,而基于进展、剩余时间、任务身份或本体感觉的基线则显著较差。为了排除任务和时间捷径,我们在相同任务、相同时间步的匹配比较下评估探测:Pi0.5探测仍然达到约92%的成对排序准确率,而标签打乱的对照则停留在随机水平。作为测试时选择器,在采样的Pi0.5动作前缀上使用相同的探测,将这一离线发现转化为行为:在推板任务中,成功率从贪婪解码下的26.7%上升到44.3%,在酒架任务中也有一个正面案例。这种提升并非普遍适用,并且需要额外的推理计算,但底层发现是清晰的:冻结的VLA已经编码了关于成功的信息,而它们的模仿目标从未明确要求这些信息。

英文摘要

Vision--language--action (VLA) policies are trained to imitate actions; their loss never asks them to estimate reward, progress, or future success. Their frozen representations nevertheless carry such information, and it can be read out and used to guide action choice without retraining the policy. From mixed successful and failed manipulation trajectories on LIBERO-Goal, we recover Monte-Carlo outcome targets using lightweight linear probes on frozen features. The targets are consistently predictable from OpenVLA, Pi0.5, DINOv2, and CLIP features, and substantially less so from baselines built on progress, time-to-go, task identity, or proprioception. To rule out task and temporal shortcuts, we evaluate the probes under same-task, same-timestep matched comparisons: Pi0.5 probes still reach roughly 92% pairwise ordering accuracy, while label-shuffled controls stay at chance. Used as a test-time selector over sampled Pi0.5 action prefixes, the same probe turns this offline finding into behavior: on push-plate, success rises from 26.7% under greedy decoding to 44.3%, with a second positive case on wine-rack. The gains are not universal and require additional inference compute, but the underlying finding is clean: frozen VLAs already encode information about success that their imitation objective never explicitly demands.

2605.28486 2026-05-28 cs.RO 版本更新

Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot Manipulation

Mag-VLA:用于双臂磁驱动微机器人操作的视觉-语言-动作模型

Yongchen Wang, Kangyi Lu, Lan Wei, Dandan Zhang

发表机构 * Department of Bioengineering, Imperial-X, Imperial College London(生物工程系,Imperial-X,帝国理工学院伦敦)

AI总结 提出Mag-VLA模型,利用双臂磁驱动微机器人实现灵巧操作,通过视觉-语言-动作框架和动作分块Transformer解码器,在真实机器人实验中达到90%接近成功率和最高80%运输成功率。

Comments Accepted by 2026 MARSS

详情
AI中文摘要

磁驱动微机器人已被用作微尺度下的无线、非接触操作工具,使其在微创应用中具有前景。然而,由于间接驱动、有限的传感和非线性磁相互作用,其控制仍然具有挑战性。在这项工作中,我们提出了Mag-VLA,一种用于灵巧磁微机器人操作的视觉-语言-动作(VLA)模型,该模型使用两个装有磁铁的机械臂来构建动态磁场。双臂协调实现了诸如微机器人重新定向等单臂难以或无法完成的功能,但也引入了耦合控制挑战,因为策略必须在共享工作空间内为两个执行器生成协调轨迹。我们的框架采用Qwen2.5-VL-7B骨干网络,使用低秩适配(LoRA)处理视觉观察和语言指令以进行动作预测。为了捕捉任务进展,我们引入了一个运动感知阶段分类器和一个阶段条件的动作分块Transformer(ACT)解码器,用于时间上连贯的多步控制。我们进一步构建了一个遥操作磁微机器人操作数据集,涵盖三种任务配置。消融研究表明,基于ACT的解码器显著优于其他生成式动作头。在真实机器人实验中,Mag-VLA在所有任务中实现了90%的接近成功率,并且随着任务难度增加,运输成功率分别为80%、70%和50%。这些结果表明,层次化VLA建模为磁微机器人操作提供了一个有前景的框架。

英文摘要

Magnetically actuated microrobots have been used as wireless, non-contact manipulation tools at microscales, making them promising for minimally invasive applications. However, their control remains challenging due to indirect actuation, limited sensing, and nonlinear magnetic interactions. In this work, we propose Mag-VLA, a vision-language-action (VLA) model for dexterous magnetic microrobot manipulation using two robotic arms with mounted magnets for dynamic magnetic-field construction. Bimanual coordination enables capabilities such as microrobot reorientation that are difficult or infeasible with a single arm, but it also introduces coupled control challenges, as the policy must generate coordinated trajectories for both actuators within a shared workspace. Our framework adapts a Qwen2.5-VL-7B backbone using Low-Rank Adaptation (LoRA) to process visual observations and language instructions for action prediction. To capture task progression, we introduce a motion-aware phase classifier and a phase-conditioned Action Chunking Transformer (ACT) decoder for temporally coherent multi-step control. We further construct a teleoperated magnetic microrobot manipulation dataset covering three task configurations. Ablation studies show that the ACT-based decoder substantially outperforms alternative generative action heads. In real-robot experiments, Mag-VLA achieves a 90% approach success rate across all tasks and transport success rates of 80%, 70%, and 50% as task difficulty increases. These results demonstrate that hierarchical VLA modeling provides a promising framework for magnetic microrobot manipulation.

2605.28468 2026-05-28 cs.RO 版本更新

EIT-Pneumatic Hybrid Robotic Skin for Practical and Accurate Force Map Reconstruction

EIT-气动混合机器人皮肤用于实用且精确的力图重建

Junhwi Cho, Sunggyu Bae, Junghyeon Ma, Hyosang Lee, Jung Kim, Kyungseo Park

发表机构 * Mechanical Engineering Department, KAIST(韩国科学技术院机械工程系) Department of Robotics and Mechatronics Engineering, DGIST(大邱科学技术院机器人与机电工程系) Mechanical Engineering Department, TU/e(埃因霍温理工大学机械工程系)

AI总结 提出一种结合电阻抗断层成像(EIT)与气动触觉传感的混合机器人皮肤,通过Tikhonov正则化逆重建和逐垫气动校准,实现大面积精确触觉传感,并降低灵敏度不均匀性。

Comments 8 pages, 8 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026. J. Cho, S. Bae, J. Ma contributed equally

详情
AI中文摘要

我们提出了一种混合机器人皮肤,它结合了电阻抗断层成像(EIT)与气动触觉传感,以提高力重建能力。所开发的机器人皮肤完全通过3D打印和喷涂制造,成本低廉且易于构建。采用Tikhonov正则化逆重建,配合逐垫气动校准,通过简单的测量方案实现了精确的大面积触觉传感。为了验证,我们进行了测力计压痕实验;结果显示,在垫内不同位置,力重建保持一致。与仅使用EIT的基线相比,灵敏度不均匀性也有所降低,变异系数从0.31降至0.14,表明所提出的方法解决了EIT长期存在的局限性。我们进一步在仿人机器人上展示了胸部安装集成,并发现气动信号在各种接触场景下保持可靠,包括同一传感垫上的多个同时接触。这些结果表明,在真实机器人系统中实现精确、可扩展的全身触觉传感是一条实用路径。

英文摘要

We present a hybrid robotic skin that combines electrical impedance tomography (EIT) with pneumatic tactile sensing to improve force reconstruction capability. The developed robotic skin is fabricated entirely by 3D printing and spray coating, making it affordable and easy to build. A Tikhonov-regularized inverse reconstruction, paired with per-pad pneumatic calibration, enables accurate large-area tactile sensing with a simple measurement scheme. For validation, we conducted load-cell indentation experiments; the results showed consistent force reconstruction across locations within a pad. Compared with an EIT-only baseline, sensitivity non-uniformity was also reduced, with the coefficient of variation decreasing from 0.31 to 0.14, indicating that the proposed approach addresses a longstanding limitation of EIT. We further demonstrated chest-mounted integration on a humanoid robot and found that the pneumatic signals remained reliable across diverse contact scenarios, including multiple simultaneous contacts on the same sensing pad. These results indicate a practical path toward accurate, scalable whole-body tactile sensing in real robotic systems.

2605.28462 2026-05-28 cs.RO 版本更新

Learning a Kinodynamic Trajectory Manifold for Impact-Aware Compliant Catching of Fast-Moving Objects

学习动力学轨迹流形以实现对快速移动物体的冲击感知柔顺抓取

Guorui Pei, Mengshi Zhang, Xi Chen, Jinsong Wu, Jiaming Qi, Peng Zhou

发表机构 * College of Robotics(机器人学院) Taiyuan University of Technology(太原科技大学) School of Data Science(数据科学学院) City University of Hong Kong (Dongguan)(香港城市大学(东莞)) School of Advanced Engineering(先进工程学院) Great Bay University(大湾大学) Department of Mechanical Engineering(机械工程系) The Hong Kong Polytechnic University(香港理工大学) College of Mechanical and Electrical Engineering(机械与电子工程学院) Northeast Forestry University(东北林业大学)

AI总结 本文通过仿真中的强化学习收集成功抓取轨迹,学习低维动力学轨迹流形,并在运行时将估计的物体初始状态直接映射到参考抓取轨迹,结合近接触柔顺控制实现快速移动物体的冲击感知抓取。

详情
AI中文摘要

快速抓取自由飞行物体由于反应时间短、冲击不确定性和动力学约束而困难。我们在仿真中使用强化学习收集成功的抓取轨迹,并学习一个低维的动力学轨迹流形。在运行时,估计的物体初始状态直接映射到参考抓取轨迹,无需在线非线性优化。轨迹通过近接触柔顺控制进行跟踪,以改善冲击吸收和抓取稳定性。

英文摘要

Fast catching of free-flying objects is difficult because of short reaction time, impact uncertainty, and kinodynamic constraints. We use reinforcement learning in simulation to collect successful catching trajectories and learn a low-dimensional kinodynamic trajectory manifold. At run time, the estimated object initial state is mapped directly to a reference catching trajectory without online nonlinear optimization. The trajectory is tracked with compliant control near contact for improved impact absorption and capture stability.

2605.28448 2026-05-28 cs.RO 版本更新

A Digital Twin Framework for Virtual Visuo-Haptic Teleoperation of Complex-Shaped Optical Microrobots

复杂形状光学微机器人的虚拟视觉-触觉遥操作数字孪生框架

Zongcai Tan, Lan Wei, Dandan Zhang

发表机构 * Department of Bioengineering, Imperial-X AI Initiative, Imperial College London(生物工程系、Imperial-X人工智能倡议、帝国理工学院伦敦分校)

AI总结 本文提出一个数字孪生框架,集成多陷阱光学操纵、图像位姿估计、微机器人运动仿真和基于模型的触觉渲染,用于复杂形状光学微机器人的虚拟视觉-触觉遥操作,实验表明触觉反馈显著降低接触力和位置误差标准差并提高任务成功率。

Comments Accepted by 2026 MARSS

详情
AI中文摘要

光镊(OT)为精细生物医学任务提供皮牛级操纵,其中视觉-触觉反馈可通过传达交互力线索和陷阱稳定性信息来增强操作员感知。然而,针对复杂形状光学微机器人的视觉-触觉遥操作框架仍不成熟,特别是在多陷阱操纵场景中。本文提出一个用于复杂形状OT驱动微机器人的虚拟视觉-触觉遥操作数字孪生框架。该框架在机器人操作系统(ROS)连接的双臂遥操作系统中集成了数字孪生环境、基于图像的位姿和深度估计、微机器人运动仿真以及基于模型的触觉渲染。在力建模方面,我们结合了多球分布操纵(MSDM)模型与来自光镊工具箱的光学力估计,从而实现仿真驱动的视觉-触觉反馈。该框架再现了代表性微机器人的运动趋势,并提供了与拟合光学力模型数值一致的触觉力渲染。在模拟细胞递送任务中,触觉反馈使接触力指标和微机器人到陷阱中心距离指标的标准差分别降低了53.2%和55.2%,并将任务成功率从30%提高到80%。这些结果证明了该框架在评估复杂形状光学微机器人视觉-触觉遥操作策略方面的有效性。

英文摘要

Optical tweezers (OT) provide piconewton-scale manipulation for delicate biomedical tasks, where visuo-haptic feedback can improve operator awareness by conveying interaction-force cues and trap-stability information. However, visuo-haptic teleoperation frameworks for complex-shaped optical microrobots remain underdeveloped, particularly in multi-trap manipulation scenarios. This paper presents a digital twin framework for virtual visuo-haptic teleoperation of complex-shaped OT-driven microrobots. The framework integrates a digital twin environment, image-based pose and depth estimation, microrobot motion simulation, and model-based haptic rendering within a Robot Operating System (ROS)-connected bimanual teleoperation system. For force modeling, we combine a Multi-Sphere Distributed Manipulation (MSDM) model with optical-force estimation from the Optical Tweezers Toolbox, enabling simulator-driven visuo-haptic feedback. The framework reproduces representative microrobot motion trends and provides haptic force rendering that is numerically consistent with the fitted optical-force model. In simulated cell-delivery tasks, haptic feedback reduced the standard deviations of the contact-force metric and the microrobot-to-trap-center distance metric by 53.2% and 55.2%, respectively, and improved task success from 30% to 80%. These results demonstrate the framework's effectiveness for evaluating visuo-haptic teleoperation strategies for complex-shaped optical microrobots.

2605.28412 2026-05-28 cs.RO cs.LG 版本更新

Tactile-Proprioceptive Sensor Fusion for Contact Wrench Estimation in Whole-Body Physical Human-Robot Interaction

触觉-本体感觉传感器融合用于全身物理人机交互中的接触力估计

Junha Min, Junghyeon Ma, Jiwung Kwon, Sunggyu Bae, Joohyung Kim, Kyungseo Park

发表机构 * Department of Robotics and Mechatronics Engineering, DGIST (Daegu Gyeongbuk Institute of Science and Technology)(机器人与机电工程系,DGIST(大邱庆尚科学技术研究所)) Kinetic Intelligent Machine Lab (KIMLAB), University of Illinois Urbana-Champaign(动能智能机器实验室(KIMLAB),伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出触觉-本体感觉融合框架,利用气动皮肤垫的触觉线索作为接触指示器,结合基于电机电流的本体感觉,通过时间卷积网络消除摩擦滞后,实现多轴接触力重建,提高物理人机交互的灵敏度和响应性。

Comments 8 pages, 6 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

直接物理引导是一种自然的教学和与机器人交互的方式,机器人皮肤通过实现灵敏的接触感知和定位做出关键贡献。本文提出了一种用于自然物理人机交互的触觉-本体感觉传感器融合框架。来自气动皮肤垫的触觉线索作为接触指示器,绕过了摩擦残余和施加外力之间的模糊性,实现了无需明确摩擦识别的高灵敏度接触检测。我们将这些线索与基于电机电流的本体感觉融合,以重建机器人表面的多轴接触力。为了在运动过程中保持精度,我们采用时间卷积网络(TCN)来减轻粘滑过渡期间的摩擦滞后,减少接触起始时的不确定性,并产生平滑、响应灵敏的引导。我们在集成皮肤的机器人臂上验证了该方法:(i)在静止接触中重建多轴力,以及(ii)同时进行力估计和动觉教学。结果表明,与仅触觉和仅本体感觉的基线相比,在不同接触条件下灵敏度和响应性均有提高,支持触觉-本体感觉融合作为安全、直观的物理人机交互的可靠途径。

英文摘要

Direct physical guidance is a natural means of teaching and interacting with robots, and robotic skins make a key contribution by enabling sensitive contact sensing and localization. This paper presents a tactile-proprioceptive sensor fusion framework for natural physical human-robot interaction. Tactile cues from pneumatic skin pads serve as contact indicators that bypass the ambiguity between frictional residues and applied external forces, enabling highly sensitive contact detection without explicit friction identification. We fuse these cues with motor-current-based proprioception to reconstruct multi-axis contact forces on the robot surface. To maintain accuracy during motion, we employ a temporal convolutional network (TCN) to mitigate friction hysteresis during stick-slip transitions, reducing uncertainty at contact onset and yielding smooth, responsive guidance. We validate the approach on a skin-integrated robot arm: (i) multi-axis forces are reconstructed in stationary contacts, and (ii) simultaneous force estimation and kinesthetic teaching are demonstrated. Results indicate improved sensitivity and responsiveness across diverse contact conditions compared with tactile-only and proprioceptive-only baselines, supporting tactile-proprioceptive fusion as a reliable pathway to safe, intuitive physical human-robot interaction.

2605.28372 2026-05-28 cs.LG cs.RO 版本更新

Teacher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning

教师-学生表征对齐用于强化学习驱动的模仿学习

Meraj Mammadov, Pedro Zuidberg Dos Martires, Johannes Andreas Stork

发表机构 * Department of Computer Science(计算机科学系) Örebro University(奥雷布罗大学)

AI总结 提出一种通过自监督对比学习构建共享嵌入空间的方法,以减小教师和学生策略之间的不可模仿差距,从而提升学生策略性能。

Comments 6 pages, 5 figures. Accepted as an oral presentation at the RL4IL Workshop at ICRA 2026

详情
AI中文摘要

从基于状态的强化学习策略进行模仿学习是克服机器人学中复杂高维观测空间维度灾难的常用方法。本文解决了当教师和学生策略孤立学习时出现的不可模仿差距,即教师策略可以依赖学生无法从其观测中推断的特权状态信息。我们提出了一种新算法,不是通过在模仿学习后进行强化学习微调(通常需要全新的训练设置)来改善学生性能,而是学习一个共享嵌入空间,该空间隐藏了特定于智能体的观测,从而通过构造训练出可模仿的教师策略。我们通过自监督对比学习与教师策略并行训练共享嵌入空间,并通过限制其梯度更新编码器网络来防止其提取私有信息。我们在多个示例领域进行了评估,并与最先进的基线方法比较,结果表明我们的算法能够实现更高的学生性能,并显著减小模仿差距。

英文摘要

Imitation learning (IL) from a state-based reinforcement learning (RL) policy is a common approach to overcome the curse of dimensionality in complex and high-dimensional observation spaces prevalent in robotics. This paper addresses the irreducible imitation gap that emerges when teacher and student are learned in isolation, and the teacher policy has the liberty to rely on privileged state information that the student cannot infer from its observations. Instead of improving poor student performance with RL finetuning after IL, which often requires a whole new training setup, we propose a novel algorithm which learns a shared embedding space that hides agent-specific observations and thus trains imitable teacher policies by construction. We train the shared embedding space with self-supervised contrastive learning in parallel to the teacher policy and prevent it from extracting private information by limiting its gradients from updating the encoder networks. We perform evaluations on several example domains and compare to state-of-the-art baselines showing that our algorithm enables higher student performance with substantially reduced imitation gap.

2605.28362 2026-05-28 cs.RO 版本更新

Accelerating Robot Path Planning via Connectivity-Preserving Region Proposal Network

加速机器人路径规划的连通性保持区域提议网络

Zhanzheng Ma, Cancan Zhao, Shuai Zhang, Bo Ouyang

发表机构 * School of Management, Hefei University of Technology(合肥工业大学管理学院)

AI总结 提出连通性保持区域提议网络(CP-RPN),通过分割模型预测紧凑且拓扑连通的候选区域,压缩搜索空间,结合Voronoi图与局部A*回退机制实现低延迟高成功率路径规划。

详情
AI中文摘要

移动机器人路径规划方法常受限于巨大的搜索空间,导致基于采样的算法存在延迟。基于学习的方法经常遭受局部区域碎片化和全局拓扑不一致性的困扰。为解决这一问题,我们提出了连通性保持区域提议网络(CP-RPN),一种分割引导模型,旨在预测紧凑且拓扑连通的候选区域,显著压缩搜索空间。具体来说,我们设计了一个分割模型,利用可变形注意力变换器(DAT)捕获长距离依赖以实现全局连通性,并采用反卷积解码器保留细粒度空间细节。为保证预测掩膜的连通性,我们设计了一个复合损失函数,结合交叉熵损失进行逐像素监督、连通性感知损失增强局部一致性,以及基于持续同调的拓扑连续性损失强制全局连通性。在这些高连通性走廊状区域的基础上,使用Voronoi图规划路径,并辅以局部A*回退机制确保鲁棒性。实验结果表明,与MPT基线相比,CP-RPN将候选区域大小减少了超过60.13%,实现了确定性低延迟规划(平均0.11秒),成功率达99.60%,在稳定性上优于传统的基于采样的算法。

英文摘要

Mobile robot path planning methods are often constrained by vast search spaces, resulting in latency in samplingbased algorithms. Learning-based approaches frequently suffer from local region fragmentation and global topological inconsistency. To tackle the problem, we present the Connectivity- Preserving Region Proposal Network (CP-RPN), a segmentationguided model designed to predict compact and topologically connected candidate regions, significantly compressing the search space. Specifically, we design a segmentation model that leverages a Deformable Attention Transformer (DAT) to capture long-range dependencies for global connectivity, with a Deconvolutional decoder to preserve fine-grained spatial details. To guarantee the connectivity of the predicted mask, we design a composite loss function that combines Cross-Entropy loss for pixelwise supervision, a Connectivity-Aware loss to enhance local coherence, and a Topological Continuity loss based on persistent homology to enforce global connectivity. Building on these highconnectivity corridor-like regions, the Voronoi diagram is used to plan the path, backed by a local A* fallback mechanism to ensure robustness. Experimental results demonstrate that CPRPN reduces the candidate region size by over 60.13% compared to the MPT baseline and achieves deterministic low-latency planning (avg. 0.11s) with a 99.60% success rate, outperforming traditional sampling-based algorithms in stability.

2605.28352 2026-05-28 cs.RO 版本更新

Magnet-Based Soft Robotic Skin Using a 3D-Printed Multi-Lattice Structure and CNN-Based Tactile Super-Resolution

基于磁体的软体机器人皮肤:使用3D打印多格点结构和CNN触觉超分辨率

Yunseong Bang, Joowon Park, Suan Sim, Youngjun Ryu, Sukho Park, Kyungseo Park

AI总结 提出一种集成多层软格点、霍尔效应传感器阵列和CNN触觉超分辨率模型的磁基机器人皮肤,通过格点参数调节实现机械柔顺性与传感特性的联合优化,并利用3D打印快速制造,实现接触位置和法向力的实时估计。

Comments 6 pages, 9 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026. Y. Bang and J. Park contributed equally

详情
AI中文摘要

本文提出一种基于磁体的机器人皮肤,它集成了多层软格点、分布式霍尔效应传感器阵列和触觉超分辨率模型。外部接触力通过嵌入的永磁体转换为磁场变化,而格点将这些变化扩散到整个传感域。这使得每个传感器具有大且重叠的感受野,从而在最小盲区的情况下实现大面积的传感。格点参数可调,能够联合调整机械柔顺性和传感特性。隐式建模工作流和选择性激光烧结(SLS)3D打印支持快速制造共形、高复杂度的结构。基于实验测量训练的卷积神经网络实时估计接触位置和法向力。实验验证了定位精度,并表明可扩展到更大表面,适用于全身机器人皮肤和安全的人机交互。

英文摘要

This paper presents a magnet-based robotic skin that integrates a multilayer soft lattice with distributed Hall-effect sensor arrays and a tactile super-resolution model. External contact forces are converted to magnetic field changes by embedded permanent magnets, and the lattice spreads these changes across the sensing domain. This gives each sensor a large, overlapping receptive field and enables a large sensing area with minimal blind spots. Lattice parameters are tunable, enabling joint adjustment of mechanical compliance and transduction characteristics. An implicit modeling workflow and selective laser sintering (SLS) 3D printing support rapid fabrication of conformal, high-complexity structures. A convolutional neural network trained on experimental measurements estimates contact location and normal force in real time. Experiments validate localization accuracy and indicate scalability to larger surfaces, suggesting applicability to whole-body robotic skin and safe human-robot interaction.

2605.28330 2026-05-28 cs.RO 版本更新

Chance-Constrained MPPI under State and Dynamic Object Prediction Uncertainty and the Evaluation of Collision Risk Calibration

状态与动态物体预测不确定性下的机会约束MPPI及碰撞风险校准评估

Benjamin Serfling, Konrad Doll, Kati Radkhah-Lens

发表机构 * Faculty of Engineering and Informatics, University of Applied Sciences Aschaffenburg(应用科学阿施芬堡大学工程与信息学院)

AI总结 针对机会约束MPPI控制中上游不确定性校准不足导致的过自信或过保守问题,提出DUCCT-MPPI架构,通过无迹变换和蒙特卡洛聚合联合集成定位与动态障碍预测不确定性,在仿真中实现鲁棒导航,成功率提升28%。

Comments Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

机会约束模型预测路径积分(MPPI)控制越来越多地用于动态环境中的导航,以明确限制碰撞风险。然而,这些概率保证隐含地假设来自定位和感知的上游不确定性是良好校准的。在实践中,估计器常常校准不良,导致特征性的闭环故障模式:过度自信导致系统性安全违规,而信心不足引发过度保守的冻结或概率稀释。为填补这一关键空白,我们的主要贡献是提出一种严格的评估方法,应用适当的评分规则来评估闭环执行期间预测碰撞风险的统计有效性。同时,提出了双不确定性机会约束管MPPI(DUCCT-MPPI)作为一种实时的、风险感知的规划架构。DUCCT-MPPI通过单管无迹变换(UT)近似联合集成定位不确定性,并通过蒙特卡洛聚合集成动态障碍预测不确定性。通过广泛的基于物理的仿真,该框架展示了鲁棒的故障缓解能力,在高度杂乱的环境中无缝过渡到安全、保守的机动,而不陷入功能死锁。在高度杂乱的环境中,DUCCT-MPPI实现了卓越的鲁棒性,导航成功率比已建立的蒙特卡洛MPPI基线高出近28%,同时记录了最低的行驶时间并最小化了诱导的社会力。最终,这些发现表明,自主导航中可靠的概率安全性不仅要求表达性的风险模型,还要求整个自主栈中统计上有效的不确定性估计。

英文摘要

Chance-constrained Model Predictive Path Integral (MPPI) control is increasingly adopted for navigation in dynamic environments to explicitly bound collision risk. However, these probabilistic guarantees implicitly assume that upstream uncertainties from localization and perception are well-calibrated. In practice, estimators are often miscalibrated, inducing characteristic closed-loop failure modes: overconfidence leads to systematic safety violations, while underconfidence triggers overly conservative freezing or probability dilution. To address this critical gap, our primary contribution is a rigorous evaluation methodology applying proper scoring rules to assess the statistical validity of predicted collision risks during closed-loop execution. Concurrently, Dual-Uncertainty Chance-Constrained Tube MPPI (DUCCT-MPPI) is proposed as a real-time, risk-aware planning architecture. DUCCT-MPPI jointly integrates localization uncertainty via a one-tube Unscented Transform (UT) approximation and dynamic obstacle prediction uncertainty via Monte Carlo aggregation. Through extensive physics-based simulations, the framework demonstrates robust failure-mitigation, seamlessly transitioning to safe, conservative maneuvering without succumbing to functional deadlocks in highly cluttered environments. In highly cluttered environments, DUCCT-MPPI achieves superior robustness, outperforming established Monte Carlo MPPI baselines by nearly 28\% in navigation success rate, while simultaneously recording the lowest travel times and minimizing induced social forces. Ultimately, these findings establish that reliable probabilistic safety in autonomous navigation dictates not only expressive risk models but statistically valid uncertainty estimates throughout the entire autonomy stack.

2605.28320 2026-05-28 cs.RO cs.AI 版本更新

Identifying Explicit Parsimonious Piece-wise Polynomial Relationships in Industrial time-series: Application to manipulator robots

识别工业时间序列中的显式简约分段多项式关系:应用于机械臂

Mazen Alamir, Sacha Clavel

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France(格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔国立理工学院、GIPSA实验室)

AI总结 本文提出一种算法,利用隐式关系中的多项式集构建显式分段表示,以识别工业时间序列中的简约显式分段多项式关系,并应用于机械臂逆模型识别,实验表明该模型在泛化能力上优于深度神经网络。

详情
AI中文摘要

本文解决了识别可能涉及大量原始特征的简约显式分段多项式关系的问题。该算法利用最近提出的一种识别算法,该算法产生简约隐式关系,从而能够在异常检测和定位的背景下推导出正常性表征。本文提出的算法更进一步,通过使用隐式表示中涉及的多项式集构建显式分段表示。该框架在识别六轴机械臂逆模型的简约显式表示问题上得到了说明。此外,还展示了在四轴机械臂上的进一步实验,这些实验旨在研究当模型面对未见过的使用场景时,简约模型与最先进的深度神经网络结构相比的泛化能力。

英文摘要

This paper addresses the problem of identifying parsimonious explicit piece-wise polynomial relationships that might involve a relatively large number of raw features. The algorithm leverages a recently proposed identification algorithm that yields parsimonious implicit relationships enabling to derive normality characterization in the context of anomaly detection and localization. The algorithm proposed in this paper goes a step further by deriving explicit piece-wise representations that are built using the set of polynomials involved in the implicit representations. The framework is illustrated on the problem of identifying parsimonious explicit representations of the inverse model of a 6-axis manipulator robot. Moreover, further experiments on a 4-axis robot are also shown which are designed to investigate the generalization capability of parsimonious models compared to state-of-the-art DNNs structures, when models face unseen contexts of use.

2605.28312 2026-05-28 cs.RO cs.CV 版本更新

EventShiftFlow: Towards Hardware-efficient FPGA-based Flow Estimation

EventShiftFlow:面向硬件高效的基于FPGA的流估计

Arianna Alonso Bizzi, Fernando Cladera, C. J. Taylor

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出一种基于事件相机的流估计方法,通过离散化事件、构建1位空间占用网格并并行评估速度假设,仅使用固定宽度整数逻辑实现,无需帧重建、浮点运算或迭代优化,适用于低延迟机器人感知。

Comments 10 pages, 5 figures. Accepted to the IEEE ICRA 2026 Workshop on Challenges and Opportunities of Neuromorphic Field Robotics and Automation

详情
AI中文摘要

基于事件的视觉传感器提供异步、高时间分辨率的测量,适用于低延迟机器人感知,但许多基于事件的运动估计方法计算密集且难以映射到FPGA硬件。我们提出一种流式速度估计器,将异步事件离散化为固定持续时间的时间片,构建1位空间占用网格,并并行评估多个速度假设,仅使用固定宽度整数逻辑——移位寄存器、计数器、比较器和小型LUT映射乘法——无除法器且无DSP模块。它不需要帧重建、浮点运算或迭代优化。该方法有意将密集亚像素光流替换为每个活动像素的稀疏量化速度估计,适用于尺寸、重量和功率受限平台上的反应式避障等低延迟任务。在具有已知真实速度的噪声合成数据上,该方法恢复了幅度和方向,其中当不同速度的物体相交时幅度估计最具挑战性。在真实事件相机序列上,所有四个评估运动段的方向准确率达到99.5%,在10-40%的占用密度范围内性能保持稳健。我们表征了算法的密度依赖行为,进行了参数敏感性分析,表明所提出的数据路径需要小于2 kB的存储,并在低成本Xilinx Artix-7上实现了单轴原型。

英文摘要

Event-based vision sensors offer asynchronous, high-temporal-resolution measurements that are attractive for low-latency robotic perception, but many event-based motion estimation methods are computationally intensive and difficult to map to FPGA hardware. We present a streaming velocity estimator that discretizes asynchronous events into fixed-duration time bins, constructs a 1-bit spatial occupancy grid, and evaluates multiple velocity hypotheses in parallel using only fixed-width integer logic - shift registers, counters, comparators, and small LUT-mapped multiplies - with no dividers and no DSP blocks. It requires no frame reconstruction, no floating-point arithmetic, and no iterative optimization. The method deliberately trades dense sub-pixel optical flow for a sparse, quantized velocity estimate at each active pixel, suited to low-latency tasks such as reactive obstacle avoidance on size-, weight-, and power-constrained platforms. On noisy synthetic data with known ground-truth velocities, the method recovers both magnitude and direction, with magnitude estimates being most challenged when objects of different velocities intersect. On a real event-camera sequence, directional accuracy reaches 99.5% across all four evaluated motion segments, with performance remaining robust across occupancy densities in the 10-40% range. We characterize the algorithm's density-dependent behavior, present a parameter sensitivity analysis, show that the proposed datapath requires less than 2 kB of storage, and implement a single-axis prototype on a low-cost Xilinx Artix-7.

2605.28279 2026-05-28 cs.RO 版本更新

IMU Propagation as Preintegration

IMU传播作为预积分

Jianzhu Huai

发表机构 * State Key Lab of Info Engineering in Surveying, Mapping and Remote Sensing(信息工程测绘遥感国家重点实验室) Wuhan University(武汉大学)

AI总结 本文证明IMU预积分与传播在计算上等价,提出一种与约定无关的视角,通过包装现有传播例程获得预积分测量、偏差雅可比和协方差,反之亦然,从而简化代码复用并支持一致性检查。

Comments 6 pages, 2 figures, to present in ISPRS2026 Thematic Session 10 on Radar Perception

详情
AI中文摘要

IMU预积分广泛用于基于因子图的视觉-惯性、激光-惯性和雷达-惯性状态估计,但通常被视为与常规IMU传播分离的专门实现。本文表明IMU预积分和传播是同一底层计算的不同实现。我们提出一种与约定无关的视角,其中预积分测量、偏差雅可比和协方差可以通过包装现有的IMU传播例程获得,而预积分模块反过来可以恢复状态转移矩阵和传播协方差。这种视角简化了现有传播代码的复用,支持跨不同误差状态定义的转换,并为预积分实现提供实用的一致性检查。随机IMU序列的实验表明,基于RK4的传播实现与GTSAM的切空间和流形预积分模块在恢复的雅可比、协方差和转移矩阵上高度一致。

英文摘要

IMU preintegration is widely used in factor-graph-based visual--inertial, lidar--inertial, and radar--inertial state estimation, yet it is often treated as a specialized implementation separate from conventional IMU propagation. This note shows that IMU preintegration and propagation are equivalent realizations of the same underlying computation. We present a convention-agnostic view in which the preintegrated measurement, bias Jacobians, and covariance can be obtained by wrapping an existing IMU propagation routine, while a preintegration module can conversely recover state-transition matrices and propagated covariances. This perspective simplifies the reuse of existing propagation code, supports translation across different error-state definitions, and provides practical consistency checks for preintegration implementations. Experiments with random IMU sequences demonstrate close agreement between an RK4-based propagation implementation and GTSAM's tangent and manifold preintegration modules in the recovered Jacobians, covariances, and transition matrices.

2605.28254 2026-05-28 cs.RO cs.SY eess.SY math.DS 版本更新

Natural Locomotion: Principle and Method

自然运动:原理与方法

Mirado Mortel, Luc Jaulin, Lionel Lapierre, Simon Rohou

AI总结 本文提出自然运动作为系统与环境约束或相互作用介导的运动交换原理,通过构建自然运动流形(NLM)并采用闭/开构造方法,在理想非完整无滑移系统上验证了该原理。

Comments Preprint. 20 pages, 7 figures

详情
AI中文摘要

当机构利用被动动力学、柔顺性和共振而非跟踪预定轨迹时,机器人运动可以变得高效。本文将自然运动表述为一种交换原理,适用于运动由环境约束或相互作用介导的系统。当内部振荡器周期性返回、身体姿态漂移且平均推进-振荡器交换功率(POE功率)在一个周期内为零时,运动是自然的。所选族是自然运动流形(NLM)。我们针对连续理想环境约束发展了该原理的保守实现:约束不做外部功,总机械能守恒,零平均POE功率是与环境介导的推进通道的内部交换,而非外部能量输入。该方法是一种闭/开构造。首先关闭推进通道以揭示有效的内部振荡器,该振荡器由一个有效自由度中的标量作用-角结构或多个自由度中的非线性模态扇区组织。然后重新打开通道,重建姿态,接受的周期必须保持内部递归和零平均POE功率。我们在两个理想非完整无滑移系统上演示了该原理:一个Chaplygin雪橇/摆驱动小车和一个三体扩展。在标量情况下,POE闭合等价于缺失的内部返回条件,从而给出NLM族的定理支持计算。在多自由度情况下,POE闭合仍然是必要的,但必须由模态恒等性、内部返回、动力学一致性、相同的固定被动架构和非零位移来补充。自然运动成为一个设计问题:哪些被动架构支持零个、一个或多个经过认证的NLM族?

英文摘要

Robotic locomotion can become efficient when mechanisms exploit passive dynamics, compliance, and resonance rather than track prescribed trajectories. This paper formulates natural locomotion as an exchange principle for systems whose motion is mediated by environmental constraints or interactions. A motion is natural when an internal oscillator returns periodically, the body pose drifts, and the mean Propulsion--Oscillator Exchange power (POE power) vanishes over one cycle. The selected family is a Natural Locomotion Manifold (NLM). We develop the conservative realization of this principle for continuous ideal environmental constraints: the constraints do no external work, total mechanical energy is conserved, and zero mean POE power is an internal exchange with the environment-mediated propulsive channel, not external energy input. The method is a closed/open construction. The propulsive channel is first closed to reveal an effective internal oscillator, organized by scalar action-angle structure in one effective degree of freedom or by nonlinear modal sectors in several degrees of freedom. The channel is then reopened, pose is reconstructed, and accepted cycles must preserve internal recurrence and zero mean POE power. We demonstrate the principle on two ideal nonholonomic no-slip systems: a Chaplygin-sleigh / pendulum-driven car and a three-body extension. In the scalar case, POE closure is equivalent to the missing internal return condition, giving a theorem-backed computation of the NLM family. In the multi-degree case, POE closure remains necessary but must be completed by modal identity, internal return, dynamics consistency, same fixed passive architecture, and nonzero displacement. Natural locomotion becomes a design question: which passive architectures support no, one, or several certified NLM families?

2605.28237 2026-05-28 cs.RO cs.CV 版本更新

POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

POINav: 在真实世界视觉语言导航中基准测试与增强最终米级到达

Ruiyan Gong, Meisheng Zhang, Yuxiang Zhao, Mingchao Sun, Yanfen Shen, Zedong Chu, Zhining Gu, Wei Guo, Xiaolong Cheng, Qiming Li, Kangning Niu, Yanqing Zhu, Xiaolong Wu, Tianlun Li, Mu Xu

发表机构 * Amap CV Lab, Alibaba Group(阿里集团阿里的Amap视觉实验室)

AI总结 针对真实世界POI导航的“最后几米”挑战,提出首个闭环评估基准POINav-Bench,并设计脑-动作框架结合70K真实标志-入口数据对,实现高保真度导航。

Comments 25 pages, 9 figures

详情
AI中文摘要

真实世界导航本质上由兴趣点(POI)驱动,然而到达精确的POI仍然是一个关键的“最后几米”挑战。现有的POI目标导航的视觉语言导航(VLN)基准通常由于生成的场景而存在粗粒度或显著的模拟到现实差距。为弥合这一差距,我们提出了POINav-Bench,这是第一个专为真实世界POI目标导航闭环评估设计的基准。它包含使用3D高斯泼溅(3DGS)从真实世界捕获重建的11个商业区域,总面积达126,398平方米,涵盖163个不同的POI。通过可通行性感知标注和参考轨迹,POINav-Bench能够在真实、POI丰富的现实环境中对导航智能体进行高保真评估。在此基础上,我们提出了POINav脑-动作框架,其中脑模块执行基于POI的推理以指导动作模块预测用于真实世界执行的连续航点。我们进一步整理了POINav-Dataset,包含70K个真实世界标志-入口对。实验表明,我们的框架为改进真实世界POI目标导航提供了一条可行路径。

英文摘要

Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

2605.28231 2026-05-28 cs.RO cs.LG 版本更新

ProgVLA: Progress-Aware Robot Manipulation Skill Learning

ProgVLA:进度感知的机器人操作技能学习

Seungsu Kim, Jinyoung Choi, Seungmin Baek, Jean-Michel Renders

发表机构 * NAVER LABS(NAVER实验室) NAVER LABS Europe(NAVER实验室欧洲)

AI总结 提出ProgVLA,一种紧凑的视觉-语言-动作模型,通过显式表示任务进度和两阶段Perceiver重采样机制,在有限计算和内存下实现长序列多模态处理,并在多任务操作基准上达到或超越大模型性能。

详情
AI中文摘要

我们提出了ProgVLA,一种紧凑的视觉-语言-动作(VLA)模型,专为在严格的计算和内存预算下进行可靠的机器人操作而设计。该模型特别关注通过维护任务进度的显式表示来高效处理长多模态序列。为此,ProgVLA集成了两个关键组件。首先,一个带有两阶段Perceiver重采样方案的多模态编码器将可变长度的视觉、语言和本体感受流压缩为一组固定的控制就绪上下文令牌,在保持跨模态基础的同时大幅减少序列长度。其次,一组辅助的进度头通过离线强化学习(RL)目标进行训练,以联合学习针对归一化剩余水平目标的批评者。这为策略提供了任务进度的内部估计,并实现了优势加权和成功加权的流匹配模仿学习。在两个成熟的多任务机器人操作基准上,一个0.1B参数的ProgVLA模型达到了与显著更大的预训练基线相当的成功率,并且在长时域和更困难的任务层级上超过了它们。消融实验表明,学习到的上下文重采样器和任务自适应视觉微调是最大的单一贡献者,而进度感知训练提供了集中在长时域和多对象任务上的一致额外增益。我们还在真实世界的玩具厨房环境中进一步验证了该方法。

英文摘要

We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by maintaining an explicit representation of task progress over extended horizons. To this end, ProgVLA integrates two key components. First, a multi-modal encoder with a two-stage Perceiver resampling scheme compresses variable-length visual, language, and proprioceptive streams into a fixed set of control-ready context tokens, substantially reducing sequence length while preserving cross-modal grounding. Second, an auxiliary set of progress heads is trained with offline reinforcement learning (RL) objectives to jointly learn critics over normalized remaining-horizon targets. This provides the policy with an internal estimate of task progress and enables advantage- and success-weighted flow-matching imitation learning. On two well-established multi-task robot manipulation benchmarks, a 0.1B-parameter ProgVLA model reaches success rates that are competitive with, and on long-horizon and harder task tiers exceed, substantially larger pretrained baselines. Ablations indicate that the learned context resampler and task-adaptive visual fine-tuning are the largest single contributors, while progress-aware training provides a consistent additional gain that is concentrated on long-horizon and multi-object tasks. We further validate the approach in real-world toy-kitchen environments.

2605.28202 2026-05-28 cs.RO 版本更新

Natural Functional Gradients for Smooth Trajectory Optimization

平滑轨迹优化的自然函数梯度

Kisang Park, Chanwoo Kim, Kyungjae Lee, Sungjoon Choi

发表机构 * Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea(韩国大学人工智能系,首尔,大韩民国) Department of Statistics, Korea University, Seoul, Republic of Korea(韩国大学统计系,首尔,大韩民国)

AI总结 提出一种基于自然函数梯度的轨迹优化框架,通过函数空间中的几何感知更新和蒙特卡洛估计,在无解析梯度时生成更平滑、更可行的运动轨迹。

详情
AI中文摘要

生成无碰撞且平滑的运动仍然是机器人操作中的一个核心挑战,尤其是在杂乱环境和狭窄通道中,可行区域高度受限且碎片化。我们提出了一种轨迹优化框架,该框架使用自然函数梯度直接在函数空间中进行几何感知更新。该方法优化了一个高斯平滑的替代目标,通过平滑轨迹扰动正则化优化景观,同时保留轨迹级结构。由于更新在函数空间内固有定义,轨迹规则性可以独立于特定时间离散化进行控制。我们推导了自然函数梯度的实用蒙特卡洛估计器,仅需黑盒轨迹评估,使得该方法在由于碰撞检测和接触丰富的仿真导致解析梯度不可用或不可靠时适用。在受限机器人操作任务上的实验表明,与代表性的规划和轨迹优化基线相比,所提出的方法在几何间隙狭窄的环境中提高了轨迹可行性并生成了更平滑的运动。更多结果、视频和实现细节可在项目页面获取:https://kisangpark.github.io/natural-functional-gradient/

英文摘要

Generating collision-free and smooth motions remains a central challenge in robotic manipulation, particularly in cluttered environments and narrow passages where feasible regions are highly constrained and fragmented. We propose a trajectory optimization framework that performs geometry-aware updates directly in function space using natural functional gradients. The method optimizes a Gaussian-smoothed surrogate objective that regularizes the optimization landscape through smooth trajectory perturbations while preserving trajectory-level structure. Because the updates are defined intrinsically in function space, trajectory regularity can be controlled independently of a particular time discretization. We derive a practical Monte-Carlo estimator of the natural functional gradient that requires only black-box trajectory evaluations, making the method applicable when analytic gradients are unavailable or unreliable due to collision checking and contact-rich simulation. Experiments on constrained robotic manipulation tasks demonstrate that the proposed method improves trajectory feasibility and produces smoother motions than representative planning and trajectory optimization baselines in environments with narrow geometric clearances. Additional results, videos, and implementation details are available at the project page: https://kisangpark.github.io/natural-functional-gradient/

2605.28186 2026-05-28 cs.RO cs.AI 版本更新

Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension

可视化运动策略中的潜在相位结构:基于时间特征扩展的多环境研究

Daisuke Yasui, Toshitaka Matuki, Hiroshi Sato

发表机构 * Mathematics and Computer Science National Defense Academy of Japan(日本防卫大学校数学与计算机科学系)

AI总结 提出一种框架,通过扩展聚类特征(包括动作、下一状态和下一动作)并引入抑制自转移的聚类数确定方法,从深度强化学习运动策略中揭示更清晰、更规则的潜在运动相位结构。

详情
AI中文摘要

深度强化学习(DRL)已被证明在MuJoCo基准测试(如HalfCheetah、Ant和Walker2D)的运动控制任务中表现出高性能。然而,可视化由深度神经网络实现的训练策略函数内部获得的运动结构仍然具有挑战性。从生物力学及相关领域可知,运动控制是通过重复运动相位(如站立相和摆动相)实现的。在本研究中,我们提出一个框架,用于从运动控制策略通过与环境交互生成的轨迹中揭示潜在的相位结构。所提出的方法将聚类特征从仅状态观测扩展到包括动作、下一状态和下一动作的增强特征,并引入一种抑制自转移的聚类数确定方法。将所提出的方法应用于三个环境——Ant-v5、HalfCheetah-v5和Walker2D-v5,我们成功识别出比现有方法具有更清晰和更规则转换规则的相位结构。

英文摘要

Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfCheetah, Ant, and Walker2D. However, visualizing the motion structures internally obtained by a trained policy function implemented as a deep neural network remains challenging. It is known from biomechanics and related fields that locomotion control is realized through the repetition of motion phases such as the stance phase and swing phase. In this study, we propose a framework for uncovering latent motion phase structures from trajectories generated by locomotion control policies through interaction with the environment. The proposed method extends the clustering features from state observations alone to augmented features including actions, next states, and next actions, and introduces a method for determining the number of clusters that suppresses self-transitions. Applying the proposed method to three environments -- Ant-v5, HalfCheetah-v5, and Walker2D-v5 -- we successfully identified phase structures with clearer and more regular transition rules than those obtained by the existing method.

2605.28172 2026-05-28 cs.RO 版本更新

Provably Guaranteed Polytopic Uncertainty Quantification for SLAM

具有可证明保证的多面体不确定性量化用于SLAM

Guangyang Zeng, Yulong Gao, Yuan Shen, Lingpeng Chen, Haoying Li, Guodong Shi, Junfeng Wu

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen(数据科学学院,香港中文大学(深圳)) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen(人工智能学院,香港中文大学(深圳)) Department of Electrical and Electronic Engineering, Imperial College London(电子与电气工程系,帝国理工学院伦敦分校) School of Aerospace, Mechanical and Mechatronic Engineering, The University of Sydney(航空航天、机械与机电工程学院,悉尼大学)

AI总结 本文提出基于多面体表示的不确定性量化算法,通过前向映射、后向位姿跟踪和位姿复合三个模块,为3D-3D路标SLAM提供可证明的确定性保证,并结合共形预测提高实用性。

Comments 16 pages, 10 figures; accepted by Robotics: Science and Systems 2026

详情
AI中文摘要

在安全关键的机器人应用中,感知中保证且实用的不确定性量化至关重要。许多现有工作要么没有提供正式包含保证,要么依赖限制性建模假设,要么只关注位姿估计而非完整的SLAM流水线。本文提出了用于基于3D-3D路标的SLAM的可证明保证的不确定性量化算法。该算法由三个基本的不确定性量化模块组成:用于建图的前向不确定性量化、用于位姿跟踪的后向不确定性量化以及位姿复合。每个模块生成一个认证的不确定性集;当输入不确定性边界是确定性的时,输出集继承确定性保证,即它们可证明地包含真实位姿和路标。具体来说,我们使用多面体表示不确定性集,从而实现易处理的计算和对位姿不确定性的统一处理。为了提高算法的实际可用性,我们结合了共形预测,从数据中以规定概率校准测量不确定性。仿真和实验表明,所提出的算法既提供了强大的理论保证,又具有实际可用性。代码开源在 https://github.com/LIAS-CUHKSZ/Polytopic-SLAM-Uncertainty-Quantification。

英文摘要

In safety-critical robotics applications, guaranteed and practical uncertainty quantification (UQ) in perception is vital. Many existing works either offer no formal containment guarantee, rely on restrictive modeling assumptions, or focus only on pose estimation rather than a complete SLAM pipeline. This paper presents provably guaranteed UQ algorithms for 3D-3D landmark-based SLAM. The algorithms consist of three basic UQ modules: forward UQ for mapping, backward UQ for pose tracking, and pose compound. Each module produces a certified uncertainty set; when the input uncertainty bounds are deterministic, the output sets inherit deterministic guarantees, i.e., they provably contain the true poses and landmarks. Specifically, we use polytopes to represent uncertainty sets, enabling tractable computations and a unified treatment of pose uncertainty. To enhance algorithms' practical usability, we incorporate conformal prediction to calibrate measurement uncertainty from data with prescribed probability. Simulations and experiments demonstrate that the proposed algorithms provide both strong theoretical guarantees and practical usability. The code is open-sourced at https://github.com/LIAS-CUHKSZ/Polytopic-SLAM-Uncertainty-Quantification.

2605.28154 2026-05-28 cs.HC cs.RO 版本更新

Robo-Blocks: Generative Scaffolding in End-User Design and Programming of Social Robots

Robo-Blocks:社交机器人终端用户设计与编程中的生成式支架

Arissa J. Sato, Callie Y. Kim, Nathan Thomas White, Abhinav Maneesh, Yuqing Wang, Hui-Ru Ho, Bilge Mutlu

发表机构 * Department of Computer Sciences\ of Wisconsin--Madison

AI总结 通过研究通过设计(RtD)过程,提出基于LLM的积木式编程环境Robo-Blocks,利用生成式支架将高级想法转化为可执行机器人行为,支持新手程序员,并揭示了用户角色与使用模式。

详情
AI中文摘要

由于需要规划、交互设计和编程方面的专业知识,编程社交机器人对新手机器人程序员来说具有挑战性。虽然大型语言模型(LLM)通过从自然语言描述生成代码具有巨大潜力,但它们可能掩盖编程的关键元素并取代设计者的意图,最终导致过度依赖而非发展编程技能。在本文中,我们通过研究通过设计(RtD)过程,探索基于LLM的社交机器人编程工具如何支持新手机器人程序员。我们设计并原型化了Robo-Blocks,这是一个基于积木的编程环境,利用LLM通过结构化叙述为新手机器人程序员提供生成式支架,将高级想法连接到可执行的机器人行为。通过与新手的部署,我们发现了生成式支架的新兴用户角色和使用模式,并展示了这种支架如何塑造终端用户的设计和编程策略。我们提出了有效使用生成式支架及其融入社交机器人编程实践的设计见解。

英文摘要

Programming social robots is challenging for novice robot programmers due to required expertise in planning, interaction design, and programming. While large language models (LLMs) hold significant promise through code generation from natural-language descriptions, they can obscure critical elements of programming and supplant designer intent, eventually resulting in over-reliance instead of developing programming skills. In this paper, we explore how LLM-based social-robot-programming tools can support novice robot programmers through a Research through Design (RtD) process. We designed and prototyped Robo-Blocks, a block-based programming environment that leverages LLMs to offer novice robot programmers generative scaffolding through structured narratives that connect high-level ideas to executable robot behaviors. Through deployment with novices, we discovered emerging user personas and usage patterns for generative scaffolding and showed how this scaffolding shapes end-user design and programming strategies. We present design insights for the effective use of generative scaffolding and its integration into the practice of social-robot programming.

2605.25770 2026-05-28 cs.RO 版本更新

Implicit Null-space Manifold Generation for Redundant Robotic Systems

冗余机器人系统的隐式零空间流形生成

Taiki Ishigaki, Teresa Vidal-Calleja, Ko Ayusawa, Eiichi Yoshida

发表机构 * Tokyo University of Science, Japan(日本东京科学大学) University of Technology Sydney, Australia(澳大利亚悉尼技术大学) National Institute of Advanced Industrial Science and Technology, Japan(日本国家先进工业科学与技术研究院)

AI总结 针对冗余机器人系统,提出一种基于雅可比引导探索的隐式标量场方法,通过零水平集表示解流形,实现解空间几何结构的有效估计与连续任务建模。

Comments Corrected author names in references

详情
AI中文摘要

具有冗余自由度的机器人系统可以通过多种配置实现相同的任务结果,从而形成配置空间中的解流形。现有方法通常通过基于雅可比的技术局部利用这种冗余性来计算单个解或轨迹。虽然这些方法在求解计算上有效,但它们不保留解集本身的几何结构表示。在这项工作中,我们采用以表示为中心的方法来估计解空间的几何结构。我们考虑由通用任务定义映射诱导的解流形,并在配置空间上构建一个隐式标量场,其零水平集对应于解流形。为此,我们使用雅可比引导的探索策略在解流形附近生成样本,该策略有效捕获其局部和全局结构。得到的隐式表示定义在配置空间上,并自然诱导出一个连续的距离场,编码到解流形的接近度。在平面三连杆机器人和七自由度Franka机械臂上的实验证明了所提出表示的有效性。此外,该框架能够对具有连续变化的任务族进行解空间的一致建模。

英文摘要

Robotic systems with redundant degrees of freedom can achieve the same task outcome using multiple configurations, resulting in solution sets that form manifolds in the configuration space. Existing approaches typically exploit such redundancy locally through Jacobian-based techniques to compute individual solutions or trajectories. While effective for solution computation, these methods do not retain a representation of the geometry of the solution set itself. In this work, we adopt a representation-centric approach to estimate the geometric structure of the solution space. We consider solution manifolds induced by general task-defining maps and construct an implicit scalar field over the configuration space, whose zero-level set corresponds to the solution manifold. To this end, we generate samples in the neighborhood of the solution manifold using a Jacobian-guided exploration strategy, which efficiently captures its local and global structure. The resulting implicit representation is defined over the configuration space and naturally induces a continuous, distance field that encodes proximity to the solution manifold. Experiments on a planar three-link robot and a seven-degree-of-freedom Franka manipulator demonstrate the effectiveness of the proposed representation. Furthermore, the framework enables consistent modeling of solution spaces across families of tasks with continuous variation.

2605.25010 2026-05-28 cs.RO cs.AI 版本更新

Performance Comparison of Classical and Neural Sampling Algorithms for Robotic Navigation

经典与神经采样算法在机器人导航中的性能比较

Hichem Cheriet, Badra Khellat Kihel, Samira Chouraqui

发表机构 * dept. of Economics Oran2 Mohamed BenAhmed University(经济系奥兰2莫哈梅德·本·阿赫迈德大学)

AI总结 本文在含凸凹障碍物的环境中比较了RRT*、Neural RRT*和Neural Informed RRT*三种算法,发现神经引导规划器能生成更短(最多14%)和更平滑(55-75%)的路径,其中Neural Informed RRT*综合性能最优。

详情
Journal ref
Presented at The 3rd Edition of National Conference on Applications of Artificial Intelligence A2I' 26. 2026
AI中文摘要

将人工智能(AI)集成到基于采样的运动规划中为提高自主导航效率提供了新的可能性。本文在包含不同障碍物密度的凸凹障碍物环境中实现并评估了三种算法,即RRT*、Neural RRT*和Neural Informed RRT*。结果表明,与传统RRT*算法相比,神经引导规划器提高了路径质量,生成了最多短14%的路径和55-75%更平滑的轨迹。在评估的方法中,Neural Informed RRT*在路径长度和轨迹平滑度方面实现了最佳整体性能。这些结果证明了AI引导采样策略在提高机器人和无人机导航的可靠性和轨迹效率方面的有效性,尽管计算时间略有增加。总体而言,该研究凸显了人工智能在实时机器人路径规划应用中日益增长的重要性。

英文摘要

Integrating artificial intelligence (AI) into sampling-based motion planning provides new possibilities for improving autonomous navigation efficiency. In this paper, three algorithms, namely RRT*, Neural RRT*, and Neural Informed RRT*, are implemented and evaluated on environments containing convex and concave obstacles with different obstacle densities. The obtained results indicate that neural-guided planners improve path quality, producing up to 14\% shorter paths and 55--75\% smoother trajectories compared with the conventional RRT* algorithm. Among the evaluated methods, Neural Informed RRT* achieves the best overall performance in terms of path length and trajectory smoothness. These results demonstrate the effectiveness of AI-guided sampling strategies for improving reliability and trajectory efficiency in robotic and UAV navigation, despite a slight increase in computation time. Overall, the study highlights the growing importance of artificial intelligence in real-time robotic path planning applications.

2605.28136 2026-05-28 cs.CV cs.RO 版本更新

SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving

SAM增强的道路数据集分割:自动驾驶中关键类别的平衡

Toomas Tahves, Mauro Bellone, Junyi Gu, Raivo Sell

发表机构 * Department of Mechanical and Industrial Engineering, Tallinn University of Technology(塔林技术大学机械与工业工程系) FinEst Centre for Smart Cities, Tallinn University of Technology(塔林技术大学智能城市研究中心) Department of Computer Science and Engineering, Universitas Mercatorum(默卡托姆大学计算机科学与工程系) Department of Computer Science and Engineering, Chalmers University of Technology(挑战者技术大学计算机科学与工程系) University of Gothenburg(哥德堡大学)

AI总结 提出基于SAM的标注流水线,将ZOD数据集的边界框转换为密集像素级语义掩码,并评估不同架构在类别不平衡下的性能,通过双向迁移学习实现跨传感器配置的有效迁移。

详情
AI中文摘要

密集语义分割对于自动驾驶至关重要,然而许多多模态数据集缺乏像素级标注。Zenseact开放数据集(ZOD)提供丰富的多传感器数据,但仅有边界框标签,限制了其在分割研究中的应用。我们的主要贡献是一个基于Segment Anything Model(SAM)的标注流水线,通过将边界框转换为语义掩码,为ZOD生成密集的像素级标注。在这项初步研究中,我们处理了超过10万帧,并手动筛选出一个2300帧的子集(接受率36%),以建立可靠的基线。利用这些标注,我们评估了基于Transformer的CLFT和基于CNN的DeepLabV3+架构在不同天气条件下的性能,其中CLFT-Hybrid达到了48.1%的mIoU。为了解决极端类别不平衡问题(行人、骑行者、标志牌像素占比不足1%),我们探索了针对稀有类别的专门模型。我们还在Iseauto自动驾驶平台上验证了该流水线,达到了77.5%的mIoU,并展示了通过双向迁移学习,SAM导出的表示能够有效地跨传感器配置迁移。所有代码和标注均已发布,以支持可重复研究。

英文摘要

Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

2605.28110 2026-05-28 cs.RO 版本更新

STR Robot: Design of an Autonomous Mobile Robot from Simulation to Reality

STR机器人:从仿真到现实的自主移动机器人设计

Vinh Nguyen, Gia-Uy Le, Tien-Dat Nguyen, Tri-Tin Nguyen, Vinh-Hao Nguyen

发表机构 * Faculty of Electrical and Electronic Engineering, Ho Chi Minh City University of Technology, VNU-HCM(电子工程学院,胡志明市技术大学,VNU-HCM)

AI总结 本文提出一种基于现有机械平台的自主移动机器人仿真到现实实现方法,重点开发机载控制、自定位和自主导航系统,并通过仿真和实验验证其可行性。

详情
AI中文摘要

随着仿真工具的快速发展,自主机器人系统在实际部署前的开发和验证变得更加高效。本文介绍了一种基于现有机械平台的自主移动机器人的仿真到现实实现。我们的工作不关注机械设计,而是集中于机载控制、自定位和自主导航系统的开发。所提出的机器人配备了机载感知和计算能力,以估计其姿态并在环境中自主导航。整个框架首先在仿真中开发和测试,然后部署在真实机器人上进行实验评估。结果证明了所提出方法的可行性,并表明仿真为开发可靠的自主移动机器人系统提供了有效基础。源代码将在 https://ntdathp.github.io/outdoor-robot-web 发布。

英文摘要

With the rapid development of simulation tools, the development and validation of autonomous robotic systems have become more efficient before real-world deployment. This paper presents a simulation-to-real implementation of an autonomous mobile robot based on an existing mechanical platform. Instead of focusing on mechanical design, our work concentrates on the development of the onboard control, self-localization, and autonomous navigation system. The proposed robot is equipped with onboard sensing and computation to estimate its pose and navigate autonomously in the environment. The overall framework is first developed and tested in simulation, and then deployed on the real robot for experimental evaluation. The results demonstrate the feasibility of the proposed approach and show that simulation provides an effective foundation for developing reliable autonomous mobile robot systems. The source code will be released at https://ntdathp.github.io/outdoor-robot-web.

2605.28097 2026-05-28 cs.RO 版本更新

ICAN-Deploy: Identity-Stable Canary Deployment for Safety-Critical Embodied Agents

ICAN-Deploy:面向安全关键具身智能体的身份稳定金丝雀部署

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Heriot-Watt University, Malaysia Campus(赫瑞-沃德大学马来西亚分校) Fraunhofer Institute for Applied Information Technology(弗劳恩霍夫应用信息技术研究所) Soochow University(苏州大学)

AI总结 提出ICAN-Deploy中间件,通过分离能力名称与版本,在安全关键具身智能体的金丝雀部署中保持身份哈希不变,避免重新认证。

Comments 14 pages, 6 figures, 4 tables

详情
AI中文摘要

金丝雀部署将一小部分流量路由到新软件版本,监控指标,并在出现回归时回滚。主流控制器(Argo Rollouts、Spinnaker、Flagger)在金丝雀窗口期间会改变部署系统的加密身份。这种漂移对于无状态微服务是无害的,但对于安全关键的具身智能体,它打破了“你认证的智能体仍然是你拥有的智能体”这一声明,迫使每次金丝雀部署都要重新认证。我们提出了ICAN-Deploy(身份稳定的金丝雀部署),这是一种中间件构造,其状态机通过分离能力名称(冻结、哈希化)和能力版本(可变运行时状态),在金丝雀窗口期间保持身份哈希不变。我们在LLM驱动的机器人的运行时治理层中实现了ICAN-Deploy,并通过封闭式证明、AST lint和TLA+模型检查验证了不变性,然后在MuJoCo中的Franka Panda手臂上通过N=100个真实金丝雀周期进行了验证(零漂移;入口延迟95% BCa CI [1.52, 2.01] ms)。一个将版本折叠到清单中的功能标志稻草人在相同工作负载下失败。在身份创建时一次性认证的系统,可以在同一认证下,在版本和名称范围内,交付任意能力演化。

英文摘要

Canary deployment routes a fraction of traffic to a new software version, monitors metrics, and rolls back on regression. Mainstream controllers (Argo Rollouts, Spinnaker, Flagger) change the deployed system's cryptographic identity during the canary window. The drift is harmless for stateless microservices but breaks the claim that "the agent you certified is still the agent you have" for safety-critical embodied agents, forcing re-certification per canary. We present ICAN-Deploy (Identity-stable CANary Deployment), a middleware construction whose state machine holds the identity hash invariant across the canary window by separating capability names (frozen, hashed) from capability versions (mutable runtime state). We implement ICAN-Deploy inside a runtime governance layer for LLM-driven robots and verify invariance by closed-form proof, AST lint, and TLA+ model-checking, then corroborate over N=100 real canary cycles on a Franka Panda arm in MuJoCo (zero drift; entry latency 95% BCa CI [1.52, 2.01] ms). A feature-flagged strawman that folds versions into the manifest falsifies on the same workload. A system certified once at identity-creation time can then ship arbitrary capability evolution under that same certification, within the version-and-name envelope.

2605.28092 2026-05-28 cs.RO 版本更新

An Operator-Based Approach to STL

一种基于算子的STL方法

Panagiotis Rousseas, Dimos V. Dimarogonas

发表机构 * Department of Decision and Control Systems, School of Electrical Engineering and Computer Science, Royal Institute of Technology (KTH)(决策与控制系统系,电气工程与计算机科学学院,皇家理工学院(KTH))

AI总结 提出一种基于可达性值函数算子的STL新框架,通过直接开发算子嵌套规则处理复杂多嵌套公式,并实现在线控制综合。

详情
AI中文摘要

信号时序逻辑(STL)因其在自主规划和控制中的丰富表达能力而近年来得到广泛发展。然而,现有的验证和控制综合方法在公式的复杂性和嵌套程度方面存在局限性。在这项工作中,我们提出了一种基于作用于可达性值函数的算子的STL新方法。这构成了一个处理复杂多嵌套公式的新理论框架,同时为在线控制综合提供了工具。与专注于设计基于STL的可达性(或控制障碍)函数不同,我们直接开发基于算子的嵌套规则。我们的方法的表达能力在理论上得到了证明,从中提取了STL公式满足的充要条件,并在复杂片段的仿真中得到了验证。

英文摘要

Signal Temporal Logic (STL), has recently seen extensive development, owing to its rich expressivenes for autonomous planning and control. Nevertheless, existing verification and control synthesis methods are limited with respect to the complexity and degree of nesting of the formulae. In this work, we propose a novel approach to STL based on an operator acting on reachability value functions. This constitutes a new theoretical framework for handling complex multi-nested formulae while at the same time providing tools for on-line control synthesis. In contrast to focusing on the design of STL-based reachability (or control barrier) functions, we develop operator-based nesting rules directly. Our method's expressiveness is demonstrated both theoretically, where necessary and sufficient conditions for STL formula satisfaction are extracted, as well as in simulations with complex fragments.

2605.28087 2026-05-28 cs.RO 版本更新

Whose Is This?: Context-Aware Object Ownership Inference with Uncertainty-Guided Questioning

这是谁的?:基于不确定性引导提问的上下文感知物体所有权推断

Saki Hashimoto, Akira Taniguchi, Shoichi Hasegawa, Yoshinobu Hagiwara, Tadahiro Taniguchi

发表机构 * Kyutech(京都科技大学)

AI总结 提出一种结合大语言模型和共形预测的上下文感知所有权推断框架(COIN),通过不确定性引导的交互式提问,在模拟家庭环境中实现高精度物体所有权估计。

Comments Under review in Advanced Robotics. Project page is https://emergentsystemlabstudent.github.io/COIN/

详情
AI中文摘要

服务机器人必须推断物体所有权才能正确解释诸如“把我的杯子拿来”之类的指令。然而,所有权是一个无法直接观察的潜在属性,现有方法通常依赖有限线索(如近期使用),在临时共享等场景中不可靠。我们提出一种具有不确定性引导交互的上下文感知所有权推断框架(COIN)。该方法使用大语言模型(LLM)整合用户背景信息和物体使用历史来估计所有权分数。为处理不确定性,我们应用共形预测构建一组可能的拥有者,并在预测不确定时选择性生成用户查询。在模拟家庭环境中的实验表明,所提方法始终优于基线方法,子集准确率达到0.988,平均Jaccard指数达到0.991。该方法在临时使用和共享所有权场景中也保持高性能。结果表明,结合上下文推理与不确定性感知交互提高了估计准确性和鲁棒性。项目页面见https://emergentsystemlabstudent.github.io/COIN/。

英文摘要

Service robots must infer object ownership to correctly interpret instructions such as "bring me my cup." However, ownership is a latent attribute that cannot be directly observed, and existing methods often rely on limited cues such as recent usage, making them unreliable in scenarios such as temporary sharing. We propose a framework for context-aware ownership inference with uncertainty-guided interaction (COIN). The method integrates user background information and object usage history using a large language model (LLM) to estimate ownership scores. To handle uncertainty, we apply conformal prediction to construct a set of plausible owners and selectively generate user queries when the prediction is uncertain. Experiments in a simulated home environment show that the proposed method consistently outperforms baseline approaches, achieving a Subset Accuracy of 0.988 and a Mean Jaccard index of 0.991. The method also maintains high performance in scenarios involving temporary use and shared ownership. The results demonstrate that combining contextual reasoning with uncertainty-aware interaction improves both estimation accuracy and robustness. The project page is available at https://emergentsystemlabstudent.github.io/COIN/.

2605.28048 2026-05-28 cs.RO 版本更新

SAFEVPR: Patch-Based Conformal Verification for Safe Cross-Condition Sequence Visual Place Recognition

SAFEVPR: 基于补丁的共形验证用于安全跨条件序列视觉地点识别

Ha Sier, Jiaqiang Zhang, Zhuo Zou, Xianjia Yu, Tomi Westerlund

发表机构 * Turku Intelligent Embedded and Robotic Systems (TIERS) Lab(图尔库智能嵌入式与机器人系统实验室) University of Turku(图尔库大学) School of Information Science and Technology(信息科学与技术学院)

AI总结 提出SAFEVPR,一种无需训练的验证与校准流程,通过互近邻补丁匹配评分和Mondrian共形LTT校准,在跨条件部署下实现序列VPR的有限样本FDR控制,实验证明在23个跨条件设置中均有效。

详情
AI中文摘要

基于序列的视觉地点识别(VPR)用于SLAM和机器人重定位必须决定检索到的top-1候选是否安全可接受。共形预测是这种接受/拒绝决策的自然框架,但其有限样本保证依赖于校准数据和部署(测试)数据之间的可交换性,这在跨条件部署下被违反。我们引入了SAFEVPR,一种无需训练的验证与校准流程,用于安全的跨条件序列VPR。SAFEVPR将标准的骨干余弦相似度替换为从冻结的DINOv2 ViT特征计算出的互近邻(MNN)补丁匹配分数,并将平坦的Learn-Then-Test校准替换为Mondrian共形LTT,为不同分数区间拟合独立的Bonferroni校正阈值。在可交换性下,这些阈值将提供有限样本的假发现率(FDR)控制;在条件偏移下,我们评估每个部署的经验有效性。在来自Oxford RobotCar、NCLT和St Lucia数据集的23个跨条件设置中,使用三个冻结的VPR骨干,SAFEVPR在目标FDR alpha=0.10下,在23/23的设置中经验有效,平均接受FDR为0.014,平均真阳性率(TPR)为0.75。结果表明,仅凭原始区分度不足以实现共形有效性:AnyLoc-VLAD和Super-Point+LightGlue达到了可比的ROC曲线下面积(AUROC),但在相同校准下失败的设置更多。在无纹理重复场景中,SAFEVPR安全地弃权,而不是接受不可靠的匹配。代码可在https://github.com/Hasar12139/SafeVPR获取。

英文摘要

Sequence-based visual place recognition (VPR) for SLAM and robot relocalization must decide whether the retrieved top-1 candidate is safe to accept. Conformal prediction is a natural framework for this accept/reject decision, but its finite-sample guarantees rely on exchangeability between calibration and deployment (test) data, which is violated under cross-condition deployment. We introduce SAFEVPR, a non-trainable verification-and-calibration pipeline for safe cross-condition sequence VPR. SAFEVPR replaces the standard backbone cosine similarity with a mutual-nearest-neighbour (MNN) patch-matching score computed from frozen DINOv2 ViT features, and replaces flat Learn-Then-Test calibration with Mondrian conformal LTT, fitting separate Bonferroni-corrected thresholds across score bins. Under exchangeability, these thresholds would provide finite-sample false-discovery-rate (FDR) control; under condition shift, we evaluate empirical validity per deployment. Across 23 cross-condition setups from Oxford RobotCar, NCLT, and St Lucia datasets, using three frozen VPR backbones, SAFEVPR is empirically valid on 23/23 setups at target FDR alpha = 0.10, achieving mean accepted FDR 0.014 and mean true-positive rate (TPR) 0.75. The results show that raw discrimination alone is not sufficient for conformal validity: AnyLoc-VLAD and Super-Point+LightGlue reach comparable area under the receiver operating characteristic curve (AUROC) but fail more setups under the same calibration. On textureless repetitive scenery, SAFEVPR safely abstains rather than accepting unreliable matches. Code is available at https://github.com/Hasar12139/SafeVPR.

2605.28033 2026-05-28 cs.RO 版本更新

How Should We Teach Robots? A Comparison of Kinesthetic, Joystick, and Gesture-Based Teaching

我们应如何教机器人?动觉、摇杆和手势教学的比较

Petr Vanc, Jan Kristof Behrens, Václav Hlaváč, Karla Stepanova

发表机构 * Czech Institute of Informatics, Robotics and Cybernetics (CIIRC CTU)(捷克信息学、机器人学与控制研究所(CIIRC CTU))

AI总结 通过用户研究比较动觉引导、摇杆遥操作和手势教学三种示范方式,评估其在操作任务中的成功率、工作负载和常见错误。

Comments 7 pages, 3 figures, 3 tables, presented at Cognition and Artificial Life (CAL/KUZ) 2026 conference at Chateau Trest

详情
AI中文摘要

通过不同的教学方式可以指导机器人从示范中学习,每种方式在可用性和性能上各有权衡。本文在八名参与者的用户研究中比较了动觉引导、摇杆遥操作和手势教学。我们评估了三种操作任务中的重放成功率、改进的NASA-TLX工作负载和常见教学错误。动觉引导在更注重方向和接触的任务中产生了最短的示范、最低的工作负载和最高的成功率。摇杆遥操作在简单的拾取销钉任务中表现最佳。手势教学虽然整体可靠性较低,但表现优于预期,在某些情况下达到了与动觉引导相当的结果。

英文摘要

Instructing robots from demonstrations can be done through different teaching modalities, each with different usability and performance trade-offs. This paper compares kinesthetic guidance, joystick teleoperation, and hand gestures in a user study with eight participants. We evaluate replay success, modified NASA-TLX workload, and common teaching errors across three manipulation tasks. Kinesthetic guidance produced the shortest demonstrations, lowest workload, and highest success on the more orientation-sensitive and contact-rich tasks. Joystick teleoperation performed best on simple peg picking. Hand-gesture teaching, although less reliable overall, performed better than expected and in some cases achieved results comparable to kinesthetic guidance.

2605.27972 2026-05-28 cs.RO 版本更新

Simultaneous Contact Selection and Planning for Contact-Rich Manipulation with Cascaded Optimization

基于级联优化的接触丰富操作中同时接触选择与规划

Zhe Zhang, Xingrong Diao, Haoxiang Liang, Han Yang, Bi-Ke Zhu, Dandan Zhang, Jiankun Wang

AI总结 提出一种级联优化框架SCSP,通过接触选择优化和接触规划优化实现接触丰富操作中的主动接触位置选择与轨迹规划。

Comments 20 pages, 18 pages

详情
AI中文摘要

我们提出了一种基于优化的鲁棒接触丰富操作框架。最近的接触隐式方法能够实现跨接触模式的在线混合规划,允许针对给定的目标状态以及机器人和物体的接触位置序列进行闭环操作。然而,大多数现有方法缺乏自主推理和生成多样化接触位置序列及操作轨迹的能力,即主动接触位置选择,这限制了它们对相对简单任务的适用性。由于接触动力学中的互补性和稀疏梯度,主动接触位置选择具有挑战性,使得设计统一的接触选择与规划框架变得困难。为了解决这些挑战,我们引入了同时接触选择与规划(SCSP),这是一个级联优化框架,包括接触选择优化(CSO)和接触规划优化(CPO)。CSO利用代理接触模型和离散-连续优化来有效解决接触选择中的非光滑性和耦合问题,实现最优接触位置的在线全局搜索。CPO通过评估CSO产生的参考接触位置,并实时生成冗余机械臂对应的操作轨迹,执行先验引导的接触规划。大量的仿真和真实世界实验表明,SCSP在不准确的动力学和感知噪声下能够产生多样化的操作行为和鲁棒控制。我们进一步在具有挑战性的操作任务上验证了该框架的泛化能力。 项目网站:\href{https://sites.google.com/view/scsp-robot}{https://sites.google.com/view/scsp-robot}。

英文摘要

We propose an optimization-based framework for robust contact-rich manipulation. Recent contact-implicit methods enable online hybrid planning across contact modes, allowing closed-loop manipulation for a given target state and contact location sequence of the robot and object. However, most existing approaches lack the ability to autonomously reason and generate diverse contact location sequences and manipulation trajectories, i.e., active contact location selection, which limits their applicability to relatively simple tasks. Active contact location selection is challenging due to complementarity in contact dynamics and the sparse gradients, making the design of a unified framework for contact selection and planning difficult. To address these challenges, we introduce Simultaneous Contact Selection and Planning (SCSP), a cascaded optimization framework comprising Contact Selection Optimization (CSO) and Contact Planning Optimization (CPO). CSO leverages a surrogate contact model and discrete-continuous optimization to efficiently resolve the nonsmoothness and coupling in contact selection, enabling online global searching of optimal contact locations. CPO performs prior-guided contact planning by evaluating the reference contact locations produced by CSO and generating corresponding manipulation trajectories in real time for redundant manipulators. Extensive simulations and real-world experiments demonstrate that SCSP produces diverse manipulation behaviors and robust control under inaccurate dynamics and perceptual noise. We further validate the generalization of the framework on challenging manipulation tasks. Project website: \href{https://sites.google.com/view/scsp-robot}{https://sites.google.com/view/scsp-robot}.

2605.27952 2026-05-28 cs.CV cs.RO 版本更新

Con-DSO: Learning Short-Horizon Consistency Priors for RGB-D Direct Sparse Odometry

Con-DSO:学习RGB-D直接稀疏里程计的短时一致性先验

Haolan Zhang, Thanh Nguyen Canh, Chenghao Li, Ziyan Gao, Xiongwen Jiang, Nak Young Chong

发表机构 * School of Information Science, Japan Advanced Institute of Science and Technology(信息科学学系,日本科学技术先进研究院) College of Information Engineering, Shenyang University of Chemical Technology(信息工程学院,沈阳化学工业大学)

AI总结 提出Con-DSO框架,通过预测光度与深度几何一致性不确定性,实现质量感知的像素选择和加权,提升RGB-D直接稀疏里程计在动态、遮挡等挑战环境下的鲁棒性。

Comments Submitted

详情
AI中文摘要

视觉里程计(VO)是机器人和增强现实中的基础组件。RGB-D直接VO受益于度量深度测量,但在动态物体、遮挡、光照变化和不可靠深度违反直接对齐所使用的短时光度和深度几何一致性假设的挑战环境中,性能会下降。现有方法通过语义过滤、显式遮挡推理、光照适应或手工几何准则来缓解这些问题,但通常依赖外部模块或针对个别故障模式的固定假设,限制了其灵活性和以统一方式处理多样挑战的能力。本文提出Con-DSO,一种一致性感知的RGB-D直接稀疏里程计框架,从时间相邻的RGB-D帧对预测密集的光度和深度几何一致性不确定性。一致性网络通过流引导的光度误差和投影深度一致性误差进行训练,使得一致性违规可表示为像素级不确定性。这些成对不确定性预测被转换为关键帧跟踪的主机侧质量先验。该先验随后通过质量感知的支持像素选择和位姿估计中的解耦光度-几何加权应用于VO,使得不可靠观测持续衰减,而非硬拒绝或基于阈值的门控。在五个公开RGB-D基准上的实验表明,与直接RGB-D VO基线相比,在ICL-NUIM上绝对轨迹误差降低超过20%,在RGB-D Scenes V2、TUM/Bonn Dynamic和OpenLORIS序列上降低50%-80%。

英文摘要

Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20\% absolute trajectory error reduction on ICL-NUIM and 50\%--80\% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.

2605.27948 2026-05-28 cs.RO 版本更新

VLM-Based Advanced Rider Assistance System for Motorcycle Safety

基于VLM的摩托车安全高级骑手辅助系统

Mohamed Elnoor, Francesca Baldini, Ananya Trivedi, Faizan M. Tariq, Jovin D'sa, David Isele, Sangjae Bae, Dinesh Manocha, Yosuke Sakamoto

发表机构 * HRI Honda Research Institute(本田研究院) University of Maryland(马里兰大学) Northeastern University(东北大学)

AI总结 提出一种利用视觉语言模型进行语义感知和风险感知规划的摩托车高级骑手辅助系统,通过构建密集风险地图并采用基于采样的规划器,在CARLA模拟器中实现更高的成功率和更低的风险暴露。

Comments Accepted to IEEE IV 2026

详情
AI中文摘要

与汽车相比,摩托车由于防护有限且对路面危险更敏感,面临不成比例的高碰撞风险,然而高级骑手辅助系统(ARAS)相对于高级驾驶辅助系统(ADAS)仍不发达。我们提出一种新颖的ARAS,通过语义感知和风险感知规划来提升摩托车安全性。我们的方法利用视觉语言模型(VLM)进行上下文危险推理,并将其与基于分割的检测相结合,以构建密集风险地图。这些地图编码了语义特征(例如,坑洼严重程度、水坑湿滑度)和物理属性(例如,大小、深度),从而产生捕捉摩托车特定风险的逐像素危险成本。这些地图被一个针对摩托车动力学定制的基于采样的规划器使用,以推荐油门和转向动作,在向目的地前进的同时最小化危险暴露。我们在CARLA模拟器的不同场景中评估了我们的系统。与基线方法相比,我们的方法实现了更高的成功率和更低的危险暴露,同时定性结果展示了可解释的风险地图和安全的轨迹推荐。

英文摘要

Motorcycles face disproportionately high crash risks compared to cars due to limited protection and heightened sensitivity to surface hazards, yet Advanced Rider Assistance Systems (ARAS) remain underdeveloped relative to Advanced Driver Assistance Systems (ADAS). We propose a novel ARAS that enhances motorcycle safety through semantic perception and risk-aware planning. Our approach leverages Vision-Language Models (VLMs) for contextual hazard reasoning and integrates them with segmentation-based detection to construct dense risk maps. These maps encode both semantic characteristics (e.g., pothole severity, puddle slipperiness) and physical attributes (e.g., size, depth), which produce per-pixel hazard costs that capture motorcycle-specific risks. These maps are used by a sampling-based planner tailored to motorcycle dynamics to recommend throttle and steering actions that minimize hazard exposure while advancing toward the destination. We evaluate our system in different scenarios in the CARLA simulator. Compared to the baseline method, our method achieves higher success rates and lower hazard exposure, while qualitative results demonstrate interpretable risk maps and safe trajectory recommendations.

2605.27947 2026-05-28 cs.RO 版本更新

SANTS: A State-Adaptive Scheduler for World Action Models

SANTS:面向世界动作模型的状态自适应调度器

Yirui Sun, Guangyu Zhuge, Keliang Liu, Jie Gu, Xinyu Bing, Zhongxue Gan, Chunxu Tian

发表机构 * Fudan University(复旦大学) Harbin Institute of Technology(哈尔滨工业大学) Deep Computing Era Technology Co., Ltd(深计算时代科技有限公司)

AI总结 提出状态自适应噪声轨迹调度器(SANTS),通过根据视频状态动态选择去噪深度来优化视频到动作的扩散策略,在保持控制性能的同时大幅降低推理延迟。

Comments 17 pages, 5 figures, 8 tables. Project page: https://advanced-robotics-lab.github.io/SANTS/

详情
AI中文摘要

世界动作模型(WAMs)通过使用基于视频的未来表示来条件化动作生成,从而改进机器人操作。然而,在像素空间WAM中,最佳动作条件不一定是完全去噪的视频。受控去噪深度扫描显示,视频细化可以降低动作误差,直到一个状态依赖的点,此后当后期预测变得与动作相关性较低或物理上不可靠时,增益可能饱和甚至逆转。这表明动作生成应使用沿视频噪声轨迹的状态依赖点,而不是固定的终端去噪深度。我们引入了状态自适应噪声轨迹调度器(SANTS),一种用于视频到动作扩散策略的轻量级调度器。在每个视频决策点,SANTS读取当前视频状态表示和噪声水平,然后联合预测累积停止风险和相对噪声进展比率。SANTS在冻结的动作分支生成最终动作块后,通过路径级奖励进行后训练,因此调度器针对下游动作质量而非中间视频保真度进行优化,同时显式惩罚冗余的视频状态更新。实验表明,SANTS在RoboTwin 2.0上达到94.4%的整体成功率,在七个真实机器人任务上平均成功率为73.1%,同时相对于完全视频去噪分别降低了81.7%和79.0%的延迟。这些结果表明,沿视频噪声轨迹的自适应选择可以保留WAM式未来推理的控制优势,同时消除其大部分冗余推理成本。

英文摘要

World Action Models (WAMs) improve robot manipulation by using video-based future representations to condition action generation. In pixel-space WAMs, however, the best action condition is not necessarily the fully denoised video. Controlled denoising-depth scans show that video refinement can reduce action error up to a state-dependent point, after which the gain may saturate or even reverse when late predictions become less action-relevant or physically unreliable. This suggests that action generation should use a state-dependent point along the video noise trajectory rather than a fixed terminal denoising depth. We introduce State-Adaptive Noise Trajectory Scheduler (SANTS), a lightweight scheduler for video-to-action diffusion policies. At each video decision point, SANTS reads the current video-state representation and noise level, then jointly predicts a cumulative stopping hazard and a relative noise-progression ratio. SANTS is post-trained with a path-level reward computed after the frozen action branch generates the final action chunk, so the scheduler is optimized for downstream action quality rather than intermediate video fidelity, while redundant video-state updates are explicitly penalized. Experiments show that SANTS reaches \(94.4\%\) overall success on RoboTwin 2.0 and \(73.1\%\) average success across seven real-robot tasks, while reducing latency by \(81.7\%\) and \(79.0\%\) relative to full video denoising, respectively. These results indicate that adaptive selection along the video noise trajectory can preserve the control benefits of WAM-style future reasoning while removing much of its redundant inference cost.

2605.27919 2026-05-28 cs.RO cs.LG 版本更新

Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal

通过子频率流形遍历的频率引导动作扩散

Junlin Wang

发表机构 * School of Engineering and Applied Science University of Pennsylvania(工程与应用科学学院 费城大学)

AI总结 提出频率引导算子(FGO),通过子频率流形逐步引导扩散策略生成平滑动作,在15个机器人操作任务上提升了动作平滑性和时间一致性。

Comments A preprint version of FGO

详情
AI中文摘要

通过行为克隆学习视觉运动策略通常涉及模仿人类操作员收集的专家演示。然而,自然的人类演示固有地包含高频噪声,例如间歇性抖动、暂停和动作抖动。训练策略直接模仿这些原始轨迹不可避免地会导致模型继承这些次优行为。这种病理在基于扩散的策略中尤为明显,其中迭代去噪步骤可能无意中放大高频伪影而牺牲有意义的细粒度细节。为了解决这些限制,我们提出了一种新颖的基于频率的算法,该算法能够实现隐式频谱操控和平滑动作生成。我们的方法,频率引导算子(FGO),通过逐步将噪声样本通过具有扩展频谱带的中间子频率流形驱动,来引导扩散策略的生成过程。在来自5个基准测试的15个机器人操作任务上验证,FGO在增强动作平滑性和时间一致性方面取得了优越性能,同时保留了成功执行任务所需的细节。项目网站:https://henrywjl.github.io/frequency-guidance-operator/

英文摘要

Learning visuomotor policies via behavior cloning typically involves mimicking expert demonstrations collected by human operators. However, natural human demonstrations inherently contain high-frequency noise, such as intermittent jerks, pauses, and action jitter. Training policies to directly imitate these raw trajectories inevitably causes the model to inherit these suboptimal behaviors. This pathology is particularly pronounced in diffusion-based policies, where iterative denoising steps can inadvertently amplify high-frequency artifacts at the expense of meaningful fine-grained details. To address these limitations, we present a novel frequency-based algorithm that enables implicit spectral maneuvering and smooth action generation. Our method, Frequency Guidance Operator (FGO), steers the generation process of diffusion polices by progressively driving the noisy samples through intermediate sub-frequency manifolds with expanding spectral bands. Validated on 15 robotic manipulation tasks from 5 benchmarks, FGO achieves superior performance in enhancing action smoothness and temporal consistency while preserving the details necessary for successful task execution. Project website: https://henrywjl.github.io/frequency-guidance-operator/

2605.27917 2026-05-28 cs.RO 版本更新

A Surveillance Evasion Game with Continuous Sensor Redeployment via Bilevel Optimization

基于双层优化的连续传感器重新部署的监视规避博弈

Jaehyeok Kim, Kartik A. Pant, Joseph Kinerson, Kylie Sommer-Kohrt, Worawis Sribunma, Li-Yu Lin, James M. Goppert

发表机构 * School of Aeronautics and Astronautics, Purdue University(航空宇航学院,普渡大学)

AI总结 针对无人机利用传感器时空间隙渗透禁飞区的问题,提出通过双层优化实现传感器沿建筑边界连续滑动部署,并利用对数-求和-指数平滑近似保持可微性,最终收敛到局部纳什均衡。

Comments 8 pages, 8 figures, submitted to IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

无人航空系统(UAS)已成为关键基础设施安全日益增长的威胁,利用传感器周界的时空间隙未被探测地渗透受限空域。我们将这种交互建模为对抗性UAS与由定向和全向传感器组成的异构传感器网络之间的两人零和微分博弈。与早期将防御者限制在离散放置图或固定配置的博弈论方法不同,我们引入了一种连续传感器重新部署技术,其中每个传感器沿凸建筑边界自由滑动。这是通过对数-求和-指数平滑近似实现的,该近似在多边形顶点处保持可微性,从而能够使用基于梯度的方法进行优化。攻击者的最佳响应通过两步法计算,结合STP-RRT*进行可行轨迹初始化和非线性规划进行检测最小化细化。联合优化通过交替双层优化收敛到局部纳什均衡(LNE),为两个参与者推导出解析的一阶平稳性条件,从而为CUAS任务中的异构传感器放置建立了可部署的基线。

英文摘要

Uncrewed Aerial Systems (UASs) have become a growing threat to the security of critical infrastructure, exploiting spatiotemporal gaps in sensor perimeters to infiltrate restricted airspace undetected. We formulate this interaction as a two-player zero-sum differential game between an adversarial UAS and a heterogeneous sensor network of directional and omnidirectional sensors. Unlike earlier game-theoretic approaches that restrict the defender to discrete placement graphs or fixed configurations, we introduce a continuous sensor redeployment technique in which each sensor slides freely along the convex building boundaries. This is enforced via a log-sum-exp smooth approximation that preserves differentiability at polygon vertices, enabling optimization with gradient-based methods. The attacker's best response is computed via a two-step approach combining STP-RRT* for feasible trajectory initialization and nonlinear programming for detection-minimization refinement. The joint optimization converges to a Local Nash Equilibrium (LNE) via alternating bilevel optimization, with analytical first-order stationarity conditions derived for both players, thereby establishing a deployable baseline for heterogeneous sensor placements in CUAS missions.

2605.27909 2026-05-28 cs.RO 版本更新

S-Cheetah: A Novel Quadrupedal Robot with a 3-DOF Active Spine Learning Agile Locomotion

S-Cheetah:一种具有3自由度主动脊柱学习敏捷运动的新型四足机器人

Zimu Li, Weibang Bai

发表机构 * School of Information Science and Technology, ShanghaiTech University(信息科学与技术学院,上海科技大学)

AI总结 本文提出一种具有3自由度仿生主动脊柱的四足机器人S-Cheetah,并设计强化学习框架使其实现高速奔跑、原地转向及空中自翻等敏捷运动。

Comments Project website: https://himmy-robotics.github.io/scheetah

详情
AI中文摘要

四足动物的生物脊柱能够实现矢状面的屈伸、侧向弯曲和轴向旋转,在高度敏捷和灵巧的运动中起着关键作用。尽管许多研究已将主动脊柱关节集成到四足机器人中以增强敏捷性,但大多数设计通过减少脊柱自由度来简化控制复杂性,未能实现生物脊柱的空间三轴旋转特性。因此,复制多自由度仿生脊柱并有效利用它来赋能四足机器人的敏捷运动仍然是一个重要的研究挑战。在本研究中,我们提出了S-Cheetah,一种具有3自由度仿生串联主动脊柱的四足机器人,能够实现仿生空间三轴旋转。为了使机器人充分利用这一主动脊柱,我们开发了一个专门的强化学习框架,通过整合加速度课程学习策略和定制的奖励函数(如奔跑步态奖励、脊柱波动奖励和脊柱转向奖励),积极促进引入的脊柱的参与并最大化机器人的运动能力。实验结果表明,S-Cheetah使用旋转G2奔跑步态可以达到6.9米/秒的峰值速度,原地转向率为7.2弧度/秒。此外,该系统展现出一种新兴的、受猫启发的空中自翻能力,使其在自由落体过程中能够从任意方向稳定地四足着地。最后,通过在不同运动任务中的广泛评估,我们证明了所提出的3自由度脊柱的引入全面增强了四足机器人的运动敏捷性。项目网站:himmy-robotics.github.io/scheetah

英文摘要

The biological spine of quadrupeds enables sagittal flexion/extension, lateral bending, and axial rotation, playing a crucial role in highly agile and dexterous locomotion. While numerous studies have integrated active spinal joints into quadrupedal robots to enhance agility, most designs simplify control complexity by reducing spinal degrees of freedom (DOF), failing to achieve the spatial tri-axial rotation characteristic of biological spines. Consequently, replicating a multi-DOF biomimetic spine and effectively leveraging it to empower the agile locomotion of quadrupedal robots remains a significant research challenge. In this study, we present S-Cheetah, a quadrupedal robot featuring a 3-DOF bio-inspired serial active spine capable of biomimetic spatial tri-axial rotation. To empower the robot to fully utilize this active spine, we developed a specialized reinforcement learning framework to actively promote the engagement of the introduced spine and maximize the robot's locomotive capabilities by integrating an acceleration curriculum learning strategy with tailored reward functions, such as a gallop gait reward, a spine undulation reward, and a spine steering reward. Experimental results demonstrate that S-Cheetah can achieve a peak speed of 6.9 m/s using the rotary G2 gallop gait and an in-place turning rate of 7.2 rad/s. Besides, the system exhibits an emergent, feline-inspired aerial self-righting capability, allowing it to land stably on four feet from arbitrary orientations during free fall. Finally, through extensive evaluations across diverse locomotion tasks, we prove that the introduction of the proposed 3-DOF spine comprehensively enhances the locomotive agility of quadrupedal robots. Project website: himmy-robotics.github.io/scheetah

2605.27886 2026-05-28 cs.RO 版本更新

Tabero: Learning Gentle Manipulation with Closed-Loop Force Feedback from Vision, Touch, and Language

Tabero: 通过视觉、触觉和语言闭环力反馈学习轻柔操作

Qiwei Wu, Rui Zhang, Xin Xiang, Tao Li, Weihua Zhang, Junjie Lai, Renjing Xu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Nvidia, Beijing, China(英伟达(北京,中国))

AI总结 针对现有视觉-语言-动作模型缺乏触觉反馈导致无法实现轻柔操作的问题,提出Tabero基准和模型套件,通过数据高效管道生成视觉-触觉-语言任务,并采用解耦力-位置命令接口的Tabero-VTLA架构,在保持高任务成功率的同时将轻柔指令下的平均夹持力降低70%以上。

Comments Code:https://github.com/NathanWu7/Tabero

详情
AI中文摘要

触觉感知对于机器人实现类人轻柔操作至关重要。然而,现有的视觉-语言-动作(VLA)模型由于缺乏对齐的视觉-触觉-语言数据以及有效的闭环力反馈机制,难以利用触觉反馈进行轻柔操作。为解决这些挑战,我们引入了Tabero,一个用于轻柔、语言条件化机器人操作的基准和模型套件,该操作要求细粒度的接触力感知。首先,Tabero基准通过提出一种数据高效的管道来解决触觉数据稀缺问题,该管道重新利用开源机器人操作轨迹生成多样化的视觉-触觉-语言任务,并建立了一个多维评估协议,同时衡量任务成功率和物理交互质量。其次,我们提出了Tabero-VTLA,一种具有解耦力-位置命令接口的架构;生成的力-位置命令由固定的混合控制器执行,以实现实时的力感知操作。在Tabero上评估,我们的模型在保持高任务成功率的同时,在轻柔指令下将平均夹持力降低了70%以上,展示了其基于多模态经验调节交互力的能力。我们的代码公开在 https://github.com/NathanWu7/Tabero。

英文摘要

Tactile sensing is essential for robots to achieve human-like gentle manipulation. However, existing Vision-Language-Action (VLA) models struggle to exploit tactile feedback for gentle manipulation due to scarce aligned vision-tactile-language data and the lack of effective closed-loop force feedback mechanisms. To address these challenges, we introduce Tabero, a benchmark and model suite for gentle, language-conditioned robotic manipulation that demands fine-grained contact force perception. First, the Tabero benchmark addresses the scarcity of tactile data by presenting a data-efficient pipeline that repurposes open-source robot manipulation trajectories to generate diverse vision-tactile-language tasks, and establishes a multidimensional evaluation protocol that measures task success alongside physical interaction quality. Second, we propose Tabero-VTLA, an architecture with a decoupled force-position command interface; the resulting force-position commands are executed by a fixed hybrid controller to enable real-time, force-aware manipulation. Evaluated on Tabero, our model maintains high task success while reducing average grip force by over 70\% under gentle instructions, demonstrating its ability to modulate interaction forces based on multimodal experience. Our code is publicly available at https://github.com/NathanWu7/Tabero.

2605.27817 2026-05-28 cs.RO cs.AI cs.CV cs.LG 版本更新

Turning Video Models into Generalist Robot Policies

将视频模型转化为通用机器人策略

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann

发表机构 * MIT(麻省理工学院) CMU(卡内基梅隆大学) Amazon FAR(亚马逊公司)

AI总结 提出一种解耦的视频到动作策略VERA,利用无动作视频世界模型和基于机器人雅可比矩阵的逆动力学模型,实现跨本体的零样本机器人控制。

Comments project page: https://vera.csail.mit.edu

详情
AI中文摘要

视频生成模型已成为一种有前景的机器人骨干网络,能够生成描绘跨本体和环境完成复杂任务的视频。最近的工作提出了机器人基础模型,通过使用带有动作标签的数据微调视频模型,联合预测未来观测和动作。在本文中,我们测试了一种替代方法的极限:保持视频规划器不变,同时训练一个特定本体的逆动力学模型(IDM)。这种解耦带来了几个自然的好处:视频规划器保持本体无关,不同的视频模型可以轻松互换而无需重新训练IDM,并且IDM可以独立地使用现成的自对弈数据进行训练。我们提出了一种闭环的视频到动作策略,该策略将无动作视频世界模型与基于机器人本体雅可比矩阵的精心设计的IDM相结合。我们证明了我们的IDM设计既数据高效又可扩展到高维动作空间。我们将该策略命名为视频到具身机器人动作模型(VERA),在模拟和真实世界基准测试中取得了强劲的性能,包括零样本的Panda机械臂操作和16自由度Allegro灵巧手立方体重新定向。通过将相同的视频规划器与不同的本体特定IDM配对,可以在多个本体上使用。我们的结果表明,解耦的视频规划加上忠实的视频到动作翻译是实现零样本、跨本体和可泛化机器人控制的可行替代途径。更多结果请访问我们的项目网站:https://vera.csail.mit.edu。

英文摘要

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.

2605.27759 2026-05-28 cs.RO 版本更新

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

Colosseum V2:视觉语言动作模型的泛化能力基准测试

Jeremy Morgan, Prajwal Vijay, Hyeonho Oh, Jincen Song, Ashvin Arora, Alina Du, Gaurav Sukhatme, Jesse Thomason, Ishika Singh

发表机构 * Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Electrical Engineering, Indian Institute of Technology Madras(印度理工学院Madras分校电子工程系) Fu Foundation School of Engineering and Applied Science, Columbia University(哥伦比亚大学工程与应用科学学院)

AI总结 提出Colosseum V2大规模仿真基准,通过28个任务和两种机器人形态,系统评估VLA模型在分布偏移下的泛化能力,揭示其在高层次理解与鲁棒行为之间的差距。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在大规模视觉和语言预训练的推动下,在机器人操作中展现出有前景的泛化能力。然而,这种进展可能具有误导性。尽管VLA具有零样本感知和语言能力,但它们的整体任务性能在分布偏移下常常下降,揭示了这些系统将高层次理解转化为鲁棒行为方面的差距。为了系统地研究这一差距,我们引入了Colosseum V2,这是一个大规模仿真基准,用于评估机器人学习中VLA在不同条件下的泛化能力。该基准包含28个任务,涵盖13个任务类别和两种机器人形态,覆盖了广泛的操作原语和长时域行为。基于ManiSkill仿真器构建,Colosseum V2支持快速、GPU并行化的评估,并支持大规模域内和域外测试。我们评估了包括Action Chunking Transformers (ACT)和Pi0.5在内的最先进方法,揭示了它们在基础性能和泛化方面的局限性。我们展示了仿真与真实世界指标之间的强相关性,支持了该基准的生态效度。通过在统一基准中标准化任务、指标和评估协议,Colosseum V2实现了可重复和公平的比较,降低了评估开销,并加速了向通用机器人策略的进展。

英文摘要

Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.

2605.27724 2026-05-28 cs.RO cs.AI 版本更新

HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning

HumanoidMimicGen: 通过全身规划生成行走操作数据

Kevin Lin, Ajay Mandlekar, Caelan Reed Garrett, Nikita Chernyadev, Yu Fang, Runyu Ding, Yuqi Xie, Justin Tran, Linxi Fan, Yuke Zhu

发表机构 * NVIDIA The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出HumanoidMimicGen方法,通过全身规划自动生成人形机器人行走操作演示数据,在模拟基准上使联合训练的策略性能提升20%。

Comments website: https://humanoidmimicgen.github.io/

详情
AI中文摘要

模仿学习是训练人形机器人行走和操作的一种有前景的方法,但它需要大量演示,而这些演示通过遥操作收集耗时且困难。现有的数据生成算法可以自动合成操作器的演示,但它们在类人机器人上效果不佳,因为其高维复合动作空间涉及手臂、腿和躯干。我们提出HumanoidMimicGen,一种生成人形机器人腿部行走操作数据的方法。我们的方法将少量源演示中的接触丰富的全身技能适应到新状态,并泛化到物体姿态的变化。通过将这些单臂和双臂技能与全身运动规划和操作规划交替进行,该方法在多样化的场景和布局中生成稳定、无碰撞的数据。为了评估我们的方法,我们引入了一个新的模拟行走操作基准,包含九个测试人形机器人行走操作能力的多样化任务。在那里,我们证明HumanoidMimicGen自动生成用于模仿学习的大规模数据集,并能够系统研究数据生成和策略学习决策如何影响模型性能。我们表明,与仅使用真实世界数据训练的策略相比,与HumanoidMimicGen生成的数据联合训练的全身视觉运动策略性能提升20%。

英文摘要

Imitation learning is a promising approach for training humanoid robots to both walk and manipulate, but it requires a large number of demonstrations, which are time-intensive and difficult to collect via teleoperation. Existing data-generation algorithms can automatically synthesize demonstrations for manipulators, but they are ineffective on humanoids because their high-dimensional composite action spaces involve arms, legs, and torsos. We present HumanoidMimicGen, a method for generating humanoid legged loco-manipulation data. Our method adapts contact-rich whole-body skills from a handful of source demonstrations to new states, generalizing across changes in object pose. By interleaving these single- and dual-arm skills with whole-body locomotion and manipulation planning, the method generates stable, collision-free data across diverse scenes and layouts. To evaluate our approach, we introduce a new simulated loco-manipulation benchmark containing nine diverse tasks that test humanoid loco-manipulation capabilities. There, we demonstrate that HumanoidMimicGen automatically generates large datasets for imitation learning and enables a systematic study of how data generation and policy learning decisions impact model performance. We show that whole-body visuomotor policies co-trained with data generated by HumanoidMimicGen outperform those trained only on real-world data by 20%.

2605.27699 2026-05-28 cs.RO 版本更新

AURA: Asymptotically Optimal Uncertainty-Robust Replanning Algorithm for Kinodynamic Systems

AURA: 动力学系统渐近最优的鲁棒重规划算法

Seyedali Golestaneh, Zhuoyun Zhong, Donghyung Lee, Constantinos Chamzas

发表机构 * Department of Robotics Engineering, Worcester Polytechnic Institute (WPI)(机器人工程系,沃斯通理工大学)

AI总结 提出AURA元规划框架,通过在线重规划和优化控制输入,在运动不确定性下实现渐近最优轨迹规划与跟踪精度提升。

详情
AI中文摘要

基于采样的运动规划器为动力学运动规划提供了一种实用且可扩展的方法,尤其适用于高维、欠驱动或非完整系统。然而,这些规划器通常离线使用,要求在执行开始前完成轨迹计算。此外,在存在运动不确定性的情况下,规划轨迹可能无法被准确跟踪,导致偏离名义解。本文在一个统一框架\method中解决了这些局限性,该框架是一个渐近最优的元规划器框架,在执行过程中同时提高路径质量和跟踪性能。除了主执行线程外,该框架包含一个重规划方法,在执行过程中持续探索状态空间并优化轨迹,以及一个优化过程,用于优化未来控制输入以减少跟踪误差。这些组件共同使\method能够在线利用渐近最优规划,同时在不确定性下提高执行精度。所提出的方法在多个系统的仿真和真实环境中进行了评估,与基线方法相比,在轨迹质量、跟踪精度和整体性能方面表现出一致的改进。

英文摘要

Sampling-based motion planners offer a practical and scalable approach to kinodynamic motion planning, notably for high-dimensional, underactuated, or non-holonomic systems. However, these planners are typically used offline, requiring execution to begin only after the trajectory has been computed. In addition, the planned trajectory may not be accurately tracked in the presence of motion uncertainty, leading to deviations from the nominal solution. In this work, these limitations were addressed within a unified framework, \method, an asymptotically-optimal meta-planner framework that improves both path quality and tracking performance during execution. In addition to the main execution thread, this framework comprises a replanning method that continuously explores the state space and refines the trajectory during execution, and an optimization process that refines future control inputs to reduce tracking error. Together, these components enable \method to leverage asymptotically optimal planning online while improving execution accuracy under uncertainty. The proposed approach is evaluated in both simulation and real-world environments across multiple systems, demonstrating consistent improvements in trajectory quality, tracking accuracy, and overall performance compared with baseline methods.

2605.27697 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

仿真引导的扩散方法用于去中心化多机器人运动规划

Jinhao Liang, Sven Koenig, Ferdinando Fioretto

发表机构 * University of Virginia(弗吉尼亚大学) University of California, Irvine(加州大学伊文斯顿分校)

AI总结 提出一种基于约束感知扩散模型的去中心化框架SID,通过仿真邻居未来轨迹并利用安全约束规划自身轨迹,在密集场景下实现高效协调。

详情
AI中文摘要

去中心化多机器人运动规划要求每个机器人仅根据局部观测生成无碰撞轨迹,无需全局感知或可靠通信。然而,大多数现有规划器(无论是经典方法还是基于学习的方法)都是从局部观测的静态快照生成轨迹,这限制了它们预测相邻机器人未来行为的能力。随着机器人数量增加和环境变得更加拥挤,这一限制变得至关重要。为了克服这一挑战,本文引入了仿真引导的扩散(SID),这是一种基于约束感知扩散模型(CADM)的去中心化框架。SID首先使用CADM从当前观测状态仿真相邻机器人的未来轨迹,然后利用这些仿真提供的安全约束,使用相同的CADM规划每个机器人自身的轨迹。关键的是,对邻居的精确仿真使得一种最小通信方案成为可能,该方案仅在高度拥挤的场景中必要时触发协调。在多种环境中的实验表明,SID在规划有效性和约束满足方面始终优于基线方法,并且可扩展到108个机器人和160个障碍物的场景。

英文摘要

Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without global sensing or reliable communication. However, most existing planners, whether classical or learning-based, generate trajectories from a static snapshot of the local observation, which limits their ability to anticipate the future behavior of neighboring robots. This limitation is critical as the number of robots increases and the environment becomes more cluttered. To overcome this challenge, this paper introduces Simulation-Informed Diffusion (SID), a decentralized framework built on constraint-aware diffusion models (CADM). SID first uses CADM to simulate the future trajectories of neighboring robots from their currently observed states, and then uses the same CADM to plan each robot's own trajectory under safety constraints informed by these simulations. Crucially, the accurate simulation of neighbors enables a minimal communication scheme that triggers coordination only when necessary in highly congested scenarios. Experiments across diverse environments show that SID consistently outperforms baseline methods in terms of planning effectiveness and constraint satisfaction, and scales to scenarios with 108 robots and 160 obstacles.

2605.27661 2026-05-28 cs.RO 版本更新

Design of a Real-time Asynchronous Monocular Odometry for Planetary Exploration

面向行星探测的实时异步单目里程计设计

Benat Inigo, Florian Steidle, Wolfgang Stuerzl

发表机构 * Institute of Robotics and Mechatronics(机器人与机电研究所) German Aerospace Center (DLR)(德国航空航天中心(DLR)) University of Zaragoza(萨拉戈萨大学)

AI总结 针对行星探测中计算资源受限、环境复杂且高动态范围光照的挑战,提出一种基于误差状态卡尔曼滤波(ESKF)的实时异步事件相机单目里程计,利用异步事件流和RATE特征跟踪器实现连续相机运动估计。

详情
AI中文摘要

我们描述了面向行星探测的实时异步事件基单目里程计的初步设计。在严格的计算约束下运行,行星探测器经常遇到复杂、不可预测的环境,需要高速感知和对高动态范围(HDR)光照的鲁棒性。事件相机通过报告异步、像素级的亮度变化(微秒级分辨率)来满足这些需求,在极端光照条件下显著降低数据带宽同时保持鲁棒性。我们提出了一种基于误差状态卡尔曼滤波(ESKF)的方法,利用异步事件流连续估计相机自运动。相机状态通过RATE(一种实时异步特征跟踪器)生成的每个跟踪位置输出进行更新。

英文摘要

We describe our preliminary design of a real-time asynchronous event-based monocular odometry for planetary exploration. Operating under strict computational constraints, planetary rovers frequently encounter complex, unpredictable environments that demand high-speed sensing and robustness to high dynamic range (HDR) lighting. Event cameras address these needs by reporting asynchronous, pixel-wise brightness changes with microsecond resolution, significantly reducing data bandwidth while maintaining robustness in extreme lighting conditions. We propose an approach based on an Error-State Kalman Filter (ESKF) that leverages this asynchronous event stream to continuously estimate camera ego-motion. The camera state is updated with every tracked position output generated by RATE, a real-time asynchronous feature tracker.

2605.27644 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data

Trinity:通过利用合成数据统一非结构化户外环境中的类无关地形与语义分割

Marcus G Müller, Wout Boerdijk, Maximilian Durner, Riccardo Giubilato, Abel Gawel, Wolfgang Stürzl, Roland Siegwart, Rudolph Triebel

发表机构 * Institute of Robotics and Mechatronics, German Aerospace Center (DLR)(机器人与机电系统研究所,德国航空航天中心(DLR)) Federal Institute of Technology Zurich (ETH Zurich)(苏黎世联邦理工学院(ETH Zurich)) Robotics and AI Institute (RAI)(机器人与人工智能研究所(RAI))

AI总结 提出基于Transformer的统一网络Trinity,联合执行类特定语义分割和类无关地形分割,利用合成数据集RUGDSynth和真实数据集EXTerra实现机器人无关的地形先验学习。

详情
AI中文摘要

地形理解对于在非结构化户外环境中运行的移动机器人至关重要。现有的基于视觉的可通行性估计方法依赖于机器人特定的标注或语义类别映射,限制了跨平台的迁移性,并在机器人能力变化时需要昂贵的重新标注,而标准的语义分割方法仅关注特定的预定义类别,无法捕捉地形的多样性。在这项工作中,我们提出了一种基于Transformer的架构,在统一网络Trinity中联合执行类特定语义分割和类无关地形分割。地形区域仅基于视觉外观进行分割,无需预定义的语义标签或机器人相关的可通行性分数。这种公式使得学习机器人无关的视觉地形先验成为可能,这些先验可以与机器人特定的经验相结合,用于下游任务,如可通行性估计、视觉里程计和任务规划。为了实现具有多样地形外观的大规模训练,我们扩展了OAISYS模拟器,并引入了RUGDSynth,这是一个受RUGD启发、包含类无关地形样本的合成数据集。此外,我们提出了EXTerra数据集,提供了带有类特定和类无关地形标签的真实世界图像。实验证明了所提出任务的可行性以及我们的联合分割方法在复杂户外环境中的有效性。代码和数据集将在本出版物发布后(经过审查)公开。

英文摘要

Terrain understanding is fundamental for mobile robots operating in unstructured outdoor environments. Existing vision-based traversability estimation methods rely on robot-specific annotations or semantic class mappings, limiting transferability across platforms and requiring costly re-annotation when robot capabilities change, while standard semantic segmentation methods only focus on specific predefined classes, which do not capture the variety of terrains. In this work, we propose a transformer-based architecture that jointly performs class-specific semantic segmentation and class-agnostic terrain segmentation within a unified network, called Trinity. Terrain regions are segmented based solely on visual appearance, without predefined semantic labels or robot-dependent traversability scores. This formulation enables the learning of robot-agnostic visual terrain priors that can be combined with robot-specific experience for downstream tasks such as traversability estimation, visual odometry, and mission planning. To enable large-scale training with diverse terrain appearances, we extend the OAISYS simulator and introduce RUGDSynth, a synthetic dataset inspired by RUGD with class-agnostic terrain samples. Furthermore, we present the EXTerra Dataset, providing real-world images annotated with both class-specific and class-agnostic terrain labels. Experiments demonstrate the feasibility of the proposed task and the effectiveness of our joint segmentation approach in complex outdoor environments. Code and datasets will be released with this publication (after review).

2605.27643 2026-05-28 cs.RO physics.optics 版本更新

Agentic Language-to-Objective Synthesis for Optofluidic Assembly

面向光流组件的智能语言到目标合成

Ivan Saraev, Elena Erben, Weida Liao, Fan Nan, Gerhard Neumann, Eric Lauga, Moritz Kreysing

发表机构 * Institute of Biological and Chemical Systems, Karlsruhe Institute of Technology, Germany(马克斯·普朗克研究所生物和化学系统研究所,卡尔斯鲁厄技术大学,德国) Department of Applied Mathematics and Theoretical Physics, University of Cambridge, UK(应用数学和理论物理系,剑桥大学,英国) Department of Mathematics, Imperial College London, UK(数学系,伦敦帝国理工学院,英国) Institute of Anthropomatics and Robotics (IAR), Karlsruhe Institute of Technology, Germany(人机学与机器人研究所(IAR),卡尔斯鲁厄技术大学,德国)

AI总结 提出Speak-to-Objective模块化智能流水线,利用条件大语言模型将口语或书面指令转换为可微目标函数,实现光流控微粒子组装,并支持用户反馈学习。

Comments 21 pages, 5 figures

详情
AI中文摘要

基于光的先进制造日益需要可编程、闭环工具,将人类设计意图转化为小尺度上的可执行操作。然而,在机器人和制造模式中仍存在一个关键瓶颈:将用户意图转化为机器可读且可靠执行的目标。尽管微机器人通过光驱动流体提供了多功能操控,但数学上可处理的目标规范仍然手动且难以重用。本文介绍Speak-to-Objective,一个模块化智能流水线,使用条件大语言模型将口语或书面指令转换为完全可微的目标函数,用于在约束感知逆求解器(SLSQP)和实验光流控平台上组装微粒。该方法采用紧凑循环——感知→组合→提议→行动→报告与学习——将目标作为意图与驱动之间的接口,分离组装或图案化什么与如何驱动,同时从用户反馈中学习。流水线组合几何、间距和分配/拓扑项,生成鲁棒的描述性目标,从部分轨迹组装并在扰动后恢复,以及用于精确定位的显式目标,所有均以执行器无关的方式。使用激光诱导热粘性流作为物理驱动模式,我们展示了自然语言可编程的、基于光的微尺度粒子图案组装在微流控环境中。除了对可编程微组装的直接影响,以及使用激光诱导光流控驱动作为降复杂度实验平台,我们的工作指向自驱动、AI辅助的光学制造平台,其中自然语言、可微目标和激光驱动耦合为可重复使用的数字工作流。

英文摘要

Light-based advanced manufacturing increasingly requires programmable, closed-loop tools that translate human design intent into executable operations at small length scales. Yet a key bottleneck persists across robotic and manufacturing modalities: turning user intent into machine-readable objectives that are reliably executable. While micro-robotics offers versatile manipulation via optical actuation of fluids, mathematically tractable goal specification remains manual and hard to reuse. Here, we introduce Speak-to-Objective, a modular agentic pipeline that uses a conditioned Large Language Model (LLM) to translate spoken or written commands into fully differentiable objective functions for assembling microparticles in a constraint-aware inverse solver (SLSQP) and on an experimental optofluidic platform. The approach employs a compact loop - perceive -> compose -> propose -> act -> report & learn - that treats the objective as the interface between intent and actuation, separating what to assemble or pattern from how to actuate, while learning from user feedback. The pipeline composes geometry, spacing, and assignment/topology terms to generate robust descriptive objectives that assemble from partial traces and recover after perturbations, as well as explicit objectives for precise placement, all in an actuator-agnostic fashion. Using laser-induced thermoviscous flows as the physical actuation modality, we demonstrate natural-language-programmable, light-based microscale assembly of particle patterns in a microfluidic environment. Beyond its immediate impact on programmable microassembly, and using laser-induced optofluidic actuation as a reduced-complexity experimental platform, our work points toward self-driving, AI-assisted optical manufacturing platforms in which natural language, differentiable objectives, and laser-based actuation are coupled into a reusable digital workflow.

2605.27582 2026-05-28 cs.RO cs.CV 版本更新

Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

Uni-LaViRA:面向统一具身导航的语言-视觉-机器人动作翻译

Hongyu Ding, Sizhuo Zhang, Ziming Xu, Jinwen Guo, Hongxiu Liu, Xingzhi Cheng, Zixuan Chen, Haifei Qi, Duo Wang, Hao Xu, Jieqi Shi, Yifan Zhang, Jing Huo, Jian Cheng, Yang Gao, Jiebo Luo

发表机构 * Nanjing University(南京大学) Beihang University(北京航空航天大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) BMW (Nanjing) Information Technology Co., Ltd.(宝马(南京)信息技术有限公司) University of Rochester(罗切斯特大学)

AI总结 提出Uni-LaViRA统一智能体架构,通过语言-视觉-机器人动作翻译结构,结合待办列表记忆和二次机会回溯机制,在零训练下实现四类导航任务和四种真实机器人的零样本泛化,性能匹配或超越近期训练式导航基础模型。

Comments Project page: https://xetroubadour.github.io/Uni-LaViRA/

详情
AI中文摘要

具身导航要求智能体将语言和视觉观测映射为一系列空间动作,驱动真实机器人在未见环境中移动。主流方法是在不断增大的机器人轨迹数据集上扩展视觉-语言-动作(VLA)基础模型。本文认为,对于导航而言,通用性可以通过结构获得,而不仅仅依赖数据规模。导航的底层决策结构可简化为单一的语言-视觉-机器人动作翻译。语言动作发出语义级方向指令,视觉动作发出像素级视觉目标。这两个输出都位于预训练多模态大语言模型(MLLM)的自然输出流形内,因此任务可以由智能体推理而非从机器人数据中学习。为此,我们提出Uni-LaViRA,一种统一的智能体架构,将相同的见解零样本地扩展到四个任务族(VLN-CE、ObjectNav、EQA和Aerial-VLN)和四种异构真实机器人(轮式、四足、人形机器人和自建无人机)。两种智能体循环机制使这种统一变得实用。待办列表记忆(TDM)在每一步重写待办子目标的结构化检查清单,将未完成项重新注入智能体的最近注意力窗口。二次机会回溯(SCB)将机器人回滚到错误前状态,并基于失败的子轨迹调整智能体的下一步计划,将单次导航转变为自我纠正过程。无需任何训练,Uni-LaViRA在VLN-CE R2R上达到60.7%的成功率(SR),在VLN-CE RxR上达到51.3%,在HM3D-v2上达到77.7%,在HM3D-OVON上达到60.0%,在MP3D-EQA上达到54.7%,在OpenUAV上达到40.0%,匹配甚至超越了近期消耗数百万样本和数千GPU小时的训练式导航基础模型。

英文摘要

Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-action (VLA) foundation models on ever-larger collections of robot trajectories. This paper argues that, for navigation specifically, generality can be obtained structurally, not only through data scale. The underlying decision structure of navigation reduces to a single Language-Vision-Robot Actions Translation. The language action emits semantic-level directional command and the vision action emits a pixel-level visual target. Both outputs lie inside the natural output manifold of pretrained multimodal large language models (MLLMs), so the task can be reasoned about by an agent rather than learned from robot data. Therefore, we present Uni-LaViRA, a unified agentic architecture that extends the same insight to four task families (VLN-CE, ObjectNav, EQA, and Aerial-VLN) and to four heterogeneous real robots (Wheeled, Quadruped, Humanoid robot, and a self-built UAV) in a zero-shot manner. Two agent-loop mechanisms make this unification practical. TODO List Memory (TDM) rewrites a structured checklist of pending sub-goals at every step, reciting the unfinished items back into the agent's most recent attention window. Second Chance Backtrack (SCB) rolls the robot back to the pre-error state and conditions the agent's next plan on the failed sub-trajectory, turning single-pass navigation into a self-correcting process. With zero training effort, Uni-LaViRA reaches 60.7% SR on VLN-CE R2R, 51.3% on VLN-CE RxR, 77.7% on HM3D-v2, 60.0% on HM3D-OVON, 54.7% on MP3D-EQA, and 40.0% on OpenUAV, matching or even surpassing recent training navigation foundation models that consume millions of samples and thousands of GPU-hours.

2605.27539 2026-05-28 cs.RO 版本更新

Synthetic Emotions vs. Gamification: Exploring Engagement Strategies for Small Social Robots in Different Age Groups

合成情感 vs. 游戏化:探索不同年龄段小型社交机器人的参与策略

Morten Roed Frederiksen, Kasper Støy

发表机构 * Data Systems & Robotics Department of The IT-University of Copenhagen(丹麦哥本哈根技术大学数据系统与机器人学系)

AI总结 本研究通过两项实验(6-8岁儿童偏好评估和20-27岁大学生行为研究)比较了触觉机器人使用合成情感反馈与积分奖励两种参与策略的效果,发现儿童偏好情感参与,而大学生在积分系统下任务准确率更高且表现持久,揭示了不同年龄组在参与策略有效性上的差异。

Comments 7 pages

详情
AI中文摘要

许多儿童在情绪调节和社交互动方面面临挑战,这限制了他们在日常活动和治疗项目中的参与。为了使社交辅助机器人在这一背景下有效,儿童必须保持持续且有意义的参与。我们探索了一种触觉机器人的参与策略,该机器人旨在通过日常互动支持患有焦虑症的儿童。机器人提供合成情感反馈或积分奖励以鼓励用户参与。我们通过两项研究评估了这些策略:一项是对16名6-8岁学龄儿童的偏好评估,另一项是在自然环境中对14名20-27岁大学生的行为研究。对学龄儿童的研究表明,他们更倾向于情感参与而非基于积分的方法。对大学生进行全天互动的后续研究显示了对比结果:基于积分的系统产生了显著更高的任务准确率(p < 0.05)并保持了持续的表现。来自不同用户群体的发现表明,陈述的偏好和行为结果可能因参与环境而异,这凸显了通过观察互动来验证设计假设的重要性。这项工作为人类-机器人交互设计中参与策略有效性的年龄相关差异提供了见解。

英文摘要

Many children experience challenges in emotional regulation and social interaction, which can limit their participation in everyday activities and therapeutic programs. For socially assistive robots to be effective in this context, it is essential that children remain consistently and meaningfully engaged. We explore engagement strategies for a tactile robot designed to support children suffering from anxiety disorders through daily interactions. The robot delivers either synthetic emotional feedback or point rewards to encourage user participation. We evaluated these strategies through two studies: a preference assessment with 16 school children aged 6-8 years, and a behavioral study with 14 university students aged 20-27 years in naturalistic environments. The study with school children indicated a preference for emotional engagement over points-based approaches. The follow up study with university students across a full day of interactions revealed contrasting results: points-based systems produced significantly higher task accuracy (p < 0.05) and sustained performance over time. Findings from different user groups suggest that stated preferences and behavioral outcomes can diverge depending on engagement context, highlighting the importance of validating design assumptions through observed interaction. This work contributes insights into age-related differences in engagement strategy effectiveness in human-robot interaction design.

2605.27533 2026-05-28 cs.RO 版本更新

Inducing Calmness With Pocket-Sized Robotics: Reducing Movement and Heart Rate in Children through Hand-Held Tactile Interactions

用口袋大小的机器人诱导平静:通过手持触觉交互降低儿童的心率和运动

Morten Roed Frederiksen, Kasper Støy, Maja Matarić

发表机构 * Data Systems and Robotics, IT-University of Copenhagen(数据系统与机器人,丹麦IT大学) Interaction Lab, University of Southern California(交互实验室,南加州大学)

AI总结 本研究通过手持触觉设备上的节奏振动匹配游戏,发现触觉交互能显著降低儿童的生理唤醒(心率下降3.56 bpm)和身体躁动(整体运动减少38%),从而促进平静和专注状态。

Comments 34 pages, 2 tables, 7 figures

详情
AI中文摘要

高唤醒或躁动期会干扰儿童的注意力、自我调节和身体平静能力。通过触觉交互鼓励具身自我调节的技术可能提供一种简单易行的方法来促进平静。本文研究了与口袋大小的触觉设备交互如何影响典型发育儿童的生理和行为平静标记。基于先前关于心率调节的研究,我们提出了关于触觉交互如何影响全身运动和姿势稳定性的新发现。我们使用一种设备,通过手持节奏振动匹配游戏吸引儿童,旨在集中注意力并鼓励静止。18名儿童参与了一项受试者内研究,包括两种条件:有和没有手持设备的触觉交互,同时记录他们的心率和身体运动。结果表明,触觉游戏交互降低了生理唤醒(心率下降3.56 bpm,p < 0.01)和身体躁动(整体运动减少38%,p < 0.05),与注意力相关的身体区域向静止变化最大(运动减少45%)。这些发现表明,与手持设备的短暂触觉游戏式参与可以下调生理激活,促进平静和专注状态,从而有助于持续注意力和行为调节。

英文摘要

Periods of heightened arousal or restlessness can interfere with children's ability to focus, self-regulation, and physically calm. Technologies that encourage embodied self-regulation through tactile interaction may provide a simple and accessible means of promoting calmness. This paper investigates how interaction with a pocket-sized tactile device influences physiological and behavioral markers of calmness in typically developing children. Building on prior work examining heart rate modulation, we present new findings on how tactile interaction affects full-body movement and postural stability. We employ a device that engages children through a hand-held rhythmic vibration-matching game, designed to focus attention and encourage stillness. Eighteen children participated in a within-subjects study that involved two conditions: with and without tactile interaction with a hand-held device, while having their heart rate and body movement recorded. Results show that the tactile game interaction reduced physiological arousal (heart rate decreased by 3.56 bpm, p < 0.01) and physical restlessness (overall movement decreased by 38%, p < 0.05), with attention-related body regions showing the greatest change toward stillness (45% reduction in movement). These findings demonstrate that brief tactile game-like engagement with a hand-held device can down-regulate physiological activation, promoting the calm and focused states toward sustained attention and behavior regulation.

2605.27532 2026-05-28 cs.RO 版本更新

SCALE-COMM: Shared, Contrastively-Aligned Latent Embeddings for MARL Communication

SCALE-COMM:用于多智能体强化学习通信的共享、对比对齐潜在嵌入

Mahmoud Abouelyazid, Eman Hammad

AI总结 提出SCALE-COMM框架,通过自监督学习紧凑、稳定的潜在通信表示,解耦通信学习与策略优化,提升多智能体协调的稳定性和样本效率。

Comments IEEE IV 2026

详情
AI中文摘要

涌现通信使得部分可观测的自主移动机器人(AMR)能够在去中心化多智能体强化学习(MARL)环境中有效协调。然而,现有方法常常面临通信协议不稳定、消息语义无根基以及通信学习与策略优化之间的干扰,导致协调性能随时间下降。我们提出SCALE-COMM(用于通信的共享、对比对齐潜在嵌入),一种自监督框架,用于学习紧凑、稳定且与策略相关的通信表示。SCALE-COMM通过训练低维潜在消息来解耦通信学习与策略优化,这些消息捕获与任务相关的规划和交通信息,同时跨智能体和时间强制执行一致性。在标准MARL基准测试和一个现实的仓库协调任务中,SCALE-COMM在表示质量和任务性能方面均持续优于现有通信框架。学习到的通信空间在策略微调下展现出改进的稳定性、样本效率和吞吐量,证明了表示驱动的通信对于可扩展多智能体协调的有效性。

英文摘要

Emergent communication enables partially observant Autonomous Mobile Robots (AMRs) to coordinate effectively in decentralized multi-agent reinforcement learning (MARL) settings. However, existing approaches often struggle with unstable communication protocols, ungrounded message semantics, and interference between communication learning and policy optimization, leading to degraded coordination over time. We propose SCALE-COMM (Shared, Contrastively-Aligned Latent Embeddings for COMMunication), a self-supervised framework for learning compact, stable, and policy-relevant communication representations. SCALE-COMM decouples communication learning from policy optimization by training low-dimensional latent messages that capture task-relevant planning and traffic information, while enforcing consistency across agents and time. Across standard MARL benchmarks and a realistic warehouse coordination task, SCALE-COMM consistently outperforms existing communication frameworks in both representation quality and task performance. The learned communication space yields improved stability, sample efficiency, and throughput under policy fine-tuning, demonstrating the effectiveness of representation-driven communication for scalable multi-agent coordination.

2605.27491 2026-05-28 cs.RO 版本更新

GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation

GE-Sim 2.0:迈向机器人操作综合闭环视频世界模拟器的路线图

Boxiang Qiu, Liliang Chen, Yue Liao, Nan Wang, Lintao Wang, Jiayi Luo, Wenzhi Zhao, Shengcong Chen, Di Chen, Ye Li, Chen Gao, Shuicheng Yan, Si Liu, Maoqing Yao, Guanghui Ren

发表机构 * AgiBot BUAA(北京航空航天大学) LV-NUS Lab(国立大学理工学院实验室) TJU(天津大学)

AI总结 提出GE-Sim 2.0,一种基于动作条件视频生成的闭环视频世界模拟器,通过重新训练数千小时真实机器人数据并新增状态专家、世界裁判和加速框架三个模块,实现高保真动作跟随和轨迹覆盖,在WorldArena排行榜上以2B参数超越专用模型和通用视频生成器,并验证了基于其生成轨迹和奖励训练的策略在真实世界中的有效性。

详情
AI中文摘要

我们介绍了GE-Sim 2.0(Genie Envisioner世界模拟器2.0),一种用于机器人操作的闭环视频世界模拟器。基于Genie Envisioner的动作条件视频生成框架,GE-Sim 2.0在数千小时的真实机器人数据上重新训练,涵盖遥操作、接触丰富交互和机载策略部署,显著提高了动作跟随保真度和轨迹覆盖范围。在此基础之上,三个新模块实现了从视频模拟到策略学习的闭环:一个状态专家,从视频潜在表示中解码本体感觉状态,以支持下游VLA策略的下一块预测;一个世界裁判,根据任务指令对生成的轨迹进行评分,提供机器可验证的成功信号和奖励,取代人工检查;以及一个加速框架,在单个H100上以2.3秒生成25帧轨迹,并在推理时支持高达4倍跳帧以实现长程评估。GE-Sim 2.0仅以2B参数便登顶公开的WorldArena排行榜,超越了专用机器人世界模型和闭源通用视频生成器,并且基于其生成轨迹和奖励训练的策略可转化为可测量的真实世界收益,确立了GE-Sim 2.0作为可扩展评估和操作策略闭环学习的实用平台。

英文摘要

We introduce GE-Sim 2.0 (Genie Envisioner World Simulator 2.0), a closed-loop video world simulator for robotic manipulation. Building on the action-conditioned video generation framework of Genie Envisioner, GE-Sim 2.0 is re-trained on thousands of hours of real-world robot data spanning teleoperation, contact-rich interaction, and on-robot policy deployment, substantially improving action-following fidelity and trajectory coverage. On top of this foundation, three new modules close the loop from video simulation to policy learning: a state expert that decodes proprioceptive state from video latents to support next-chunk prediction by downstream VLA policies; a world judge that scores generated rollouts against task instructions, yielding machine-verifiable success signals and rewards in place of manual inspection; and an acceleration framework that delivers a 25-frame rollout in 2.3 seconds on a single H100, with up to 4* frame skipping at inference for long-horizon evaluation. GE-Sim 2.0 tops the public WorldArena leaderboard at only 2B parameters, outperforming both dedicated robotic world models and closed-source general video generators, and policies trained against its rollouts and rewards translate into measurable real-world gains, establishing GE-Sim 2.0 as a practical platform for scalable evaluation and closed-loop learning of manipulation policies.

2605.27461 2026-05-28 cs.RO 版本更新

A Factory-Floor Deployment Case Study of VLA Pipelines for Industrial Packaging Task: Workflow, Failures, and Lessons

工业包装任务的VLA流水线工厂部署案例研究:工作流、故障与经验教训

Brian Zhu, Philipp Schmitt, Philine Meister, Lukas Gensler, Momen Khalil, Emmanuele Poggi, Johannes Hechtl, Carsten Braunroth, Kai Wurm, Gokul Narayanan, Eugen Solowjow, Georg von Wichert, Andre Scholz, Felix Albrecht, Maxmillian Metzner

发表机构 * Siemens Corporation(西门子公司)

AI总结 本研究通过在西门子工厂部署预训练Pi0.5策略执行工业包装任务,迭代微调并收集2535个现场数据片段,总结了VLA流水线部署中的常见故障模式与改进工作流的经验教训。

详情
AI中文摘要

视觉-语言-动作(VLA)策略展示了有前景的操作能力,但其实际影响常受限于现实部署的可靠性要求。我们展示了西门子工厂(德国埃尔朗根GWE)中一项工业包装任务的部署研究:机器人必须从杂乱堆中拾取透明配件袋,将其插入纸板包装的剩余空腔,并确保袋子及其内容物保持在闭合平面以下。我们的目标是理解通过迭代微调和部署驱动的改进,将预训练的Pi0.5策略适配到单一工厂任务所需的实际工作量。该流水线包括数据收集、整理、微调、评估和针对性恢复数据收集的重复循环。我们从现场工厂设置中积累了2535个片段(10小时)。在本文中,我们贡献了一个工厂级VLA部署的实证报告,重点介绍了常见的故障模式和有助于改进部署工作流的经验教训。

英文摘要

Vision-Language-Action (VLA) policies have shown promising manipulation capabilities, yet their practical impact is often limited by the reliability demands of real-world deployment. We present a deployment study of an industrial packaging task at Siemens Factory (GWE, Erlangen, Germany), where a robot must pick a transparent accessory bag from a cluttered pile, insert it into the remaining cavity of a cardboard package, and ensure that the bag and its contents remain below the closing plane. Our goal is to understand the practical effort required to adapt a pretrained Pi0.5 policy to a single factory-floor task through iterative fine-tuning and deployment-driven refinement. The pipeline consists of repeated loops of data collection, curation, fine-tuning, evaluation, and targeted recovery data collection. We have accumulated 2535 episodes (10 hours) from the on-site factory settings. In this paper, we contribute an empirical account of a factory-floor VLA deployment, highlighting recurring failure modes and lessons that inform how to improve the deployment workflow.

2605.27418 2026-05-28 cs.MA cs.RO 版本更新

Differentiable Model Predictive Safety for Heterogeneous Mobility at Urban Intersections

城市交叉口异构移动体的可微分模型预测安全

Wenzhe Song, Hao Zhang

发表机构 * School of Business(商学院) Department of Mechanical Engineering(机械工程系) Stevens Institute of Technology(史蒂文斯理工学院) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出可微分模型预测安全(DMPS)框架,将模型预测控制的前瞻性嵌入数据驱动的端到端强化学习架构,通过可微分安全评价器实现精确在线安全校正,在高密度混合交通仿真中将碰撞率降至5.6%以下。

Comments 6 pages. Published in IEEE IARCE 2025

详情
Journal ref
2025 IEEE 5th International Conference on Industrial Automation, Robotics and Control Engineering (IARCE), Chongqing, China, 2025, pp. 1-6
AI中文摘要

自动驾驶车辆和移动机器人在城市环境中的即将集成对未来的智能交通系统提出了严峻的安全挑战。本文解决了在无信号交叉口协调具有不同动力学的异构智能体的复杂问题。我们引入了一种新颖的框架,称为可微分模型预测安全(DMPS),它将模型预测控制的前瞻性嵌入到数据驱动的端到端强化学习架构中。DMPS智能体学习一个潜在动力学模型,以预测依赖于其动作的未来轨迹。然后,一个学习到的可微分安全评价器评估这些轨迹的风险。关键的是,通过利用通过整个展开预测模型的反向传播,智能体可以高效地计算未来安全性相对于当前动作的梯度,从而实现最小且精确的在线安全校正。集成到多智能体训练方案中,DMPS在高密度混合车辆-机器人交通仿真中几乎消除了碰撞,碰撞率低于5.6%,在不牺牲能量和交通效率的情况下展示了最先进的安全性。

英文摘要

The imminent integration of autonomous vehicles and mobile robots in urban settings presents a critical safety challenge for future intelligent transportation systems. This paper addresses the complex problem of coordinating heterogeneous agents with disparate dynamics at unregulated intersections. We introduce a novel framework, differentiable model predictive safety (DMPS), which embeds the foresight of model-predictive control into a data-driven, end-to-end reinforcement learning architecture. DMPS agents learn a latent dynamics model to predict future trajectories contingent on their actions. A learned, differentiable safety critic then evaluates the risk of these trajectories. Crucially, by leveraging backpropagation through the entire unrolled predictive model, agents can efficiently compute the gradient of future safety with respect to their current action, enabling a minimal and precise online safety correction. Integrated into a multi-agent training scheme, DMPS virtually eliminates collisions to less than 5.6% in high-density, mixed vehicle-robot traffic simulations, demonstrating state-of-the-art safety without compromising energy and traffic efficiency.

2605.27365 2026-05-28 cs.CV cs.AI cs.LG cs.RO 版本更新

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything: 基于并行框解码的快速高质量视觉定位

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Princeton University(普林斯顿大学) Nanjing University(南京大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出并行框解码(PBD)方法,将边界框和点作为原子单元单步解码,结合大规模数据集LocateAnything-Data,实现高效统一的目标定位与检测,在保持高精度同时显著提升解码吞吐量。

Comments fix github link

详情
AI中文摘要

视觉语言模型(VLM)通常将视觉定位和检测表述为坐标令牌生成问题,将每个2D框序列化为多个1D令牌,这些令牌在很大程度上独立学习和解码。这种逐令牌解码与框几何的耦合结构不匹配,并且由于严格的顺序生成而造成了实际的推理瓶颈。我们引入了LocateAnything,一个基于并行框解码(PBD)的统一生成式定位和检测框架。通过将边界框和点等几何元素作为原子单元单步解码,LocateAnything保持了框内几何一致性并实现了显著的并行性。我们证明PBD提高了解码吞吐量和定位精度。我们进一步开发了一个可扩展的数据引擎,并策划了LocateAnything-Data,这是一个包含超过1.38亿个训练样本的大规模数据集,大大增加了高精度定位的数据多样性。大量评估表明,LocateAnything推进了速度-精度前沿,在多个基准测试中实现了显著更高的解码吞吐量,同时提高了高IoU定位质量。结果突显了并行框解码和大规模训练数据在实现高效精确的统一视觉定位和检测中的互补优势。

英文摘要

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

2605.19257 2026-05-28 cs.RO 版本更新

PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

PRISM-SLAM: 面向尺度感知度量SLAM的概率射线基础推理

Eunsoo Im, Gyeonggwan Lee, Junghun Suh

发表机构 * KakaoMobility, South Korea(韩国 KakaoMobility)

AI总结 提出PRISM-SLAM框架,通过将视觉基础模型先验集成到贝叶斯因子图中,利用Plücker射线距离因子和动态场景不确定性门控机制,实现无尺度漂移的实时单目度量SLAM。

详情
AI中文摘要

单目SLAM历来在动态环境中存在尺度模糊和跟踪失败的问题。虽然最近的视觉基础模型(VFM)提供了显著的零样本深度先验,但简单地整合这些确定性预测忽略了预测不确定性和帧间尺度不一致性。我们提出了PRISM-SLAM,一个实时框架,将VFM先验严格集成到结构化的贝叶斯因子图中,以实现尺度感知、度量一致的定位与建图。具体来说,我们引入了Plücker射线距离因子,将单目观测锚定在全局一致的度量坐标系中的绝对空间,通过使度量尺度Fisher可识别,从数学上解决了尺度漂移。为了处理环境动态,我们从时间深度一致性中推导出认知不确定性代理,并设计了动态场景不确定性门控(DSUG)机制。这种软门控方法概率性地降低动态干扰物的权重,而不会产生与传统语义分割掩码相关的高计算开销。通过采用多进程架构异步处理VFM推理和几何跟踪,PRISM-SLAM仅使用RGB输入即可在30 FPS下提供验证的度量输出,弥合了基础模型与现实机器人应用之间的差距。在TUM RGB-D和7-Scenes基准上的评估表明,PRISM-SLAM的度量$SE(3)$绝对轨迹误差(ATE)几乎与其对齐的$Sim(3)$误差相同。这表明我们的系统能够生成可直接部署的度量轨迹,无需任何后处理尺度校正。项目页面:https://prismslam-cmd.github.io/prismslam_pr/

英文摘要

Monocular SLAM historically suffers from scale ambiguity and tracking failure in dynamic environments. While recent vision foundation models (VFMs) provide remarkable zero-shot depth priors, naively integrating these deterministic predictions ignores predictive uncertainty and frame-to-frame scale inconsistencies. We propose PRISM-SLAM, a real-time framework that rigorously integrates VFM priors into a structured Bayesian factor graph to achieve scale-aware, metric-consistent localization and mapping. Specifically, we introduce a Plücker Ray-Distance Factor to anchor monocular observations in absolute space within a globally consistent metric coordinate system, mathematically resolving scale drift by making the metric scale Fisher-identifiable. To handle environmental dynamics, we derive an epistemic uncertainty proxy from temporal depth consistency and formulate a Dynamic Scene Uncertainty Gating (DSUG) mechanism. This soft-gating approach probabilistically down-weights dynamic distractors without incurring the heavy computational overhead associated with traditional semantic segmentation masks. By employing a multi-process architecture that asynchronously processes VFM inference and geometric tracking, PRISM-SLAM provides verified metric output at 30 FPS using solely RGB input, bridging the gap between foundation models and real-world robotic applications. Evaluated on the TUM RGB-D and 7-Scenes benchmarks, PRISM-SLAM achieves a metric $SE(3)$ Absolute Trajectory Error (ATE) nearly identical to its oracle-aligned $Sim(3)$ error. This demonstrates that our system can produce deployment-ready metric trajectories by delivering robust metric SLAM solutions without any post-hoc scale correction. Project page: https://prismslam-cmd.github.io/prismslam_pr/

2605.17929 2026-05-28 cs.RO 版本更新

TacSE3: Equivariant SE(3) Motion Estimation from Low-Texture Visuotactile Images for In-Gripper Tracking and Compensation

TacSE3: 基于低纹理视触觉图像的等变SE(3)运动估计用于夹爪内跟踪与补偿

Zhongyuan Liao, Junzhe Wang, Qingyang Liu, Zhenmin Huang, Jun Ma, Yi Cai, Fei Meng, Haobo Liang, Michael Yu Wang

发表机构 * Hong Kong Center for Construction Robotics, The Hong Kong University of Science and Technology, Hong Kong, China(香港建设机器人中心,香港科技大学,中国香港) The Hong Kong University of Science and Technology (Guangzhou), China(香港科技大学(广州),中国) School of Engineering, Great Bay University, Dongguan, China(工程学院,大湾大学,中国东莞)

AI总结 提出TacSE3,一种将低纹理视触觉观测转化为解耦三维力场并估计SE(3)增量刚体运动的触觉运动估计流程,通过双传感器感知减少平移-旋转歧义,实现夹爪内跟踪与补偿。

详情
AI中文摘要

机器人手内操作需要在频繁视觉遮挡下可靠地跟踪物体运动,然而低纹理视触觉图像为传统的图像或几何匹配方法提供的稳定对应点很少。本文提出TacSE3,一种触觉运动估计流程,将低纹理视触觉观测转化为解耦的三维力场,并估计SE(3)上的增量刚体运动。该方法从接触质心运动推导平面平移,并主要从剪切相关的触觉响应估计旋转,从而为夹爪内跟踪和补偿提供物理可解释的信号。使用成对DM-Tac指尖传感器的实验表明,双传感器感知减少了平移-旋转歧义,支持跨轴和物体几何形状的旋转跟踪,并提供轻量级补偿信号,在不重新训练基础策略的情况下提高了下游操作任务中的扰动容忍度。

英文摘要

Robotic in-hand manipulation requires reliable object-motion tracking under frequent visual occlusion, yet low-texture visuotactile images provide few stable correspondences for conventional image- or geometry-matching methods. This paper presents TacSE3, a tactile motion-estimation pipeline that converts low-texture visuotactile observations into a decoupled three-dimensional force field and estimates incremental rigid-body motion on SE(3). The method derives planar translation from contact-centroid motion and estimates rotation primarily from shear-related tactile responses, yielding a physically interpretable signal for in-gripper tracking and compensation. Experiments with paired DM-Tac fingertip sensors show that dual-sensor sensing reduces translation-rotation ambiguity, supports rotation tracking across axes and object geometries, and provides a lightweight compensation signal that improves disturbance tolerance in downstream manipulation tasks without retraining the base policy.

2605.16030 2026-05-28 cs.LG cs.RO 版本更新

Mind Dreamer: Untethering Imagination via Active Causal Intervention on Latent Manifolds

Mind Dreamer: 通过潜在流形上的主动因果干预释放想象力

Shaojun Xu, Xiaoling Zhou, Yihan Lin, Yapeng Meng, Xinglong Ji, Luping Shi, Rong Zhao

发表机构 * Center for Brain-Inspired Computing Research, Department of Precision Instrument, Tsinghua University, Beijing, China(脑启发计算研究中心,精密仪器系,清华大学,北京,中国) College of Computer Science and Technology, Zhejiang University, Hangzhou, China(计算机科学与技术学院,浙江大学,杭州,中国) Pen-Tung Sah Institute of Micro-Nano Science and Technology, Xiamen University, Xiamen, China(彭途萨微纳米科学与技术研究院,厦门大学,厦门,中国)

AI总结 针对基于模型的强化学习中历史束缚导致策略优化滞后的问题,提出Mind Dreamer框架,通过主动因果干预生成非连续潜在跳跃,并推导中继价值函数与中继不确定性函数,实现样本效率提升。

Comments 34 pages, 7 figures, ICML 2026 accepted

详情
AI中文摘要

基于模型的强化学习通过潜在想象实现样本效率,但仍受限于历史束缚:想象通常从观测状态初始化。这造成了学习不对称,即世界模型的流形发现速度超过策略的稀疏奖励优化速度。我们提出Mind Dreamer (MD)框架,实例化主动因果干预以超越马尔可夫连续性。MD将发现重新定义为全局中继期望自由能的最小化。它不是从历史数据初始化,而是从对抗生成器$s_0 \sim p_{gen}(\cdot)$中抽取初始状态,产生到认知盲点的非连续潜在跳跃,这些盲点物理上合理但认知上具有挑战性。我们推导了中继价值函数和中继不确定性函数,以解决这些空间断裂中的信用分配悖论。将合成锚点视为干预性中间状态,这些势能通过贝尔曼式备份传播实用和认知价值。值得注意的是,我们证明了跨不连续性的不确定性传播需要二次折扣$\gamma^2$,建立了形式化的认知视野。理论上,MD近似一个方差最小化重要性采样器,扩大了流形的谱间隙,减少了到达关键瓶颈状态的命中时间。实验上,MD在DeepMind Control Suite上比DreamerV3平均加速1.67倍,在稀疏奖励任务中达到8.8倍。

英文摘要

Model-Based Reinforcement Learning yields sample efficiency via latent imagination, yet remains constrained by Historical Tethering: imagination is typically initialized from observed states. This creates a learning asymmetry, where the world model's manifold discovery outpaces the policy's sparse-reward optimization. We propose Mind Dreamer (MD), a framework that instantiates Active Causal Intervention to transcend Markovian continuity. MD reformulates discovery as the minimization of a global Relay Expected Free Energy. Instead of initializing from historical data, it draws initial states from an adversarial generator $s_0 \sim p_{gen}(\cdot)$, creating non-continuous latent jumps to epistemic blind spots that are physically plausible yet cognitively challenging. We derive Relay Value Function and Relay Uncertainty Function to resolve the credit assignment paradox across these spatial ruptures. Treating synthesized anchors as interventional intermediary states, these potentials propagate pragmatic and epistemic value through Bellman-style backups. Notably, we prove that uncertainty propagation across discontinuities necessitates a quadratic discount $γ^2$, establishing a formal epistemic horizon. Theoretically, MD approximates a variance-minimizing importance sampler that expands the manifold's spectral gap, reducing the hitting time to critical bottleneck states. Empirically, MD achieves a 1.67$\times$ average speedup over DreamerV3 on DeepMind Control Suite, reaching 8.8$\times$ in sparse-reward tasks.

2602.05198 2026-05-28 cs.RO 版本更新

Informative Path Planning with Guaranteed Estimation Uncertainty

具有保证估计不确定性的信息性路径规划

Kalvik Jakkala, Saurav Agarwal, Jason O'Kane, Srinivas Akella

发表机构 * Texas A&M University(德克萨斯大学) Indian Institute of Technology Bombay(孟买印度理工学院) University of North Carolina at Charlotte(夏洛特北卡罗来纳大学)

AI总结 提出一种三阶段方法,通过高斯过程模型将不确定性约束转化为覆盖图,并规划近最短路径,在复杂环境中实现具有保证估计不确定性的信息性路径规划。

Comments 15 pages, 11 figures, RSS 2026

详情
AI中文摘要

环境监测机器人通常需要在严格的资源约束下估计数据场(例如盐度、温度、水深)。经典的往复式割草机式测量提供了几何覆盖保证,但可能因过度采样可预测区域而浪费精力。相比之下,信息性路径规划(IPP)方法利用空间相关性减少过度采样,但通常不提供估计质量的保证。本文通过解决复杂环境中具有保证估计不确定性的IPP来弥合这些方法:计算最短路径,其测量确保高斯过程(GP)后验方差——一种内在的不确定性度量,在GP模型下下界均方预测误差——在监测区域上由用户指定的阈值上界。 我们提出了一种高效环境监测的三阶段方法:(i)从先验信息学习GP模型;(ii)将GP核转换为二元覆盖图,识别不确定性可降低到目标阈值以下的位置;(iii)规划一条近最短路径以满足全局不确定性约束。我们的方法结合非平稳核以捕捉异质现象中空间变化的关联性,并适应具有障碍物的非凸环境。我们为传感位置选择以及在旅行预算下的联合选择与路由问题提供了近最优近似保证。在真实世界地形数据上的实验表明,我们的规划器以比代表性基线更少的传感位置和更短的旅行距离实现了不确定性目标。此外,使用自主水面和水下车辆进行的现场实验验证了该方法的实际可行性。我们的代码可在www.sgp-tools.com获取。

英文摘要

Environmental monitoring robots often need to estimate data fields (e.g., salinity, temperature, bathymetry) under tight resource constraints. Classical boustrophedon lawnmower surveys provide geometric coverage guarantees but can waste effort by oversampling predictable regions. In contrast, informative path planning (IPP) methods leverage spatial correlations to reduce oversampling, yet typically offer no guarantees on estimation quality. This paper bridges these approaches by addressing IPP with guaranteed estimation uncertainty in complex environments: computing the shortest path whose measurements ensure that the Gaussian process (GP) posterior variance -- an intrinsic uncertainty measure that lower-bounds the mean-squared prediction error under the GP model -- is upper bounded by a user-specified threshold over the monitoring region. We propose a three-stage approach for efficient environmental monitoring: (i) learning a GP model from prior information; (ii) transforming the GP kernel into binary coverage maps that identify locations where uncertainty can be reduced below a target threshold; and (iii) planning a near-shortest route to satisfy the global uncertainty constraint. Our approach incorporates non-stationary kernels to capture spatially varying correlations in heterogeneous phenomena and accommodates non-convex environments with obstacles. We provide near-optimal approximation guarantees for both sensing-location selection and the joint selection-and-routing problem under a travel budget. Experiments on real-world topographic data demonstrate that our planners achieve uncertainty targets with fewer sensing locations and shorter travel distances than representative baselines. Furthermore, field experiments with autonomous surface and underwater vehicles validate the real-world feasibility of the approach. Our code is available at: www.sgp-tools.com

2605.08758 2026-05-28 cs.RO cs.AI math.OC 版本更新

Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

基于全尺度学习的料箱搬运机器人系统订单履行序贯决策框架

Jiaxin Liu, Peng Yang, Yuping Li, Xinyue Xie

发表机构 * Institution of Data and Information, Shenzhen International Graduate School, Tsinghua University, Nanshan District, Shenzhen 518055, China(数据与信息研究所,深圳国际研究生院,清华大学,南山区,深圳518055,中国)

AI总结 针对料箱搬运机器人系统的订单履行决策,提出一种结合结构化组合优化与多智能体强化学习的通用可扩展序贯决策框架OLSF-TRS,在小规模系统上平均最优性差距低于3.5%,在大规模场景中相比启发式基线减少8-12%的料箱移动,并保持实时响应。

Comments 35 pages, 5 figures

详情
AI中文摘要

受电子商务和小批量生产的快速扩张推动,成品、半成品和原材料的内部物流负载单元规模正在稳步缩小。料箱正逐渐取代托盘成为主要的搬运和存储容器。这一转变将料箱搬运机器人系统推向了自动化订单履行中心的前沿。料箱搬运机器人系统的订单履行决策具有共同的订单-料箱-机器人序贯决策性质。现有研究主要针对特定系统的决策机制,难以泛化或迁移到其他场景。我们提出了一种基于全尺度学习的料箱搬运机器人系统订单履行序贯决策框架(OLSF-TRS),这是一个通用且可扩展的序贯决策框架,结合了结构化组合优化与多智能体强化学习,以协调订单、料箱和机器人决策。在小规模料箱搬运机器人系统上,OLSF-TRS在两种不同的系统配置下实现了接近最优的性能,平均最优性差距低于3.5%。在大规模场景中,OLSF-TRS在两种不同类型的系统上始终优于启发式基线,与基于规则的最先进方法相比,总料箱移动量减少了8-12%和超过30%,同时保持实时响应。这些改进转化为切实的运营效益,包括成本降低、能耗降低和吞吐量稳定性增强。所提出的框架为广泛部署的料箱搬运机器人系统提供了一种高效且统一的订单履行决策框架,支持电子商务和工业物流领域的高质量订单履行。

英文摘要

Driven by the rapid expansion of e-commerce and small-batch production, the size of the intralogistics load unit of finished goods, semi-finished goods and raw materials is steadily shrinking. Totes are gradually replacing pallets as the primary handling and storage container. This shift has propelled tote-handling robotic systems to the forefront of automation order fulfillment centers. The order-fulfillment decisions of tote-handling robotic systems share a common order-tote-robot sequential decision-making nature. Existing studies primarily focus on decision mechanisms tailored to particular systems, making it difficult to generalize or transfer them to other contexts. We propose an Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems (OLSF-TRS), a generalized and scalable sequential decision framework that combines structured combinatorial optimization with multi-agent reinforcement learning to coordinate order,tote, and robot decisions. On small-scale tote-handling robotic systems, OLSF-TRS achieves near-optimal performance with average optimality gaps below 3.5% across two distinct system configurations. In large-scale scenarios, OLSF-TRS consistently outperforms heuristic baselines across two different system types, reducing total tote movements by 8-12% and over 30% compared to SOTA rule-based approaches, while maintaining real-time responsiveness. These improvements translate into tangible operational benefits, including cost reduction, lower energy consumption, and enhanced throughput stability. The proposed framework delivers an efficient and unified order fulfillment decision-making framework for widely deployed tote-handling robotic systems,supporting high-quality order fulfillment in both e-commerce and industrial logistics sectors.

2604.05673 2026-05-28 cs.RO cs.AI 版本更新

Rectified Schrödinger Bridge Matching for Few-Step Visual Navigation

整流薛定谔桥匹配用于少步视觉导航

Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu, Rui Ma

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) College of Computer Science, Chongqing University(重庆大学计算机学院) Department of Computer Science, University of Liverpool(利物浦大学计算机科学系) Changchun GenY Technology Co., Ltd.(长春GenY科技有限公司)

AI总结 提出整流薛定谔桥匹配(RSBM)框架,利用速度场结构不变性和线性方差减少,在仅3步积分中实现高保真生成策略,满足具身AI低延迟需求。

Comments 18 pages, 7 figures, 10 tables. Code available at https://github.com/WuyangLuan/RSBM

详情
AI中文摘要

视觉导航是具身AI中的核心挑战,要求自主智能体将高维感官观测转化为连续的、长视界动作轨迹。基于扩散模型和薛定谔桥(SB)的生成策略能有效捕捉多模态动作分布,但由于高方差随机传输,需要数十个积分步骤,这对实时机器人控制构成了关键障碍。我们提出整流薛定谔桥匹配(RSBM),该框架利用标准薛定谔桥(ε=1,最大熵传输)与确定性最优传输(ε→0,如条件流匹配)之间共享的速度场结构,由单一熵正则化参数ε控制。我们证明两个关键结果:(1)条件速度场的函数形式在整个ε谱上保持不变(速度结构不变性),使单一网络能够服务于所有正则化强度;(2)减小ε线性降低条件速度方差,实现更稳定的粗步ODE积分。基于缩短传输距离的学习条件先验,RSBM在中间ε下运行,平衡多模态覆盖和路径直线性。实验表明,标准桥需要≥10步才能收敛,而RSBM在仅3个积分步骤中实现了超过94%的余弦相似度和92%的成功率——无需蒸馏或多阶段训练——显著缩小了高保真生成策略与具身AI低延迟需求之间的差距。

英文摘要

Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into continuous, long-horizon action trajectories. While generative policies based on diffusion models and Schrödinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high-variance stochastic transport, posing a critical barrier for real-time robotic control. We propose Rectified Schrödinger Bridge Matching (RSBM), a framework that exploits a shared velocity-field structure between standard Schrödinger Bridges ($\varepsilon=1$, maximum-entropy transport) and deterministic Optimal Transport ($\varepsilon\to 0$, as in Conditional Flow Matching), controlled by a single entropic regularization parameter $\varepsilon$. We prove two key results: (1) the conditional velocity field's functional form is invariant across the entire $\varepsilon$-spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing $\varepsilon$ linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate $\varepsilon$ that balances multimodal coverage and path straightness. Empirically, while standard bridges require $\geq 10$ steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps -- without distillation or multi-stage training -- substantially narrowing the gap between high-fidelity generative policies and the low-latency demands of Embodied AI.

2603.13003 2026-05-28 cs.RO cs.SY eess.SY 版本更新

From Passive Monitoring to Active Defence: Resilient Control of Manipulators Under Cyberattacks

从被动监测到主动防御:网络攻击下机械臂的弹性控制

Gabriele Gualandi, Alessandro V. Papadopoulos

发表机构 * Department of Computer Science and Engineering, Mälardalen University(计算机科学与工程系,马尔默大学)

AI总结 针对虚假数据注入攻击(FDIA)下冗余机械臂的弹性控制问题,提出一种基于异常分数的主动控制级防御方法,通过单调函数衰减控制输入,显著降低攻击引起的末端执行器偏差,同时保证无攻击时的标称性能。

Comments v2: Accepted at ICRA 2026. Corrected minor typos, grammatical errors, and notation inconsistencies. Corrected the attacker's PD law in Sec. III-C: removed the feedforward acceleration term, viable only when the attacker assumes sufficient tracking precision; the active defence prevents this in our experiments, so only PD terms are used

详情
AI中文摘要

网络物理机器人系统容易受到虚假数据注入攻击(FDIAs),其中攻击者在规避基于残差的被动异常检测器(如卡方检验)的同时破坏传感器信号。这种隐蔽攻击可以在不触发警报的情况下引起显著的末端执行器偏差。本文研究了冗余机械臂对隐蔽FDIAs的弹性,并将架构从被动监测推进到主动防御。我们建立了一个闭环模型,包括反馈线性化机械臂、稳态卡尔曼滤波器和基于卡方的异常检测器。在此被动监测层的基础上,我们提出了一种主动控制级防御,通过一个新颖的驱动投影、无测量状态预测器生成的异常分数的单调函数来衰减控制输入。所提出的设计在标称驱动损失上提供了概率保证,并保持了闭环稳定性。从攻击者角度,我们推导了一个凸QCQP来计算一步最优隐蔽攻击。在六自由度平面机械臂上的仿真表明,所提出的防御显著减少了攻击引起的末端执行器偏差,同时在无攻击时保持了标称任务性能。

英文摘要

Cyber-physical robotic systems are vulnerable to false data injection attacks (FDIAs), in which an adversary corrupts sensor signals while evading residual-based passive anomaly detectors such as the chi-squared test. Such stealthy attacks can induce substantial end-effector deviations without triggering alarms. This paper studies the resilience of redundant manipulators to stealthy FDIAs and advances the architecture from passive monitoring to active defence. We formulate a closed-loop model comprising a feedback-linearized manipulator, a steady-state Kalman filter, and a chi-squared-based anomaly detector. Building on this passive monitoring layer, we propose an active control-level defence that attenuates the control input through a monotone function of an anomaly score generated by a novel actuation-projected, measurement-free state predictor. The proposed design provides probabilistic guarantees on nominal actuation loss and preserves closed-loop stability. From the attacker perspective, we derive a convex QCQP for computing one-step optimal stealthy attacks. Simulations on a 6-DOF planar manipulator show that the proposed defence significantly reduces attack-induced end-effector deviation while preserving nominal task performance in the absence of attacks.

2603.09882 2026-05-28 cs.RO cs.AI 版本更新

Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning

杂乱场景中通过动力学感知策略学习涌现的外在灵巧性

Yixin Zheng, Jiangran Lyu, Yifan Zhang, Jiayi Chen, Mi Yan, Yuntian Deng, Xuesong Shi, Xiaoguang Zhao, Yizhou Wang, Zhizheng Zhang, He Wang

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Galbot Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出动力学感知策略学习框架,通过显式世界建模学习接触诱导物体动力学表示并用于强化学习,使杂乱场景中的外在灵巧性无需手工启发式或复杂奖励塑造即可涌现。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project page: https://pku-epic.github.io/DAPL/

详情
AI中文摘要

外在灵巧性利用环境接触来克服抓取操作的局限性。然而,在杂乱场景中实现这种灵巧性仍然具有挑战性且未被充分探索,因为它需要选择性地利用多个相互作用的物体之间的接触,而这些物体具有内在耦合的动力学。现有方法缺乏对这种复杂动力学的显式建模,因此在杂乱环境中的非抓取操作方面表现不足,这反过来限制了它们在现实环境中的实际应用。在本文中,我们介绍了一种动力学感知策略学习(DAPL)框架,该框架可以利用在杂乱环境中学习到的接触诱导物体动力学的表示来促进策略学习。这种表示通过显式世界建模学习,并用于条件化强化学习,使得外在灵巧性无需手工制作的接触启发式或复杂的奖励塑造即可涌现。我们在仿真和现实世界中评估了我们的方法。在具有不同密度的未见过的仿真杂乱场景中,我们的方法在成功率上比抓取操作、人类遥操作和基于先前表示的策略高出25%以上。在10个杂乱场景中,现实世界的成功率达到了约50%,而实际杂货部署进一步证明了稳健的仿真到现实迁移和适用性。

英文摘要

Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.

2603.05642 2026-05-28 cs.RO cs.AI 版本更新

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

基于3D场景图的开放世界交互式物体搜索的关系语义推理

Imen Mahdi, Matteo Cassinelli, Fabien Despinoy, Tim Welschehold, Abhinav Valada

发表机构 * University of Freiburg(弗赖堡大学) Toyota Motor Europe(丰田欧洲公司)

AI总结 提出SCOUT方法,通过从LLM蒸馏的关系探索启发式直接搜索3D场景图,实现高效开放世界交互式物体搜索,性能匹配LLM且计算高效。

详情
AI中文摘要

家庭环境中的开放世界交互式物体搜索需要理解物体与其周围环境之间的语义关系,以有效引导探索。先前的方法要么依赖视觉-语言嵌入相似性,这不能可靠地捕获任务相关的关系语义,要么依赖大型语言模型(LLM),这对于实时部署来说太慢且成本高昂。我们提出SCOUT:基于场景图探索的开放世界交互式物体搜索学习效用,这是一种新颖的方法,通过使用关系探索启发式(如房间-物体包含和物体-物体共现)为房间、前沿和物体分配效用分数,直接搜索3D场景图。为了在不牺牲开放词汇泛化能力的情况下使其实用,我们提出了一种离线程序化蒸馏框架,将LLM中的结构化关系知识提取到轻量级模型中,用于机器人上的推理。此外,我们提出了SymSearch,一个用于评估交互式物体搜索任务中语义推理的可扩展符号基准。在符号和模拟环境中的广泛评估表明,SCOUT优于基于嵌入相似性的方法,并在保持计算效率的同时达到LLM级别的性能。最后,真实世界实验证明了向物理环境的有效迁移,在现实感知和导航约束下实现了开放世界交互式物体搜索。

英文摘要

Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models (LLMs), which are too slow and costly for real-time deployment. We introduce SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search, a novel method that searches directly over 3D scene graphs by assigning utility scores to rooms, frontiers, and objects using relational exploration heuristics such as room-object containment and object-object co-occurrence. To make this practical without sacrificing open-vocabulary generalization, we propose an offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference. Furthermore, we present SymSearch, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks. Extensive evaluations across symbolic and simulation environments show that SCOUT outperforms embedding similarity-based methods and matches LLM-level performance while remaining computationally efficient. Finally, real-world experiments demonstrate effective transfer to physical environments, enabling open-world interactive object search under realistic sensing and navigation constraints.

2603.01766 2026-05-28 cs.RO 版本更新

Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models

神经隐式动作场:从离散路点到连续函数的视觉-语言-动作模型

Haoyun Liu, Jianzhuang Zhao, Xinyuan Chang, Tianle Shi, Chuanzhang Meng, Jiayuan Tan, Feng Xiong, Tong Lin, Dongjie Huo, Mu Xu, SongLin Dong, Zhiheng Ma, Yihong Gong, Sheng Zhong

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Faculty of Computility Microelectronics, Shenzhen University of Advanced Technology(深圳大学计算微电子学院) Guangdong Provincial Key Laboratory of Computility Microelectronics(广东省计算微电子重点实验室) Amap, Alibaba Group(阿里集团Amap) Shenzhen University(深圳大学) Xi'an Jiaotong University(西安交通大学) Beijing University of Chemical Technology(北京化工大学)

AI总结 针对视觉-语言-动作模型预测离散动作路点与物理运动连续性不匹配的问题,提出神经隐式动作场(NIAF),通过将动作表示从离散路点重构为连续函数,实现任意时间分辨率的连续动作流形合成,支持解析求导和显式速度监督,提升控制平滑性和物理合理性。

Comments Accepted at ICML 2026

详情
AI中文摘要

尽管视觉-语言-动作(VLA)模型取得了快速进展,但将动作块预测为离散路点的普遍做法在结构上与物理运动的内在连续性不一致。这种离散化自然源于固定频率的机器人数据收集和大语言模型的逐词预测范式,但将动作绑定到固定的采样率,不能自然支持解析一致的高阶导数,并引入量化伪影,阻碍精确、柔顺的交互。我们提出神经隐式动作场(NIAF),将块级动作表示从离散路点重构为连续动作函数。通过使用视觉-语言模型作为可学习运动先验上的分层频谱调制器,NIAF 合成具有任意时间分辨率的连续时间动作流形。这种公式支持解析微分,允许显式监督速度和正则化高阶导数信号,以促进数学一致性、物理合理性和控制平滑性。我们的方法在 CALVIN 和 LIBERO 上跨多种骨干网络取得了强劲结果。真实世界实验进一步证实 NIAF 支持稳定的阻抗控制,桥接了策略侧动作生成和执行侧平滑控制。

英文摘要

Despite the rapid progress of vision-language-action (VLA) models, the prevailing practice of predicting action chunks as discrete waypoints remains structurally misaligned with the intrinsic continuity of physical motion. This discretization arises naturally from fixed-rate robot data collection and the token-by-token prediction paradigm of large language models, but ties actions to rigid sampling rates, does not naturally support analytically consistent higher-order derivatives, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), which reformulates chunk-level action representation from discrete waypoints to continuous action functions. Using a vision-language model as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes continuous-time action manifolds with arbitrary temporal resolution. This formulation enables analytical differentiation, allowing explicit supervision of velocity and regularization of higher-order derivative signals to promote mathematical consistency, physical plausibility, and control smoothness. Our approach achieves strong results on CALVIN and LIBERO across diverse backbones. Real-world experiments further confirm that NIAF supports stable impedance control, bridging policy-side action generation and execution-side smooth control.

2403.11852 2026-05-28 cs.RO cs.AI 版本更新

Delay-Aware Reinforcement Learning for Highway On-Ramp Merging under Stochastic Communication Latency

考虑随机通信延迟的高速公路匝道合流延迟感知强化学习

Amin Tabrizian, Zhitong Huang, Arsyi Aziz, Peng Wei

发表机构 * Department of Computer Science, George Washington University, Washington, D.C.(计算机科学系,乔治华盛顿大学,华盛顿特区) Connected and Automated Vehicle Program Manager, Traffic Operations Division, Virginia Department of Transportation(连接与自动化车辆计划主任,交通运营处,弗吉尼亚州交通部) Department of Mechanical & Aerospace Engineering, George Washington University, Washington, D.C.(机械与航空航天工程系,乔治华盛顿大学,华盛顿特区)

AI总结 针对V2I通信随机延迟导致状态观测延迟的问题,提出DAROM框架,通过随机延迟MDP建模和延迟感知编码器恢复马尔可夫性,结合物理安全控制器实现鲁棒控制。

详情
AI中文摘要

延迟和部分可观测的状态信息给现实自动驾驶中基于强化学习(RL)的控制带来了重大挑战。在高速公路匝道合流中,路侧单元(RSU)可以感知附近交通,进行边缘感知,并通过车到基础设施(V2I)链路将状态估计传输给自车。随着智能交通基础设施和边缘计算的最新进展,这种RSU辅助感知越来越现实,并已部署在现代互联道路系统中。然而,边缘处理时间和无线传输可能引入随机的V2I通信延迟,违反马尔可夫假设并显著降低控制性能。在这项工作中,我们提出了DAROM,一种对随机延迟鲁棒的高速公路匝道合流延迟感知强化学习框架。我们将问题建模为随机延迟马尔可夫决策过程(RDMDP),并开发了一个统一的RL智能体用于联合纵向和横向控制。为了在延迟观测下恢复马尔可夫表示,我们引入了一个延迟感知编码器,该编码器以延迟观测、掩蔽动作历史和观测延迟幅度为条件来推断当前潜在状态。我们进一步集成基于物理的安全控制器以减少合流过程中的碰撞风险。在模拟城市交通(SUMO)模拟器中,使用下一代仿真(NGSIM)数据集的真实交通数据进行的实验表明,DAROM在各种交通密度下始终优于标准RL基线。特别是,基于门控循环单元(GRU)的编码器在高达2.0秒的随机V2I延迟的高密度交通中实现了超过99%的成功率。

英文摘要

Delayed and partially observable state information poses significant challenges for reinforcement learning (RL)-based control in real-world autonomous driving. In highway on-ramp merging, a roadside unit (RSU) can sense nearby traffic, perform edge perception, and transmit state estimates to the ego vehicle over vehicle-to-infrastructure (V2I) links. With recent advancements in intelligent transportation infrastructure and edge computing, such RSU-assisted perception is increasingly realistic and already deployed in modern connected roadway systems. However, edge processing time and wireless transmission can introduce stochastic V2I communication delays, violating the Markov assumption and substantially degrading control performance. In this work, we propose DAROM, a Delay-Aware Reinforcement Learning framework for On-ramp Merging that is robust to stochastic delays. We model the problem as a random delay Markov decision process (RDMDP) and develop a unified RL agent for joint longitudinal and lateral control. To recover a Markovian representation under delayed observations, we introduce a Delay-Aware Encoder that conditions on delayed observations, masked action histories, and observed delay magnitude to infer the current latent state. We further integrate a physics-based safety controller to reduce collision risk during merging. Experiments in the Simulation of Urban MObility (SUMO) simulator using real-world traffic data from the Next Generation Simulation (NGSIM) dataset demonstrate that DAROM consistently outperforms standard RL baselines across traffic densities. In particular, the gated recurrent unit (GRU)-based encoder achieves over 99% success in high-density traffic with random V2I delays of up to 2.0 seconds.

2602.03668 2026-05-28 cs.RO cs.CV 版本更新

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

MVP-LAM:通过跨视角重建学习以动作为中心的潜在动作

Jung Min Lee, Dohyeok Lee, Seokhun Ju, Taehyun Cho, Jin Woo Koo, Li Zhao, Sangwoo Hong, Jungwoo Lee

发表机构 * Seoul National University, Seoul, South Korea(首尔国立大学,首尔,韩国) Konkuk University, Seoul, South Korea(韩国konkuk大学,首尔,韩国) Microsoft Research Asia, Beijing, China(微软亚洲研究院,北京,中国) HodooAI Labs, Seoul, South Korea(HodooAI实验室,首尔,韩国)

AI总结 提出MVP-LAM模型,利用多视角视频通过跨视角重建目标学习与真实动作高度相关的潜在动作,提升动作预测和下游操作性能。

详情
AI中文摘要

从多样化人类视频中学习的潜在动作作为视觉-语言-动作(VLA)预训练的伪标签,但只有当它们对底层真实动作保持信息量时才能提供有效监督。为了有效监督,潜在动作应包含关于底层动作的信息,尽管这些信息不可直接获取。我们提出多视角潜在动作模型(MVP-LAM),该模型从多视角视频中学习与真实动作高度相关的潜在动作。MVP-LAM通过跨视角重建目标训练潜在动作,使得一个视角的潜在动作必须解释另一个视角的未来,从而减少对视角特定线索的依赖。在Bridge V2上,MVP-LAM生成更以动作为中心的潜在动作,与真实动作的互信息更高,动作预测性能提升,包括在分布外评估下。最后,使用MVP-LAM潜在动作预训练VLA模型提高了各种基准上的下游操作性能。代码和训练好的检查点可在https://jmsnu.github.io获取。

英文摘要

Latent actions learned from diverse human videos serve as pseudo-labels for vision-language-action (VLA) pretraining, but provide effective supervision only if they remain informative about the underlying ground-truth actions. For effective supervision, latent actions should contain information about the underlying actions even though they are inaccessible. We propose Multi-ViewPoint Latent Action Moel (MVP-LAM), which learns latent actions that are highly informative about ground-truth actions from multi-view videos. MVP-LAM trains latent actions with a cross-viewpoint reconstruction objective, so that a latent action from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on various benchmarks. The code and trained checkpoints are available at https://jmsnu.github.io.

2512.14340 2026-05-28 cs.RO 版本更新

Field evaluation and optimization of a lightweight autonomous lidar-based UAV system based on a rigorous experimental setup in boreal forest environments

基于严格实验设置的轻量级自主激光雷达无人机系统在北方森林环境中的现场评估与优化

Aleksi Karhunen, Teemu Hakala, Väinö Karjalainen, Eija Honkavaara

发表机构 * Finnish Geospatial Research Institute in National Land Survey of Finland(芬兰地理研究 institute 在芬兰国家土地测绘局)

AI总结 提出标准化实验设置评估自主林下无人机系统,通过轻量级激光雷达四旋翼在北方森林中的93次真实飞行验证,优化后系统在中难度森林中1m/s和2m/s速度下成功率分别为12/15和15/15,在困难森林中为12/15和5/15。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

近年来,利用自主无人机进行林下森林遥感引起了越来越多的兴趣,导致科学文献中发表了大量自主飞行算法。为了支持此类算法的选择和开发,基于已发表研究对现有方法进行可靠比较至关重要。然而,由于实验设置差异很大且报告实践不完整,目前可靠比较面临挑战。本研究提出了一种标准化的实验设置,用于评估自主林下无人机系统,以填补这一空白。所提出的设置强调森林复杂性的定量报告、测试环境的可视化表示、多次重复飞行的执行,以及飞行成功率与定性飞行结果的报告。此外,鼓励在多个目标速度下飞行,并报告实际飞行速度、任务完成时间和点对点飞行距离。该设置通过采用最先进开源算法的轻量级激光雷达四旋翼进行演示,并在两个天然北方森林环境中进行了大量实验评估。基于对原始系统的系统评估,引入了若干改进。随后对优化后的系统重复相同的实验协议,总共进行了93次真实世界飞行。优化后的系统在中难度森林中,目标飞行速度为1 m/s和2 m/s时分别实现了12/15和15/15的成功率,在困难森林中分别为12/15和5/15。采用所提出的实验设置将有助于基于文献的自主林下飞行系统比较,并支持未来基于无人机的森林机器人解决方案的系统性能改进。

英文摘要

Interest in utilizing autonomous uncrewed aerial vehicles (UAVs) for under-canopy forest remote sensing has increased in recent years, resulting in the publication of numerous autonomous flight algorithms in the scientific literature. To support the selection and development of such algorithms, a reliable comparison of existing approaches based on published studies is essential. However, reliable comparisons are currently challenging due to widely varying experimental setups and incomplete reporting practices. This study proposes a standardized experimental setup for evaluating autonomous under-canopy UAV systems to fill this gap. The proposed setup emphasizes quantitative reporting of forest complexity, visual representation of test environments, execution of multiple repeated flights, and reporting of flight success rates alongside qualitative flight results. In addition, flights at multiple target speeds are encouraged, with reporting of realized flight speed, mission completion time, and point-to-point flight distance. The proposed setup is demonstrated using a lightweight lidar-based quadrotor employing state-of-the-art open-source algorithms, evaluated through extensive experiments in two natural boreal forest environments. Based on a systematic evaluation of the original system, several improvements were introduced. The same experimental protocol was then repeated with the optimized system, resulting in a total of 93 real-world flights. The optimized system achieved success rates of 12/15 and 15/15 at target flight speeds of 1 m/s and 2 m/s, respectively, in a medium-difficulty forest, and 12/15 and 5/15 in a difficult forest. Adoption of the proposed experimental setup would facilitate the literature-based comparison of autonomous under-canopy flight systems and support systematic performance improvement of future UAV-based forest robotics solutions.

2307.06240 2026-05-28 cs.LG cs.AI cs.RO cs.SY eess.SY 版本更新

DSSE: a drone swarm search environment

DSSE:无人机群搜索环境

Manuel Castanares, Luis F. S. Carrete, Enrico F. Damiani, Leonardo D. M. de Abreu, José Fernando B. Brancalion, Fabrício J. Barth

发表机构 * Insper Embraer

AI总结 基于PettingZoo的多智能体强化学习环境,无人机通过动态概率输入搜索目标。

Comments 7 pages

详情
AI中文摘要

无人机群搜索项目是一个基于 extsc{PettingZoo}的环境,用于多智能体(或单智能体)强化学习算法。在该环境中,智能体(无人机)必须找到目标(海难人员)。智能体不知道目标的位置,也不接收与自身到目标距离相关的奖励。然而,智能体会收到目标位于地图某个单元格的概率。该项目旨在辅助研究需要动态概率作为输入的强化学习算法。描述该软件第二版的同行评审论文已发表在JOSS上:https://doi.org/10.21105/joss.06746。

英文摘要

The Drone Swarm Search project is an environment, based on \textsc{PettingZoo}, that is to be used in conjunction with multi-agent (or single-agent) reinforcement learning algorithms. It is an environment in which the agents (drones), have to find the targets (shipwrecked people). The agents do not know the position of the target and do not receive rewards related to their own distance to the target(s). However, the agents receive the probabilities of the target(s) being in a certain cell of the map. The aim of this project is to aid in the study of reinforcement learning algorithms that require dynamic probabilities as inputs. A peer-reviewed paper describing version 2 of this software has been published in JOSS: https://doi.org/10.21105/joss.06746.

2512.12649 2026-05-28 cs.RO cs.SY eess.SY 版本更新

Bayesian Optimization Parameter Tuning Framework for a Lyapunov Based Path Following Controller

基于Lyapunov的路径跟踪控制器的贝叶斯优化参数调优框架

Zhewen Zheng, Wenjing Cao, Hongkang Yu, Mo Chen, Takashi Suzuki

AI总结 针对非线性几何控制器中参数相互依赖导致手动调优效率低的问题,提出一种将闭环系统视为黑箱、利用高斯过程代理模型进行贝叶斯优化的数据高效调优方法,并在本田AI-Formula三轮机器人上验证了其在32次试验内提升控制器性能的有效性。

Comments The authors request withdrawal because the current arXiv version does not reflect the complete and finalized authorship record of the manuscript. The author list and contribution record require correction before further public dissemination

详情
AI中文摘要

实际实验中的参数调优受限于硬件上有限的评估预算。本文研究的路径跟踪控制器反映了非线性几何控制器中的典型情况,其中多个增益通过耦合非线性项影响动力学。这种相互依赖性使得手动调优效率低下,且在实际试验次数内难以获得令人满意的性能。为应对这一挑战,我们提出了一种贝叶斯优化(BO)框架,该框架将闭环系统视为黑箱,并使用高斯过程代理模型选择控制器增益。BO提供了无模型探索、量化不确定性和数据高效搜索,使其非常适合每次评估成本高昂的调优任务。该框架在Honda的AI-Formula三轮机器人上实现,并通过在固定测试轨道上重复全圈实验进行评估。结果表明,BO在32次试验内(包括15次预热初始评估)提升了控制器性能,表明它能够在实际条件下高效定位参数空间中的高性能区域。这些发现证明,BO为真实机器人平台上的非线性路径跟踪控制器提供了一种实用、可靠且数据高效的调优方法。

英文摘要

Parameter tuning in real-world experiments is constrained by the limited evaluation budget available on hardware. The path-following controller studied in this paper reflects a typical situation in nonlinear geometric controller, where multiple gains influence the dynamics through coupled nonlinear terms. Such interdependence makes manual tuning inefficient and unlikely to yield satisfactory performance within a practical number of trials. To address this challenge, we propose a Bayesian optimization (BO) framework that treats the closed-loop system as a black box and selects controller gains using a Gaussian-process surrogate. BO offers model-free exploration, quantified uncertainty, and data-efficient search, making it well suited for tuning tasks where each evaluation is costly. The framework is implemented on Honda's AI-Formula three-wheeled robot and assessed through repeated full-lap experiments on a fixed test track. The results show that BO improves controller performance within 32 trials, including 15 warm-start initial evaluations, indicating that it can efficiently locate high-performing regions of the parameter space under real-world conditions. These findings demonstrate that BO provides a practical, reliable, and data-efficient tuning approach for nonlinear path-following controllers on real robotic platforms.

2501.01669 2026-05-28 cs.LG cs.RO 版本更新

Inversely Learning Transferable Rewards via Abstracted States

通过抽象状态逆向学习可迁移奖励

Yikang Gui, Prashant Doshi

发表机构 * THINC Lab, School of Computing University of Georgia(THINC实验室,计算学院,佐治亚大学) School of Computing and Institute for AI University of Georgia(计算学院和人工智能研究所,佐治亚大学)

AI总结 提出一种通过行为轨迹逆向学习抽象奖励函数的方法,并在未见过的领域实例中验证其可迁移性。

Comments Accepted at IJCAI 2026

详情
AI中文摘要

逆向强化学习(IRL)在从行为数据中准确学习离散和连续领域中的潜在奖励方面取得了显著进展。下一步的进展是学习内在偏好,以在与观察到的设置或任务不同但一致的情况下产生有用行为。在机器人应用的背景下,这有助于将机器人集成到涉及新任务(具有共享的内在偏好)的处理线中,而无需从头编程。我们提出了一种方法,从领域中的两个或更多不同实例的行为轨迹中逆向学习一个抽象奖励函数。然后,该抽象奖励函数用于在领域的另一个单独实例中学习任务行为。这一步提供了其可迁移性的证据,并验证了其正确性。我们在OpenAI的Gym测试平台和AssistiveGym中多个领域的任务轨迹上评估了该方法,结果表明,学习到的抽象奖励函数能够成功地在各自领域的未见过的实例中学习任务行为。

英文摘要

Inverse reinforcement learning (IRL) has progressed significantly toward accurately learning the underlying rewards in both discrete and continuous domains from behavior data. The next advance is to learn {\em intrinsic} preferences in ways that produce useful behavior in settings or tasks which are different but aligned with the observed ones. In the context of robotic applications, this helps integrate robots into processing lines involving new tasks (with shared intrinsic preferences) without programming from scratch. We introduce a method to inversely learn an abstract reward function from behavior trajectories in two or more differing instances of a domain. The abstract reward function is then used to learn task behavior in another separate instance of the domain. This step offers evidence of its transferability and validates its correctness. We evaluate the method on trajectories in tasks from multiple domains in OpenAI's Gym testbed and AssistiveGym and show that the learned abstract reward functions can successfully learn task behaviors in instances of the respective domains, which have not been seen previously.

2510.20480 2026-05-28 cs.RO 版本更新

Degradation-Aware Cooperative Multi-Modal GNSS-Denied Localization Leveraging LiDAR-Based Robot Detections

基于激光雷达机器人检测的退化感知协同多模态GNSS拒止定位

Václav Pritzl, Xianjia Yu, Tomi Westerlund, Petr Štěpán, Martin Saska

发表机构 * Multi-robot Systems Group, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague(布拉格捷克技术大学电气工程学院控制学系多机器人系统组) Turku Intelligent Embedded and Robotic Systems (TIERS) Lab, University of Turku(图尔库大学智能嵌入式与机器人系统实验室)

AI总结 提出一种因子图框架下的自适应多模态多机器人协同定位方法,融合异步VIO、LIO和3D机器人间检测,通过插值因子和Wasserstein距离加权处理传感器退化,显著提升定位精度。

Comments Preprint version. This work has been submitted to Elsevier for possible publication

详情
AI中文摘要

在无全球导航卫星系统(GNSS)环境中,使用机载传感器进行精确长期定位对机器人至关重要。虽然互补传感器可以缓解个体退化,但在单个机器人上携带所有可用传感器类型会显著增加尺寸、重量和功耗需求。将传感器分布在多个机器人上可增强部署能力,但引入了来自独立移动平台的异步多模态数据融合的挑战。我们提出了一种新颖的自适应多模态多机器人协同定位方法,采用因子图公式,以松耦合方式融合来自不同机器人的异步视觉-惯性里程计(VIO)、激光雷达-惯性里程计(LIO)和3D机器人间检测。该方法适应变化的条件,利用可靠数据帮助受传感器退化影响的机器人。一种新颖的基于插值的因子实现了非同步测量的融合。LIO退化基于近似扫描匹配Hessian进行评估。提出了一种根据连续VIO输出之间的Wasserstein距离按比例加权里程计数据的新方法。提供了理论分析,研究了各种条件下(主要是在传感器退化存在时)的协同定位问题。该方法已在使用无人地面车辆(UGV)和无人飞行器(UAV)异构团队收集的真实世界数据上进行了广泛评估,结果表明该方法在各种传感器退化情况下显著提高了定位精度。

英文摘要

Accurate long-term localization using onboard sensors is crucial for robots operating in Global Navigation Satellite System (GNSS)-denied environments. While complementary sensors mitigate individual degradations, carrying all the available sensor types on a single robot significantly increases the size, weight, and power demands. Distributing sensors across multiple robots enhances the deployability but introduces challenges in fusing asynchronous, multi-modal data from independently moving platforms. We propose a novel adaptive multi-modal multi-robot cooperative localization approach using a factor-graph formulation to fuse asynchronous Visual-Inertial Odometry (VIO), LiDAR-Inertial Odometry (LIO), and 3D inter-robot detections from distinct robots in a loosely-coupled fashion. The approach adapts to changing conditions, leveraging reliable data to assist robots affected by sensory degradations. A novel interpolation-based factor enables fusion of the unsynchronized measurements. LIO degradations are evaluated based on the approximate scan-matching Hessian. A novel approach of weighting odometry data proportionally to the Wasserstein distance between the consecutive VIO outputs is proposed. A theoretical analysis is provided, investigating the cooperative localization problem under various conditions, mainly in the presence of sensory degradations. The proposed method has been extensively evaluated on real-world data gathered with heterogeneous teams of an Unmanned Ground Vehicle (UGV) and Unmanned Aerial Vehicles (UAVs), showing that the approach provides significant improvements in localization accuracy in the presence of various sensory degradations.

2503.01450 2026-05-28 cs.LG cs.AI cs.RO 版本更新

Investigating Memory in Model-Free RL with POPGym Arcade

基于POPGym Arcade的无模型强化学习中的记忆研究

Zekang Wang, Zhe He, Borong Zhang, Edan Toledo, Steven Morad

发表机构 * Faculty of Science and Technology, University of Macau(澳门大学科技学院) Centre for AI, University College London(伦敦大学学院人工智能中心)

AI总结 本文通过引入分析工具和POPGym Arcade环境套件,研究深度强化学习中的记忆机制,发现价值函数会将信用分配到无关历史,并展示分布外场景如何污染记忆。

Comments Appear at ICML 2026 as a Spotlight paper

详情
AI中文摘要

如何分析深度强化学习中的记忆?我们引入了在部分可观测性下分析策略的工具,并揭示智能体如何利用记忆做出决策。为了利用这些工具,我们提出了POPGym Arcade,这是一个受Atari启发的、硬件加速的环境集合,共享单一观测和动作空间。每个环境都提供完全和部分可观测的变体,从而实现对可观测性的反事实研究。我们发现,受控研究对于公平比较是必要的,并识别出一种病理现象,即价值函数将信用过度分配到无关历史。利用这种病理现象,我们展示了分布外场景如何污染记忆,从而在遥远的未来扰动策略。我们的代码可在https://github.com/bolt-research/popgym-arcade获取。

英文摘要

How should we analyze memory in deep RL? We introduce tools for analyzing policies under partial observability and revealing how agents use memory to make decisions. To utilize these tools, we present POPGym Arcade, a collection of Atari-inspired, hardware-accelerated environments sharing a single observation and action space. Each environment provides fully and partially observable variants, enabling counterfactual studies on observability. We find that controlled studies are necessary for fair comparisons and identify a pathology where value functions smear credit over irrelevant history. Using this pathology, we demonstrate how out-of-distribution scenarios can contaminate memory, perturbing the policy far into the future. Our code is available at https://github.com/bolt-research/popgym-arcade.

2508.21046 2026-05-28 cs.CV cs.RO 版本更新

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

CogVLA: 通过指令驱动路由与稀疏化实现认知对齐的视觉-语言-动作模型

Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区计算机科学与技术学院)

AI总结 提出CogVLA框架,通过指令驱动路由和稀疏化机制,在LIBERO基准和真实机器人任务上以2.5倍训练成本降低和2.8倍推理延迟降低实现97.4%和70.0%的成功率。

Comments Accepted to NeurIPS 2025, Project Page: https://jiutian-vl.github.io/CogVLA-page

详情
AI中文摘要

最近基于预训练视觉-语言模型(VLM)构建的视觉-语言-动作(VLA)模型需要大量后训练,导致计算开销高,限制了可扩展性和部署。我们提出CogVLA,一个认知对齐的视觉-语言-动作框架,利用指令驱动路由和稀疏化来提高效率和性能。CogVLA受人类多模态协调启发,引入了一个3阶段渐进式架构。1)基于编码器-FiLM的聚合路由(EFA-Routing)将指令信息注入视觉编码器,以选择性聚合和压缩双流视觉标记,形成指令感知的潜在表示。2)基于这种紧凑的视觉编码,基于LLM-FiLM的剪枝路由(LFP-Routing)通过剪枝与指令无关的视觉接地标记将动作意图引入语言模型,从而实现标记级稀疏性。3)为确保压缩的感知输入仍能支持准确且连贯的动作生成,我们引入了V-L-A耦合注意力(CAtten),它将因果视觉-语言注意力与双向动作并行解码相结合。在LIBERO基准和真实机器人任务上的大量实验表明,CogVLA实现了最先进的性能,成功率分别为97.4%和70.0%,同时与OpenVLA相比,训练成本降低了2.5倍,推理延迟降低了2.8倍。CogVLA已开源,可在https://github.com/JiuTian-VL/CogVLA获取。

英文摘要

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

2509.14075 2026-05-28 cs.RO cs.SY eess.SY 版本更新

RCM Constraint-Consistent Dynamic Control in Surgical Robots

手术机器人中的RCM约束一致性动态控制

Yu Li, Hamid Sadeghian, Zewen Yang, Valentin Le Mesle, Sami Haddadin

发表机构 * Munich Institute of Robotics and Machine Intelligence, Technical University of Munich, Germany(慕尼黑机器人与机器智能研究所,慕尼黑技术大学,德国) Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE(穆罕默德·本·扎耶德人工智能大学,阿布扎比,阿联酋)

AI总结 将远程运动中心(RCM)建模为流变完整约束,并集成到基于投影的逆动力学控制器中,实现扭矩层面的约束一致控制,降低RCM残差并平滑扭矩曲线。

Comments Accepted at ICRA 2026

详情
AI中文摘要

机器人辅助微创手术(RAMIS)需要精确执行远程运动中心(RCM)约束,以确保通过套管针的安全工具运动。现有的虚拟RCM控制器通常在运动学层面或作为任务空间目标进行公式化,这使得在套管针运动和物理交互下难以一致地制定扭矩层面的执行。本文将RCM建模为流变完整约束,并将其纳入基于投影的逆动力学控制器中,具有显式的约束/自由运动扭矩分解。所得公式在扭矩层面统一了运动学RCM执行和任务空间跟踪,同时为残差调节和零空间顺应性保留了约束一致的结构。所提出的控制器在仿真和RAMIS训练平台上与代表性的基于投影和约束动力学基线进行了验证。在螺旋跟踪、变化插入深度、移动套管针条件和人类交互中,该方法实现了更低的RCM残差和更平滑的扭矩曲线,同时保持准确的工具尖端跟踪。这些结果支持使用约束一致扭矩控制来实现手术机器人中可靠的虚拟RCM执行。项目页面位于https://rcmpc-cube.github.io。

英文摘要

Robotic-assisted minimally invasive surgery (RAMIS) requires accurate enforcement of the remote center of motion (RCM) constraint to ensure safe tool motion through a trocar. Existing virtual RCM controllers are commonly formulated either at the kinematic level or as task-space objectives, which makes torque-level enforcement under trocar motion and physical interaction difficult to formulate consistently. This paper models the RCM as a rheonomic holonomic constraint and incorporates it into a projection-based inverse-dynamics controller with explicit constrained/free-motion torque decomposition. The resulting formulation unifies kinematic RCM enforcement and task-space tracking at the torque level, while preserving a constraint-consistent structure for residual regulation and null-space compliance. The proposed controller is validated in simulation and on a RAMIS training platform against representative projection-based and constrained-dynamics baselines. Across spiral tracking, varying insertion depth, moving trocar conditions, and human interaction, the method achieves lower RCM residuals and smoother torque profiles while maintaining accurate tool-tip tracking. These results support the use of constraint-consistent torque control for reliable virtual RCM enforcement in surgical robotics. The project page is available at https://rcmpc-cube.github.io

2509.13177 2026-05-28 cs.RO 版本更新

ROOM: A Physics-Based Continuum Robot Simulator for Photorealistic Medical Datasets Generation

ROOM: 基于物理的连续体机器人模拟器,用于生成逼真的医学数据集

Salvatore Esposito, Matías Mattamala, Daniel Rebain, Francis Xiatian Zhang, Kevin Dhaliwal, Mohsen Khadem, Subramanian Ramamoorthy

发表机构 * University of Edinburgh, UK(爱丁堡大学,英国) University of British Columbia, Canada(不列颠哥伦比亚大学,加拿大)

AI总结 提出ROOM模拟框架,利用患者CT扫描生成多模态支气管镜训练数据,验证其在姿态估计和深度估计任务中的有效性。

详情
Journal ref
International Conference on Robotics and Automation 2026
AI中文摘要

连续体机器人通过进入复杂的肺气道并进行靶向干预,正在推进支气管镜手术。然而,由于缺乏真实的训练和测试环境,其发展受到限制:由于伦理约束和患者安全问题,真实数据难以收集,而开发自主算法需要逼真的成像和物理反馈。我们提出了ROOM(医学中的逼真光学观察),一个用于生成逼真支气管镜训练数据的综合模拟框架。通过利用患者CT扫描,我们的流程渲染多模态传感器数据,包括具有真实噪声和光斑的RGB图像、度量深度图、表面法线、光流和点云,这些数据在医学相关尺度上生成。我们在两个医学机器人学的典型任务中验证了ROOM生成的数据:多视图姿态估计和单目深度估计,展示了最先进方法在迁移到这些医学环境时必须克服的多种挑战。此外,我们表明ROOM生成的数据可用于微调现有深度估计模型以克服这些挑战,并支持其他下游应用,如导航。我们期望ROOM能够在不同患者解剖结构和临床环境中难以捕获的手术场景中实现大规模数据生成。代码和数据:https://github.com/iamsalvatore/room。

英文摘要

Continuum robots are advancing bronchoscopy procedures by accessing complex lung airways and enabling targeted interventions. However, their development is limited by the lack of realistic training and test environments: Real data is difficult to collect due to ethical constraints and patient safety concerns, and developing autonomy algorithms requires realistic imaging and physical feedback. We present ROOM (Realistic Optical Observation in Medicine), a comprehensive simulation framework designed for generating photorealistic bronchoscopy training data. By leveraging patient CT scans, our pipeline renders multi-modal sensor data including RGB images with realistic noise and light specularities, metric depth maps, surface normals, optical flow and point clouds at medically relevant scales. We validate the data generated by ROOM in two canonical tasks for medical robotics: multi-view pose estimation and monocular depth estimation, demonstrating diverse challenges that state-of-the-art methods must overcome to transfer to these medical settings. Furthermore, we show that the data produced by ROOM can be used to fine-tune existing depth estimation models to overcome these challenges, also enabling other downstream applications such as navigation. We expect that ROOM will enable large-scale data generation across diverse patient anatomies and procedural scenarios that are challenging to capture in clinical settings. Code and data: https://github.com/iamsalvatore/room.

2311.02304 2026-05-28 cs.RO 版本更新

Imitating and Finetuning Model Predictive Control for Robust and Symmetric Quadrupedal Locomotion

模仿与微调模型预测控制实现鲁棒且对称的四足运动

Donghoon Youm, Hyunyoung Jung, Hyeongjun Kim, Jemin Hwangbo, Hae-Won Park, Sehoon Ha

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出模仿与微调模型预测控制(IFM)框架,结合模型预测控制与模仿学习及强化学习,提升四足机器人在复杂地形上的运动性能、对称性和能效。

详情
Journal ref
IEEE Robotics and Automation Letters ( Volume: 8, Issue: 11, November 2023
AI中文摘要

腿式机器人的控制是一个具有挑战性的问题,已有多种方法进行研究,如基于模型的控制和学习算法。本文提出了一种新颖的模仿与微调模型预测控制(IFM)框架,以结合两种方法的优势。该框架首先使用微分动态规划和Raibert启发式方法开发一个传统的模型预测控制器(MPC),作为专家策略。然后,通过模仿学习训练MPC的克隆,使控制器可学习。最后,利用有限探索的深度强化学习在更具挑战性的地形上进一步微调策略。通过全面的仿真和硬件实验,我们证明了所提出的IFM框架能够显著提高给定MPC控制器在粗糙、湿滑和传送带等需要仔细协调步态的地形上的性能。我们还展示了与普通强化学习相比,IFM能够以最小的奖励塑造负担高效地产生更对称、周期性和节能的步态。

英文摘要

Control of legged robots is a challenging problem that has been investigated by different approaches, such as model-based control and learning algorithms. This work proposes a novel Imitating and Finetuning Model Predictive Control (IFM) framework to take the strengths of both approaches. Our framework first develops a conventional model predictive controller (MPC) using Differential Dynamic Programming and Raibert heuristic, which serves as an expert policy. Then we train a clone of the MPC using imitation learning to make the controller learnable. Finally, we leverage deep reinforcement learning with limited exploration for further finetuning the policy on more challenging terrains. By conducting comprehensive simulation and hardware experiments, we demonstrate that the proposed IFM framework can significantly improve the performance of the given MPC controller on rough, slippery, and conveyor terrains that require careful coordination of footsteps. We also showcase that IFM can efficiently produce more symmetric, periodic, and energy-efficient gaits compared to Vanilla RL with a minimal burden of reward shaping.

2409.13058 2026-05-28 cs.HC cs.RO 版本更新

Mixed Reality Tele-Ultrasound over 750 km: A Feasibility Study

混合现实远程超声检查跨越750公里:可行性研究

Ryan Yeung, David Black, Patrick B. Chen, Victoria Lessoway, Janice Reid, Sergio Rangel-Suarez, Silvia D. Chang, Septimiu E. Salcudean

发表机构 * School of Biomedical Engineering, The University of British Columbia(生物医学工程学院,不列颠哥伦比亚大学) Department of Electrical and Computer Engineering, The University of British Columbia(电气与计算机工程系,不列颠哥伦比亚大学) The University of British Columbia(不列颠哥伦比亚大学) Department of Radiology, The University of British Columbia(放射学系,不列颠哥伦比亚大学)

AI总结 本研究提出并评估了一种基于混合现实和触觉反馈的人机远程超声系统,通过新手操作员在专家远程控制下完成腹部超声检查,在754公里距离上实现了92%的图像质量达标率。

Comments 8 pages, 11 figures

详情
AI中文摘要

为解决偏远社区缺乏超声检查的问题,先前工作引入了人机远程操作,一种基于混合现实和触觉的远程超声系统。该方法中,新手扮演认知机器人角色,由专家通过混合现实远程控制。本文总结了该系统的新进展,并描述了一项评估其用于长距离远程腹部超声检查的可行性研究。为提供简单有效的触觉反馈,我们使用了患者椭球模型,并通过系统的位置和力传感器校准其参数。我们在加拿大海达瓜依的斯基德盖特测试了该系统,专家位于754公里外的加拿大温哥华。我们共进行了11次扫描,涉及10名新手和2名超声技师。超声技师的任务是获取上腹部区域的5个目标图像。图像采集质量由2名放射科医生评估。我们收集了对准数据,新手完成了任务负荷和可用性问卷。新手和超声技师均提供了书面和口头反馈,以指导未来的设计迭代。92%的获取图像具有足够质量,可供两位放射科医生解读。新手报告的平均任务负荷低于文献中的参考值,可用性一致获得正面评价。未发现图像质量与跟随者相对于虚拟换能器的对准误差之间存在相关性。总体而言,我们表明人机远程操作使超声技师能够以高性能执行远程腹部超声成像,即使跨越远距离且使用新手跟随者。未来工作将把人机远程操作与传统、机器人及远程指导超声进行比较。

英文摘要

To address the lack of access to ultrasound in remote communities, previous work introduced human teleoperation, a mixed reality and haptics-based tele-ultrasound system. In this approach, a novice takes the role of a cognitive robot controlled remotely by an expert through mixed reality. In this manuscript we summarize new developments to this system and describe a feasibility study assessing its use for long-distance remote abdominal ultrasound examinations. To provide simple but effective haptic feedback, we used an ellipsoid model of the patient with its parameters calibrated using our system's position and force sensors. We tested the system in Skidegate, Haida Gwaii, Canada, with the experts positioned 754 km away in Vancouver, Canada. We performed 11 total scans with 10 novices and 2 sonographers. The sonographers were tasked with acquiring 5 target images in the epigastric region. The image acquisition quality was assessed by 2 radiologists. We collected alignment data and the novices completed task load and usability questionnaires. Both the novices and sonographers provided written and verbal feedback to inform future design iterations. 92% of the acquired images had sufficient quality for interpretation by both radiologists. The mean task load reported by the novices was below reference values reported in literature and the usability was unanimously positive. No correlation was found between image quality and the follower's alignment error with the virtual transducer. Overall, we show that human teleoperation enables sonographers to perform remote abdominal ultrasound imaging with high performance, even across large distances and with novice followers. Future work will compare human teleoperation to conventional, robotic and tele-mentored ultrasound.

2506.05012 2026-05-28 cs.RO physics.comp-ph physics.flu-dyn 版本更新

Realizing Robotic Swimming with Unified Fluid-Robot Multiphysics

实现统一流体-机器人多物理场的水下机器人游泳

Jeong Hun Lee, Junzhe Hu, Sofia Kwok, Carmel Majidi, Zachary Manchester

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出一个可微分的统一流体-机器人多物理场仿真框架,通过最小作用量原理联合推导耦合的机械臂和不可压缩Navier-Stokes方程,并利用离散变分力学和隐函数定理实现稳定、准确的联合仿真与梯度计算,成功优化仿生鳗鱼机器人的波动游泳和高动态C形逃逸动作,并验证了从仿真到实物的迁移。

Comments 9 pages, 10 figures, accepted to Robotics: Science and Systems 2026

详情
AI中文摘要

在水下机器人领域,实现与鱼类相当的游泳效率和敏捷性一直是一个难以达到的目标。这种运动能力依赖于机器人身体与周围流体之间复杂的涡旋相互作用。然而,模拟这些由耦合的常微分方程和偏微分方程控制的动力学,比经典刚性机器人系统的多体动力学要困难得多。我们提出了一个可微分的框架,将强耦合的流体-机器人多物理场作为一个统一的优化问题进行仿真。耦合的机械臂和不可压缩Navier-Stokes方程通过最小作用量原理从单个拉格朗日量中联合推导出来。我们采用离散变分力学,推导出一个稳定、条件良好且物理精确的方案,用于联合仿真铰接体及其周围的流体。我们利用隐函数定理计算完全耦合动力学的导数。利用这个仿真器及其梯度,我们实现了波动游泳步态,并优化了仿生鳗鱼机器人的高动态C形逃逸动作。我们在物理硬件上验证了这两种步态,展示了成功的仿真到实物迁移。仿真代码、硬件数据和鳗鱼机器人的示意图可在此处找到:https://unified-fluid-robot-multiphysics.github.io/

英文摘要

Matching the swimming efficiency and agility of fish has remained an elusive goal in underwater robotics. Such locomotion capabilities rely on complex vortex interactions between the robot's body and the surrounding fluid. However, simulating these dynamics, which are governed by coupled ordinary and partial differential equations, is significantly more difficult than the multi-body dynamics of classical rigid robotic systems. We present a differentiable framework for simulating strongly coupled fluid-robot multiphysics as a unified optimization problem. The coupled manipulator and incompressible Navier-Stokes equations are derived together from a single Lagrangian using the principle of least action. We employ discrete variational mechanics to derive a stable, well-conditioned, and physically accurate scheme for jointly simulating articulated bodies and the surrounding fluid. We leverage the implicit function theorem to compute derivatives of the fully coupled dynamics. Using this simulator and its gradients, we realize undulating swimming gaits and optimize a highly dynamic C-start escape maneuver for a bioinspired eel robot. We validate both gaits on physical hardware, demonstrating successful sim-to-real transfer. Simulation code, hardware data, and schematics for the eel robot can be found here: https://unified-fluid-robot-multiphysics.github.io/

2504.20736 2026-05-28 cs.RO cs.CV 版本更新

A Survey on Event-based Optical Marker Systems

基于事件的光学标记系统综述

Nafiseh Jabbari Tofighi, Maxime Robic, Fabio Morbidi, Pascal Vasseur

发表机构 * MIS laboratory, University of Picardie Jules Verne(皮卡第大学朱勒斯·弗尔大学MIS实验室) DART Lab, Politecnico di Milano(米兰理工学院DART实验室)

AI总结 本文综述了基于事件的光学标记系统(EBOMS),分析其异步操作原理和鲁棒性,并介绍了在目标检测、姿态估计和光通信等领域的应用。

Comments 11 pages, 6 figures, 2 table

详情
AI中文摘要

事件相机的出现,以其低延迟、高动态范围和低功耗,标志着机器感知和机器人视觉的转折点。特别是,这些神经形态传感器与广泛可用的被动或主动光学标记(例如AprilTags、闪烁LED阵列)的结合,最近开辟了一个新的机遇领域。本综述论文对基于事件的光学标记系统(EBOMS)进行了全面回顾。我们分析了这些系统所基于的基本原理和技术,特别关注其异步操作和对挑战性光照条件的鲁棒性。我们还描述了EBOMS最相关的应用,包括目标检测与跟踪、姿态估计和光通信。文章最后讨论了这一快速发展的多学科领域可能的未来研究方向。

英文摘要

The advent of event-based cameras, with their low latency, high dynamic range, and reduced power consumption, marked a turning point in machine perception and robotic vision. In~particular, the combination of these neuromorphic sensors with widely-available passive or active optical markers (e.g. AprilTags, arrays of blinking LEDs), has recently opened up a new field of opportunities. This survey paper provides a comprehensive review of Event-Based Optical Marker Systems (EBOMS). We~analyze the underlying principles and technologies on which these systems are based, with a special focus on their asynchronous operation and robustness against challenging lighting conditions. We also describe the most relevant applications of EBOMS, including object detection and tracking, pose estimation, and optical communication. The article concludes with a discussion of possible future research directions in this rapidly-emerging and multidisciplinary area.