arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

机器人 / 具身智能

机器人、具身智能、机器人学习、操作、导航和具身世界模型。

今日/当前日期收录 14 信号源:cs.RO, cs.AI, cs.CV, cs.LG
2606.18363 2026-06-18 cs.RO cs.AI 新提交 95%

Guava: An Effective and Universal Harness for Embodied Manipulation

Guava: 一种有效且通用的具身操作工具框架

Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao

发表机构 * University of Maryland College Park(马里兰大学帕克分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Waterloo(滑铁卢大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) University of Pennsylvania(宾夕法尼亚大学) Amazon FAR(亚马逊 FAR)

专题命中 机器人操作 :提出具身操作工具框架Guava,结合推理与外部模块。

AI总结 提出Guava框架,通过迭代感知-推理-行动循环、语义动作抽象和多模态观测三大关键设计,将具身操作能力蒸馏到4B开源模型中,在仿真和真实环境中性能媲美前沿专有模型。

详情
AI中文摘要

在大规模视觉-语言数据上训练的语言模型已展现出作为具身智能体的强大潜力。通过具身工具使用来驾驭模型,为端到端的视觉-语言-行动系统提供了一种有前景的替代方案,它将高层推理与外部模块(用于感知、规划和控制)相结合。然而,对于具身操作而言,什么构成了有效的工具框架,以及这种框架能在多大程度上解锁广泛推理模型的具身能力,仍不清楚。在这项工作中,我们提出了Guava,一个通过系统探索智能体工作流、动作空间和观测空间的设计空间而开发的具身工具使用框架。我们的研究确定了有效具身智能体的三个关键要素:迭代感知-推理-行动循环、语义动作抽象和多模态观测。为了理解这些设计原则是否对小型模型也具有普适性,我们开发了一个端到端的训练流程,利用完全在仿真中收集的不到2000条轨迹,将具身操作能力蒸馏到一个4B开源模型中。在仿真和真实环境中的实验结果表明,其性能与前沿专有模型相当,同时展现出对未见物体、新指令和长时域任务的强大泛化能力。结果表明,一个精心设计的框架可以作为具身操作的可扩展、模型无关的接口,使紧凑的开源模型在极少的训练数据下展现出强大的涌现具身能力。

英文摘要

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

2606.19340 2026-06-18 cs.RO 新提交 90%

Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning

零样本长时程灵巧操作:基于多视图3D接地VLM推理

Jisoo Kim, Sangwon Baik, Taeksoo Kim, Sungjoo Kim, Junyoung Lee, Mingi Choi, Hanbyul Joo

发表机构 * Seoul National University(首尔国立大学)

专题命中 机器人操作 :灵巧操作零样本框架,VLM生成3D规划

AI总结 提出零样本框架,利用多视图RGB图像通过VLM生成3D任务规划,结合三角测量和射线投票实现精确3D接地,支持抓取和工具使用,在真实实验中优于基线方法。

详情
AI中文摘要

我们提出了一个零样本框架,用于长时程灵巧操作,该框架将语言指令从校准的多视图RGB图像接地到可执行的3D任务规划。我们的系统不是训练端到端策略,而是使用视觉语言模型(VLM)生成参考帧任务接地和原始级2D关键点,然后通过多视图融合将其提升到3D。这种提升结合了视图级VLM接地的三角测量与参考视图射线投票,后者沿语义相机射线搜索跨相邻视图的几何一致候选点。生成的3D关键点支持抓取和放置以及工具使用:对于工具使用,我们检索与推断技能类别对应的以对象为中心的原子动作,并将其存储的6D工具轨迹对齐到场景;对于灵巧执行,我们将提升的抓取关键点扩展为任务条件抓取可行区域,并使用臂手运动生成器生成可行的抓取-运动对。真实世界实验表明,与单视图RGB-D接地和微调VLA基线相比,3D接地精度和执行可靠性有所提高。我们进一步通过闭环状态验证和重新规划展示了长时程操作,实现了在新场景中对未见物体和工具使用任务的零样本执行。

英文摘要

We present a zero-shot framework for long-horizon dexterous manipulation that grounds language instructions into executable 3D task plans from calibrated multi-view RGB images. Rather than training an end-to-end policy, our system uses a vision-language model (VLM) to produce reference-frame task grounding and primitive-level 2D keypoints, then lifts them into 3D via multi-view fusion. This lifting combines triangulation of view-wise VLM groundings with reference-view ray voting, which searches along a semantic camera ray for geometrically consistent candidates across neighboring views. The resulting 3D keypoints support both pick-and-place and tool-use: for tool-use, we retrieve an object-centric atomic action corresponding to the inferred skill category and align its stored 6D tool trajectory to the scene; for dexterous execution, we expand the lifted grasp keypoint into a task-conditioned grasp affordance region and generate feasible grasp-motion pairs with an arm-hand motion generator. Real-world experiments show improved 3D grounding accuracy and execution reliability over single-view RGB-D grounding and fine-tuned VLA baselines. We further demonstrate long-horizon manipulation through closed-loop status verification and replan, enabling zero-shot execution on unseen objects and tool-use tasks in novel scenes.

2606.19265 2026-06-18 cs.RO 新提交 90%

Shape Sensing of Continuum Robots using Direct Laser Writing

使用直接激光写入的连续体机器人形状感知

Amber K. Rothe, Nidhi Malhotra, Jaydev P. Desai

发表机构 * Medical Robotics and Automation (RoboMed) Laboratory(医疗机器人与自动化实验室) Wallace H. Coulter Department of Biomedical Engineering(Wallace H. Coulter生物医学工程部门) Georgia Institute of Technology(佐治亚理工学院)

专题命中 机器人操作 :连续体机器人形状感知与闭环控制

AI总结 本文利用直接激光写入技术制造应变传感器,集成于连续体机器人关节中,通过线性和非线性模型预测关节角度,误差低至1.76度,并实现闭环控制,跟踪误差小于3度。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

连续体机器人因其固有的柔顺性和灵巧性,为微创和自然腔道手术提供了一种有前景的方法。然而,这种灵活性也使得估计机器人当前形状变得具有挑战性。已有多种方法用于重建这些机器人的形状,包括成像、光学传感、磁传感和电阻传感。使用直接激光写入(DLW)制造的应变传感器可以提供一种替代传感方法。该技术涉及使用激光诱导某些聚合物碳化,以创建石墨烯图案,例如应变传感器。在本文中,我们展示了如何使用同一激光和同一设置将柔性连续体关节和DLW传感器加工成一个整体结构。使用线性和非线性模型对制造的传感器进行表征,这些模型用于预测关节角度,误差低至1.76度。此外,我们展示了如何使用DLW传感器在机器人关节中实现闭环控制,跟踪误差低于3度。

英文摘要

Continuum robots offer a promising approach for minimally invasive and natural-orifice surgical procedures due to their inherent compliance and dexterity. However, this flexibility also makes estimating the current shape of the robot challenging. Several approaches have been used to reconstruct the shape of these robots, including imaging, optical sensing, magnetic sensing, and resistive sensing. Strain sensors fabricated using direct laser writing (DLW) could provide an alternative sensing method. This technique involves using a laser to induce carbonization of certain polymers to create graphene patterns, such as strain sensors. In this paper, we demonstrate how a flexible continuum joint and a DLW sensor can be machined as one monolithic structure using the same laser and the same setup. The fabricated sensors are characterized using linear and nonlinear models, which are used to predict the joint angle with error as low as 1.76 degrees. Furthermore, we demonstrate how a DLW sensor can be used to implement closed-loop control in a robotic joint, achieving tracking error under 3 degrees.

2606.19240 2026-06-18 cs.RO cs.CV cs.HC cs.SY eess.SY 新提交 90%

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

透过遮挡:机器人遥操作的确定性手臂运动学校正

Thomas M. Kwok, Nicholas Koenig, Yue Hu

发表机构 * Department of Mechanical and Mechatronics Engineering, University of Waterloo, Canada(滑铁卢大学机械与机电工程系)

专题命中 机器人操作 :遥操作中手臂运动学校正方法

AI总结 提出手臂运动学校正方法,利用恒定臂长几何约束和勾股定理确定性地重建遮挡关节深度,无需复杂建模,经Vicon验证有效,并成功应用于遥操作。

详情
AI中文摘要

无标记、单RGB-D相机动作捕捉为机器人遥操作提供了一种低成本、非侵入性的替代传统标记系统的方法;然而,在自遮挡存在时,特别是上肢运动期间,深度估计常常退化。本文提出了一种手臂运动学校正(AKC)方法,通过基于恒定臂长施加几何约束来改进深度估计。所提出的方法利用手腕位置和预定义臂长,基于勾股定理的确定性公式重建遮挡关节深度,从而避免了对复杂概率建模或参数调整的需求。针对Vicon参考系统的实验验证表明,该方法在静态和动态关节运动下均表现出可靠的性能,通过均方根误差(RMSE)和皮尔逊相关性进行评估。此外,在模拟和物理机器人环境中成功演示了运动映射遥操作。结果表明,AKC在长时间、严重自遮挡下增强了鲁棒性并保持了解剖一致性,即使与不太可靠的时间滤波器配对时也是如此,突显了其在机器人遥操作和人机交互等实时应用中的实用性。

英文摘要

Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

2606.19091 2026-06-18 cs.RO 新提交 90%

GCNGrasp-VP: Affordance-Guided View Planning for Efficient Task-Oriented Grasping

GCNGrasp-VP: 基于功能引导的视角规划用于高效任务导向抓取

Zanjia Tong, Wenlong Dong, Chengjie Zhang, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision(机器人与计算机视觉深圳重点实验室)

专题命中 机器人操作 :任务导向抓取,主动视角规划。

AI总结 提出GCNGrasp-VP框架,通过功能场预测引导主动视角规划,无需场景重建,单次视角调整即可显著提升遮挡下的任务导向抓取成功率。

Comments Accepted to IROS 2026

详情
AI中文摘要

当物体视角存在遮挡时,任务导向抓取性能会显著下降。现有的任务导向抓取方法通常假设任务相关区域在初始帧中可见,而视角规划方法虽然能够实现主动感知,但往往忽略任务语义并依赖耗时的场景重建。为了解决这些局限性,我们提出了GCNGrasp-VP,一个将功能场预测与主动视角规划相结合的高效框架。该框架的核心是GCNGrasp-v2,一个同时支持抓取评估和功能场预测的任务导向抓取模型,实现了常数时间推理复杂度。利用这一能力,我们的功能引导视角规划器(Affordance-VP)将功能场作为信息增益度量,无需场景重建即可引导相机观察任务相关区域。视角规划结果表明,我们的方法仅需一次视角调整就显著优于基于场景不确定性的基线方法。真实世界验证进一步证实了在单物体场景中抓取成功率的显著提升,同时保持毫秒级计算延迟。代码和模型可在以下网址获取:this https URL。

英文摘要

Task-oriented grasping performance degrades significantly when object views suffer from occlusions. Existing task-oriented grasping methods typically assume task-relevant regions are visible in the initial frame, while view planning approaches enable active perception but often ignore task semantics and rely on time-consuming scene reconstruction. To address these limitations, we present GCNGrasp-VP, an efficient framework integrating affordance field prediction with active view planning. Central to this framework is GCNGrasp-v2, a task-oriented grasp model that simultaneously supports grasp evaluation and affordance field prediction, achieving constant-time inference complexity. Leveraging this capability, our Affordance-guided View Planner (Affordance-VP) utilizes the affordance field as an information gain metric to guide camera observation of task-relevant regions without requiring scene reconstruction. View planning results show that our method significantly outperforms scene-uncertainty-driven baselines with only one view adjustment. Real-world validation further confirms substantial improvements in grasp success rates for single-object scenarios while maintaining millisecond-level computational latency. Code and models are available at https://github.com/Instinct323/GCNGrasp-VP.

2606.18601 2026-06-18 cs.RO 新提交 90%

Admittance-Based Surface Alignment for Human-in-the-Loop Robotic Visual Inspection

基于导纳的表面对齐用于人在环机器人视觉检测

Antara Banerjee, Colin Acton, Xu Chen

发表机构 * University of Washington(华盛顿大学)

专题命中 机器人操作 :提出导纳控制框架实现机器人表面精确对齐

AI总结 提出一种基于导纳的实时闭环控制框架,融合操作员输入与感知驱动,实现机器人末端执行器与局部表面的精确对齐,在6自由度机械臂上验证了稳定法向跟踪和0.4°的平均定向误差。

详情
AI中文摘要

精密视觉检测是航空航天、半导体和医疗制造中质量保证的基础,这些领域中高价值零件上未被检测到的表面缺陷直接导致报废、返工和现场故障。机器人视觉检测需要在存在感知噪声和表面不规则的情况下,实现末端执行器与局部表面几何的精确对齐。在工业环境中,通常通过遥操作或共享自主性将人类操作员保持在回路中,引入实时调整,使得纯离线运动规划不足。这激发了能够在人类和感知不确定性下做出反应性、顺从行为的控制架构。本文提出了一种新颖的实时闭环机器人定向控制流程,用于精密视觉检测,该流程采用基于导纳的框架,统一了操作员输入和感知驱动的表面对齐。我们将末端执行器设计为在粘性介质中运动的虚拟球体,使得由此产生的物理可解释的质量-阻尼系统根据定向误差和操作员命令生成同步、顺从的运动。我们在6自由度机械臂上验证了该框架,展示了稳定的法向跟踪和0.4°的最终平均定向误差。

英文摘要

Precision visual inspection underpins quality assurance across aerospace, semiconductor, and medical manufacturing, where undetected surface anomalies on high-value parts translate directly into scrap, rework, and field failures. Robotic visual inspection requires precise alignment between the end-effector and local surface geometry in the presence of perception noise and surface irregularities. In industrial settings, a human operator is often kept in the loop via teleoperation or shared autonomy, introducing real-time adjustments that render purely offline motion planning inadequate. This motivates control architectures capable of reactive, compliant behavior under combined human and perceptual uncertainty. This paper presents a novel real-time, closed-loop robotic orientation control pipeline for precision visual inspection, with an admittance-based framework that unifies operator input and perception-driven surface alignment. We design the end-effector as a virtual sphere moving through a viscous medium, such that the resulting physically interpretable mass--damper system generates synchronized, compliant motion from orientation error and operator commands. We validate the framework on a 6-DOF manipulator demonstrating stable normal-tracking and a final mean orientation error of 0.4°.

2606.18594 2026-06-18 cs.RO cs.AI 新提交 90%

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

基于视觉的机器人操作中强化学习动作空间的基准测试

Seyed Alireza Azimi, Homayoon Farrahi, Abhishek Naik, Colin Bellinger, A. Rupam Mahmood

发表机构 * Department of Computing Science, University of Alberta(阿尔伯塔大学计算机科学系) National Research Council Canada(加拿大国家研究委员会) School of Electrical Engineering and Computer Science, University of Ottawa(渥太华大学电气工程与计算机科学学院) Vector Institute(向量研究所) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所)

专题命中 机器人操作 :基准测试RL动作空间在视觉机器人操作中性能

AI总结 本研究通过模拟到现实的迁移,在物体抓取和推动任务中评估了四种动作空间,发现关节速度动作空间在平滑性和任务性能上最优,并为RL实践者提供了动作空间选择指导。

Comments 9 pages with references

详情
AI中文摘要

在现实世界的强化学习(RL)中,动作空间的选择在塑造运动平滑性、安全性和整体任务性能方面起着关键作用。在本研究中,我们评估了位姿增量、位姿速度、关节位置增量和关节速度在两项基于视觉的操作任务(物体抓取和推动)中的表现。我们在模拟中训练策略,并通过模拟到现实的迁移将其部署到现实世界。我们发现,动作空间表示确实显著影响模拟到现实的性能。特别是,我们发现关节速度动作空间在平滑性和最终任务性能方面最适合基于视觉的抓取和推动任务。我们还为RL实践者在模拟和现实实验中选择动作空间提供了实用指导。

英文摘要

In real-world reinforcement learning (RL), the choice of action space can play a key role in shaping motion smoothness, safety, and overall task performance. In this study, we evaluate pose increment, pose velocity, joint position increment, and joint velocity across two vision-based manipulation tasks: object picking and pushing. We train policies in simulation and deploy them to the real world using sim-to-real transfer. We find that action-space representation indeed significantly affects sim-to-real performance. In particular, we find that the joint velocity action space is best for the vision-based picking and pushing tasks in terms of smoothness and final task performance. We also provide practical guidance for RL practitioners in choosing action spaces for both simulation and real-world experiments.

2606.19314 2026-06-18 cs.RO 新提交 85%

Modeling Branches for Active Manipulation using Iterative Parameter Estimation

基于迭代参数估计的主动操作分支建模

Madhav Rijal, Rashik Shrestha, Trevor Smith, Yu Gu

发表机构 * Department of Mechanical and Aerospace Engineering, West Virginia University(机械与航空航天工程系,西弗吉尼亚大学)

专题命中 机器人操作 :植物分支建模与主动操作

AI总结 提出一种通过迭代估计材料参数来建模植物分支的方法,利用有限元模拟和变形感知运动规划器,实现精确分支操作,平均变形能量降低35.69%。

Comments Accepted to IROS 2026

详情
AI中文摘要

本研究提出了一种通过迭代估计材料参数来建模多样化植物分支的方法,以支持精细的分支操作。在农业机器人中,分支操作对于植物重新定位、稳定以及清除密集叶片中的视觉障碍是必要的。该方法从点云数据构建四面体分支模型,并使用有限元方法模拟其行为。利用真实观测的变形数据,迭代估计分支参数,然后通过变形感知运动规划器计算最优路径,以在另一个机器人的视野内移动和稳定分支。在30次对具有不同几何形状和材料特性的分支进行的试验中,该方法平均降低了35.69%的变形能量,同时路径长度平均增加了8.10%。

英文摘要

This study presents a method for modeling diverse plant branches by iteratively estimating material parameters to support delicate branch manipulation. Branch manipulation is necessary in agricultural robotics for plant repositioning, stabilizing, and clearing visual obstructions in dense foliage. The proposed method builds a tetrahedral branch model from point-cloud data and simulates its behavior using the finite element method. Using real observed deformation data, it iteratively estimates branch parameters and then computes an optimal path with a deformation-aware motion planner to move and stabilize branches within another robot's field of view. Across 30 trials on branches with varying geometries and material properties, the proposed method reduced the deformation energy by 35.69% while increasing the path length by 8.10% on average.

2606.19233 2026-06-18 cs.RO 新提交 85%

Mobile Pedipulation for Object Sliding via Hierarchical Control on a Wheeled Bipedal Robot

基于轮式双足机器人分层控制的移动式腿部操作物体滑动

Yue Qin, Yulun Zhuang, Zelin Shen, Yanran Ding

发表机构 * Department of Robotics, University of Michigan(密歇根大学机器人系)

专题命中 机器人操作 :轮式双足机器人用腿部滑动物体,属于机器人操作

AI总结 提出一种分层控制框架,使轮式双足机器人能用腿部滑动平面物体,通过简化三刚体动力学模型和轨迹优化运动规划器,在实验中成功实现1kg物体取回和4kg物体滑动。

Comments 8 pages, 7 figures

详情
AI中文摘要

在本文中,我们提出了一种分层控制框架,使轮式双足机器人能够利用其轮式腿执行平面物体滑动任务。该方法基于一个简化三刚体动力学模型构建了非线性模型预测控制器,该模型明确考虑了髋关节滚动自由度和多种轮-环境接触模式,这对于横向步态和腿部操作任务至关重要。在该框架内,非线性模型预测控制器同时调节机器人 locomotion 和交互力,使机器人能够稳定地执行滚动和物体操作行为。我们开发了一个基于轨迹优化的机器人-物体运动规划器,以生成包含地面-物体接触中粘滑转换的参考运动。通过实际硬件实验验证了两种代表性的腿部操作运动,即滑行和横向滑动,其中机器人成功地从桌子下取回一个1kg的物体,并通过滑行将一个4kg的物体滑动0.228米的距离。

英文摘要

In this letter, we present a hierarchical control framework that enables wheeled bipedal robots to perform planar object sliding tasks with their wheeled legs. The proposed approach formulates a nonlinear model predictive controller (NMPC) based on a reduced-order three rigid bodies (TRB) dynamical model that explicitly accounts for the hip roll degree of freedom and multiple wheel-environment contact modes, which is essential for lateral stepping and pedipulation tasks. Within this framework, the NMPC simultaneously regulates robot locomotion and interaction forces, allowing the robot to stably execute both rolling and object manipulation behaviors. A trajectory-optimization-based robot-object motion planner is developed to generate reference motions that incorporate stick-slip transitions in ground-object contact. Two representative pedipulation motions, namely scooting and lateral sliding, are validated through real-world hardware experiments, in which the robot successfully retrieves a 1 kg object from under a desk and slides a 4 kg object over a distance of 0.228 m via scooting.

2606.19194 2026-06-18 cs.RO 新提交 85%

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

用于机器人操作中一步流匹配的可逆神经网络适配器

Yu Zhang, Kangyi Ji, Yongxiang Zou, Rongtao Xu, Feng Zheng, Long Cheng

专题命中 机器人操作 :可逆神经网络用于机器人操作动作生成

AI总结 提出可逆神经网络适配器,通过一步去噪过程生成高维动作,降低推理复杂度并保持精度,在仿真和真实实验中提升效率。

详情
AI中文摘要

本文提出了一种用于通用机器人操作的可逆神经网络适配器,旨在通过一步去噪过程,基于多模态观测(包括视觉、语言和本体感受输入)生成精确的高维动作。基于流匹配公式,所提出的适配器有效地将动作生成轨迹约束在可逆潜空间内,从而仅需单次推理步骤即可实现高效、高质量的灵巧动作合成。与传统的迭代流匹配策略相比,所提出的框架显著降低了推理复杂度,同时保持了强大的动作预测精度和稳定性。在多种仿真基准和真实机器人平台上进行了大量实验,以评估所提出方法的有效性。在仿真基准测试中,所提出的适配器在广泛的操作任务上持续表现出优于或接近最先进的性能。此外,真实世界实验显示,视觉-语言-动作(VLA)模型的推理效率显著提升,平均推理延迟从110毫秒降低到61毫秒,同时保持了强大的任务性能。

英文摘要

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.

2606.19089 2026-06-18 cs.RO 新提交 85%

ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing

ART-VS:用于视觉Transformer伺服的自适应分辨率分块

Alessandro Scherl, Bernhard Neuberger, Simon Schwaiger, David Mulero-Pérez, Lucas Muster, Jose Garcia-Rodriguez

发表机构 * Department of Computer Technology, University of Alicante(阿尔瓦登特技术系,阿利坎特大学) Department of Industrial Engineering, UAS Technikum Vienna(工业工程系,维也纳技术学院) Automation and Control Institute, TU Wien(自动化与控制研究所,维也纳技术大学) Institute of Software Engineering and Artificial Intelligence, Graz University of Technology(软件工程与人工智能研究所,格拉茨技术大学) Institute for Integrative Nature Conservation Research, University of Natural Resources and Life Sciences Vienna(整合自然保护研究 institute,维也纳自然资源与生命科学大学)

专题命中 机器人操作 :视觉伺服,自适应分辨率分块。

AI总结 提出ART-VS方法,通过粗-精两阶段自适应调整特征粒度,在不需任务特定训练下提升视觉伺服鲁棒性和精度,显著降低定位误差并提高速度。

Comments Accepted at IROS2026

详情
AI中文摘要

基于自监督视觉Transformer(ViT)特征的视觉伺服实现了无需训练的机器人定位,具有强泛化能力,但面临鲁棒性与精度之间的根本权衡。粗粒度的块级描述符提供稳定的对应关系,但限制了定位精度。提高图像分辨率可改善精度,但鲁棒性增益有限——在扰动下,高分辨率处理仅将收敛成功率从76.6%提升至81.0%,尽管ViT块数量增加了12倍。因此,我们提出自适应分辨率分块视觉伺服(ART-VS),一种两阶段方法,根据伺服进程调整特征粒度:先以原生ViT分辨率进行粗阶段实现稳定对齐,然后进行分块高分辨率阶段,将匹配限制在局部邻域以提高定位精度。无需任何任务特定训练,ART-VS在扰动下达到95.4%的收敛率,比标准分辨率和全分辨率ViT伺服分别高出18.8和14.4个百分点。与前者相比,定位误差降低53%,同时运行速度比后者快10倍以上,VRAM使用减少27%。我们在三个ViT骨干网络上验证了ART-VS,并展示了真实世界类别级抓取未见过的物体实例,透明瓶成功率95/100,鞋子成功率98/100。代码见该链接。

英文摘要

Visual servoing with self-supervised Vision Transformer (ViT) features enables training-free robotic positioning with strong generalization, but faces a fundamental trade-off between robustness and precision. Coarse patch-level descriptors provide stable correspondences yet limit positioning accuracy. Increasing image resolution improves precision but yields only marginal robustness gains - under perturbation, high-resolution processing improves convergence success rate from 76.6% to just 81.0% despite 12x more ViT patches. Therefore, we propose Adaptive Resolution Tiling Visual Servoing (ART-VS), a two-phase method that adapts feature granularity to servoing progress: a coarse phase at native ViT resolution for stable alignment, then a tiled high-resolution phase that restricts matching to local neighborhoods improving positioning accuracy. Without any task-specific training, ART-VS achieves 95.4% convergence under perturbation, outperforming standard and full-resolution ViT-based servoing by 18.8 and 14.4 percentage points. Over the former it reduces positioning error by 53%, while running at over 10x higher speed and 27% lower VRAM than the latter. We validate ART-VS across three ViT backbones and demonstrate real-world category-level grasping of unseen object instances, achieving 95/100 on transparent bottles and 98/100 on shoes. Code available under https://art-vs.github.io/.

2606.18883 2026-06-18 cs.RO 新提交 85%

ZiMPedance: Impedance-Aware ZMP Modeling and Control for Payload Carrying with Quadruped Robots

ZiMPedance:面向四足机器人负载搬运的阻抗感知ZMP建模与控制

Giovanni B. Dessy, Lorenzo Amatucci, Victor Barasuol, Claudio Semini

发表机构 * Dynamic Legged Systems Lab, Istituto Italiano di Tecnologia (IIT)(动态腿部系统实验室,意大利技术研究院(IIT))

专题命中 机器人操作 :四足机器人负载搬运,阻抗感知控制

AI总结 提出扩展零力矩点(ZMP)公式以包含被动负载接口动力学,结合模型预测控制减少稳定性违规达10倍,并提高运动效率。

详情
AI中文摘要

四足机器人的负载运输受到机器人与负载之间物理接口动力学的强烈影响。与主动机械臂相比,被动弹簧臂减轻了重量和复杂性,但其弹簧-阻尼动力学可能引入振荡力,降低运动稳定性。本文推导了一个扩展的零力矩点(ZMP)公式,该公式包含被动负载接口动力学,将刚度、阻尼和负载质量与稳定性裕度联系起来。分析表明,欠阻尼配置可能与运动谐波共振。基于这一见解,我们通过被动子系统动力学增强了单刚体动力学模型,并将其集成到模型预测控制框架中。在仿真中,所提出的控制器将稳定性违规减少高达10倍(从7.0%降至0.7%),并通过将水平地面反作用力努力降低高达15%来提高运动效率。硬件实验表明,在标称控制器失效的拉放扰动下,携带2公斤负载的机器人能够稳定运动。同一模型还使得通过被动臂动力学实现末端执行器跟踪成为可能,而无需直接驱动臂。

英文摘要

Load transportation with quadruped robots is strongly affected by the dynamics of the physical interface between the robot and the load. Passive spring-based arms reduce weight and complexity compared to active manipulators, but their spring-damper dynamics can introduce oscillatory forces that degrade locomotion stability. This paper derives an extended Zero Moment Point (ZMP) formulation that includes passive payload-interface dynamics, relating stiffness, damping, and payload mass to the stability margin. The analysis shows that underdamped configurations can resonate with locomotion harmonics. Based on this insight, we augment a Single Rigid Body Dynamics model with passive subsystem dynamics and integrate it into a Model Predictive Control framework. In simulation, the proposed controller reduces stability violations by up to $10\times$, from $7.0\%$ to $0.7\%$, and increase locomotion efficiency by lowering horizontal ground reaction force effort by up to $15\%$ compared to a nominal baseline. Hardware experiments with a $2\,\mathrm{kg}$ payload show stable locomotion under pull-release disturbances where the nominal controller fails. The same model also enables end-effector tracking through passive arm dynamics without direct arm actuation.

2606.18628 2026-06-18 cs.RO 新提交 80%

Self-Supervised Mask-Aware Transformers for Fault-Tolerant FBG Force Sensing in Minimally Invasive Surgical Robotics

自监督掩码感知Transformer用于微创手术机器人中容错FBG力传感

Peibo Sun, Shiyuan Dong, Shucheng Ye, Jianrong Cai, Yushan Liu, Hongen Liao, Tianqi Huang, Fang Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)

专题命中 机器人操作 :微创手术机器人中FBG力传感的容错方法

AI总结 针对微创手术机器人中FBG传感器因通道耦合和断裂导致的力估计退化问题,提出统一的自监督掩码感知Transformer,通过掩码通道重建预训练和动态损坏课程微调,实现多通道故障下的优雅降级,在8通道数据集上达到0.0066 N均方根误差。

详情
AI中文摘要

在微创手术机器人中,导管级光纤布拉格光栅(FBG)传感器因其能够通过复用多个光学通道来估计多维力而具有前景。然而,部署这些紧凑的多通道传感器引入了两个关键工程挑战:复杂变形过程中固有的非线性交叉轴耦合,以及受限工作空间中光纤断裂导致的间歇性通道丢失。这些复合问题严重降低了力估计性能。现有的容错方法依赖于组合模型库,其随通道数量呈指数级扩展,并且需要昂贵的每模式校准。在本文中,我们提出了一种统一的、自监督的掩码感知Transformer,它显式地建模通道可用性,以在多样化和动态的传感器故障下实现优雅降级。编码器通过未标记数据流上的掩码通道重建进行预训练,并使用平衡的干净与损坏视图目标以及动态损坏课程进行力回归微调。此外,通过异方差高斯负对数似然训练的并行不确定性头,在单次前向传播中预测每轴置信度,避免了多遍集成的开销。在导管级8通道FBG数据集上评估,我们的单一统一模型实现了标称均方根误差(RMSE)0.0066 N,并在严重4通道故障下优雅降级至0.0126 N。这显著优于包含255个每模式神经网络的综合模型库(4通道丢失时为0.0154 N),同时消除了模式特定校准。

英文摘要

In minimally invasive surgical robotics, catheter-scale Fiber Bragg Grating (FBG) sensors are promising due to their ability to estimate multi-dimensional forces by multiplexing several optical channels. However, deploying these compact multi-channel sensors introduces two critical engineering challenges: inherent nonlinear cross-axis coupling during complex deformations, and intermittent channel dropouts caused by fiber fractures in constrained workspaces. These compounding issues severely degrade force estimation. Existing fault-tolerant approaches rely on combinatorial model banks, which scale exponentially with the channel count and demand prohibitively expensive per-pattern calibration. In this paper, we propose a unified, self-supervised mask-aware Transformer that explicitly models channel availability to enable graceful degradation under diverse and dynamic sensor failures. The encoder is pretrained via masked-channel reconstruction on unlabeled data streams and fine-tuned for force regression using a balanced clean-and-corrupted-view objective alongside a dynamic corruption curriculum. Furthermore, a parallel uncertainty head, trained via heteroscedastic Gaussian negative log-likelihood, predicts per-axis confidence in a single forward pass, circumventing the overhead of multi-pass ensembles. Evaluated on a catheter-scale 8-channel FBG dataset, our single unified model achieves a nominal Root Mean Square Error (RMSE) of 0.0066~N and degrades gracefully to 0.0126~N under severe 4-channel failures. This significantly outperforms a comprehensive model bank of 255 per-pattern neural networks (0.0154~N at 4-channel loss) while eliminating pattern-specific calibration.

2606.18558 2026-06-18 cs.CV 新提交 70%

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI(艾伦人工智能研究所) University of Washington(华盛顿大学) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

专题命中 机器人操作 :在机器人操作中验证有效性

AI总结 提出一种基于语言指令的3D点运动预测方法,通过构建大规模数据集和基准,实现类无关、视角稳定的运动轨迹预测,并在机器人操作和视频生成中验证其有效性。

详情
AI中文摘要

运动预测是视觉智能的核心:智能体必须预测物体如何运动,以规划行动、推理物理交互并合成逼真的未来场景。我们认为,世界坐标系中的3D点提供了一种通用表示,具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务:给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述,模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务:(1) MolmoMotion-1M是一个大型语料库,包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹;(2) PointMotionBench是一个人工验证的基准,涵盖111个物体类别和61种运动类型;(3) MolmoMotion是一个通用运动预测模型,支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式,并在PointMotionBench上显著优于现有运动预测基线。最后,我们展示了学习到的3D运动先验能很好地迁移到下游应用:它提高了机器人操作的训练效率和泛化能力,其预测轨迹为生成模型提供了有效的运动指导,以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.