arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

机器人 / 具身智能

机器人、具身智能、机器人学习、操作、导航和具身世界模型。

2026-06-19 至 2026-06-19 收录 25 信号源:cs.RO, cs.AI, cs.CV, cs.LG
2606.19397 2026-06-19 cs.RO 新提交 95%

DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

DiffusionVS:基于扩散策略的鲁棒视觉伺服生成框架

Hongkang Cui, Rui He, Haoyao Chen

专题命中 机器人操作 :提出基于扩散策略的视觉伺服方法,用于机器人操作和导航。

AI总结 提出基于扩散策略的视觉伺服方法,通过条件去噪生成相机速度,并采用在线训练增强泛化能力,仿真成功率近100%,物理实验93%。

Comments 8 pages, 4 figures, 7 tables

详情
AI中文摘要

视觉伺服是机器人操作和导航中的基础技术。基于回归的视觉伺服常因噪声敏感的单步映射和分布偏移时的误差累积而出现轨迹抖动。相比之下,扩散策略通过预测动作序列保持时间一致性,并通过隐式数据增强提高鲁棒性。本文提出一种新颖的基于扩散的伺服方法。基于扩散策略,该方法使用观测标签角点的归一化图像坐标作为输入,通过条件去噪生成相机速度。为了克服在静态数据集上训练的模型的泛化限制,采用了在线训练范式,通过交互经验收集持续扩展训练数据的多样性。该策略显著提升了模型的性能和泛化能力。全面的仿真和实际实验证明了该方法的有效性,在仿真中实现了近100%的成功率,在物理实验中达到93%。除了具体的流程,我们进一步验证了扩散机制的通用性。实验表明,现有的视觉伺服网络在与我们的扩散模块集成时,性能持续提升。这些结果表明,所提出的策略具有广泛的适用性,能够增强除本文具体架构之外的各种视觉伺服系统。

英文摘要

Visual servoing is a fundamental technique in robotic manipulation and navigation. Regression-based visual servoing frequently experiences trajectory jitter as a result of noise-sensitive single-step mappings and the accumulation of errors during distribution shifts. In contrast, Diffusion Policy maintains temporal consistency by predicting action sequences and improves robustness through implicit data augmentation. This paper presents a novel diffusion-based servoing method. Based on Diffusion Policy, the proposed approach uses normalized image coordinates of observed tag corners as input and generates camera velocity through conditional denoising. To overcome the generalization limitations of models trained on static datasets, an online training paradigm is adopted, continuously expanding the diversity of training data through interactive experience collection. This strategy substantially enhances both the performance and generalization capability of the model. Comprehensive simulations and real-world experiments demonstrate the effectiveness of the proposed method, achieving success rates of nearly 100\% in simulation and 93\% in physical experiments. Beyond the specific pipeline, we further validate the generality of the diffusion mechanism. Experiments show that existing visual servoing networks consistently achieve improved performance when integrated with our diffusion-based module. These results indicate that the proposed strategy possesses broad applicability and can enhance various visual servoing systems beyond the specific architecture presented here.

2606.17054 2026-06-19 cs.RO cs.AI cs.CV cs.LG 新提交 95%

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

专题命中 机器人操作 :提出HUG模型实现零样本机器人抓取

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2603.04531 2026-06-19 cs.RO 版本更新 95%

PTLD: Sim-to-real Privileged Tactile Latent Distillation for Dexterous Manipulation

PTLD: 从仿真到现实的触觉潜在知识蒸馏用于灵巧操作

Rosy Chen, Mustafa Mukadam, Michael Kaess, Tingfan Wu, Francois R Hogan, Jitendra Malik, Akash Sharma

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Washington(华盛顿大学) FAIR at Meta(Meta的FAIR团队) UC Berkeley(伯克利大学)

专题命中 机器人操作 :提出触觉蒸馏方法用于灵巧操作任务

AI总结 提出PTLD方法,通过真实世界触觉策略数据蒸馏鲁棒状态估计器,解决触觉仿真困难问题,在灵巧操作任务中相比纯本体感策略提升182%和57%。

详情
AI中文摘要

触觉灵巧操作对于自动化复杂家务任务至关重要,但学习有效控制策略仍然是一个挑战。虽然最近的工作依赖于模仿学习,但通过机器人遥操作或动觉教学获取多指手的高质量演示是困难的。另一种方法是,通过强化学习我们可以在仿真中学习技能,但快速且真实的触觉观测仿真具有挑战性。为了弥合这一差距,我们引入了PTLD:从仿真到现实的触觉潜在知识蒸馏,这是一种无需触觉仿真即可学习触觉操作技能的新方法。我们的关键思想不是模拟触觉传感器或纯粹依赖本体感策略进行零样本从仿真到现实的迁移,而是利用现实世界中的特权传感器收集真实的触觉策略数据。然后,这些数据用于蒸馏一个鲁棒的状态估计器,该估计器基于触觉输入运行。我们的实验表明,PTLD可以通过结合触觉感知显著改善在仿真中训练的本体感操作策略。在基准的掌内旋转任务中,PTLD相比纯本体感策略实现了182%的提升。我们还展示了PTLD能够学习具有挑战性的触觉掌内重定向任务,在该任务中,我们观察到达到的目标数量相比仅使用本体感提高了57%。网站:此 https URL。

英文摘要

Tactile dexterous manipulation is essential to automating complex household tasks, yet learning effective control policies remains a challenge. While recent work has relied on imitation learning, obtaining high quality demonstrations for multi-fingered hands via robot teleoperation or kinesthetic teaching is prohibitive. Alternatively, with reinforcement we can learn skills in simulation, but fast and realistic simulation of tactile observations is challenging. To bridge this gap, we introduce PTLD: sim-to-real Privileged Tactile Latent Distillation, a novel approach to learning tactile manipulation skills without requiring tactile simulation. Instead of simulating tactile sensors or relying purely on proprioceptive policies to transfer zero-shot sim-to-real, our key idea is to leverage privileged sensors in the real world to collect real-world tactile policy data. This data is then used to distill a robust state estimator that operates on tactile input. We demonstrate from our experiments that PTLD can be used to improve proprioceptive manipulation policies trained in simulation significantly by incorporating tactile sensing. On the benchmark in-hand rotation task, PTLD achieves a 182% improvement over a proprioception only policy. We also show that PTLD enables learning the challenging task of tactile in-hand reorientation where we see a 57% improvement in the number of goals reached over using proprioception alone. Website: https://akashsharma02.github.io/ptld-website/.

2606.20562 2026-06-19 cs.RO 新提交 90%

MemoryWAM: Efficient World Action Modeling with Persistent Memory

MemoryWAM:具有持久记忆的高效世界动作建模

Sizhe Yang, Juncheng Mu, Tianming Wei, Chenhao Lu, Xiaofan Li, Linning Xu, Zhengrong Xue, Zhecheng Yuan, Dahua Lin, Jiangmiao Pang, Huazhe Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学) Zhejiang University(浙江大学)

专题命中 机器人操作 :机器人操作中的世界动作建模与记忆

AI总结 提出MemoryWAM,通过混合记忆设计和定制注意力机制,在长时域机器人操作任务中实现高效记忆依赖决策,优于现有VLA和WAM基线。

详情
AI中文摘要

现实世界中的鲁棒机器人操作不仅需要理解当前观测,还需要记忆和动力学建模。世界动作模型(WAM)通过联合建模基于当前和历史观测的视觉预测和动作,具备了这些能力,使其成为机器人操作的一个有前景的范式。然而,现有的WAM面临一个基本权衡:高效推理的方法通常仅基于最近观测的有界窗口进行条件化,因此在非马尔可夫环境中表现不佳;而保留长历史的方法则会产生随序列长度大幅增长的时间和空间成本。为解决这一挑战,我们引入了MemoryWAM,一种具有高效持久记忆的世界动作模型。MemoryWAM采用混合记忆设计,结合了最近帧、事件边界锚点帧以及总结长程历史的紧凑要点令牌。一种定制的注意力机制能够检索详细的短期上下文和压缩的长期上下文,支持具有降低推理延迟和GPU内存使用的记忆依赖决策。在模拟和现实世界的长时域、记忆依赖的操作任务中,MemoryWAM在保持良好计算效率的同时,优于强大的视觉-语言-动作(VLA)和WAM基线。

英文摘要

Robust robotic manipulation in the real world requires not only an understanding of the current observation, but also memory and dynamics modeling. World action models (WAMs) possess these capabilities by jointly modeling visual foresight and actions conditioned on both current and historical observations, making them a promising paradigm for robotic manipulation. However, existing WAMs face a fundamental trade-off: methods with efficient inference typically condition only on a bounded window of recent observations and therefore struggle in non-Markovian environments, whereas methods that preserve long histories incur time and space costs that grow substantially with sequence length. To address this challenge, we introduce MemoryWAM, a world action model with efficient persistent memory. MemoryWAM uses a hybrid memory design that combines recent frames, event-boundary anchor frames, and compact gist tokens that summarize long-range history. A tailored attention mechanism enables retrieval of both detailed short-term context and compressed long-term context, supporting memory-dependent decision-making with reduced inference latency and GPU memory usage. Across long-horizon, memory-dependent manipulation tasks in both simulation and the real world, MemoryWAM outperforms strong vision-language-action (VLA) and WAM baselines while maintaining favorable computational efficiency.

2606.20193 2026-06-19 cs.RO 新提交 90%

Belt-Finger: An Affordable Soft Belt-Driven Gripper for Dexterous In-Hand Manipulation

Belt-Finger: 一种经济实惠的软带驱动夹爪,用于灵巧的手内操作

Boya Zhang, Andreas Zell, Georg Martius

发表机构 * University of Tübingen(图宾根大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)

专题命中 机器人操作 :软带驱动夹爪实现灵巧手内操作。

AI总结 提出一种双软带手指模块,为平行夹爪增加三个手内自由度(平移、俯仰、滚动),在保持低成本、易集成的同时提升灵巧操作能力,并通过MPC和遥操作验证其有效性。

详情
AI中文摘要

平行夹爪是机器人中默认的操纵器选择,因为它们简单、坚固且廉价。然而,其有限的手内移动性常常迫使大幅度的臂部运动,并限制了在狭窄工作空间中的灵巧操作。我们提出了一种平行夹爪的升级方案:一种基于双软带的指模块,在保留标准开合功能的同时增加了三个手内自由度(DoF):平移、俯仰和滚动。该机制故意保持简单,并设计为经济制造和直接集成,保留了传统平行夹爪的可靠性和精确控制,同时大大拓宽了操作能力的范围。为了展示新增自由度的实用性,我们将该夹爪集成到两个控制流程中。首先,我们调整了一个模型预测控制器,用于已知物体的手内操作。其次,我们引入了一个轻量级遥操作接口,能够以最少的硬件同时控制机器人臂和夹爪(总共10个自由度)。通过遥操作、MPC和训练策略执行的一系列具有挑战性的操作任务,与传统的平行夹爪相比,所提出的夹爪在灵巧性和任务可行性上持续改进。

英文摘要

Parallel-jaw grippers are the default manipulator choice in robotics because they are simple, robust, and inexpensive. Their limited in-hand mobility, however, often forces large arm motions and restricts dexterous manipulation in confined workspaces. We present a parallel-gripper upgrade: a double-soft-belt-based finger module that preserves standard opening/closing while adding three in-hand degrees of freedom (DoF): translation, pitch, and roll. The mechanism is deliberately kept simple and engineered for inexpensive manufacturing and straightforward integration, preserving the reliability and precise control of traditional parallel grippers while greatly broadening the range of manipulation capabilities. To demonstrate the utility of the added DoFs, we integrate the gripper in two control pipelines. First, we adapt a model predictive controller for in-hand manipulation of known objects. Second, we introduce a lightweight teleoperation interface that enables simultaneous control of the robot arm and gripper (10 DoFs total) with minimal hardware. Across a suite of challenging manipulation tasks executed via teleoperation, MPC, and trained policies, the proposed gripper consistently improves dexterity and task feasibility compared to a conventional parallel gripper

2606.20135 2026-06-19 cs.RO cs.AI 新提交 90%

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

频率感知流匹配用于连续且一致的机器人动作生成

Jianing Guo, Fangzheng Chen, Zihao Mao, Wong Lik Hang Kenny, Zhenhong Wu, Yu Li, Yishuai Cai, Yuanpei Chen, Yikun Ban, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Simin Li

发表机构 * Beihang University(北京航空航天大学) Peking University(北京大学) The Chinese University of Hong Kong(香港中文大学) PKU-Psibot Lab(北大-智源机器人实验室) Zhongguancun Laboratory(中关村实验室) Hefei Comprehensive National Science Center(合肥综合性国家科学中心)

专题命中 机器人操作 :频率感知流匹配用于机器人动作生成。

AI总结 提出频率感知流匹配(FAFM),通过离散余弦变换将离散动作序列转换到频域进行流匹配,并正则化一阶时间导数以生成平滑连续的动作,提升成功率、多模态表达性和运动平滑性。

详情
AI中文摘要

流匹配已成为机器人操作的标准范式,因为它与扩散策略等类似方法一样,对建模复杂的多模态动作分布具有很强的表达能力。然而,现有方法依赖于离散化的动作块,使得它们对以异构控制频率收集的演示数据脆弱,并且容易产生时间上不一致的动作,从而降低控制稳定性。在本文中,我们提出了频率感知流匹配(FAFM),它输出连续的、时间上一致的动作。为了处理异构频率输入,我们使用离散余弦变换(DCT)将离散动作序列转换到频域,对得到的系数进行流匹配,并通过余弦基展开重建连续动作。为了生成时间上一致的动作,我们对一阶时间导数进行正则化以促进平滑动作。这对应于一个Sobolev型约束,抑制高频误差并阻止突变的动作变化。我们的FAFM简单,不引入额外的网络参数,并且适用于独立的流匹配策略和视觉-语言动作模型。在合成玩具基准、避障、LapGym和LIBERO上,FAFM提高了成功率、多模态表达能力、运动平滑性、收敛速度、对机械偏差和混合频率输入的鲁棒性。这些优势在真实世界的Franka机器人上部署时保持一致。代码见此https URL。

英文摘要

Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at https://anonymous.4open.science/r/FAFM.

2606.20118 2026-06-19 cs.RO cs.LG 新提交 90%

Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation

Pose6DAug: 用于机器人数据增强的物理合理多视图物体替换

Jonghoon Lee, Seong Hyeon Park, Byungwoo Jeon, Minha Lee, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院) Korea University(韩国大学) RLWRLD

专题命中 机器人操作 :数据增强框架提升VLA策略泛化。

AI总结 提出Pose6DAug,一种基于失败驱动的数据增强框架,通过3D网格和6D姿态轨迹替换成功轨迹中的物体,生成多视图一致的物理合理演示,无需额外数据收集,在新型物体上提升VLA策略成功率16.5%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在通用操作中展现出强大潜力,但在外观或几何形状偏离训练分布的新型分布外物体上常常失败。标准的补救措施是为每个失败案例收集多视图遥操作数据,但这在成本和时间上扩展性差。我们提出Pose6DAug,一种失败驱动的数据增强框架,将策略自身的成功回合转化为针对其失败模式的目标演示,无需任何新数据收集。我们的关键洞察是,每个成功回合已经编码了一个物理有效的动作轨迹以及校准的多视图观测。通过仅替换被操作物体同时保留该轨迹,我们获得新的且物理基础的演示。然而,简单的2D视频编辑会破坏多视图一致性和物理合理性,特别是在严重遮挡和以自我为中心的视角下。我们的方法直接在3D中操作,通过时间一致的6D姿态轨迹驱动的显式网格锚定目标物体,确保所有相机视图的几何一致渲染。在我们方法增强的数据上微调VLA,相对于最先进的基线,在新型物体上的成功率提高了16.5%,同时保持了分布内性能。这些结果表明,多视图和物理一致的增强是实现可扩展VLA泛化的实用途径。

英文摘要

Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy's own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successful episode already encodes a physically valid action trajectory together with calibrated multi-view observations. By swapping only the manipulated object while preserving this trajectory, we obtain new and physically grounded demonstrations. However, naive 2D video editing breaks multi-view consistency and physical plausibility, particularly under heavy occlusion and egocentric viewpoints. Our method instead operates directly in 3D, anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent renderings across all camera views. Fine-tuning a VLA on data augmented by our method improves success rates by 16.5% relative to the state-of-the-art baseline on novel objects, while preserving in-distribution performance. These results show that multi-view and physically consistent augmentation is a practical path to scalable VLA generalization.

2606.19980 2026-06-19 cs.AI 新提交 90%

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

ENPIRE: 现实世界中智能体机器人策略的自我改进

Wenli Xiao, Jia Xie, Tonghe Zhang, Haotian Lin, Letian "Max" Fu, Haoru Xue, Jalen Lu, Yi Yang, Cunxi Dai, Zi Wang, Jimmy Wu, Guanzhi Wang, S. Shankar Sastry, Ken Goldberg, Linxi "Jim" Fan, Yuke Zhu, Guanya Shi

发表机构 * NVIDIA(英伟达) CMU(卡内基梅隆大学) UC Berkeley(加州大学伯克利分校)

专题命中 机器人操作 :提出ENPIRE框架实现机器人策略自我改进

AI总结 提出ENPIRE框架,通过环境重置、策略执行、结果验证和迭代优化的闭环反馈,使编码智能体自主改进机器人操作策略,在灵巧操作任务上达到99%成功率。

详情
AI中文摘要

在现实世界中实现灵巧的机器人操作严重依赖人工监督和算法工程,这成为追求通用物理智能的核心瓶颈。尽管新兴的编码智能体可以生成代码来自动化算法搜索,但其成功主要局限于数字环境。我们推测,自动化机器人研究缺失的抽象是一个可重复的反馈循环,用于现实世界策略改进:重置场景、执行策略、验证结果并优化下一次迭代。为弥补这一差距,我们引入ENPIRE,一个用于编码智能体的框架,通过四个核心模块实例化这一物理反馈例程:环境模块(EN)用于自动重置和验证,策略改进模块(PI)启动策略优化,推出模块(R)用于评估一个或多个并行运行的物理机器人的策略,以及进化模块(E),其中编码智能体分析日志、查阅文献、改进训练基础设施和算法代码以解决失败模式。这一闭环系统将现实世界操作学习转化为可控的优化过程,在最小化人工努力的同时,允许对训练方案和智能体变体进行公平消融。在ENPIRE的支持下,前沿编码智能体可以自主训练策略,在具有挑战性的灵巧操作任务(如整理针盒、紧固扎带和工具使用)上达到99%的成功率,并且当我们派遣智能体团队在机器人集群上工作时,这一过程会进一步加速。我们的结果展示了将编码智能体部署到物理世界中自主推进机器人技术的实用且可扩展的路径。

英文摘要

Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined in digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with one or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world manipulation learning into a controllable optimization procedure, minimizing human effort while allowing fair ablations across training recipe and agent variants. Powered by ENPIRE, frontier coding agents can autonomously train a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks, such as organizing a pin box, fastening a zip tie, and tool use, a process that further accelerates when we dispatch an agent team on a robot fleet. Our results suggest a practical and scalable path toward deploying coding agents to autonomously advancing robotics in the physical world.

2606.19897 2026-06-19 cs.RO 新提交 90%

One-to-Two Acting: A Novel Framework for Single-arm Agent Action Expansion to Dual Arms

一对二执行:一种面向单臂智能体动作扩展至双臂的新框架

Youbin Yao, Nieqin Cao, Mingyan Li, Yan Ding, Fuqiang Gu, Chao Chen

发表机构 * Chongqing University(重庆大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学) Lumos Robotics

专题命中 机器人操作 :双臂操作框架,从单臂监督学习。

AI总结 提出ExS2D层次化动作扩展框架,利用单臂监督实现双臂操作,通过时间优先关系提取、子任务引导动作映射和碰撞避免协调规划,在仿真中减少54.4%执行步骤并保持成功率。

Comments 6 pages, 5 figures, 3 tables

详情
AI中文摘要

双臂操作可以通过并行执行提高吞吐量,但收集双臂演示进行训练成本高且困难。我们提出ExS2D,一种层次化动作扩展框架,能够从单臂监督实现双臂操作。ExS2D首先从文本指令生成结构化子任务,同时显式捕获时间优先关系。然后通过观察中的子任务引导动作映射,将每个子任务落地为可执行动作。最后,由多模态大语言模型驱动的协调器执行考虑优先关系的动作分配和同步规划,以选择无碰撞的双臂执行。仿真实验表明,ExS2D在保持与单臂基线相当的成功率的同时,平均执行步骤减少了54.4%。在四个任务上的真实机器人实验进一步证明了ExS2D在少量单臂样本下进行双臂执行的可靠性,且未使用任何双臂演示。

英文摘要

Dual-arm manipulation can improve throughput via parallel execution, but collecting bimanual demonstrations for training is costly and difficult. We present ExS2D, a hierarchical action expansion framework that enables dual-arm manipulation from single-arm supervision. ExS2D first generates structured subtasks from textual instructions while explicitly capturing temporal precedence. It then grounds each subtask into executable actions through subtask-guided action mapping in observation. Finally, precedence-aware action allocation and synchronized planning are performed by a multimodal large language model driven coordinator to select collision-free dual-arm executions. Simulation experiments demonstrate that ExS2D reduces the average execution steps by 54.4% while maintaining a comparable success rate to a single-arm baseline. Real-robot experiments on four tasks further demonstrate the reliability of ExS2D for dual-arm execution under few-shot single-arm samples, while using zero bimanual demonstrations.

2606.19358 2026-06-19 cs.RO 新提交 90%

WorkBenchMark: A LEGO-Based Assembly Benchmark with an Assembly-by-Disassembly Baseline for the Smart Manufacturing League

WorkBenchMark:面向智能制造联盟的基于乐高积木的装配基准与通过拆卸进行装配的基线方法

Wenbo Ma, Daniel Swoboda, Matteo Tschesche, Till Hofmann

发表机构 * Chair of Machine Learning and Reasoning (i6), RWTH Aachen University(亚琛工业大学机器学习与推理教席(i6)) MASCOR Institute, FH Aachen University of Applied Science(亚琛应用技术大学MASCOR研究所)

专题命中 机器人操作 :基于乐高的机器人装配基准。

AI总结 提出一个基于乐高Duplo的机器人装配基准,包含400个任务和四个复杂度层级,并提供一个基于规划的基线方法,在所有层级上优于现代视觉-语言-动作方法。

Comments RoboCup Symposium 2026 accepted paper

详情
AI中文摘要

我们介绍了WorkBenchMark,一个受RoboCup智能制造联盟启发的基于乐高Duplo的机器人装配基准。机器人装配将低层操作与物理约束下的任务级符号推理相结合,当前端到端学习方法尚未可靠解决这一组合。该基准提供跨四个复杂度层级的400个任务。我们提供了一个开放词汇的感知、通过拆卸进行装配的基线解决方案。我们的基于规划的流水线在所有层级上优于现代视觉-语言-动作方法。该基准、仿真环境和基线实现将公开发布,以支持更广泛的机器人装配社区。

英文摘要

We introduceWorkBenchMark, a LEGO Duplo-based robotic assembly benchmark motivated by the RoboCup Smart Manufacturing League. Robotic assembly couples low-level manipulation with task-level symbolic reasoning under physical constraints, a combination that current end-to-end learning methods do not yet solve reliably. The benchmark provides 400 tasks across four complexity tiers. We provide an open-vocabulary perception, Assembly-by-Disassembly baseline solution. Our planning-based pipeline outperforms a modern vision-language-action approach across all tiers. The benchmark, simulation environment, and baseline implementation will be released openly to support the broader robotic assembly community.

2606.15516 2026-06-19 cs.RO 新提交 90%

Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands

传递接触,而不仅仅是运动:跨灵巧手的柔顺抓取

Soofiyan Atar, Yao-Ting Huang, Michael Yip

发表机构 * University of California San Diego(加州大学圣迭戈分校)

专题命中 机器人操作 :跨灵巧手柔顺抓取,属于机器人操作

AI总结 提出跨本体力-位置接口,通过校准力矩和指尖力实现异构灵巧手间的接触感知抓取,结合流匹配视觉运动策略和混合力位控制器,实现可迁移的柔顺抓取。

Comments Website(overview): transferring-contact-not-just-motion.github.io/

详情
AI中文摘要

灵巧抓取依赖于接触调节,而不仅仅是运动。稳定操作要求手指在接触滑动、变形或视觉遮挡时保持适当的物体负载。现有的跨本体灵巧策略通过重定向手部姿态或潜在动作统一运动,但力反馈仍与每只手的感觉和驱动绑定,限制了迁移。本文引入了一种跨本体力-位置接口,用于异构灵巧手之间的接触感知操作。运动意图在共享的手部姿态潜在空间中表示,而每只手的力信号通过系统辨识校准为物理关节扭矩(单位N.m)。这些扭矩被映射为指尖力和紧凑的每指负载描述符,使策略获得关于手部应移动到哪里以及物体如何加载的可比观测。利用该接口,训练了一个流匹配视觉运动策略,输入视觉、本体感觉和校准后的接触,并采用结构化视觉掩码,在抓取相关遮挡下鼓励依赖力。相同的校准信号驱动混合力-位置控制器进行演示采集和执行,保持训练和部署中的力目标一致。在结构不同的手上进行的实验表明,校准的接触反馈实现了可迁移的柔顺抓取,学习到的基元可在长时程操作流程中重复使用。

英文摘要

Dexterous grasping depends on contact regulation, not motion alone. Stable manipulation requires fingers to maintain appropriate object loading as contacts slip, deform, or become visually occluded. Existing cross-embodiment dexterous policies unify motion through retargeted hand poses or latent actions, but force feedback remains tied to each hand's sensing and actuation, limiting transfer. This work introduces a cross-embodiment force-position interface for contact-aware manipulation across heterogeneous dexterous hands. Motion intent is represented in a shared hand-pose latent, while each hand's effort signal is calibrated through system identification into physical joint torque in N.m. These torques are mapped to fingertip forces and compact per-finger load descriptors, giving the policy comparable observations of where the hand should move and how the object is loaded. Using this interface, a flow-matching visuomotor policy is trained on vision, proprioception, and calibrated contact, with structured visual masking that encourages reliance on force under grasp-relevant occlusion. The same calibrated signal drives a hybrid force-position controller for demonstration collection and execution, keeping force targets consistent across training and deployment. Experiments across structurally different hands show that calibrated contact feedback enables transferable compliant grasping, with learned primitives reusable in long-horizon manipulation pipelines.

2510.08807 2026-06-19 cs.RO cs.LG 版本更新 90%

Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation

Humanoid Everyday:面向开放世界人形机器人操作的综合机器人数据集

Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharov, Vitor Guizilini, Yue Wang

发表机构 * University of Southern California(南加州大学) Toyota Research Institute(丰田研究院)

专题命中 机器人操作 :提供人形机器人灵巧操作数据集,含260任务

AI总结 提出Humanoid Everyday数据集,包含10.3k轨迹、260个任务的多模态数据,用于人形机器人灵巧操作、人机交互和移动操作研究,并配套云评估平台。

详情
AI中文摘要

从运动到灵巧操作,人形机器人在展示复杂的全身能力方面取得了显著进展。然而,当前大多数机器人学习数据集和基准主要关注固定机器人臂,少数现有人形数据集要么局限于固定环境,要么任务多样性有限,通常缺乏人机交互和下肢运动。此外,缺乏用于在人形数据上对基于学习的策略进行基准测试的标准化评估平台。在这项工作中,我们提出了Humanoid Everyday,一个大规模且多样化的人形操作数据集,其特点是涉及灵巧物体操作、人机交互、运动集成动作等广泛的任务多样性。利用高效的人工监督遥操作流水线,Humanoid Everyday聚合了高质量的多模态感官数据,包括RGB、深度、LiDAR和触觉输入,以及自然语言注释,包含10.3k条轨迹和超过300万帧数据,涵盖7个大类共260个任务。此外,我们对数据集上的代表性策略学习方法进行了分析,提供了它们在不同任务类别中的优势和局限性的见解。为了标准化评估,我们引入了一个基于云的评估平台,允许研究人员在我们的受控环境中无缝部署他们的策略并接收性能反馈。通过发布Humanoid Everyday以及我们的策略学习分析和标准化的基于云的评估平台,我们旨在推进通用人形操作的研究,并为现实世界中更有能力和具身化的机器人代理奠定基础。我们的数据集、数据收集代码和云评估网站在我们的项目网站上公开发布。

英文摘要

From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.

2606.20426 2026-06-19 cs.RO 新提交 85%

TaCauchy: An Extensible FEM Framework for Vision-Based Tactile Simulation

TaCauchy:面向视觉触觉仿真的可扩展有限元框架

Hengfei Zhao, Yifan Xie, Junhao Gong, Yue Sun, Kai Zhu, Weihua He, Shoujie Li, Haohuan Fu, Wenbo Ding

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Huawei Inc.(华为技术有限公司)

专题命中 机器人操作 :触觉仿真框架用于机器人操作中的力计算

AI总结 提出TaCauchy框架,基于UIPC求解器在Isaac Sim中集成有限元法,直接计算柯西应力张量并投影为接触力,实现高保真触觉仿真,支持多种传感器,物理验证SSIM>0.93。

Comments Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2026

详情
AI中文摘要

基于视觉的触觉传感器需要高保真仿真以支持强化学习,然而现有方法难以在GPU加速的机器人平台中提供精确的机械应力场。我们提出TaCauchy,一个可扩展的有限元法(FEM)框架,将严格的基于物理的力计算集成到Isaac Sim中。TaCauchy基于统一增量势接触(UIPC)求解器,直接从超弹性本构定律计算柯西应力张量,并将其投影到接触表面以获得牵引力和压力分布,从而从第一性原理而非经验估计提供机械真实值。我们的框架具有几何感知自适应细化的自动网格生成和模块化传感器接口,能够以最小配置快速集成多种传感器(GelSight Mini、DIGIT、9DTact)。性能基准测试显示,单环境帧率为33.40 FPS,60个并行环境的总吞吐量为555 FPS,应力提取开销低于1 ms。物理验证实验表明,在1.2556 N至4.7332 N的力范围内,仿真与真实触觉响应高度一致,SSIM超过0.93,证实了该框架为下游机器人操作任务提供准确、基于物理的力监督的能力。

英文摘要

Vision-based tactile sensors require high-fidelity simulation for reinforcement learning, yet existing approaches struggle to provide accurate mechanical stress fields within GPU-accelerated robotics platforms. We present TaCauchy, an extensible Finite Element Method (FEM) framework that integrates rigorous physics-based force computation into Isaac Sim. Built on the Unified Incremental Potential Contact (UIPC) solver, TaCauchy directly computes Cauchy stress tensors from hyperelastic constitutive laws and projects them onto contact surfaces to obtain traction forces and pressure distributions, providing mechanical ground truth from first principles rather than empirical estimation. Our framework features automatic mesh generation with geometry-aware adaptive refinement and a modular sensor interface enabling rapid integration of diverse sensors (GelSight Mini, DIGIT, 9DTact) with minimal configuration. Performance benchmarks demonstrate 33.40 FPS for single environments and 555 FPS aggregate throughput across 60 parallel environments, with stress extraction overhead under 1 ms. Physical validation experiments show strong agreement between simulated and real tactile responses across force ranges from 1.2556 N to 4.7332 N, achieving SSIM above 0.93, confirming the framework's capability to provide accurate, physically-grounded force supervision for downstream robotic manipulation tasks.

2606.20285 2026-06-19 cs.RO 新提交 85%

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

Co-VLA:面向双臂视觉-语言-动作系统的协调感知结构化动作建模

Yandong Wang, Jiaqian Yu, Xiongfeng Peng, Lu Xu, Yamin Mao, Weiming Li, Jaewook Yoo, Dongwook Lee, Daehyun Ji, Mingbo Zhao, Chao Zhang

发表机构 * Donghua University(东华大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院) Samsung AI Center, DS Division(三星DS部门AI中心)

专题命中 机器人操作 :聚焦双臂机器人操作任务

AI总结 针对双臂紧耦合任务中隐式协调不足的问题,提出Co-VLA框架,通过结构化动作专家和潜在感知控制器显式引入协调先验,在仿真和真实场景中显著提升成功率和效率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在单臂和双臂机器人操作中展现出强大能力。先前研究表明,通过端到端学习,利用大型视觉-语言骨干网络和连续动作预测,可以涌现出协调的双臂行为。然而,随着双臂任务变得紧密耦合且执行约束变得关键,仅靠隐式协调不足以确保可靠、可解释且稳定的行为。在这项工作中,我们提出了Co-VLA,一个协调感知的双臂操作框架,将显式结构先验引入VLA模型。我们在一个最先进的视觉-语言骨干网络上实例化我们的方法,用专为双臂协调设计的结构化动作专家(SAE)替换其单一动作头。具体来说,我们在动作生成层面引入显式结构,采用模块化的协调感知损失,根据任务特定结构塑造共享和残差潜在变量。共享潜在变量编码任务级协调意图,而残差潜在变量捕获每个手臂的执行调整。在部署时,潜在感知控制器(LAC)解释学习到的表示,以实时调节同步强度、执行不对称性、平滑性和安全约束。LAC在关节命令级别运行,并与标准控制流水线兼容,无需力或阻抗控制。在仿真和真实世界基准上的实验表明,Co-VLA显著优于单一基线,在紧协调任务中成功率达到27%的提升,在OOD真实世界场景中性能翻倍(从13%提升至27%),并将任务完成时间减少高达25%。

英文摘要

Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.

2606.20120 2026-06-19 cs.RO cs.AI 新提交 85%

Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform

用于将自然语言协议翻译为机器人实验室平台的双智能体跨模型验证框架

Hyeonna Choi, Jung Yup Kim, Hyuneui Lim, Seunggyu Jeon

发表机构 * Department of Bionic Machinery, Research Institute of AI Robot, Korea Institute of Machinery & Materials(生物机械系、人工智能机器人研究所、韩国机械材料研究院)

专题命中 机器人操作 :双智能体框架翻译自然语言协议到机器人平台。

AI总结 提出双智能体框架,通过解析器形式化协议、规则映射引擎生成控制命令、异构LLM验证器纠错,实现自然语言微孔板协议到机器人平台可执行命令的转换,并验证了端到端自主执行。

详情
AI中文摘要

生物实验协议以自然语言编写,而自动化系统依赖预定义控制命令,这造成了限制自主执行的语义鸿沟。微孔板自动实验由于需要同时控制孔映射、样本-试剂组合、重复放置和平行分配而尤其具有挑战性。本研究提出一种基于智能体的协议翻译框架,将自然语言微孔板协议转换为机器人实验室平台的可执行控制命令。解析器智能体将自然语言协议形式化为结构化表示,基于规则的映射引擎确定性地融入机器人实验室平台的操作约束以生成设备级控制命令。异构LLM验证器检查完整性、参数准确性和执行顺序,并在检测到错误时触发带有结构化反馈的自校正循环。在随机选择的ELISA协议上对7个解析器和3个验证器进行扫描,评估模型规模和验证器类型在跨模型验证下对翻译准确率和通过率的影响。通过将所提框架的基于规则映射与LLM端到端直接映射进行比较,进一步验证了准确率-延迟权衡。最后,在机器人实验室平台上演示了基于Bradford法的微孔板蛋白质定量,验证了从自然语言协议到真实实验的端到端自主执行。所提框架为缩小自然语言协议与基于微孔板的自主实验室之间的语义鸿沟提供了一种灵活方法。

英文摘要

Biological experiment protocols are written in natural language, whereas automation systems rely on predefined control commands, creating a semantic gap that limits autonomous execution. Microplate-based automatic experiments are particularly challenging due to the need to simultaneously control well mapping, sample-reagent combinations, replicate placement, and parallel dispensing. This study proposes an agent-based protocol translation framework that converts natural-language microplate-based protocols into executable control commands for a robotic laboratory platform. A Parser Agent formalizes the natural-language protocol into a structured representation, and a rule-based mapping engine deterministically incorporates the operational constraints of the robotic laboratory platform to generate device-level control commands. A heterogeneous LLM Validation Agent verifies completeness, parameter accuracy, and execution order, and triggers a self-correction loop with structured feedback when errors are detected. A sweep involving 7 Parsers and 3 Validators on randomly selected ELISA protocols evaluates how model scale and Validator type affect translation accuracy and pass rates under cross-model verification. The accuracy-latency trade-off is further verified by comparing the rule-based mapping of the proposed framework with LLM end-to-end direct mapping. Finally, Bradford assay-based protein quantification using a microplate was demonstrated on a robotic laboratory platform, validating end-to-end autonomous execution from natural-language protocols to real-world experiments. The proposed framework provides a flexible approach to narrowing the semantic gap between natural-language protocols and microplate-based self-driving laboratories.

2606.20092 2026-06-19 cs.CV 新提交 85%

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

EventVLA: 面向长程视觉-语言-动作策略的事件驱动视觉证据记忆

Ganlin Yang, Zhangzheng Tu, Yuqiang Yang, Sitong Mao, Junyi Dong, Tianxing Chen, Jiaqi Peng, Jing Xiong, Jiafei Cao, Jifeng Dai, Wengang Zhou, Yao Mu, Tai Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Dalian University of Technology(大连理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司) The University of Hong Kong(香港大学) Tsinghua University(清华大学) Peking University(北京大学)

专题命中 机器人操作 :长程机器人操作记忆方法

AI总结 针对长程机器人操作中记忆瓶颈问题,提出EventVLA框架,通过动态关键帧证据记忆模块自主捕获任务关键视觉事件,在17个模拟和4个真实任务中平均成功率提升40%。

详情
AI中文摘要

记忆仍然是长程机器人操作的关键瓶颈,因为标准的视觉-语言-动作(VLA)策略在任务相关线索随时间变得遮挡或不可观测时常常失败。虽然现有的记忆增强方法利用历史上下文,但它们要么遭受严重的信息瓶颈,通过解耦的双系统引入高延迟,要么依赖积累大量视觉冗余的无选择性缓冲区。为了解决这些限制,我们引入了EventVLA,一个基于稀疏视觉证据记忆概念的端到端框架,包含两个核心组件:用于保留初始和短期上下文的基础视觉锚点,以及动态关键帧证据记忆(KEM)模块。具体来说,KEM直接从VLA的潜在嵌入中预测未来关键帧概率,以自主捕获和存储稀疏的、任务关键的视觉事件。这种前瞻驱动的机制使策略能够动态评估当前观测的未来因果效用,在瞬态视觉证据变得不可观测之前将其保留。此外,我们提出了RoboTwin-MeM,一个专门设计用于评估具有交互式视觉证据的非马尔可夫操作任务的诊断基准。大量评估表明,在17个需要记忆的模拟任务和4个真实世界双臂任务中,EventVLA相比最先进的记忆增强VLA实现了平均成功率提升+40%。

英文摘要

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

2606.19586 2026-06-19 cs.RO 新提交 85%

One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies

一个演示胜过千条轨迹:用于视觉运动策略的动作-视角增强

Chuer Pan, Litian Liang, Dominik Bauer, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Shuran Song

发表机构 * Stanford University(斯坦福大学) Columbia University(哥伦比亚大学) Toyota Research Institute(丰田研究所)

专题命中 机器人操作 :提出动作-视角增强框架提升操作策略成功率

AI总结 提出一种数据增强框架,通过高斯泼溅和轨迹优化生成逼真的鱼眼图像序列和物理可行的动作轨迹,提升操作策略在场景变化和障碍物下的成功率。

Comments Project website: https://chuerpan.com/1001-demos.github.io/. Published at CoRL 2025

Journal ref Proceedings of The 9th Conference on Robot Learning, PMLR 305:3902-3914, 2025

详情
AI中文摘要

用于操作的视觉运动策略在建模复杂机器人行为方面展现出显著潜力,但机器人初始配置的微小变化和未见障碍物容易导致分布外观测。在没有大量数据收集工作的情况下,这些会导致灾难性的执行失败。在这项工作中,我们引入了一个有效的数据增强框架,该框架从真实世界的眼在手演示中生成视觉上逼真的鱼眼图像序列和相应的物理上可行的动作轨迹,这些演示使用带有单个鱼眼摄像头的便携式平行夹爪捕获。我们引入了一种新颖的高斯泼溅公式,适用于广角鱼眼摄像头,以重建和编辑带有未见物体的3D场景。我们利用轨迹优化生成平滑、无碰撞、视图渲染友好的动作轨迹,并从相应新视角渲染视觉观测。在仿真和现实世界中的综合实验表明,我们的增强框架提高了各种操作任务在相同场景和需要避障的增强场景中的成功率。

英文摘要

Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot's initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.

2606.18960 2026-06-19 cs.CV cs.RO 新提交 85%

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

专题命中 机器人操作 :内存增强世界模型用于机器人操作

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

2509.10416 2026-06-19 cs.RO 版本更新 85%

TASC: Task-Aware Shared Control for Relational Telemanipulation

TASC:面向关系遥操作的任务感知共享控制

Ze Fu, Pinhao Song, Yutong Hu, Renaud Detry

发表机构 * KU Leuven, Dept. Mechanical Engineering, Research unit Robotics, Automation and Mechatronics(KU莱顿机械工程系,机器人、自动化与机电一体化研究单位) KU Leuven, Dept. Electrical Engineering, Research unit Processing Speech and Images(KU莱顿电气工程系,语音与图像处理研究单位)

专题命中 机器人操作 :提出遥操作共享控制框架,涉及机器人操作和任务级意图推理。

AI总结 提出TASC框架,通过视觉构建开放词汇交互图推断任务级用户意图,并基于空间约束提供共享控制辅助,提升关系遥操作效率与泛化能力。

Comments Accepted to IROS 2026

详情
AI中文摘要

我们提出了TASC,一个面向关系遥操作的任务感知共享控制框架,该框架从仅运动输入中推断任务级用户意图并提供辅助。为了在没有预定义模板的情况下支持抓取关系任务,TASC从视觉输入构建一个开放词汇的交互图来表示功能性物体关系,并据此推断用户意图。然后,共享控制策略在抓取和物体交互过程中提供辅助,该辅助由视觉语言模型预测的空间约束引导。我们的方法解决了共享控制下关系遥操作的两个关键挑战:(1)从低级运动命令中推断任务级意图,以及(2)跨不同物体和任务的泛化辅助。在仿真和真实世界的实验表明,与先前方法相比,TASC提高了任务效率并减少了用户输入努力,同时实现了跨多种关系遥操作任务的零样本泛化。支持我们实验的代码在此https URL公开提供。

英文摘要

We present TASC, a Task-Aware Shared Control framework for relational telemanipulation that infers task-level user intent and provides assistance from motion-only input. To support prehensile relational tasks without predefined templates, TASC constructs an open-vocabulary interaction graph from visual input to represent functional object relationships, and infers user intent accordingly. A shared control policy then provides assistance during both grasping and object interaction, guided by spatial constraints predicted by a vision-language model. Our method addresses two key challenges in relational telemanipulation under shared control: (1) task-level intent inference from low-level motion commands, and (2) generalizable assistance across diverse objects and tasks. Experiments in both simulation and the real world demonstrate that TASC improves task efficiency and reduces user input effort compared to prior methods, while enabling zero-shot generalization across diverse relational telemanipulation tasks. The code that supports our experiments is publicly available at https://github.com/fitz0401/tasc.

2509.00271 2026-06-19 cs.RO 版本更新 85%

Learn from What We HAVE: History-Aware VErifier that Reasons about Past Interactions Online

从我们所拥有的学习:在线推理过去交互的历史感知验证器

Yishu Li, Xinyi Mao, Ying Yuan, Kyutae Sim, Ben Eisner, David Held

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

专题命中 机器人操作 :提出历史感知验证器,用于机器人操作中的动作选择。

AI总结 提出历史感知验证器HAVE,通过解耦动作生成与验证,利用历史交互在线消除歧义,理论证明其提升期望动作质量,在多个模拟和真实环境中验证有效性。

Comments CoRL 2025

详情
AI中文摘要

我们引入了一种新颖的历史感知验证器(HAVE),通过利用过去的交互来在线消除不确定场景中的歧义。机器人经常遇到视觉上模糊的物体,这些物体的操作结果直到物理交互之前都是不确定的。虽然仅凭生成模型理论上可以适应这种模糊性,但在实践中,即使在以动作历史为条件的情况下,它们在模糊情况下也会获得次优性能。为了解决这个问题,我们提出明确地将动作生成与验证解耦:我们使用无条件的基于扩散的生成器来提出多个候选动作,并采用我们的历史感知验证器通过推理过去的交互来选择最有希望的动作。通过理论分析,我们证明了使用验证器显著提高了期望动作质量。在多个模拟和真实环境(包括铰接物体、多模态门和不均匀物体拾取)中的实证评估和分析证实了我们方法的有效性以及对基线的改进。我们的项目网站位于:this https URL

英文摘要

We introduce a novel History-Aware VErifier (HAVE) to disambiguate uncertain scenarios online by leveraging past interactions. Robots frequently encounter visually ambiguous objects whose manipulation outcomes remain uncertain until physically interacted with. While generative models alone could theoretically adapt to such ambiguity, in practice they obtain suboptimal performance in ambiguous cases, even when conditioned on action history. To address this, we propose explicitly decoupling action generation from verification: we use an unconditional diffusion-based generator to propose multiple candidate actions and employ our history-aware verifier to select the most promising action by reasoning about past interactions. Through theoretical analysis, we demonstrate that employing a verifier significantly improves expected action quality. Empirical evaluations and analysis across multiple simulated and real-world environments including articulated objects, multi-modal doors, and uneven object pick-up confirm the effectiveness of our method and improvements over baselines. Our project website is available at: https://liy1shu.github.io/HAVE_CoRL25/

2605.25005 2026-06-19 cs.RO 版本更新 80%

Stiffness Optimization for Concentrated Bending in Magnetically Actuated Catheters: Maintaining Steerability under Gradient Stiffness

磁驱动导管集中弯曲的刚度优化:在梯度刚度下保持可操控性

Jiewen Tan, Junnan Xue, Shing Shin Cheng, Shuang Song, Erli Lyu, Jiaole Wang

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) The Chinese University of Hong Kong(香港中文大学) Macao Polytechnic University(澳门理工学院)

专题命中 机器人操作 :磁驱动软导管刚度优化与操控

AI总结 针对磁驱动软导管在推送性与近端集中弯曲之间的权衡,提出一种刚度优化的多段磁驱动导管(SO-MAC),通过解耦转向-推进机构和梯度刚度架构,在推进过程中实现稳定的近端枢轴弯曲,同时远端被动自直以传递推进力。

详情
AI中文摘要

对于磁驱动软导管,实现高效的推送性(推进力传递)和近端集中弯曲以保持可操控性具有挑战性:较高的轴向/弯曲刚度可改善力传递但降低可操控性,而较低的刚度可实现大的近端集中弯曲,但在压缩推送载荷下增加扭结/屈曲风险。为了解决这一权衡,我们提出了一种刚度优化的多段磁驱动导管(SO-MAC),它集成了解耦的转向-推进机构与梯度刚度架构。SO-MAC在推进过程中将弯曲集中在稳定的近端枢轴周围,而远端部分通过优化的刚度分布和弹簧骨架的弹性恢复抵抗摩擦引起的扭结/屈曲,被动自直以传递推进力。在$0{-}180^{\circ}$的组合转向和推进过程中,枢轴保持稳定,远端尖端几乎直线地向目标方向推进。直径为1.5 mm的SO-MAC在其10 mm尖端处实现了高达$180^{\circ}$的转向,弯曲半径为3 mm,平均形状误差为$1.39 \pm 0.56$ mm,转向枢轴误差为$0.35 \pm 0.10$ mm。在支气管模型中的视觉反馈控制进一步验证了通过高度弯曲的分叉路径的鲁棒导航。

英文摘要

Achieving both efficient pushability (propulsion transmission) and proximally concentrated bending for steerability is challenging for magnetically actuated soft catheters: higher axial/bending stiffness improves force transmission but reduces steerability, whereas lower stiffness enables large, proximally concentrated bending yet increases kinking/buckling risk under compressive push loads. To address this trade-off, we propose a stiffness-optimized multi-segment magnetically actuated catheter (SO-MAC) that integrates a decoupled steering-advancement mechanism with a gradient-stiffness architecture. The SO-MAC concentrates bending about a stable proximal pivot during advancement while the distal section passively self-straightens to transmit propulsion, aided by the optimized stiffness distribution and elastic recovery of the spring backbone against friction-induced kinking/buckling. Over $0{-}180^{\circ}$ combined steering and advancement, the pivot remained stable and the distal tip advanced near-straight toward the target direction. A 1.5 mm-diameter SO-MAC achieved up to $180^{\circ}$ steering with a 3 mm bending radius at its 10 mm tip, with an average shape error of $1.39 \pm 0.56$ mm and a steering-pivot error of $0.35 \pm 0.10$ mm. Visual feedback control in a bronchial phantom further confirmed robust navigation through highly curved, bifurcating paths.

2508.21677 2026-06-19 cs.RO 版本更新 80%

Robust Convex Model Predictive Control with collision avoidance guarantees for robot manipulators

具有碰撞避免保证的机器人操作器鲁棒凸模型预测控制

Bernhard Wullt, Johannes Köhler, Per Mattsson, Mikeal Norrlöf, Thomas B. Schön

发表机构 * ABB robotics(ABB机器人公司) Department of Mechanical Engineering, Imperial College London(帝国理工学院机械工程系) Department of Information Technology, Uppsala University(乌普萨拉大学信息科技系)

专题命中 机器人操作 :鲁棒MPC实现工业机器人无碰撞运动

AI总结 提出一种结合鲁棒管MPC与走廊规划算法的凸MPC方案,在模型不确定下实现工业机器人快速无碰撞运动,优于基准方法。

详情
AI中文摘要

工业操作器通常在杂乱环境中运行,安全运动规划至关重要。然而,模型不确定性使任务更加复杂,导致保守的速度限制以减少干扰影响。因此,需要能够保证快速执行安全运动的控制方法。我们通过为操作器提出一种新颖的模型预测控制(MPC)方案来解决这一问题,其中两个主要组件是鲁棒管MPC和用于获得无碰撞运动的走廊规划算法。我们的方案形成凸MPC公式,可以快速求解,使方法具有实际应用价值。我们在模拟环境中展示了方法的有效性,该环境包含一个6自由度工业机器人在具有不确定模型参数的杂乱环境中运行。通过容忍更高水平的模型不确定性同时实现更快的运动,我们优于基准方法。

英文摘要

Industrial manipulators typically operate in cluttered environments, where safe motion planning is critical. However, model uncertainties further complicate this task, which leads to conservative speed limits to reduce the influence of disturbances. Hence, there is a need for control methods that can guarantee safe motions which are executed fast. We address this by suggesting a novel model predictive control (MPC) solution for manipulators, where our two main components are a robust tube MPC and a corridor planning algorithm to obtain collision-free motion. Our solution results in a convex MPC formulation, which we can solve fast, making our method practically useful. We demonstrate the efficacy of our method in a simulated environment with a 6 DOF industrial robot operating in cluttered environments with uncertain model parameters. We outperform benchmark methods by tolerating higher levels of model uncertainty while achieving faster motion.

2504.15535 2026-06-19 cs.RO 版本更新 80%

VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation

VibeCheck: 使用主动声学触觉传感进行接触丰富的操作

Kaidi Zhang, Do-Gon Kim, Eric T. Chang, Hua-Hsuan Liang, Zhanpeng He, Kathryn Lampo, Philippe Wu, Ioannis Kymissis, Matei Ciocarlie

发表机构 * Dept. of Mechanical Engineering(机械工程系) Dept. of Computer Science(计算机科学系) Dept. of Electrical Engineering(电气工程系) Columbia University(哥伦比亚大学)

专题命中 机器人操作 :主动声学触觉传感用于接触丰富的操作任务。

AI总结 本文构建了带有两个压电手指的主动声学传感夹爪,通过物体传递声学振动来感知其声学特性和接触状态,用于物体分类、抓取位置估计、内部结构姿态估计以及外部接触类型分类,并基于接触分类模型实现了鲁棒的插销任务。

Comments Published at IROS 2025. 8 pages, 7 figures

详情
AI中文摘要

物体的声学响应可以揭示其全局状态,例如材料属性或与外界的外部接触。在这项工作中,我们构建了一个主动声学传感夹爪,配备两个压电手指:一个用于生成信号,另一个用于接收信号。通过将一个手指的声学振动通过物体传递到另一个手指,我们能够洞察物体的声学特性和接触状态。我们使用该系统进行物体分类、估计抓取位置、估计内部结构的姿态,以及分类物体与环境的外部接触类型。利用我们的接触类型分类模型,我们解决了一个标准的长时域操作问题:插销插入。我们基于传感器的性能使用一个简单的模拟转移模型来训练一个模仿学习策略,该策略对分类器的不完美预测具有鲁棒性。最后,我们在UR5机器人上演示了该策略,仅使用主动声学传感作为反馈。视频可在此 https URL 找到。

英文摘要

The acoustic response of an object can reveal a lot about its global state, for example its material properties or the extrinsic contacts it is making with the world. In this work, we build an active acoustic sensing gripper equipped with two piezoelectric fingers: one for generating signals, the other for receiving them. By sending an acoustic vibration from one finger to the other through an object, we gain insight into an object's acoustic properties and contact state. We use this system to classify objects, estimate grasping position, estimate poses of internal structures, and classify the types of extrinsic contacts an object is making with the environment. Using our contact type classification model, we tackle a standard long-horizon manipulation problem: peg insertion. We use a simple simulated transition model based on the performance of our sensor to train an imitation learning policy that is robust to imperfect predictions from the classifier. We finally demonstrate the policy on a UR5 robot with active acoustic sensing as the only feedback. Videos can be found at https://roamlab.github.io/vibecheck .

2512.20014 2026-06-19 cs.RO cs.AI 版本更新 75%

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Bring My Cup! 使用视觉注意力提示个性化视觉-语言-动作模型

Sangoh Lee, Sangwoo Mo, Wook-Shin Han

发表机构 * GSAI, POSTECH(POSTECH 人工智能研究所) IME, POSTECH(POSTECH 信息媒体研究所)

专题命中 机器人操作 :机器人操作个人物品

AI总结 针对VLA模型难以处理个性化指令的问题,提出无需训练的视觉注意力提示(VAP)方法,通过参考图像作为非参数记忆,利用开放词汇检测和嵌入匹配定位个人物品,并以视觉提示注入模型,在多个仿真和真实场景中显著提升成功率和正确物体操作。

Comments ICML 2026. Project page: https://vap-project.github.io/

详情
AI中文摘要

尽管视觉-语言-动作(VLA)模型能够很好地泛化到通用指令,但在处理个性化命令(如“bring my cup”)时却存在困难,因为机器人必须在视觉相似的物体中识别并操作特定实例。我们研究了这种操作个人物品的场景,其中VLA必须仅使用少量参考图像来识别并控制训练中未见过的用户特定物体。为了解决这一挑战,我们提出了视觉注意力提示(VAP),一种简单而有效的无需训练的感知适配器,为冻结的VLA模型赋予自上而下的选择性注意力。VAP将参考图像视为非参数视觉记忆,通过开放词汇检测和基于嵌入的匹配将个人物品定位到场景中,然后通过突出显示该物体并重写指令,将这种定位作为视觉提示注入模型。我们构建了两个仿真基准(Personalized-SIMPLER和Personalized-VLABench)以及一个真实桌面基准,用于评估多个机器人和任务上的个性化操作。实验表明,VAP在成功率和正确物体操作方面始终优于通用策略和令牌学习基线,有助于弥合语义理解与实例级控制之间的差距。

英文摘要

While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

2606.19451 2026-06-19 cs.LG cs.CV cs.RO 新提交 70%

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

3D-DLP:自监督3D物体中心场景表示学习

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

专题命中 机器人操作 :3D潜在粒子用于下游机器人操作。

AI总结 提出3D-DLP模型,通过自监督学习将场景级RGB-D或体素观测分解为3D潜在粒子,每个粒子编码解耦属性,实现可解释的逐粒子分割图,并支持场景操控和下游机器人操作。

Comments ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp

详情
AI中文摘要

我们引入了3D-DLP,一种自监督的物体中心表示学习模型,它将场景级RGB-D或体素观测分解为一组3D潜在粒子。基于深度潜在粒子(DLP)框架,每个粒子编码解耦的属性,包括3D关键点位置、边界框尺寸和外观特征,并代表场景中的一个独特实体。该模型通过端到端的自监督重建目标学习可解释的逐粒子分割图。我们在模拟和真实数据集上证明,学习到的潜在空间是可解释和可控的:通过操纵粒子位置并解码,我们可以生成新颖的场景配置。此外,我们展示了将这些紧凑的3D潜在粒子用于下游机器人操作,相比缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法,性能有所提升。代码和视频可在以下网址获取:此 https URL。

英文摘要

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.