arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

机器人 / 具身智能

机器人、具身智能、机器人学习、操作、导航和具身世界模型。

今日/当前日期收录 5 信号源:cs.RO, cs.AI, cs.CV, cs.LG
2606.18888 2026-06-18 cs.AI 新提交 90%

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境下导航的生成模型预测规划

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

发表机构 * University of Manchester(曼彻斯特大学) Aalto University(阿尔托大学)

专题命中 具身导航 :部分可观测环境导航,结合扩散模型与MPC

AI总结 提出BeliefDiffusion框架,结合扩散模型和模型预测控制,显式建模多模态信念分布并进行前瞻规划,在合成地图环境中显著优于无模型强化学习和生成方法。

详情
AI中文摘要

部分可观测环境中的导航对自主智能体构成重大挑战,需要在未知环境中利用有限的感知信息做出有效决策。基于信念的方法,特别是那些使用神经网络近似信念空间的方法,往往无法捕捉信念空间固有的多模态性,尤其是在具有感知混淆的高维情况下。虽然生成模型提供了一种有吸引力的替代方案,但它们通常需要大量数据或专家演示,并且缺乏长期规划的显式机制。在本文中,我们介绍了BeliefDiffusion,一种结合了生成和规划优势的新框架。BeliefDiffusion利用扩散模型显式表征多模态信念分布,并利用模型预测控制(MPC)同时进行前瞻规划。它包含两个步骤:(1)基于观测历史想象合理的环境配置;(2)在聚合的配置上规划高效的导航策略。通过在合成地图环境中的大量实验,我们证明BeliefDiffusion在导航成功率和路径效率上显著优于无模型强化学习基线和其它生成方法。我们的结果验证了将多模态信念表示显式纳入规划能够在部分可观测设置中实现更鲁棒的导航。

英文摘要

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

2606.18426 2026-06-18 cs.RO 新提交 90%

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

VEGA: 从野外自我中心视频中通过几何轨迹监督学习导航VLA

Gershom Seneviratne, Yohan Abeysinghe, Jianyu An, Vaibhav Shende, Dinesh Manocha

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

专题命中 具身导航 :训练导航VLA模型,利用自我中心视频

AI总结 提出VEGA方法,利用未标注的自我中心视频通过重建场景几何生成障碍感知轨迹,训练流匹配VLA导航策略,在VEGA-Bench上碰撞减少33.0%,真实世界成功率提升至少150.0%。

详情
AI中文摘要

我们提出了VEGA,一种从未标注的自我中心导航视频中训练导航视觉-语言-动作(VLA)模型的方法。互联网规模的自我中心视频提供了可扩展的导航相关视觉观察来源,捕捉了杂乱场景、近距离障碍物以及通过真实世界空间的自然人体运动。然而,这些视频不能直接用于策略学习,因为它们没有提供在机器人坐标系中基于显式导航目标的障碍感知轨迹。VEGA通过从单目视频重建局部场景几何、采样导航目标(表示为文本、图像或空间路径点)并利用构建的几何生成障碍感知轨迹来解决这一差距。生成的轨迹分布随后用于训练流匹配VLA导航策略。通过仅在训练期间使用几何,VEGA将障碍感知规划直接蒸馏到基于视觉的策略中。此外,我们引入了VEGA-Bench,一个包含25万场景和约500万个导航目标(与场景几何配对)的基准,旨在评估VLA的目标进展、碰撞避免和障碍物间隙。我们的评估表明,VEGA在VEGA-Bench上实现了有竞争力的目标进展,同时相比最强基线碰撞减少33.0%,障碍物间隙提高17.9%,在真实世界试验中成功率至少提高150.0%,碰撞至少减少66.7%,障碍物间隙至少提高60.0%。最终,我们证明了视频衍生的几何监督为训练障碍感知导航VLA提供了可扩展且有效的信号。代码和基准将在发表时发布。

英文摘要

We introduce VEGA, an approach for training navigation VisionLanguage-Action (VLA) models from unlabeled egocentric navigation videos. Internet-scale egocentric videos provide a scalable source of navigation-relevant visual observations, capturing cluttered scenes, close-range obstacles, and natural human motion through real-world spaces. However, these videos are not directly usable for policy learning because they do not provide obstacle-aware trajectories conditioned on explicit navigation goals in the robot's coordinate frame. VEGA addresses this gap by reconstructing local scene geometry from monocular video, sampling navigation goals (represented as text, image, or spatial waypoints) and generating obstacle-aware trajectories using the constructed geometry. The resulting trajectory distribution is then used to train a flow-matching VLA navigation policy. By using geometry exclusively during training, VEGA distills obstacle-aware planning directly into a vision-based policy. Furthermore, we introduce VEGA-Bench, a benchmark containing 250k scenes and approximately 5 million navigation goals paired with scene geometry, designed to evaluate goal progress, collision avoidance, and obstacle clearance of VLAs. Our evaluation shows that VEGA achieves competitive goal progress while reducing collisions by 33.0% and improving obstacle clearance by 17.9% over the strongest baseline on VEGABench, while improving success by at least 150.0%, reducing collisions by at least 66.7%, and improving obstacle clearance by at least 60.0% in real-world trials. Ultimately, we demonstrate that video-derived geometric supervision provides a scalable and effective signal for training obstacle-aware navigation VLAs. The code and benchmark will be released at the time of publication.

2606.18847 2026-06-18 cs.AI 新提交 85%

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines: 对长时域有状态具身智能体进行基准测试与建模

Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKUST(香港科技大学) Knowin

专题命中 具身导航 :长时域具身智能体基准,家庭任务规划。

AI总结 提出WorldLines基准,通过构建带时间跨度的家庭轨迹(含对话、动作、状态变化等)评估具身智能体的长时记忆与任务规划能力,并设计ObsMem记忆框架提升状态感知决策。

Comments 27 pages, 18 figures

详情
AI中文摘要

为了在真实家庭环境中长时间协助人类,具身智能体必须记住用户习惯、世界状态和过去的交互。现有的长期记忆基准主要评估以语言为中心的检索和问答,而具身基准通常关注短时域任务执行,未测试在动态环境中长期记忆的使用。我们引入WorldLines,一个项目驱动的长时域具身家庭辅助基准。它构建了带时间跨度的家庭轨迹,包含对话、动作、执行反馈、物体和设备状态变化,并将其转换为带有证据链接的样本,用于记忆问答和具身任务规划。我们进一步提出ObsMem,一个观察者锚定的记忆框架,维护可见性感知的记忆和动作原生状态轨迹,以实现状态感知的决策。实验揭示了在部分可观测性、被覆盖的世界状态以及将长期记忆转化为具身规划方面的持续挑战,而ObsMem为此场景提供了更强的参考架构。

英文摘要

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

2606.18634 2026-06-18 cs.RO cs.AI 新提交 85%

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)(香港科技大学(广州)智能交通系统中心)

专题命中 具身导航 :融合深度与视觉语言实现物体目标导航

AI总结 提出EffiNav框架,融合深度信息与视觉语言模型,通过预测探索边界和语义先验指导导航,在HM3D和OVON数据集上匹配或超越基线,提升路径效率与泛化性。

详情
AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力,应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航(ObjNav)。在ObjNav中,成功到达目标物体提供了基本的性能度量;然而,导航轨迹的效率同样重要,因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中,高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能,但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题,在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D(HM3D)和开放词汇物体目标导航(OVON)上评估EffiNav,并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改,我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务,展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率(SR)和路径长度加权成功率(SPL)上,EffiNav匹配或超越了最近的基线,反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点,性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

2606.19122 2026-06-18 cs.RO 新提交 70%

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

基于混合2D-3D学习的人行道机器人单目3D占用感知

Yukai Ma, Joe Lin, Liu Liu, Honglin He, Lulu Ricketts, Brad Squicciarini, Yong Liu, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Zhejiang University(浙江大学) Coco Robotics(Coco机器人) Massachusetts Institute of Technology(麻省理工学院)

专题命中 具身导航 :人行道机器人导航,属于具身导航

AI总结 提出WalkOCC框架,通过混合射线行进单目3D占用感知,结合LiDAR-RGB配对数据与大规模无配对单目图像学习,提升人行道机器人导航的预测精度和泛化能力。

详情
AI中文摘要

现实世界中的人行道拥挤、杂乱且结构化程度低于道路,使得3D占用预测成为配送机器人和电动轮椅等移动机器人安全导航的关键。现有的占用学习流程主要针对道路自动驾驶设计,通常在大规模配对的LiDAR-RGB数据集上训练,需要密集的3D监督和多个摄像头输入,这些数据收集成本高且未能充分捕捉人行道特定特征。我们提出WalkOCC,一种用于人行道机器人的混合射线行进单目3D占用感知框架。WalkOCC显式地将来自LiDAR-RGB配对数据的几何基础与来自大规模无配对单目图像的可扩展学习相结合。它从配对序列中引导出伪占用监督,并在额外的仅2D数据上联合学习图像级表示。它在不需要昂贵的3D占用标注的情况下实现了稳定的优化和改进的泛化能力。大量实验表明,与基于自监督图像的基线相比,在预测精度、对路缘和排水沟等细微城市结构的细粒度分割以及对环境和跨本体变化的鲁棒性方面,WalkOCC均取得了一致的提升。为了便于评估和基准测试,我们还引入了Sidewalk3D,这是一个大规模的人行道感知数据集,包含在多个地点和时间段收集的LiDAR-相机配对序列,以及用于评估的3D语义占用标注。代码和数据将公开提供。

英文摘要

Sidewalks in the real world are crowded, cluttered, and less structured than roads, making 3D occupancy prediction a key ingredient for the safe navigation of mobile robots such as delivery bots and electric wheelchairs. Existing occupancy learning pipelines are largely designed for on-road autonomous driving and often train on large-scale paired LiDAR-RGB datasets with dense 3D supervision and multiple camera inputs, which are costly to collect and do not adequately capture sidewalk-specific characteristics. We propose WalkOCC, a hybrid Ray-marching monocular 3D occupancy perception framework for robots operating on sidewalks. WalkOCC explicitly couples geometric grounding from LiDAR-RGB paired data with scalable learning from large-scale unpaired monocular images. It bootstraps pseudo occupancy supervision from paired sequences and jointly learns image-level representations on additional 2D-only data. It yields stable optimization and improved generalization without requiring costly 3D occupancy annotations. Extensive experiments demonstrate consistent gains in prediction accuracy, fine-grained segmentation of subtle urban structures such as curbs and gutters, and robustness to environmental and cross-embodiment shifts compared with self-supervised image-based baselines. To facilitate evaluation and benchmarking, we also introduce Sidewalk3D, a large-scale sidewalk perception dataset with LiDAR-camera paired sequences collected across multiple locations and time periods, along with 3D semantic occupancy annotations for evaluation. Code and data will be made available.