机器人 / 具身智能

2606.18634 2026-06-18 cs.RO cs.AI 新提交 85%

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)（香港科技大学（广州）智能交通系统中心）

专题命中具身导航：融合深度与视觉语言实现物体目标导航

AI总结提出EffiNav框架，融合深度信息与视觉语言模型，通过预测探索边界和语义先验指导导航，在HM3D和OVON数据集上匹配或超越基线，提升路径效率与泛化性。

详情

AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力，应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航（ObjNav）。在ObjNav中，成功到达目标物体提供了基本的性能度量；然而，导航轨迹的效率同样重要，因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中，高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能，但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题，在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D（HM3D）和开放词汇物体目标导航（OVON）上评估EffiNav，并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改，我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务，展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率（SR）和路径长度加权成功率（SPL）上，EffiNav匹配或超越了最近的基线，反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点，性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

URL PDF HTML ☆

赞 0 踩 0

2606.19122 2026-06-18 cs.RO 新提交 70%

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

基于混合2D-3D学习的人行道机器人单目3D占用感知

Yukai Ma, Joe Lin, Liu Liu, Honglin He, Lulu Ricketts, Brad Squicciarini, Yong Liu, Bolei Zhou

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Zhejiang University（浙江大学）； Coco Robotics（Coco机器人）； Massachusetts Institute of Technology（麻省理工学院）

专题命中具身导航：人行道机器人导航，属于具身导航

AI总结提出WalkOCC框架，通过混合射线行进单目3D占用感知，结合LiDAR-RGB配对数据与大规模无配对单目图像学习，提升人行道机器人导航的预测精度和泛化能力。

详情

AI中文摘要

现实世界中的人行道拥挤、杂乱且结构化程度低于道路，使得3D占用预测成为配送机器人和电动轮椅等移动机器人安全导航的关键。现有的占用学习流程主要针对道路自动驾驶设计，通常在大规模配对的LiDAR-RGB数据集上训练，需要密集的3D监督和多个摄像头输入，这些数据收集成本高且未能充分捕捉人行道特定特征。我们提出WalkOCC，一种用于人行道机器人的混合射线行进单目3D占用感知框架。WalkOCC显式地将来自LiDAR-RGB配对数据的几何基础与来自大规模无配对单目图像的可扩展学习相结合。它从配对序列中引导出伪占用监督，并在额外的仅2D数据上联合学习图像级表示。它在不需要昂贵的3D占用标注的情况下实现了稳定的优化和改进的泛化能力。大量实验表明，与基于自监督图像的基线相比，在预测精度、对路缘和排水沟等细微城市结构的细粒度分割以及对环境和跨本体变化的鲁棒性方面，WalkOCC均取得了一致的提升。为了便于评估和基准测试，我们还引入了Sidewalk3D，这是一个大规模的人行道感知数据集，包含在多个地点和时间段收集的LiDAR-相机配对序列，以及用于评估的3D语义占用标注。代码和数据将公开提供。

英文摘要

Sidewalks in the real world are crowded, cluttered, and less structured than roads, making 3D occupancy prediction a key ingredient for the safe navigation of mobile robots such as delivery bots and electric wheelchairs. Existing occupancy learning pipelines are largely designed for on-road autonomous driving and often train on large-scale paired LiDAR-RGB datasets with dense 3D supervision and multiple camera inputs, which are costly to collect and do not adequately capture sidewalk-specific characteristics. We propose WalkOCC, a hybrid Ray-marching monocular 3D occupancy perception framework for robots operating on sidewalks. WalkOCC explicitly couples geometric grounding from LiDAR-RGB paired data with scalable learning from large-scale unpaired monocular images. It bootstraps pseudo occupancy supervision from paired sequences and jointly learns image-level representations on additional 2D-only data. It yields stable optimization and improved generalization without requiring costly 3D occupancy annotations. Extensive experiments demonstrate consistent gains in prediction accuracy, fine-grained segmentation of subtle urban structures such as curbs and gutters, and robustness to environmental and cross-embodiment shifts compared with self-supervised image-based baselines. To facilitate evaluation and benchmarking, we also introduce Sidewalk3D, a large-scale sidewalk perception dataset with LiDAR-camera paired sequences collected across multiple locations and time periods, along with 3D semantic occupancy annotations for evaluation. Code and data will be made available.

URL PDF HTML ☆

赞 0 踩 0

2606.18632 2026-06-18 cs.RO 新提交 85%

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

ROBOSHACKLES: 面向具身基础模型中人体伤害预防的安全数据集

Zhuowen Yin, Chongyang Liu, Wenzhang Yang, Renjue Li, Yinxing Xue

发表机构 * Institute of Al for Industries, Chinese Academy of Sciences（工业人工智能研究所，中国科学院）； University of Science and Technology of China（中国科学技术大学）

专题命中机器人基础模型：具身基础模型安全数据集，预防人体伤害

AI总结为解决机器人伤害人类数据难以安全收集的问题，提出基于真实观测的安全数据构建流水线，生成包含1万条视频的ROBOSHACKLES数据集，涵盖直接和间接伤害类别，评估发现现有模型在安全关键场景下100%产生不安全动作。

详情

AI中文摘要

具身基础模型（EFMs）整合了多模态理解、未来状态推理和可执行的机器人动作。然而，它们在预防人体伤害方面的安全对齐仍未得到充分探索，主要是因为机器人伤害人类或造成危险家庭情境的真实世界数据无法安全或合乎道德地收集。为应对这一挑战，我们提出了一种针对人体伤害预防的安全关键数据构建流水线。该流水线从真实的DROID观测出发，经过场景理解、危险感知图像编辑、时间提示生成和单次滚动合成等步骤。时间提示指定了预期的场景演变，而Wan2.7则从编辑后的危险状态中单次合成逼真的机器人滚动视频。利用该流水线，我们构建了ROBOSHACKLES，一个包含10,000条机器人视频片段的数据集，源自真实的DROID观测，涵盖两个直接伤害和四个间接伤害类别。为确保数据集质量，我们使用自动指标评估任务完成度和视觉质量，并在基于拒绝的安全准则下评估了六个代表性EFM。结果表明，所有评估模型在测试的安全关键场景中都产生了不安全动作，不安全动作生成率为100%。ROBOSHACKLES可作为拒绝学习和机器人动作执行前危险预测的可扩展基准和训练资源。该数据集公开于https://roboshackles.github.io。

英文摘要

Embodied Foundation Models (EFMs) integrate multimodal understanding, future-state reasoning, and executable robot actions. Yet their safety alignment for human-injury prevention remains underexplored, primarily because real-world data of robots harming humans or creating hazardous household situations cannot be safely or ethically collected. To address this challenge, we propose a safety-critical data construction pipeline for human-injury prevention in EFMs.Starting from real DROID observations, our construction pipeline proceeds through scene understanding, hazard-aware image editing, temporal prompt generation, and single-pass rollout synthesis. The temporal prompts specify the expected scene evolution, while Wan2.7 synthesizes realistic robotic rollouts from the edited hazardous states in a single pass. Using this pipeline, we construct ROBOSHACKLES, a 10,000-clip robotic video dataset derived from real DROID observations, spanning two direct-harm and four indirect-harm categories. To ensure dataset quality, we assess task completion and visual quality with automatic metrics, and evaluate six representative EFMs under a refusal-based safety criterion. Results show that all evaluated models produce unsafe actions in the tested safety-critical scenarios, yielding a 100% unsafe action generation rate. ROBOSHACKLES serves as a scalable benchmark and training resource for refusal learning and hazard anticipation before robot action execution.The dataset is publicly available at https://huggingface.co/datasets/YZW00/RoboShackles.

URL PDF HTML ☆

赞 0 踩 0

2606.18610 2026-06-18 cs.RO cs.CV 新提交 85%

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； NVIDIA（英伟达）； Physical Intelligence ； Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； Allen Institute for AI（艾伦人工智能研究所）

专题命中机器人基础模型：通过自洽视频生成评估机器人基础模型

AI总结提出SC3-Eval方法，利用前向-反向动力学一致性、跨视角一致性和测试时一致性，将预训练视频基础模型转化为准确的策略评估器，在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情

AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差，多视角观测必须保持相互一致，且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战，这是一种自洽视频生成方案，通过强制三种互补的一致性，将预训练视频基础模型转化为准确的策略评估器。首先，前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作，将生成的 rollout 锚定在物理上合理的动作流形上，并抵消仅前向模型无法惩罚的漂移。其次，跨视角一致性训练模型从每个相机视角修补其他视角，使多相机观测在长 rollout 中保持连贯，无需任何显式记忆机制。第三，测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号，当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式，支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上，SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119，优于三个强先前的基于视频模型的基线，并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.17030 2026-06-18 cs.CV 新提交 75%

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告：通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team（Qwen团队）

专题命中机器人基础模型：具身世界模型，用于机器人操作等任务

AI总结提出Qwen-RobotWorld，一种以自然语言为统一动作接口的语言条件视频世界模型，通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练，在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测，在多个基准上取得最优结果。

详情

AI中文摘要

我们介绍Qwen-RobotWorld，一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口，它从当前观测预测物理上合理的未来视觉轨迹，涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向：用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境，以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的：a) 双流MMDiT与MLLM动作编码，其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合；b) 具身世界知识(EWK)，一个860万视频-文本语料库（2亿+帧），包含20+种具身形态和500+动作类别的动作-语言映射；c) 通用+专家渐进式课程，一种两阶段训练策略，首先学习通用视觉先验，然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力：在EWMBench和DreamGen Bench上总体排名第一，在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.18625 2026-06-18 cs.RO 新提交 85%

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

SRL：结合SLIP模型与强化学习实现敏捷机器人跳跃

Xiaowen Hu, Linqi Ye, Yudi Zhu, Chenyue Shao, Rankun Li, Qingdu Li, Yan Peng

发表机构 * Institute of Artificial Intelligence, Shanghai University（上海大学人工智能研究院）； Institute of Machine Intelligence, University of Shanghai for Science and Technology（上海理工大学机器智能研究院）

专题命中机器人学习：结合SLIP模型与强化学习实现敏捷跳跃

AI总结提出SRL框架，融合SLIP模型的物理基线与强化学习的自适应能力，通过前馈控制信号与实时反馈优化机器人跳跃，显著减少训练时间并保持高精度跟踪。

Comments 17 pages, 12 figures

详情

AI中文摘要

机器人跳跃在搜救和物流等应用中至关重要，这些场景中跨越障碍和提高机动效率是关键。弹簧负载倒立摆（SLIP）模型利用简化的弹簧-质量动力学，自然编码了生物上合理的弹跳运动，但由于对接触和关节动力学的理想化假设，其在不规则地形上的性能会下降。同时，强化学习（RL）能够适应多样化和复杂的环境，但通常需要来自无引导探索的大量数据。SLIP的物理基线与RL的自适应能力的互补优势促使我们提出一种混合框架，以克服各自的局限性。因此，我们提出了弹簧负载强化学习（SRL），它将基于SLIP的前馈控制信号与RL驱动的实时反馈相结合，实现了机器人跳跃的持续优化。实验结果表明，与基线方法相比，SRL能够在更少的训练时间内实现更稳定的跳跃，平均位置跟踪误差低于0.1米，速度跟踪误差在目标值的±3%以内。通过双足和四足模拟的地面与楼梯跳跃，以及sim-to-sim和sim-to-real验证，SRL展现出对各种任务要求和环境复杂性的鲁棒适应性，突显了其在实际部署中的潜力。

英文摘要

Robotic jumping is pivotal in applications such as search and rescue and logistics, where crossing obstacles and enhancing mobility efficiency are critical. The Spring-Loaded Inverted Pendulum (SLIP) model leverages simplified spring-mass dynamics that naturally encode biologically plausible hopping motions, yet its performance degrades on irregular terrain due to idealized assumptions regarding contact and joint dynamics. Meanwhile, Reinforcement Learning (RL) can adapt to diverse and complex environments but often requires extensive data from unguided exploration. The complementary strengths of SLIP's physically grounded baseline and RL's adaptive capabilities motivate a hybrid framework that overcomes these individual limitations. We therefore propose Spring-loaded Reinforcement Learning (SRL), which integrates SLIP-based feedforward control signals with RL-driven real-time feedback, enabling continuous optimization of robotic jumping. Experimental results demonstrate that SRL can achieve more stable jumps with much less training time than the baseline method, maintaining an average position tracking error below 0.1 m and velocity tracking errors within +/-3% of the target values. Through bipedal and quadrupedal simulations of ground and stair jumping, as well as sim-to-sim and sim-to-real validations, SRL exhibits robust adaptability to various task requirements and environmental complexities, underscoring its potential for real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18589 2026-06-18 cs.RO 新提交 85%

DREAM-Chunk: Reactive Action Chunking with Latent World Model

DREAM-Chunk：基于潜在世界模型的反应式动作分块

Wenxi Chen, Kaidi Zhang, Chi Lin, Zhiyuan Zhang, Yu She, Yuejiang Liu, Raymond A. Yeh, Shaoshuai Mou, Yan Gu

发表机构 * Purdue University（普渡大学）； Stanford University（斯坦福大学）

专题命中机器人学习：DREAM-Chunk增强动作分块策略鲁棒性

AI总结提出DREAM-Chunk方法，通过轻量级潜在世界模型在测试时采样多个候选动作分块并选择最优执行，提升动作分块策略在随机动态下的鲁棒性。

详情

AI中文摘要

动作分块已成为视觉-语言-动作（VLA）模型的常见接口，使得低频策略推理能够驱动高频机器人执行。然而，一旦动作分块被提交，其开环执行在随机动态、硬件执行错误和部分可观测性下可能变得脆弱。我们提出DREAM-Chunk，一种测试时扩展方法，通过轻量级潜在世界模型增强基于分块的策略，无需额外的策略微调。在测试时，DREAM-Chunk采样多个候选动作分块，展开其预测的潜在未来，并从预测状态与观测展开最匹配的分块中选择动作。通过这种方式，DREAM-Chunk利用额外的测试时计算覆盖多个可能的随机未来，并提高长时域分块执行期间的响应性。在Kinetix基准测试中，DREAM-Chunk在增加的动作噪声下提高了鲁棒性，并从更大的候选样本量中受益，尤其是当演示包含纠正行为时。我们进一步在两个机器人平台的四个操作任务和两种VLA策略下，针对各种随机性来源验证了DREAM-Chunk。在仿真和硬件实验中，DREAM-Chunk提高了动作分块策略在随机动态下的鲁棒性。

英文摘要

Action chunking has become a common interface for vision-language-action (VLA) models, enabling low-frequency policy inference to drive high-frequency robot execution. However, once an action chunk is committed, its open-loop execution can be brittle under stochastic dynamics, hardware execution errors, and partial observability. We propose DREAM-Chunk, a test-time scaling method that augments chunking-based policies with a lightweight latent world model, without requiring additional policy fine-tuning. At test time, DREAM-Chunk samples multiple candidate action chunks, rolls out their predicted latent futures, and selects actions from the chunk whose predicted state best matches the observed rollout. In this way, DREAM-Chunk uses additional test-time computation to cover multiple plausible stochastic futures and improve reactivity during long-horizon chunk execution. On the Kinetix benchmark, DREAM-Chunk improves robustness under increasing action noise and benefits from larger candidate sample sizes, especially when demonstrations contain corrective behaviors. We further validate DREAM-Chunk on four manipulation tasks across two robot platforms and two VLA policies under various sources of stochasticity. Across simulation and hardware experiments, DREAM-Chunk improves the robustness of action-chunking policies in stochastic dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.19161 2026-06-18 cs.RO 新提交 80%

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

HT-Bench：基于自我中心视觉的灵巧全手触觉表示基准与学习

Yuzhe Huang, Jiaping Wu, Jiaming Jiang, Hezhe Lin, Aikebaier Aierken, Yunlong Wang, Kun Cheng, Ziyuan Jiao, Yuanxin Zhong

发表机构 * Beihang University（北航）； Rimbot ； BUPT（北邮）； ShanghaiTech University（上海科技大学）； Tsinghua University（清华大学）； CAS（中国科学院）

专题命中机器人学习：触觉表示基准用于机器人灵巧操作学习

AI总结提出HT-Bench多任务基准和HandTouch编码器，通过大规模自我中心视觉与全手触觉数据，在触觉相似性检索、掩码修复、视觉到触觉合成等任务上验证了触觉表示的有效性。

Comments 9pages, 4figures

详情

AI中文摘要

由于触觉传感器设计、数据格式和机器人形态的多样性，为机器人操作中的触觉表示学习建立通用基准仍然具有挑战性。我们并未试图建立这样的基准，而是探索了一个可扩展且有前景的未来发展方向：将自我中心视觉与全手触觉数据配对。为此，我们引入了\ extbf{HT-Bench}，一个用于灵巧全手触觉感知的大规模多任务基准，包含在226个任务中收集的1000万RGB帧和780万触觉帧。HT-Bench从三个关键角度评估触觉表示：它们是否编码有意义的接触几何、是否能够将触觉观测与视觉信息对齐、以及是否能够泛化到未见任务。为评估这些能力，HT-Bench包含四个任务：细粒度触觉相似性检索、掩码触觉修复、视觉到触觉合成以及多模态触觉帧预测。我们进一步提出了\ extbf{HandTouch}，一个矢量量化视觉-触觉编码器，通过渐进的空间、跨模态和时间训练学习触觉表示。在HT-Bench上，HandTouch始终优于代表性的触觉编码器基线，将细粒度触觉相似性检索的Recall@5从74.65%提高到85.23%，将掩码触觉修复的RMSE从0.022降低到0.010，并将视觉到触觉合成的OOD cIoU从0.628提高到0.705。这些结果证明了HandTouch的有效性，并表明大规模自我中心全手触觉数据为评估和推进灵巧操作中的触觉表示学习提供了可扩展的基础。

英文摘要

Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representations from three key perspectives: whether they encode meaningful contact geometry, whether they can align tactile observations with visual information, and whether they generalize to unseen tasks. To assess these capabilities, HT-Bench includes four tasks: fine-grained tactile similarity retrieval, masked tactile inpainting, vision-to-tactile synthesis, and multimodal tactile frame prediction. We further propose \textbf{HandTouch}, a vector-quantized vision--tactile encoder that learns tactile representations through progressive spatial, cross-modal, and temporal training. Across HT-Bench, HandTouch consistently outperforms representative tactile encoder baselines, improving Recall@5 on fine-grained tactile similarity retrieval from 74.65\% to 85.23\%, reducing RMSE on masked tactile inpainting from 0.022 to 0.010, and increasing OOD cIoU on vision-to-tactile synthesis from 0.628 to 0.705. These results demonstrate the effectiveness of HandTouch and suggest that large-scale egocentric full-hand tactile data provides a scalable basis for evaluating and advancing tactile representation learning in dexterous manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.19088 2026-06-18 cs.RO 新提交 80%

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

ReSiReg：面向语言条件机器人任务的空间一致语义

Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, Gerald Steinbauer-Wagner

发表机构 * Graz University of Technology, Institute of Software Engineering and Artificial Intelligence（格拉茨技术大学，软件工程与人工智能研究所）； University of Applied Sciences Technikum Wien, Department of Industrial Engineering（维也纳应用科技大学，工业工程系）； University of Alicante, Department of Computer Technology（阿利坎特大学，计算机技术系）； University of Natural Resources and Life Sciences, Institute for Integrative Nature Conservation Research（自然资源与生命科学大学，整合自然保护研究 institute）

专题命中机器人学习：语言条件机器人任务，空间一致语义。

AI总结提出ReSiReg方法，通过重构空间一致的VLM中间特征，改善密集语言接地检索，在OVSS和3D映射中提升空间一致性，并发布紧凑的25M参数VLM模型。

详情

AI中文摘要

视觉-语言模型（VLM）使机器人能够遵循开放语言指令。然而，密集的VLM嵌入已被证明存在噪声且缺乏空间一致性。这对于需要同时推理语义和3D空间的机器人应用来说是有问题的。我们研究了近期VLM的空间结构，并提出了ReSiReg，一种特征重构方法，利用空间一致的VLM中间特征来改善密集语言接地检索。ReSiReg将中间特征聚类为视觉原型，推导其语言描述符，并将每个补丁重构为原型级语言嵌入的软混合。我们在OVSS和3D映射上跨骨干网络进行定量评估，并在真实世界操作场景中进行定性评估。定量结果显示密集检索得到改善；操作场景显示出更空间一致的目标激活。我们进一步为机器人应用提供了一个紧凑的25M密集VLM，远小于ViT-B基线且具有竞争力。可从此网址获取。

英文摘要

Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a soft mixture of prototype-level language embeddings. We evaluate quantitatively on OVSS and 3D mapping across backbones, and qualitatively in real-world manipulation scenes. Quantitative results show improved dense retrieval; manipulation scenes show more spatially consistent target activations. We further provide a compact 25M dense VLM for robotic applications, substantially smaller than and competitive with ViT-B baselines. Available at https://resireg.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.19067 2026-06-18 cs.RO cs.CV 新提交 80%

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

传感器配置至关重要：四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院智能机器人实验室）； Institute for Intelligent Systems, Esslingen University of Applied Sciences（埃森堡应用科学大学智能系统研究所）； Department of Computer Science, University of Freiburg（弗赖堡大学计算机科学系）

专题命中机器人学习：四足机器人多模态SLAM评估。

AI总结针对四足机器人运动中的传感器配置问题，系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法，发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情

AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建（SLAM）。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟，但在腿部运动的剧烈动态下，硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战，包括足部冲击、高频机械振动和快速角旋转，这些都会降低标准感知管道的性能。为了填补这一空白，我们使用在ANYmal D四足机器人上记录的GrandTour数据集，对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响，分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明，硬件选择对系统鲁棒性有显著影响：立体配置始终优于单目和RGB-D模态，全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败，并且关键的是，在剧烈的腿部运动下，标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南，以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18836 2026-06-18 cs.HC cs.AI 新提交 80%

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

通过先前协作的片段记忆改善城市搜索与救援中的人机团队合作

Taewoon Kim, Emma van Zoelen, Mark Neerincx

发表机构 * HumemAI, The Netherlands（荷兰HumemAI）； Vrije Universiteit Amsterdam, The Netherlands（荷兰阿姆斯特丹自由大学）； TNO, The Netherlands（荷兰TNO）

专题命中机器人学习：人机团队协作，片段记忆提升救援。

AI总结提出利用知识图谱片段记忆存储历史协作模式，通过图表示学习选择代表性记忆初始化机器人，在MATRX USAR环境中将救援成功率从25.7%提升至41.3%，任务时间减少283秒。

详情

AI中文摘要

有效的人机团队合作要求机器人从交互开始就适应伙伴、情境和任务动态。在MATRX城市搜索与救援（USAR）环境中，人们可以通过聊天和反思界面将他们在团队合作中发现的协作模式（CPs）外部化。我们研究机器人是否可以利用这种先前的团队经验，在未来的交互中成为更好的队友。为此，我们将历史CPs表示为知识图谱片段记忆，并使用具有节点分类目标的图表示学习来识别一个代表性且有效的记忆以供重用。然后，在新的协作片段开始之前，我们用该记忆初始化机器人。在20名参与者和160轮次观察中，用单个自动选择的先前CP初始化机器人将救援成功率从25.7%提高到41.3%，并将平均任务时间减少283秒。最强的提升出现在交互开始时，表明可重用的片段记忆可以帮助机器人以更有效的任务知识进入协作，并支持更顺畅的早期团队合作。

英文摘要

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

URL PDF HTML ☆

赞 0 踩 0

2606.18786 2026-06-18 cs.AI 新提交 80%

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

R2D-RL：用于多智能体强化学习的RoboCup 2D足球环境

Haobin Qin, Baofeng Zhang, Hidehisa Akiyama, Keisuke Fujii

发表机构 * Graduate School of Informatics, Nagoya University（名古屋大学信息学研究科）； School of Information and Data Sciences, Nagasaki University（长崎大学信息与数据科学学院）

专题命中机器人学习：多智能体强化学习环境，机器人足球

AI总结提出R2D-RL环境，通过共享内存通信和周期级同步连接RCSS2D与Python MARL接口，支持全场和场景训练，提供可配置对手、离散/混合动作空间、EPV奖励塑造及并行执行。

Comments Code is available at: https://github.com/open-starlab/R2DRL

详情

AI中文摘要

机器人足球是多智能体强化学习的一个具有挑战性的测试平台，因为它结合了部分可观测性、合作与对抗交互、稀疏奖励以及长期战术行为。RoboCup 2D足球仿真（RCSS2D）提供了一个成熟的机器人足球平台，但其面向竞争的服务器-客户端架构难以直接用于现代基于Python的MARL工作流。我们引入了R2D-RL，这是一个强化学习环境，通过共享内存通信和周期级同步将RCSS2D和基于HELIOS的玩家客户端连接到Python MARL接口。R2D-RL支持全场和基于场景的训练，具有可配置的对手、基础离散和混合参数化动作空间、动作掩码、基于预期控球值（EPV）的奖励塑造以及并行执行。我们提供了前场场景和11对11全场基准测试，以及基线结果。

英文摘要

Robot soccer is a challenging testbed for multi-agent reinforcement learning because it combines partial observability, cooperative and adversarial interaction, sparse rewards, and long-horizon tactical behavior. RoboCup 2D Soccer Simulation (RCSS2D) provides a mature robot-soccer platform, but its competition-oriented server-client architecture is difficult to use directly with modern Python-based MARL workflows. We introduce R2D-RL, a reinforcement learning environment that connects RCSS2D and HELIOS-based player clients to a Python MARL interface through shared-memory communication and cycle-level synchronization. R2D-RL supports full-field and scenario-based training with configurable opponents, Base discrete and Hybrid parameterized action spaces, action masks, expected possession value (EPV)-based reward shaping, and parallel execution. We provide front-goal scenarios and an 11-vs-11 full-field benchmark, together with baseline results.

URL PDF HTML ☆

赞 0 踩 0

2606.18516 2026-06-18 cs.RO 新提交 80%

Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets

动态杂乱环境下的任务分配与运动规划：基于CBBA与凸集图

Matthew D. Osburn, Cameron K. Peterson, John L. Salmon

发表机构 * Electrical and Computer Engineering（电气与计算机工程系）； Mechanical Engineering（机械工程系）

专题命中机器人学习：多智能体任务分配与运动规划

AI总结针对动态杂乱环境中的多智能体任务规划，提出结合凸集图（GCS）进行轨迹优化与共识捆绑算法（CBBA）进行分布式任务分配的方法，实现安全高效的轨迹规划和任务协调。

Comments 15 pages single column, 10 figures, AIAA-Scitech 2027 Submission

详情

AI中文摘要

在杂乱、动态环境中的多智能体任务规划需要在分配任务给智能体的同时，确定通过环境的安全、时间高效的轨迹。当任务是动态的（例如会合目标）时，分配决策不仅取决于哪个智能体最适合某项任务，还取决于该任务何时何地可以到达。本文提出了一个解决该问题的方法，该方法将凸集图（GCS）用于轨迹优化，与共识捆绑算法（CBBA）用于分布式任务分配相结合。在我们的方法中，GCS通过使用时间扩展（3D+时间）配置空间找到通过动态环境的最优轨迹。同时，CBBA协调跨智能体的任务分配，使得在移动环境中能够做出明智的决策。然后，我们连接分配和规划，使智能体能够在3D+时间配置空间中避免碰撞，并提供准确的任务完成时间估计。我们在具有静态和动态任务的模拟杂乱环境中展示了我们方法的有效性。

英文摘要

Multi-agent task planning in cluttered, dynamic environments requires assigning tasks to agents while simultaneously determining safe, time-efficient trajectories through the environment. When tasks are dynamic, such as rendezvous objectives, allocation decisions depend not only on which agent is best suited for a task, but also on when and where that task can be reached. This paper presents a solution to this problem, which combines Graphs of Convex Sets (GCS) for trajectory optimization with the Consensus-Based Bundle Algorithm (CBBA) for distributed task allocation. In our approach, GCS finds optimal trajectories through dynamic environments using a time-extended (3D+time) configuration space. At the same time, CBBA coordinates task assignments across agents, enabling informed decision-making in a moving environment. We then connect allocation and planning to allow the agents to avoid collisions in the 3D+time configuration space and provide accurate time estimates for task completion. We demonstrate the effectiveness of our approach in simulated cluttered environments with static and dynamic tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.18861 2026-06-18 cs.CV cs.AI 新提交 75%

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

基于可微联合推理与能量一致性验证的RGB-D序列URDF合成

Xinze Zhang

发表机构 * University of Southern California（南加州大学）

专题命中机器人学习：重建可仿真数字孪生，用于机器人。

AI总结提出KinemaForge管道，通过可微关节推理和能量一致性验证，从RGB-D序列联合估计部件形状、关节拓扑和参数，显著降低关节轴误差和仿真漂移。

详情

AI中文摘要

从传感器观测重建可仿真的铰接物体数字孪生仍受两个持续存在的差距制约：(i) 部件级几何重建与运动学参数估计分离，(ii) 恢复的模型常违反能量守恒等基本动态不变量，导致URDF在物理仿真器中重放时出现漂移。我们提出KinemaForge，一种约束驱动管道，从短RGB-D序列联合推断部件级形状、关节拓扑和关节参数，并通过基于可微刚体动力学构建的能量一致性验证器验证结果。该管道引入三个组件：将关节-部件关联编码为软边的运动学约束图；通过Featherstone铰接体算法从渲染观测反向传播到关节参数的可微螺旋轴求解器；以及惩罚重建模型非物理自由响应的能量残差损失。在五个PartNet-Mobility类别和一个内部RGB-D基准上，KinemaForge将平均关节轴误差从最强几何基线(PARIS)的4.52度降至2.83度(-37.4%)，从基于交互的Ditto基线的5.30度降至2.83度(-46.6%)，在50秒滚动中长时仿真漂移比PARIS降低64%，初步评估中闭环操作成功率比Ditto提高14.6个百分点。代码和重建数据将在接收后发布。

英文摘要

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.18537 2026-06-18 cs.LG 新提交 75%

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

入乡随俗：从异构智能体学习通用行为

Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung

发表机构 * University of Washington（华盛顿大学）； NVIDIA（英伟达）

专题命中机器人学习：从异构智能体学习通用行为

AI总结提出GRID方法，从追求不同目标的异构示范者中提取通用奖励，训练通用智能体以学习环境通用能力，避免模式平均偏差，提升下游任务微调效率。

详情

AI中文摘要

人类通常通过观察他人来获取新技能，因为观察到的行为隐含地揭示了如何在环境中行动。然而，从异构群体中获得的观察会引入冲突的行为信号，使得难以确定哪些行为值得模仿。我们通过通用奖励推断与解耦（GRID）来解决这一挑战，这是一种从追求不同目标的异构示范者群体中提取普遍有用行为的社会学习方法。GRID将每个智能体的奖励函数分解为通用奖励（捕捉所有智能体共享的行为）和特定奖励（捕捉个体偏好和目标）。仅基于通用奖励进行训练提供了一种通用预训练的新范式。它产生了一个通用智能体，该智能体内化了通用的环境能力，如安全性和基本任务熟练度，而不会出现困扰标准从示范学习技术的模式平均偏差。这个通用智能体作为微调到下游任务（包括训练中未见过的偏好）的优越先验。在合成基函数分解、多智能体Craftax和连续自动驾驶模拟器（Highway-Env）上的实验证实，GRID以语义上有意义的方式成功解耦了奖励结构，优于标准的从示范学习基线，并实现了更高效和稳定的特化。

英文摘要

Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it difficult to determine which behaviors are worth imitating. We address this challenge with General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from a heterogeneous population of demonstrators pursuing different goals. GRID decomposes per-agent reward functions into a general reward, capturing behaviors shared across all agents, and specific rewards, capturing individual preferences and objectives. Training exclusively on the general reward provides a new paradigm of generalist pretraining. It yields a generalist agent that internalizes universal environmental competencies, such as safety and basic task proficiency, without the mode-averaging bias that afflicts standard learning from demonstration techniques. This generalist serves as a superior prior for fine-tuning to downstream tasks, including preferences unseen during training. Experiments across a synthetic basis function decomposition, multi-agent Craftax, and a continuous autonomous driving simulator (Highway-Env) confirm that GRID successfully disentangles reward structure in a semantically meaningful way, outperforms standard learning from demonstration baselines, and enables more efficient and stable specialization.

URL PDF HTML ☆

赞 0 踩 0

2606.18519 2026-06-18 cs.RO cs.AI 新提交 75%

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿：利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced（加州大学默塞德分校）

专题命中机器人学习：LLM任务规划用于精准农业机器人

AI总结针对自然语言歧义性，提出基于线性时序逻辑（LTL）反馈循环的LLM任务规划系统，通过双LLM分工实现规范生成与验证，提升精准农业任务规划的可靠性。

Journal ref Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)

详情

AI中文摘要

尽管机器人系统现已商业化并部署于各行各业，但许多系统高度专业化，通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题，我们近期引入了一个任务规划器，利用大语言模型（LLM）根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色，但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑（LTL）的反馈循环来扩展我们的系统，以确保任务规划系统满足用户制定的规范，同时仍使用自然语言。为减轻潜在偏差，我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验，我们强调了将任务验证集成到全自主流水线中的优势与局限，特别是关于LLM生成有效LTL公式的能力，并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

URL PDF HTML ☆

赞 0 踩 0

2606.19297 2026-06-18 cs.LG cs.RO 新提交 70%

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

VLA 甚至知道基础知识吗？衡量视觉-语言-动作模型中的常识和世界知识保留

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina, Aleksandr I. Panov, Alexey K. Kovalev, Vlad Shakhuro

发表机构 * CogAI Lab（CogAI实验室）； FusionBrain Lab（FusionBrain实验室）； IAI MSU（MSU人工智能研究所）； Lomonosov MSU（Lomonosov莫斯科大学）； NUST MISIS ； Applied AI Institute（应用人工智能研究所）； HSE University（俄罗斯高等经济大学）； Generalizable AI Systems（可泛化人工智能系统）； ISP RAS（俄罗斯科学院信息与自动化过程研究所）； MIRAI ； Domain-specific NLP Group（领域特定自然语言处理小组）

专题命中机器人学习：VLA模型在机器人任务中评估常识知识

AI总结提出 Act2Answer 协议，通过动作回答评估 VLA 模型的知识保留，发现模型在简单概念上表现良好，但在丰富语义类别上存在差距，且 VQA 联合训练有助于知识保留。

Comments Project page: https://tttonyalpha.github.io/act2answer/

详情

AI中文摘要

具身视觉-语言-动作（VLA）模型通常通过在机器人数据上微调强大的预训练 VLM 获得，但目前尚不清楚它们在适应后保留了多少常识和事实知识。在知识敏感任务上的失败是模糊的，混淆了知识缺失与低级控制泛化能力差。我们引入 Act2Answer，一种轻量级协议，通过要求智能体通过动作来回答，将 VLM 知识基准适配到 VLA 评估。每个问题变成一个简短的桌面场景，其中智能体执行单个物体放置动作以选择候选答案，从而产生动作基础的、减少控制混淆的成功率。我们在不同的常识和世界知识类别中策划了这样的环境测试套件，并引入逐层意图探测以定位 VLM 骨干和动作头中与答案相关的信息。在对 7 个 VLA 模型和 9 个 VLM 基线的大规模研究中，我们系统地跨类别对模型进行排名，发现 VLA 在简单概念上表现稳健，但在更丰富的语义类别上相对于其源 VLM 显示出更大的差距，VQA 联合训练与更好的知识保留相关，并且答案相关信号在 VLA 中间层达到峰值，但在上层减弱。Act2Answer 可在以下网址获取：此 https URL。

英文摘要

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

URL PDF HTML ☆

赞 0 踩 0

2606.18514 2026-06-18 cs.RO cs.LG 新提交 70%

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

N(CO)$^2$: 基于机会约束的神经组合优化求解随机定向问题

Anas Saeed, Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * Department of Computer Science and Engineering, University of California, Merced（加州大学默塞德分校计算机科学与工程系）

专题命中机器人学习：神经组合优化求解随机定向问题

AI总结提出N(CO)$^2$框架，结合强化学习求解随机定向问题，无需手工启发式，在不确定环境下优化路径选择，性能媲美MILP。

Journal ref In Proceedings of the IEEE International Conference on Automation Science and Engineering (CASE), 2025

详情

AI中文摘要

神经组合优化（NCO）通过学习启发式，为求解复杂图优化问题提供了一种有前景的替代传统启发式方法的方法。这类问题在自动化领域频繁出现，可用于建模多种应用。虽然NCO在确定性组合优化问题上已被广泛研究，但只有少数工作旨在解决随机组合优化问题。本文提出N(CO)$^2$：基于机会约束的神经组合优化，用于求解随机定向问题（SOP），无需手工设计的启发式。通过集成强化学习（RL）框架，模型在不确定性下优化路径选择，有效平衡探索与利用。实验结果表明，我们的方法在多种SOP实例上具有良好的泛化能力，与最先进的混合整数线性规划（MILP）相比性能具有竞争力。所提方法减少了启发式设计的人力投入，同时在不确定环境中实现自适应和高效的决策。

英文摘要

Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO)$^2$: Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.

URL PDF HTML ☆

赞 0 踩 0

2606.18308 2026-06-18 cs.LG cs.AI 新提交 70%

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

TRIDENT: 打破混合安全-物理耦合以实现可证明安全的多智能体强化学习

Zijie Meng, Ziwei Li, Yufei Liu, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Miao Zhang

发表机构 * Peking University（北京大学）； Xiamen University（厦门大学）； National Taiwan University（国立台湾大学）； WHU（武汉大学）； THU / Jimei University（清华大学 / 集美大学）

专题命中机器人学习：提出可证明安全的多智能体强化学习框架。

AI总结针对混合离散-连续动作、训练时安全约束和物理动力学形成的耦合问题，提出TRIDENT框架，通过Richardson-Romberg梯度校正、Lyapunov约束序列信任域更新和物理信息残差评论家，实现可证明的安全收敛，显著降低训练违规并提升奖励。

Comments 16 pages, 4 figures

详情

AI中文摘要

网络化信息物理系统中的安全协调迫使学习算法同时处理混合离散-连续动作、严格的训练时安全约束和物理支配的动力学。我们证明这三个特征形成了一个有向偏差循环，击败了任何现成模块的朴素组合，并将其形式化为一个三向耦合引理。然后我们引入TRIDENT，这是第一个MARL框架，其三个组件被共同设计以消除每个泄漏：一个将Gumbel-Softmax偏差从O(tau)降低到O(tau^2)的Richardson-Romberg梯度校正，一个强制每次迭代可行性的Lyapunov约束顺序信任域更新，以及一个分解价值而非奖励的物理信息残差评论家。我们证明了以O~(1/sqrt(K))的收敛速率达到约束纳什均衡，以及O(sqrt(K))的累积违规界。在多无人机移动边缘计算、自主交叉口管理和混合SMAC变体上，TRIDENT相比MADDPG减少了95.5%的训练时违规，相比MACPO减少了76.3%，同时相比最强的无约束基线提高了13.5%的奖励。

英文摘要

Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these three features form a directed cycle of biases that defeats any naive composition of off-the-shelf modules, and formalize this as a three-way coupling lemma. We then introduce TRIDENT, the first MARL framework whose three components are co-designed to cancel each leak: a Richardson-Romberg gradient correction reducing Gumbel-Softmax bias from O(tau) to O(tau^2), a Lyapunov-constrained sequential trust-region update enforcing per-iterate feasibility, and a physics-informed residual critic that decomposes value rather than reward. We prove an O~(1/sqrt(K)) convergence rate to a constrained Nash equilibrium and an O(sqrt(K)) cumulative-violation bound. On multi-UAV mobile-edge computing, autonomous intersection management, and a hybrid SMAC variant, TRIDENT cuts training-time violations by 95.5% over MADDPG and 76.3% over MACPO, while improving reward by 13.5% over the strongest unconstrained baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.19154 2026-06-18 cs.RO 新提交 65%

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Viking Hill数据集：用于森林场景检测与分割的激光雷达-雷达-相机数据集

Vladimír Kubelka, Oleksandr Kotlyar, Unal Artan, Martin Magnusson

发表机构 * Örebro University（奥雷布罗大学）； AASS research centre（AASS研究中心）； Robot Navigation and Perception Lab（机器人导航与感知实验室）

专题命中机器人学习：机器人平台采集数据，用于自主导航感知

AI总结提出首个包含4D成像雷达的森林多传感器数据集，通过MinkowskiUNet实现雷达与激光雷达点云的语义分割，并评估树干分割质量与树木尺寸的关系。

Comments 33 pages, 11 figures

详情

AI中文摘要

在森林冠层下运行的自主机器人需要对树木及周围植被在不同季节条件下进行稳健感知。现有的林业数据集提供带有单棵树标注的激光雷达或相机数据，但均未包含共配准的4D成像雷达——这一模态因其对视觉退化、表面污染和植被遮挡的鲁棒性而日益受到关注。我们介绍了一个由移动机器人收集的多传感器森林数据集，该机器人配备了高分辨率FMCW成像雷达、激光雷达、RGB相机、IMU和RTK-GNSS。该场地在两个不同植被状态的会话中记录，3D立方体标注（包括每棵树的直径估计）为所有三种感知模态提供了共享语义标签。此外，我们提供了使用MinkowskiUNet对雷达和激光雷达点云进行语义分割的基线结果。雷达在主要类别（地面91%，冠层86%）上取得了与激光雷达竞争性的IoU分数，但在几何精细结构（如树干）上落后（56%对74%）。跨模态分析进一步比较了激光雷达和雷达的树干分割与RGB检测模型，而按直径分层的评估揭示了树干分割质量如何随树木尺寸变化。除了分割，共配准的多模态数据和RTK-GNSS辅助参考定位支持冠层下地图构建、定位和传感器融合的研究。数据集和标注工具已公开。

英文摘要

Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per-tree annotations, but none include co-registered 4D imaging radar -- a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi-sensor forest dataset collected by a mobile robot equipped with a high-resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK-GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations -- including per-tree diameter estimates -- provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross-modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter-stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co-registered multi-modal data and RTK-GNSS-aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.18315 2026-06-18 cs.LG cs.AI 新提交 65%

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

鬼吸引子网络：用于闭环序列生成的盆地结构动力学解码器

Tianyu Wang, Ying Wang, Zhihao Liu, Xi Vincent Wang, Lihui Wang

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）； Department of Production Engineering, KTH Royal Institute of Technology（瑞典皇家理工学院生产工程系）； Department of Decision and Control Systems, KTH Royal Institute of Technology（瑞典皇家理工学院决策与控制系统系）

专题命中机器人学习：提出动力学解码器用于机器人动作序列生成。

AI总结提出鬼吸引子网络，一种理论推导的动力学解码器，通过构建盆地-吸引子结构实现高效闭环序列生成，在机器人动作解码任务中以2.3M参数匹配1.07B参数扩散变压器的离线精度，延迟降低32倍。

详情

AI中文摘要

使用大规模Transformer和扩散解码器进行序列输出生成时，内存成本随序列长度增长，且需要迭代逐步骤计算。用小型前馈解码器替代可恢复效率，但产生非结构化的潜在表示，限制了闭环控制：相位条件动作生成和跨步骤潜在传递都需要具有稳定盆地的潜在几何结构。本文提出鬼吸引子网络，一种理论推导的动力学解码器，其潜在变量在学习的势能下演化并带有漂移，通过构造产生盆地-吸引子结构。三个期望（多模态、解码器级单次切换和恒定内存）激发了势能-漂移形式，模式转变作为鞍结分岔和鬼吸引子逃逸出现。层次化的相空间分解将一阶盆地收敛与二阶本体感受细化分开。实验上，使用行为克隆和对比目标端到端训练的鬼网络在其势能中表现出预测的梯度流收缩，在1430个保留样本上，梯度范数在五个积分步骤中衰减67%。鬼网络作为机器人动作解码器进行评估。一个230万参数的鬼网络以462倍少的参数和32倍低的延迟匹配了10.7亿参数扩散变压器的离线精度，并在离线均方误差上比五个替代的200万参数解码器（MLP、神经常微分方程、条件变分自编码器、Transformer、单步扩散）低5.9%至29%。在LIBERO-10闭环基准测试中，鬼网络的盆地结构潜在上的相位条件比前馈MLP基线提高了13.5个百分点的成功率，持久潜在集成达到95.7%的最终成功率。

英文摘要

Sequential output generation with large-scale Transformer and diffusion decoders pays a memory cost that grows with sequence length, plus iterative per-step computation. Replacing them with small feed-forward decoders restores efficiency but produces unstructured latent representations that limit closed-loop control: phase-conditioned action generation and cross-step latent carry-over both require a latent geometry with stable basins. This article proposes Ghost Attractor Networks, a theoretically derived dynamical decoder whose latent evolves under a learned potential with drift and produces a basin-attractor structure by construction. Three desiderata (multi-modality, decoder-level single-pass switching, and constant memory) motivate the potential-drift form, and mode transitions arise as saddle-node bifurcations with ghost-attractor escape. A hierarchical phase-space decomposition separates first-order basin convergence from second-order proprioceptive refinement. Empirically, a Ghost trained end-to-end with a behavioral-cloning and contrastive objective exhibits the predicted gradient-flow contraction in its potential, with the gradient norm decaying by 67 percent across five integration steps on 1430 held-out samples. Ghost is evaluated as a robotic action decoder. A 2.3-million-parameter Ghost matches the offline accuracy of a 1.07-billion-parameter Diffusion Transformer at 462 times fewer parameters and 32 times lower latency, and beats five alternative 2M-parameter decoders (MLP, Neural ODE, CVAE, Transformer, 1-step Diffusion) on offline mean squared error by 5.9 to 29 percent. On the LIBERO-10 closed-loop benchmark, phase conditioning on Ghost's basin-structured latent yields a 13.5 percentage-point success-rate gain over a feed-forward MLP baseline, and persistent-latent ensembling reaches a 95.7 percent final success rate.

URL PDF HTML ☆

赞 0 踩 0

2606.17639 2026-06-18 cs.RO cs.CV 新提交 85%

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

ERQA-Plus：具身AI推理的诊断基准

Hong Yang, Basura Fernando

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research（新加坡科技研究局前沿人工智能研究中心）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

专题命中具身推理：具身AI推理诊断基准

AI总结提出ERQA-Plus基准，包含1766个基于机器人中心图像的问答实例，覆盖感知、动作、社交、导航和常识推理，用于诊断具身AI的推理能力。

详情

AI中文摘要

通用具身智能体需要的不仅仅是物体识别：它们必须从情境视觉观察中推理空间关系、动作、程序、人类意图、环境约束和常识后果。然而，现有的视觉和具身问答基准通常对测试的推理依赖关系控制有限，使得难以将基于具身的推理与基于捷径的视觉或语言模式匹配区分开来。我们提出了ERQA-Plus，一个用于具身AI推理的诊断基准。ERQA-Plus包含1766个问答实例，这些实例基于711张以机器人为中心的图像，并根据一个结构化的分类法组织，涵盖感知、动作中心、社交交互、导航环境和上下文常识推理。该数据集使用多阶段生成和验证流程构建，结合了分类法引导的问题生成、自动质量判断、迭代修订和人工评估，以改进视觉基础、答案有效性和推理质量。我们对代表性的通用视觉语言模型和具身模型进行了基准测试，包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强的模型Qwen3-VL-32B达到了83.4%的整体准确率和61.4的SBERT分数，但类别级别的结果揭示了空间推理、程序推理、事件预测和意图推理方面的持续弱点。因此，ERQA-Plus提供了一个细粒度的评估框架，不仅衡量具身智能体是否回答正确，还衡量它们能够可靠地执行哪些形式的具身推理。数据集可在https://this https URL获取，项目页面在https://this https URL。

英文摘要

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

URL PDF HTML ☆

赞 0 踩 0

2606.18664 2026-06-18 cs.SD cs.AI 新提交 80%

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

NeuralMUSIC: 一种用于机器人声源定位的混合神经-子空间框架

Yizhuo Yang, Junqiao Fan, Shenghai Yuan, Lihua Xie

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University（南洋理工大学电气与电子工程学院）

专题命中其他机器人：机器人声源定位混合框架

AI总结提出NeuralMUSIC混合框架，结合神经网络估计空间协方差矩阵与经典MUSIC子空间方法，通过频率注意力融合和自监督学习提升机器人声源定位的鲁棒性和跨域泛化能力。

详情

AI中文摘要

可靠的声源定位是机器人听觉的基础，使自主机器人能够感知空间线索并在动态环境中有效运行。经典方法如多信号分类（MUSIC）具有坚实的理论基础，但在低信噪比下性能下降。基于深度学习的方法虽然取得了有前景的性能，但通常难以在多种条件下泛化。为了解决这些挑战，我们提出了NeuralMUSIC，一种用于机器人声源定位的混合神经-子空间框架。具体来说，神经网络首先从多通道麦克风观测中估计空间协方差矩阵。然后将预测的协方差集成到经典的MUSIC流程中，包括特征值分解（EVD）和伪谱计算，随后通过频率注意力融合（FAF）模块产生最终的DOA估计。为了提高数据效率，我们进一步引入了一种自监督空间相关学习（SSCL）策略，利用未标记的声学数据来捕获空间结构。跨不同机器人任务的广泛实验表明，NeuralMUSIC在实现有竞争力的定位精度的同时，表现出更强的鲁棒性和跨域泛化能力。

英文摘要

Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.18688 2026-06-18 cs.LG cs.AI 新提交 70%

Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow

双通道接地世界建模 (DCGWM)：通过异构外部接地与内向梯度流结构性防止目标干扰崩溃

Akshay Hazare

发表机构 * Independent Researcher（独立研究者）

专题命中其他机器人：世界模型表示学习，双通道接地

AI总结提出双通道接地世界建模（DCGWM），通过分区潜空间和内向梯度流，结构性防止联合嵌入预测架构中多目标接地导致的目标干扰崩溃。

Comments Position paper. Experimental validation in progress

详情

AI中文摘要

联合嵌入预测架构（JEPAs）是世界模型表示学习的主要方法。我们识别出基于JEPA的世界模型在接地于两种性质不同的外部信号时存在一种失败模式：物理动力学（稀疏、高幅度、满足约束的梯度修正）和社会行为动力学（扩散、分布匹配的修正）。我们将其称为目标干扰崩溃（OIC）：我们认为在共享潜空间中的联合学习会导致主导通道系统地崩溃从属通道的表示子空间，且仅通过损失加权无法解决。我们提出双通道接地世界建模（DCGWM），通过分区潜空间（物理子空间Z_p，行为子空间Z_b）和内向梯度流，从结构上防止OIC。物理接地通道通过VICReg风格的对齐到物理测量仅更新Z_p；社会行为接地通道通过对齐到涌现多智能体模拟的轨迹仅更新Z_b。通道间接口模块在任务级别耦合子空间，而不产生跨子空间梯度。非对称接地 adherence 损失通过硬铰链惩罚物理违反和软KL惩罚行为发散来惩罚 rollout 漂移。生成渲染层在架构上与潜世界模型隔离。我们给出三个理论结果：分区消除了与OIC相关的梯度干扰路径；每个接地子空间从其对齐目标继承抗崩溃保证；在生成目标几何形状的假设下，生成隔离是必要的。本文建立了问题表述和架构；实验验证正在进行中，将在未来修订中报告。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) are a leading approach to world model representation learning. We identify a failure mode in JEPA-based world models grounded against two qualitatively distinct external signals: physical dynamics (sparse, high-magnitude, constraint-satisfying gradient corrections) and social-behavioral dynamics (diffuse, distribution-matching corrections). We term this Objective Interference Collapse (OIC): we argue that joint learning in a shared latent space causes the dominant channel to systematically collapse the subordinate channel's representational subspace, in a manner not resolvable by loss weighting alone. We propose Dual-Channel Grounded World Modeling (DCGWM), designed to structurally prevent OIC through a partitioned latent space (physical subspace Z_p, behavioral subspace Z_b) with inward-only gradient flow. A Physical Grounding Channel updates only Z_p via VICReg-style alignment to physical measurements; a Social-Behavioral Grounding Channel updates only Z_b via alignment to trajectories from an emergent multi-agent simulation. An Inter-Channel Interface Module couples the subspaces at the task level without cross-subspace gradients. An Asymmetric Grounding Adherence Loss penalizes rollout drift with a hard hinge for physical violations and a soft KL for behavioral divergence. A Generative Rendering Layer is architecturally isolated from the latent world model. We present three theoretical results: the partition removes the gradient-interference pathway implicated in OIC; each grounded subspace inherits anti-collapse guarantees from its alignment objective; and generative isolation is necessary under a stated assumption on the generative objective's geometry. This manuscript establishes the problem formulation and architecture; experimental validation is ongoing and will be reported in a future revision.

URL PDF HTML ☆

赞 0 踩 0

2606.18532 2026-06-18 cs.CR cs.AI cs.RO cs.SE 新提交 60%

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

AI沙箱：威胁模型、分类法与测量框架

Inderjeet Singh, Haitham Mahmoud, Andrés Murillo

发表机构 * Fujitsu Research of Europe（富士通欧洲研究）

专题命中其他机器人：涉及物理AI和具身自主系统

AI总结提出AI沙箱的威胁模型、分类法和测量框架，形式化沙箱边界与最弱链规则，定义网络物理威胁模型，并通过三个案例验证。

Comments 50 pages, 8 figures, 10 tables

详情

AI中文摘要

AI系统越来越多地在结合隔离、仿真、仪器化、监督和证据捕获的有界环境中进行评估。对于物理AI、AIoT和网络物理系统，这种转变不仅仅是术语问题：被测系统可能通过物理过程、网络设备和人类操作员进行感知、决策、执行、通信和故障。本文开发了一种面向保证的AI沙箱描述，将其作为数字AI、具身自主和网络物理部署中测试、评估、验证和确认的受控环境。我们形式化了沙箱边界和用于将每个维度的证据组合成有界部署声明的“最弱链”规则；分离了主要的沙箱原型；定义了一个包括对保证装置本身攻击的网络物理威胁模型；并引入了一个跨越保真度、可控性、可观测性、包含性、可重复性和治理工件的测量框架，在三个实际沙箱的工作案例研究中实例化。由此产生的威胁模型、分类法和测量框架阐明了沙箱可以有效测试什么、它可以包含哪些风险，以及它可以为安全、安保和监管保证支持哪些形式的证据。

英文摘要

AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance.

URL PDF HTML ☆

赞 0 踩 0

2606.18628 2026-06-18 cs.RO 新提交 80%

Self-Supervised Mask-Aware Transformers for Fault-Tolerant FBG Force Sensing in Minimally Invasive Surgical Robotics

自监督掩码感知Transformer用于微创手术机器人中容错FBG力传感

Peibo Sun, Shiyuan Dong, Shucheng Ye, Jianrong Cai, Yushan Liu, Hongen Liao, Tianqi Huang, Fang Chen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）

专题命中机器人操作：微创手术机器人中FBG力传感的容错方法

AI总结针对微创手术机器人中FBG传感器因通道耦合和断裂导致的力估计退化问题，提出统一的自监督掩码感知Transformer，通过掩码通道重建预训练和动态损坏课程微调，实现多通道故障下的优雅降级，在8通道数据集上达到0.0066 N均方根误差。

详情

AI中文摘要

在微创手术机器人中，导管级光纤布拉格光栅（FBG）传感器因其能够通过复用多个光学通道来估计多维力而具有前景。然而，部署这些紧凑的多通道传感器引入了两个关键工程挑战：复杂变形过程中固有的非线性交叉轴耦合，以及受限工作空间中光纤断裂导致的间歇性通道丢失。这些复合问题严重降低了力估计性能。现有的容错方法依赖于组合模型库，其随通道数量呈指数级扩展，并且需要昂贵的每模式校准。在本文中，我们提出了一种统一的、自监督的掩码感知Transformer，它显式地建模通道可用性，以在多样化和动态的传感器故障下实现优雅降级。编码器通过未标记数据流上的掩码通道重建进行预训练，并使用平衡的干净与损坏视图目标以及动态损坏课程进行力回归微调。此外，通过异方差高斯负对数似然训练的并行不确定性头，在单次前向传播中预测每轴置信度，避免了多遍集成的开销。在导管级8通道FBG数据集上评估，我们的单一统一模型实现了标称均方根误差（RMSE）0.0066 N，并在严重4通道故障下优雅降级至0.0126 N。这显著优于包含255个每模式神经网络的综合模型库（4通道丢失时为0.0154 N），同时消除了模式特定校准。

英文摘要

In minimally invasive surgical robotics, catheter-scale Fiber Bragg Grating (FBG) sensors are promising due to their ability to estimate multi-dimensional forces by multiplexing several optical channels. However, deploying these compact multi-channel sensors introduces two critical engineering challenges: inherent nonlinear cross-axis coupling during complex deformations, and intermittent channel dropouts caused by fiber fractures in constrained workspaces. These compounding issues severely degrade force estimation. Existing fault-tolerant approaches rely on combinatorial model banks, which scale exponentially with the channel count and demand prohibitively expensive per-pattern calibration. In this paper, we propose a unified, self-supervised mask-aware Transformer that explicitly models channel availability to enable graceful degradation under diverse and dynamic sensor failures. The encoder is pretrained via masked-channel reconstruction on unlabeled data streams and fine-tuned for force regression using a balanced clean-and-corrupted-view objective alongside a dynamic corruption curriculum. Furthermore, a parallel uncertainty head, trained via heteroscedastic Gaussian negative log-likelihood, predicts per-axis confidence in a single forward pass, circumventing the overhead of multi-pass ensembles. Evaluated on a catheter-scale 8-channel FBG dataset, our single unified model achieves a nominal Root Mean Square Error (RMSE) of 0.0066~N and degrades gracefully to 0.0126~N under severe 4-channel failures. This significantly outperforms a comprehensive model bank of 255 per-pattern neural networks (0.0154~N at 4-channel loss) while eliminating pattern-specific calibration.

URL PDF HTML ☆

赞 0 踩 0

2606.18558 2026-06-18 cs.CV 新提交 70%

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； University of Washington（华盛顿大学）； UNC-Chapel Hill（北卡罗来纳大学教堂山分校）

专题命中机器人操作：在机器人操作中验证有效性

AI总结提出一种基于语言指令的3D点运动预测方法，通过构建大规模数据集和基准，实现类无关、视角稳定的运动轨迹预测，并在机器人操作和视频生成中验证其有效性。

详情

AI中文摘要

运动预测是视觉智能的核心：智能体必须预测物体如何运动，以规划行动、推理物理交互并合成逼真的未来场景。我们认为，世界坐标系中的3D点提供了一种通用表示，具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务：给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述，模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务：(1) MolmoMotion-1M是一个大型语料库，包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹；(2) PointMotionBench是一个人工验证的基准，涵盖111个物体类别和61种运动类型；(3) MolmoMotion是一个通用运动预测模型，支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式，并在PointMotionBench上显著优于现有运动预测基线。最后，我们展示了学习到的3D运动先验能很好地迁移到下游应用：它提高了机器人操作的训练效率和泛化能力，其预测轨迹为生成模型提供了有效的运动指导，以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

URL PDF HTML ☆

赞 0 踩 0

1. 具身导航 2 篇

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

2. 机器人基础模型 3 篇

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

3. 机器人学习 16 篇

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

DREAM-Chunk: Reactive Action Chunking with Latent World Model

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

4. 具身推理 1 篇

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

5. 其他机器人 3 篇

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

6. 机器人操作 2 篇

Self-Supervised Mask-Aware Transformers for Fault-Tolerant FBG Force Sensing in Minimally Invasive Surgical Robotics

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction