多模态信息融合

2606.18772 2026-06-18 cs.RO 新提交 70%

HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations

HALOMI: 从人类演示中学习具有主动感知的人形机器人全身操控

Zehui Zhao, Yuxuan Zhao, Gaojing Zhang, Chenxi Liu, Maolin Zheng, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Sussex（萨塞克斯大学）； East China University of Science and Technology（华东理工大学）

专题命中多传感器融合：人形机器人全身操控，融合主动感知与多传感器数据。

AI总结提出HALOMI框架，通过扩展通用操控接口(UMI)实现主动感知，利用流形约束控制器和观察-动作对齐，使Unitree G1人形机器人在五项真实任务中平均成功率达85%。

详情

AI中文摘要

人类演示可以大规模收集，并自然捕捉主动的手眼协调，是学习人形机器人全身操控的有前景的数据源。然而，直接将人类演示迁移到人形机器人需要精确的世界坐标系跟踪控制器，这在分布外(OOD)目标下通常脆弱，而人形差异在自我中心观察和动作执行中持续存在。为解决这些挑战，我们提出HALOMI，一个从人类演示中学习具有主动感知的人形机器人全身操控的可扩展框架。HALOMI扩展了通用操控接口(UMI)并加入自我中心感知，以大规模收集自我视角和手腕视角观察以及头-手轨迹。我们进一步提出一个流形约束控制器，在学习的潜在行为流形中规划，以实现世界坐标系中精确鲁棒的头-手跟踪。为弥合人形差异，我们进行自我视角对齐，并引入控制器感知的参考轨迹自适应，以减少观察和动作执行中的不匹配。我们在配备活动脖子的Unitree G1人形机器人上验证HALOMI，涉及导航、抓取、双手操控、全身协调和动态行为五项真实任务。在三个定量评估的任务中，HALOMI平均成功率达85%，而额外定性演示显示其支持动态抛掷和深蹲抓取的能力。

英文摘要

Human demonstrations, which can be collected at scale and naturally capture active hand-eye coordination, are a promising data source for learning humanoid loco-manipulation. However, directly transferring human demonstrations to humanoids requires a precise world-frame tracking controller, which is often brittle under Out-of-Distribution(OOD) targets, while human-to-humanoid gaps persist in both egocentric observation and action execution. To address these challenges, we present HALOMI, a scalable framework for learning humanoid loco-manipulation with active perception from human demonstrations. HALOMI extends Universal Manipulation Interface (UMI) with egocentric sensing to collect ego-view and wrist-view observations along with head-hand trajectories at scale. We further propose a manifold-constrained controller that plans in a learned latent behavior manifold to enable precise and robust head-hand tracking in the world frame. To bridge the human-to-humanoid gap, we perform ego-view alignment and introduce a controller-aware reference trajectory adaptation to reduce mismatch in both observation and action execution. We validate HALOMI on a Unitree G1 humanoid robot with an actuated neck across five real-world tasks involving navigation, grasping, bimanual manipulation, whole-body coordination, and dynamic behaviors. Across the three quantitatively evaluated tasks, HALOMI achieves an average success rate of 85\%, while additional qualitative demonstrations show its ability to support dynamic tossing and deep-squat grasping.

URL PDF HTML ☆

赞 0 踩 0

2606.18439 2026-06-18 cs.CV cs.RO 新提交 70%

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT：面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； University of California, Irvine（加利福尼亚大学尔湾分校）； Nanyang Technological University（南洋理工大学）

专题命中多传感器融合：VGGT从多视图图像恢复3D场景，涉及多视角融合。

AI总结提出RegimeVGGT，通过逐层U形压缩（显著性引导带状合并与选择性保护K/V下采样）去除冗余，在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情

AI中文摘要

视觉几何基础Transformer（VGGT）通过一次前向传播从多视图图像恢复密集3D场景结构，但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算，忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域：浅层缺乏跨视图结构，中层驱动跨视图对齐，深层对密集几何是冗余的，但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩：显著性引导带状合并保护几何和边缘显著性令牌，而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练，RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

URL PDF HTML ☆

赞 0 踩 0

2606.08206 2026-06-18 cs.CV cs.LG 新提交 70%

SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests

SegmentAnyTreeV2：跨传感器、平台和森林的基于Transformer的树木实例分割扩展

Maciej Wielgosz, Stefano Puliti, Rasmus Astrup

发表机构 * Norwegian Institute of Bioeconomy Research (NIBIO)（挪威生物经济研究所（NIBIO））

专题命中多传感器融合：跨传感器和平台的树木实例分割，融合不同LiDAR数据

AI总结提出SegmentAnyTreeV2，一种传感器和平台无关的森林点云语义与实例分割框架，结合Point Transformer v3骨干网络、轻量语义头和树木交叉注意力掩码解码器，在FOR-instance v3基准上达到90.5%精度和80.2%召回率，并展现出强跨域泛化能力。

Comments 25 pages, 6 figures, 10 tables, Corrected bibliography metadata and minor typographical issues; results unchanged

详情

AI中文摘要

我们提出SegmentAnyTreeV2，一种传感器和平台无关的森林点云语义与实例分割框架。该模型结合了基于序列化的Point Transformer v3骨干网络、轻量级语义头以及专注于树木的交叉注意力掩码解码器。语义预测将实例解码限制在树木类体素上，而实例感知的查询初始化、一对多种子监督和非对称掩码评分改善了密集和结构复杂林分中的分离效果。我们进一步引入了FOR-instance v3，一个扩展的基准数据集，包含427个场景和26,496棵标注树木，涵盖不同生物群落、森林结构和LiDAR平台。在FOR-instanceV2测试集上，SegmentAnyTreeV2实现了90.5%的精度、80.2%的召回率、85.0%的F1分数、90.7%的覆盖率和87.6%的语义mIoU，在实例检测和掩码完整性方面均优于以往基于学习的方法。在独立站点上的零样本评估进一步证明了其强大的跨域泛化能力。

英文摘要

We present SegmentAnyTreeV2, a sensor- and platform-agnostic framework for semantic and instance segmentation of forest point clouds. The model combines a serialization-based Point Transformer v3 backbone with a lightweight semantic head and a tree-focused cross-attention mask decoder. Semantic predictions restrict instance decoding to tree-class voxels, while instance-aware query initialization, one-to-many seed supervision, and asymmetric mask scoring improve separation in dense and structurally complex stands. We further introduce FOR-instance v3, an expanded benchmark comprising 427 scenes and 26,496 annotated trees across diverse biomes, forest structures, and LiDAR platforms. On the FOR-instanceV2 test split, SegmentAnyTreeV2 achieves 90.5% precision, 80.2% recall, 85.0% F1, 90.7% coverage, and 87.6% semantic mIoU, outperforming previous learning-based methods in both instance detection and mask completeness. Zero-shot evaluation on independent sites further demonstrates strong cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.19122 2026-06-18 cs.RO 新提交 65%

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

基于混合2D-3D学习的人行道机器人单目3D占用感知

Yukai Ma, Joe Lin, Liu Liu, Honglin He, Lulu Ricketts, Brad Squicciarini, Yong Liu, Bolei Zhou

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Zhejiang University（浙江大学）； Coco Robotics（Coco机器人）； Massachusetts Institute of Technology（麻省理工学院）

专题命中多传感器融合：结合LiDAR-RGB配对与单目图像学习

AI总结提出WalkOCC框架，通过混合射线行进单目3D占用感知，结合LiDAR-RGB配对数据与大规模无配对单目图像学习，提升人行道机器人导航的预测精度和泛化能力。

详情

AI中文摘要

现实世界中的人行道拥挤、杂乱且结构化程度低于道路，使得3D占用预测成为配送机器人和电动轮椅等移动机器人安全导航的关键。现有的占用学习流程主要针对道路自动驾驶设计，通常在大规模配对的LiDAR-RGB数据集上训练，需要密集的3D监督和多个摄像头输入，这些数据收集成本高且未能充分捕捉人行道特定特征。我们提出WalkOCC，一种用于人行道机器人的混合射线行进单目3D占用感知框架。WalkOCC显式地将来自LiDAR-RGB配对数据的几何基础与来自大规模无配对单目图像的可扩展学习相结合。它从配对序列中引导出伪占用监督，并在额外的仅2D数据上联合学习图像级表示。它在不需要昂贵的3D占用标注的情况下实现了稳定的优化和改进的泛化能力。大量实验表明，与基于自监督图像的基线相比，在预测精度、对路缘和排水沟等细微城市结构的细粒度分割以及对环境和跨本体变化的鲁棒性方面，WalkOCC均取得了一致的提升。为了便于评估和基准测试，我们还引入了Sidewalk3D，这是一个大规模的人行道感知数据集，包含在多个地点和时间段收集的LiDAR-相机配对序列，以及用于评估的3D语义占用标注。代码和数据将公开提供。

英文摘要

Sidewalks in the real world are crowded, cluttered, and less structured than roads, making 3D occupancy prediction a key ingredient for the safe navigation of mobile robots such as delivery bots and electric wheelchairs. Existing occupancy learning pipelines are largely designed for on-road autonomous driving and often train on large-scale paired LiDAR-RGB datasets with dense 3D supervision and multiple camera inputs, which are costly to collect and do not adequately capture sidewalk-specific characteristics. We propose WalkOCC, a hybrid Ray-marching monocular 3D occupancy perception framework for robots operating on sidewalks. WalkOCC explicitly couples geometric grounding from LiDAR-RGB paired data with scalable learning from large-scale unpaired monocular images. It bootstraps pseudo occupancy supervision from paired sequences and jointly learns image-level representations on additional 2D-only data. It yields stable optimization and improved generalization without requiring costly 3D occupancy annotations. Extensive experiments demonstrate consistent gains in prediction accuracy, fine-grained segmentation of subtle urban structures such as curbs and gutters, and robustness to environmental and cross-embodiment shifts compared with self-supervised image-based baselines. To facilitate evaluation and benchmarking, we also introduce Sidewalk3D, a large-scale sidewalk perception dataset with LiDAR-camera paired sequences collected across multiple locations and time periods, along with 3D semantic occupancy annotations for evaluation. Code and data will be made available.

URL PDF HTML ☆

赞 0 踩 0

2606.18824 2026-06-18 cs.CV cs.LG 新提交 65%

Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

他们将去哪里？从自我中心视频建模多模态行人机动

Yuxuan Xie, Nicolas Pugeault, Chongfeng Wei, Hubert P. H. Shum, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow（格拉斯哥大学计算机科学学院）； James Watt School of Engineering, University of Glasgow（格拉斯哥大学詹姆斯·瓦特工程学院）； Department of Computer Science, Durham University（杜伦大学计算机科学系）

专题命中多传感器融合：自我中心视频预测行人轨迹，融合视觉与运动信息。

AI总结提出MMPM框架，通过行为感知交互模块和基于CVAE的模态感知轨迹预测器，分别建模行人过马路和不过马路两种模式，提升自我中心视角下多模态轨迹预测准确性。

Comments Accepted at The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情

AI中文摘要

从自我中心摄像头进行行人轨迹预测具有挑战性，因为它依赖于与车辆和场景上下文的复杂交互以及行人的意图。通过建模行人历史与未来轨迹的相关性和意图，通常会产生多模态（即多个模式）分布。现有的随机预测器通常从单一单峰分布中采样多个未来轨迹，这可能导致次优的“混合模式”轨迹，这些轨迹位于不同的运动模式之间，并在真实场景中变得不合理。在本文中，我们提出MMPM，一种模态感知框架，基于行人的过马路行为将未来轨迹分布分别建模为语义上有意义的模式。MMPM由两个模块组成：行为感知行人交互模块（PIM），通过引入注视、头部和手势来联合捕捉行人-车辆和行人-环境交互；以及基于CVAE的模态感知轨迹预测器（MTP）模块，分别对过马路和不过马路两种模式的未来轨迹分布进行建模。基于查询的解码器进一步在解码过程中强制执行模态一致性。在PIE和JAAD数据集上的实验表明，我们的方法超越了最先进的基线。我们提出的MTP是模型无关的，可以集成到现有框架如BiTrap-NP和SGNet-ED中，以进一步提高未来轨迹预测性能。我们还引入了一种数据驱动的验证协议，将预测与时空一致的真实轨迹匹配，展示了相比先前工作改进的逐帧位移误差。

英文摘要

Pedestrian trajectory prediction from an ego-centric camera is challenging since it depends on complex interactions with vehicles and scene context, as well as the intention of the pedestrian. By modelling correlation and intent from the historical and future trajectories of the pedestrian, it will usually result in a multimodal (i.e. multiple modes) distribution. Existing stochastic predictors often sample multiple futures from a single unimodal distribution, which can yield sub-optimal 'mixed-mode' trajectories that lie between distinct motion patterns and become implausible in real scenes. In this paper, we propose MMPM, a mode-aware framework that separately models future trajectory distributions into semantically meaningful modes based on the pedestrian's crossing behavior. MMPM consists of two modules: behavior-aware Pedestrian Interaction Module (PIM) that jointly captures pedestrian-vehicle and pedestrian-environment interactions by introducing gaze, head and hand gesture, and a CVAE-based Mode-aware Trajectory Predictor (MTP) module to model the future trajectory distributions on two modes, crossing and non-crossing the road, separately. A query-based decoder further enforces mode consistency during decoding. Experiments on PIE and JAAD datasets show that our method surpasses state-of-the-art baselines. Our proposed MTP is model-agnostic, which can be integrated into existing frameworks such as BiTrap-NP and SGNet-ED to further improve future trajectory prediction performance. We additionally introduce a data-driven validation protocol that matches predictions to spatio-temporally consistent ground-truth trajectories, demonstrating improved frame-wise displacement errors over previous work.

URL PDF HTML ☆

赞 0 踩 0

2606.18732 2026-06-18 cs.LG cs.CV 新提交 60%

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

低成本神经形态跌倒检测：使用合成事件数据和混合SNN

Guillermo Rojas, Gonzalo Soto, Daniel Yunge

发表机构 * School of Electrical Engineering Pontificia Universidad Católica de Valparaíso, Chile（瓦尔帕莱索天主教大学电气工程学院）

专题命中多传感器融合：跌倒检测，融合事件相机与CNN，但非典型多模态融合。

AI总结提出混合SNN-CNN模型，从智能手机视频合成事件相机数据，实现高效准确的跌倒检测。

Comments 4 pages, 6 figures, presented at ICONS 2025 during the Poster Session, but not published

2606.19194 2026-06-18 cs.RO 新提交 60%

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

用于机器人操作中一步流匹配的可逆神经网络适配器

Yu Zhang, Kangyi Ji, Yongxiang Zou, Rongtao Xu, Feng Zheng, Long Cheng

专题命中音视频/视觉语言融合：条件于多模态观测生成动作，但非典型融合

AI总结提出可逆神经网络适配器，通过一步去噪过程生成高维动作，降低推理复杂度并保持精度，在仿真和真实实验中提升效率。

详情

AI中文摘要

本文提出了一种用于通用机器人操作的可逆神经网络适配器，旨在通过一步去噪过程，基于多模态观测（包括视觉、语言和本体感受输入）生成精确的高维动作。基于流匹配公式，所提出的适配器有效地将动作生成轨迹约束在可逆潜空间内，从而仅需单次推理步骤即可实现高效、高质量的灵巧动作合成。与传统的迭代流匹配策略相比，所提出的框架显著降低了推理复杂度，同时保持了强大的动作预测精度和稳定性。在多种仿真基准和真实机器人平台上进行了大量实验，以评估所提出方法的有效性。在仿真基准测试中，所提出的适配器在广泛的操作任务上持续表现出优于或接近最先进的性能。此外，真实世界实验显示，视觉-语言-动作（VLA）模型的推理效率显著提升，平均推理延迟从110毫秒降低到61毫秒，同时保持了强大的任务性能。

英文摘要

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.

URL PDF HTML ☆

赞 0 踩 0

1. 多传感器融合 6 篇

HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

2. 音视频/视觉语言融合 1 篇

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation