多模态信息融合 - arXivDaily 专题

2603.00654 2026-06-19 cs.CV 版本更新 95%

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

RC-GeoCP：雷达-相机协同感知的几何一致性

Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Songkai Wang, Huiliang Shen

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University（浙江大学信息科学与电子工程学院）； School of Automotive Studies, Tongji University（同济大学汽车学院）； Thrust of Artificial Intelligence, Hong Kong University of Science and Technology（香港科技大学人工智能研究所）

专题命中多传感器融合：提出4D雷达与相机协同感知框架，融合多传感器信息。

AI总结提出首个4D雷达与相机协同感知框架RC-GeoCP，通过雷达锚定几何一致性解决深度模糊和空间分散导致的错位，实现高效通信与全局一致表示。

Comments 11 pages, 6 figures, 9 tables

详情

AI中文摘要

协同感知（CP）通过多智能体信息共享增强场景理解。尽管以LiDAR为中心的系统提供精确几何，但高成本和恶劣天气下的性能下降需要多模态替代方案。尽管具有密集的视觉语义和鲁棒的空间测量，相机与4D雷达之间的协同在协作环境中仍未得到充分探索。本文介绍RC-GeoCP，这是首个探索CP中4D雷达与图像融合的框架。为解决由深度模糊和跨智能体空间分散引起的错位，RC-GeoCP建立了雷达锚定的几何一致性。具体而言，几何结构修正（GSR）将视觉语义与雷达导出的几何对齐，以生成空间有根基的、几何一致的表示。不确定性感知通信（UAC）将选择性传输表述为条件熵减少过程，基于智能体间分歧优先处理信息特征。最后，共识驱动聚合器（CDA）通过共享几何锚聚合多智能体信息，形成全局一致的表示。我们在V2X-Radar和V2X-R上建立了首个统一的雷达-相机CP基准，展示了最先进的性能，同时显著降低了通信开销。代码即将发布。

英文摘要

Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.

URL PDF HTML ☆

赞 0 踩 0

2604.13240 2026-06-19 cs.CV cs.LG 版本更新 85%

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

基于概念的可解释AI的高分辨率景观数据集及其在物种分布模型中的应用

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

发表机构 * Université Rennes 2, CNRS, Nantes Université, Univ Brest, LETG, UMR 6554（里昂大学第二分校、法国国家科学研究中心、南特大学、布列塔尼大学、LETG、UMR 6554）； LTSER Zone Atelier Armorique（Armorique 领域实验室区）； University of Würzburg, Center for Artificial Intelligence and Data Science（乌尔姆大学、人工智能与数据科学中心）

专题命中多传感器融合：融合多光谱和LiDAR无人机影像，属于多传感器融合

AI总结提出首个基于概念的可解释AI方法用于物种分布模型，利用高分辨率多光谱和LiDAR无人机影像构建景观概念数据集，通过Robust TCAV量化景观概念对模型预测的影响，案例研究验证了方法的有效性。

详情

AI中文摘要

绘制物种空间分布对于保护政策和入侵物种管理至关重要。物种分布模型（SDMs）是完成此任务的主要工具，具有两个目的：实现稳健的预测性能，同时提供关于分布驱动因素的生态见解。然而，深度学习SDMs日益增长的复杂性使得提取这些见解更具挑战性。为了调和这些目标，我们提出了首个基于概念的可解释AI（XAI）在SDMs中的实现。我们利用Robust TCAV（测试与概念激活向量）方法量化景观概念对模型预测的影响。为此，我们提供了一个新的开放获取的景观概念数据集，该数据集源自高分辨率多光谱和LiDAR无人机影像。它包括跨越15个不同景观概念的653个斑块和1,450个随机参考斑块，旨在适用于广泛的物种。我们通过两个水生昆虫（襀翅目和毛翅目）的案例研究，使用两个卷积神经网络和一个视觉Transformer来展示这种方法。结果表明，基于概念的XAI有助于根据专家知识验证SDMs，同时发现产生新生态假说的新颖关联。Robust TCAV还提供了景观层面的信息，对政策制定和土地管理有用。代码和数据集公开可用。

英文摘要

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.09383 2026-06-19 cs.RO 版本更新 80%

Safety-Critical LiDAR-Inertial Odometry with On-Manifold Deterministic Protection Level

安全关键的激光雷达-惯性里程计与在线流形确定性保护级别

Yueqi Zhu, Yan Pan, Chufan Rui, Jiasheng Luo, Shihua Li, Bo Zhou

发表机构 * School of Automation, Southeast University（东南大学自动化学院）； Key Laboratory of Measurement and Control of CSE, Ministry of Education（教育部测控CSE重点实验室）

专题命中多传感器融合：融合LiDAR与惯性测量，实现安全关键里程计

AI总结本文提出一种安全关键的激光雷达-惯性里程计，通过在线流形确定性状态估计提供确定性保护级别，以提升移动机器人在安全关键场景中的导航安全性。

详情

AI中文摘要

在安全关键场景中，自主导航系统的保护级别对于使移动机器人安全执行任务至关重要。然而，现有针对机器人概率导航系统的研究通常使用有限数据集进行离线准确性评估，并假设结果可应用于未知真实环境。因此，当前自主移动机器人往往缺乏在线安全评估的保护级别。为填补这一空白，我们提出了一种安全关键的激光雷达-惯性里程计（LIO），其基于在线流形确定性状态估计提供确定性保护级别。通过采用未知但有界的假设，我们推导出点云噪声与迭代最近点算法估计不确定性之间的简洁闭式关系。利用这一关系，我们设计了一种在线流形椭球集成员滤波器，并将其实现于LIO系统中。利用集成员滤波器的性质，我们的系统将估计位置的可行集作为确定性保护级别，用作机器人下游自主操作的安全参考。实验结果表明，我们的系统能够为各种环境中的不同机器人提供有效的确定性在线安全参考。

英文摘要

In safety-critical scenarios, the protection level of the autonomous navigation system is crucial for enabling mobile robots to perform safe tasks. However, existing studies on probabilistic navigation systems for robots usually perform offline accuracy evaluations using limited datasets and assume that the results can be applied to unknown real-world environments. As a result, current autonomous mobile robots often lack protection levels for online safety assessment. To fill this gap, we propose a safety-critical LiDAR-inertial odometry (LIO) that provides deterministic protection levels based on on-manifold deterministic state estimation. By adopting the unknown but bounded assumption, we derive a neat closed-form relationship between point cloud noise and the uncertainty of the estimation from the iterated closest point algorithm. Using this relationship, we design an on-manifold ellipsoidal set-membership filter and implement it within the LIO system. Leveraging the properties of the set-membership filter, our system offers the feasible sets of the estimated locations as the deterministic protection levels, serving as safety references for the robots' downstream autonomous operations. The experimental results show that our system can provide effective deterministic online safety references for diverse robots in various environments.

URL PDF HTML ☆

赞 0 踩 0

2602.15707 2026-06-19 cs.MM cs.CL cs.LG 版本更新 80%

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

基于音频和IMU的主动式程序性任务对话助手

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

发表机构 * Qualcomm Technologies, Inc.（高通技术公司）

专题命中多传感器融合：融合音频和IMU多模态输入实现对话助手。

AI总结提出首个仅使用音频和IMU模态的实时对话助手，通过微调语言模型减少不必要对话并提升问答准确性，在边缘设备上实现无云依赖。

Comments 5 figures. 5 more in appendix

详情

AI中文摘要

实时对话助手用于程序性手工任务通常依赖视频输入，这会导致计算成本高且侵犯用户隐私。我们首次提出一种实时对话助手，仅使用来自用户可穿戴设备的轻量级隐私保护模态（如音频和IMU输入）来理解上下文，为程序性手工任务提供全面指导。通过家具组装任务和烹饪任务，我们展示了该助手如何主动向执行程序性任务的用户提供逐步指令，并回答用户问题。我们阐述了实现该助手的数据生成方法和系统设计。观察到现成的语言模型健谈但并非总能正确回答问题，我们展示了微调模型如何将其减少不必要对话的能力提升50%（精确度），同时将正确回答问题的能力提升150%（召回率）。我们进一步描述了如何在边缘设备上实现该助手，无需依赖云端。

英文摘要

Real-time conversational assistants for procedural manual tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for procedural manual tasks using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. Using a furniture assembly task and a cooking task, we show how this assistant proactively communicates step-by-step instructions to a user performing a procedural task, and answers user questions. We illustrate the data generation method and the system design to achieve such an assistant. On observing that an off-the-shelf language model is a talkative assistant but is not always able to answer questions correctly, we demonstrate how finetuning the model improves its ability to limit unnecessary dialogues with a 50% increase in the precision, while also improving its ability to answer questions correctly, measured by a 150% increase in the recall of answers. We further describe how such an assistant is implemented on an edge device with no dependence on the cloud.

URL PDF HTML ☆

赞 0 踩 0

2507.21460 2026-06-19 cs.CV 版本更新 75%

An Angular-Temporal Interaction Network for Light Field Object Tracking in Low-Light Scenes

用于低光场景光场目标跟踪的角-时交互网络

Mianzhao Wang, Fan Shi, Xu Cheng, Feifei Zhang, Shengyong Chen

发表机构 * Engineering Research Center of Learning-Based Intelligent System (Ministry of Education)（教育部学习驱动智能系统工程研究中心）； key Laboratory of Computer Vision and System (Ministry of Education)（教育部计算机视觉与系统重点实验室）； School of Computer Science and Engineering, Tianjin University of Technology（天津工业大学计算机科学与工程学院）

专题命中多传感器融合：光场与时间交互，属于多传感器融合

AI总结提出一种光场极线平面结构图像表示和角-时交互网络，通过显式建模几何结构和自监督优化，在低光场景下实现高效目标跟踪，性能达到最优。

详情

AI中文摘要

高质量的四维光场表示结合高效的角特征建模对于场景感知至关重要，因为它可以提供判别性的空间-角度线索来识别移动目标。然而，近期的发展仍然难以在时间域中提供可靠的角建模，尤其是在复杂的低光场景中。在本文中，我们提出了一种新颖的光场极线平面结构图像（ESI）表示，该表示显式定义了光场内的几何结构。通过利用极线平面内光线角度的突变，这种表示可以增强低光场景中的视觉表达，并减少高维光场的冗余。我们进一步提出了一种用于光场目标跟踪的角-时交互网络（ATINet），该网络从光场的几何结构线索和角-时交互线索中学习角感知表示。此外，ATINet还可以通过自监督方式进行优化，以增强时间域上的几何特征交互。最后，我们引入了一个大规模的光场低光数据集用于目标跟踪。大量实验表明，ATINet在单目标跟踪中达到了最先进的性能。此外，我们将所提方法扩展到多目标跟踪，这也显示了高质量光场角-时建模的有效性。

英文摘要

High-quality 4D light field representation with efficient angular feature modeling is crucial for scene perception, as it can provide discriminative spatial-angular cues to identify moving targets. However, recent developments still struggle to deliver reliable angular modeling in the temporal domain, particularly in complex low-light scenes. In this paper, we propose a novel light field epipolar-plane structure image (ESI) representation that explicitly defines the geometric structure within the light field. By capitalizing on the abrupt changes in the angles of light rays within the epipolar plane, this representation can enhance visual expression in low-light scenes and reduce redundancy in high-dimensional light fields. We further propose an angular-temporal interaction network (ATINet) for light field object tracking that learns angular-aware representations from the geometric structural cues and angular-temporal interaction cues of light fields. Furthermore, ATINet can also be optimized in a self-supervised manner to enhance the geometric feature interaction across the temporal domain. Finally, we introduce a large-scale light field low-light dataset for object tracking. Extensive experimentation demonstrates that ATINet achieves state-of-the-art performance in single object tracking. Furthermore, we extend the proposed method to multiple object tracking, which also shows the effectiveness of high-quality light field angular-temporal modeling.

URL PDF HTML ☆

赞 0 踩 0

2509.13972 2026-06-19 cs.RO 版本更新 70%

BIM Informed Visual SLAM for Construction Environments

BIM 引导的视觉 SLAM 在建筑环境中的应用

Asier Bikandi-Noya, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg（自动化与机器人研究组，安全、可靠与信任跨学科研究中心（SnT），卢森堡大学）

专题命中多传感器融合：融合BIM与RGB-D数据，属于多传感器融合

AI总结针对建筑环境中视觉SLAM轨迹漂移问题，提出利用建筑信息模型（BIM）的结构先验增强RGB-D SLAM系统，通过墙面对应与几何约束优化减少漂移，提升全局一致性，实验显示轨迹误差降低25.23%，地图精度提升7.14%。

Comments 9 pages, 7 tables, 4 figures

详情

AI中文摘要

监测建筑施工现场需要将计划设计与实际建造状态进行比较，而同步定位与地图构建（SLAM）技术可以实时估计实际状态。然而，视觉SLAM在建筑环境中容易产生轨迹漂移，生成的地图在几何上与实际环境不准确。为解决这一局限，我们利用从建筑信息模型（BIM）导出的结构先验增强现有的RGB-D SLAM系统。该系统将检测到的墙面与BIM中的对应墙面关联，并将这些对应关系作为几何约束加入后端优化，从而减少漂移并增强全局一致性。所提方法实时运行，并在多个真实建筑工地上验证，与最先进的基线相比，平均轨迹误差降低25.23%，地图精度提升7.14%。鲁棒性分析进一步表明，该方法对不完整的BIM数据以及计划模型与实际环境之间的几何差异具有韧性。

英文摘要

Monitoring building construction sites requires comparing the as-planned design with the as-built state, which can be estimated in real time using Simultaneous Localization and Mapping (SLAM) techniques. However, visual SLAM is prone to trajectory drift in construction environments, producing maps that are geometrically inaccurate with the actual environment. To address this limitation, we augment an existing RGB-D SLAM system with structural priors derived from the Building Information Model (BIM). The system associates detected walls with their BIM counterparts and includes these correspondences as geometric constraints in the back-end optimization, reducing drift and enhancing global consistency. The proposed method operates in real time and is validated on multiple real construction sites, achieving an average trajectory error reduction of 25.23% and a 7.14% improvement in map accuracy over state-of-the-art baselines. Robustness analyses further demonstrate resilience to incomplete BIM data and geometric discrepancies between as-planned models and the as-built environment.

URL PDF HTML ☆

赞 0 踩 0