多模态信息融合 - arXivDaily 专题

2606.20103 2026-06-19 cs.CV 新提交 95%

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

3D高斯溅射中保持几何结构的LiDAR-相机外参标定

Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang

发表机构 * Kyung Hee University（庆熙大学）

专题命中多传感器融合：LiDAR-相机外参标定，典型多传感器融合

AI总结针对LiDAR-相机标定中跨模态特征稀缺问题，提出通过多视图LiDAR深度监督和阻止光度梯度更新高斯空间参数来保持3DGS代理的度量几何，提升标定精度。

Comments Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

详情

AI中文摘要

精确的LiDAR-相机标定对于鲁棒的多模态感知至关重要。无目标方法避免了手动设置，但仍受限于跨模态判别特征的稀缺性。最近的方法通过在可微模型中重建场景，通过密集光度监督实现外参优化。其中，3D高斯溅射（3DGS）被广泛用作几何代理，在单一可微框架内桥接LiDAR和相机。然而，由于3DGS最初是为新视图合成设计的，现有方法倾向于优先考虑渲染质量，导致代理几何偏离真实的LiDAR结构。我们提出了一种框架，通过聚合多视图LiDAR观测进行密集深度监督，并阻止光度梯度更新高斯空间参数，从而保持高斯代理的度量几何。我们在公开驾驶数据集上验证了该方法，在标定精度上持续优于现有无目标方法。

英文摘要

Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

URL PDF HTML ☆

赞 0 踩 0

2603.00654 2026-06-19 cs.CV 版本更新 95%

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

RC-GeoCP：雷达-相机协同感知的几何一致性

Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Songkai Wang, Huiliang Shen

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University（浙江大学信息科学与电子工程学院）； School of Automotive Studies, Tongji University（同济大学汽车学院）； Thrust of Artificial Intelligence, Hong Kong University of Science and Technology（香港科技大学人工智能研究所）

专题命中多传感器融合：提出4D雷达与相机协同感知框架，融合多传感器信息。

AI总结提出首个4D雷达与相机协同感知框架RC-GeoCP，通过雷达锚定几何一致性解决深度模糊和空间分散导致的错位，实现高效通信与全局一致表示。

Comments 11 pages, 6 figures, 9 tables

详情

AI中文摘要

协同感知（CP）通过多智能体信息共享增强场景理解。尽管以LiDAR为中心的系统提供精确几何，但高成本和恶劣天气下的性能下降需要多模态替代方案。尽管具有密集的视觉语义和鲁棒的空间测量，相机与4D雷达之间的协同在协作环境中仍未得到充分探索。本文介绍RC-GeoCP，这是首个探索CP中4D雷达与图像融合的框架。为解决由深度模糊和跨智能体空间分散引起的错位，RC-GeoCP建立了雷达锚定的几何一致性。具体而言，几何结构修正（GSR）将视觉语义与雷达导出的几何对齐，以生成空间有根基的、几何一致的表示。不确定性感知通信（UAC）将选择性传输表述为条件熵减少过程，基于智能体间分歧优先处理信息特征。最后，共识驱动聚合器（CDA）通过共享几何锚聚合多智能体信息，形成全局一致的表示。我们在V2X-Radar和V2X-R上建立了首个统一的雷达-相机CP基准，展示了最先进的性能，同时显著降低了通信开销。代码即将发布。

英文摘要

Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.

URL PDF HTML ☆

赞 0 踩 0

2606.20189 2026-06-19 cs.CV cs.AI cs.RO 新提交 90%

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA：利用扩散的分层蒸馏推进自监督LiDAR预训练

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson, Patric Jensfelt, Olov Andersson

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）； Linköping University（林雪平大学）； TRATON AB（TRATON公司）； Qualcomm Auto Ltd Sweden Filial（高通汽车有限公司瑞典分公司）

专题命中多传感器融合：相机到LiDAR知识蒸馏，融合视觉与激光雷达

AI总结提出HilDA框架，通过分层蒸馏（多层蒸馏和全局上下文蒸馏）结合时间占用扩散目标，自监督预训练LiDAR骨干网络，在3D检测、场景流和语义占用预测任务上达到最先进水平。

Comments Accepted to ECCV 2026. Maciej and Jesper contributed equally

详情

AI中文摘要

利用视觉基础模型（VFM）进行相机到LiDAR的知识蒸馏为解决真实世界自动驾驶中巨大的几何和运动多样性所需的标注数据稀缺问题提供了一种有前景的方案。然而，当前方法通常将VFM视为黑盒教师，仅依赖逐帧特征相似性。因此，它们未能充分利用教师的逐层语义结构和全局上下文，以及LiDAR序列中固有的丰富时空信息。我们提出HilDA，一个用于LiDAR骨干网络的自监督预训练框架，能更好地捕捉驾驶任务所需的语义“是什么”和几何“在哪里”。HilDA结合了分层蒸馏（包括用于渐进语义对齐的多层蒸馏和用于场景级语义的全局上下文蒸馏）与一个促进时空一致性的时间占用扩散目标。使用HilDA预训练的模型在跨模态蒸馏基准上取得了最先进的结果，并在3D目标检测、场景流和语义占用预测任务上优于通过先前蒸馏方法训练的模型。代码见：此 https URL。

英文摘要

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

URL PDF HTML ☆

赞 0 踩 0

2606.20300 2026-06-19 cs.CV 新提交 85%

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

CMDS-AD: 跨模态双流解耦用于少样本异常检测

Junhao Cai, Deyu Zeng, Junhao Pang, Junyu Chen, Qiwei Liang, Xiaopin Zhong, Zongze Wu

发表机构 * Shenzhen University（深圳大学）； Guangzhou Maritime University（广州航海学院）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

专题命中多传感器融合：融合RGB和3D几何信息进行少样本异常检测

AI总结提出跨模态双流异常检测框架CMDS-AD，通过扩散模型生成多样本并利用低频正常估计辅助解耦高频缺陷，在1-shot设置下MVTec 3D-AD上I-AUROC提升5.7%。

Comments Accepted to ECCV 2026!

详情

AI中文摘要

少样本异常检测由于训练数据有限仍然具有挑战性。多模态异常检测（MAD）提供了一种可行的解决方案，利用3D几何线索丰富2D RGB表示并弥补这一稀缺性。然而，现有的MAD方法采用空间均匀的特征处理，混淆了稳定的宏观结构与高频局部缺陷信号，加剧了跨模态错位并增加了假阳性率。为了克服这一问题，我们提出了CMDS-AD，一种跨模态双流异常检测框架。一个LoRA引导的扩散模型生成多样的RGB样本以缓解极端数据稀缺。对于3D正常增强，我们采用预训练的扩散模型作为正常估计器。关键的是，该估计器本质上充当非线性低通滤波器，直接从RGB输入中提取低频正常表示。这建立了一个纯低频信息的辅助估计流，锚定稳健的结构模板，并帮助包含耦合高低频分量的未压缩真实流精确隔离微缺陷。一个坐标感知的分层特征映射器自适应地对齐跨模态语义，而一个乘法评分机制过滤模态特定噪声。在极端1-shot设置下，CMDS-AD在MVTec 3D-AD上实现了5.7%（I-AUROC）和2.0%（AUPRO）的绝对性能提升，在EyeCandies上分别提升了7.7%和5.6%，确立了新的最先进水平。

英文摘要

Few-shot anomaly detection remains challenging due to limited training data. Multi-modal anomaly detection (MAD) offers a viable solution, leveraging 3D geometric cues to enrich 2D RGB representations and compensate for this scarcity. However, existing MAD methods apply spatially uniform feature processing, conflating stable macroscopic structures with high-frequency localized defect signals, exacerbating cross-modal misalignment and inflating false-positive rates. To overcome this, we present CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework. A LoRA-guided diffusion model generates diverse RGB samples to mitigate extreme data scarcity. For 3D normal augmentation, we employ a pre-trained diffusion model as a normal estimator. Crucially, this estimator inherently acts as a non-linear low-pass filter, directly extracting low-frequency normal representations from RGB inputs. This establishes an auxiliary estimated stream of purely low-frequency information, anchoring robust structural templates and assisting the uncompressed real stream, containing coupled high- and low-frequency components, to precisely isolate micro-defects. A Coordinate-Aware Hierarchical Feature Mapper adaptively aligns cross-modal semantics, while a multiplicative scoring mechanism filters modality-specific noise. Under the extreme 1-shot setting, CMDS-AD achieves absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, alongside 7.7% and 5.6% improvements on EyeCandies, establishing a new state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2606.20044 2026-06-19 cs.CV 新提交 85%

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

FUSE：面向多模态目标重识别的频域统一与频谱能量对齐

Xuanhao Qi, Tom H. Luan, Yukang Zhang, Jinkai Zheng, Zhou Su, Shuwei Li, Lei Tan

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University（西安交通大学网络空间安全学院）； School of Informatics, Xiamen University（厦门大学信息学院）； National University of Singapore（新加坡国立大学）

专题命中多传感器融合：提出频域框架FUSE，对齐多模态特征，提升重识别性能。

AI总结提出频域框架FUSE，通过频谱解耦和能量对齐两阶段处理，解决多模态重识别中低频偏置问题，在三个数据集上mAP提升9.1%。

Comments Accepted in ICML 2026

详情

AI中文摘要

尽管多模态重识别（ReID）取得了显著进展，现有方法往往强调低频线索。因此，它们关注颜色、光照和粗略外观等属性，而忽略了编码几何、纹理和身份判别细节的中高频结构。这种不平衡导致频谱表示不完整和跨模态对齐不稳定。为了克服这些限制，我们引入了FUSE，一个频域框架，将多模态ReID重新表述为频谱解耦和能量对齐的两阶段过程。所提出的频谱分解模块（SDM）自适应地将特征划分为低频、中频和高频子空间，实现分层频谱建模。跨模态对齐模块（CAM）进一步通过频率一致性正则化强制实现跨模态的能量对齐和子空间互补性。此外，FUSE结合了可学习的频率调制，以增强在不同光照和异构传感器条件下的鲁棒性。在RGBNT201、RGBNT100和MSVR310上的大量实验表明，FUSE实现了9.1%的mAP和9.5%的Rank-1改进，为多模态表示学习建立了一个可解释的频域范式。

英文摘要

Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

URL PDF HTML ☆

赞 0 踩 0

2604.13240 2026-06-19 cs.CV cs.LG 版本更新 85%

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

基于概念的可解释AI的高分辨率景观数据集及其在物种分布模型中的应用

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

发表机构 * Université Rennes 2, CNRS, Nantes Université, Univ Brest, LETG, UMR 6554（里昂大学第二分校、法国国家科学研究中心、南特大学、布列塔尼大学、LETG、UMR 6554）； LTSER Zone Atelier Armorique（Armorique 领域实验室区）； University of Würzburg, Center for Artificial Intelligence and Data Science（乌尔姆大学、人工智能与数据科学中心）

专题命中多传感器融合：融合多光谱和LiDAR无人机影像，属于多传感器融合

AI总结提出首个基于概念的可解释AI方法用于物种分布模型，利用高分辨率多光谱和LiDAR无人机影像构建景观概念数据集，通过Robust TCAV量化景观概念对模型预测的影响，案例研究验证了方法的有效性。

详情

AI中文摘要

绘制物种空间分布对于保护政策和入侵物种管理至关重要。物种分布模型（SDMs）是完成此任务的主要工具，具有两个目的：实现稳健的预测性能，同时提供关于分布驱动因素的生态见解。然而，深度学习SDMs日益增长的复杂性使得提取这些见解更具挑战性。为了调和这些目标，我们提出了首个基于概念的可解释AI（XAI）在SDMs中的实现。我们利用Robust TCAV（测试与概念激活向量）方法量化景观概念对模型预测的影响。为此，我们提供了一个新的开放获取的景观概念数据集，该数据集源自高分辨率多光谱和LiDAR无人机影像。它包括跨越15个不同景观概念的653个斑块和1,450个随机参考斑块，旨在适用于广泛的物种。我们通过两个水生昆虫（襀翅目和毛翅目）的案例研究，使用两个卷积神经网络和一个视觉Transformer来展示这种方法。结果表明，基于概念的XAI有助于根据专家知识验证SDMs，同时发现产生新生态假说的新颖关联。Robust TCAV还提供了景观层面的信息，对政策制定和土地管理有用。代码和数据集公开可用。

英文摘要

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.19929 2026-06-19 cs.RO 新提交 80%

Motor Angular Speed Preintegration for Multirotor UAV State Estimation

多旋翼无人机状态估计中的电机角速度预积分

Matěj Petrlík, Filip Novák, Robert Pěnička, Martin Saska

发表机构 * Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague（电子工程系控制学系，布拉格捷克技术大学）

专题命中多传感器融合：融合电机转速与LiDAR，提升无人机状态估计。

AI总结针对无人机振动导致IMU精度下降的问题，提出基于电机转速加速度预积分的方法，替代IMU进行状态传播，并构建因子用于图优化，结合LiDAR形成MAS-LO算法，相比LIO-SAM位置精度提升28%，速度精度提升65%。

详情

AI中文摘要

精确的状态估计对于实现无人机的敏捷和近障碍飞行所需的紧密反馈控制至关重要。最先进的方法融合慢速位姿测量与高频惯性测量以获得精确的状态估计。然而，来自无人机上IMU的惯性测量会受到旋转螺旋桨振动的退化，导致估计状态的精度下降。我们提出了一种基于电机转速加速度预积分的新方法。我们展示了以这种方式获得的加速度可以单独用于状态传播，在不包含IMU的情况下实现更好的精度。此外，我们提出了一个由预积分电机转速组成的因子，可以直接用于因子图优化框架。我们将该因子与LiDAR测量结合，提出电机角速度LiDAR里程计（MAS-LO）算法，用于精确状态估计，并开源该算法。最后，我们与最先进的惯性算法LIO-SAM进行估计精度评估，结果显示位置估计精度提升28%，速度估计精度提升65%，测量延迟降低14%，并且对错误参数值具有高鲁棒性。

英文摘要

A precise state estimate is crucial for a tight feedback control that enables agile and near-obstacle flights of UAVs. The state-of-the-art methods fuse slow pose measurements with high-frequency inertial measurements to obtain a precise state estimate. However, the inertial measurements from the IMU onboard the UAV are degraded by vibrations from spinning propellers and the precision of the estimated state suffers. We propose a novel approach based on the preintegration of accelerations obtained from motor speeds. We show that the accelerations obtained in this manner can be used for state propagation on their own to achieve better precision without including the IMU. Further, we propose a factor composed of the preintegrated motor speeds that can be directly employed in factor graph optimization frameworks. We combine our factor with LiDAR measurements into the proposed Motor Angular Speed LiDAR Odometry (MAS-LO) algorithm for precise state estimation, which we open-source. Lastly, we evaluate the estimation precision against a state-of-the-art inertial algorithm LIO-SAM to show 28% improvement in position and 65% in velocity estimation accuracy, 14% lower measurement lag, and high robustness to wrong parameter values.

URL PDF HTML ☆

赞 0 踩 0

2606.19874 2026-06-19 cs.RO cs.CV 新提交 80%

MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM

MMD-SLAM：结构增强的多元高斯分布引导视觉SLAM

Fan Zhu, Ziyu Chen, Peichen Liu, Yifan Zhao, Zhisong Xu, Hui Zhu, Hongxing Zhou, Sixun Liu, Chunmao Jiang

发表机构 * HFIPS, Chinese Academy of Sciences（中国科学院合肥物质科学研究院）； University of Science and Technology of China（中国科学技术大学）； Aarhus University（奥胡斯大学）； University of Tokyo（东京大学）； Beijing University of Chemical Technology（北京化工大学）； North China Electric Power University（华北电力大学）

专题命中多传感器融合：视觉SLAM融合点线特征，多传感器融合

AI总结提出MMD-SLAM，利用亚特兰大世界假设引导多元高斯表示，通过点线融合、主导方向编码和高斯进化策略，提升视觉SLAM的跟踪精度与建图质量。

Comments ICRA 2026

详情

AI中文摘要

3D高斯泼溅（3DGS）显著提升了新视角合成和高保真场景重建，扩展了基于3DGS的视觉同步定位与建图（SLAM）方法的潜力。然而，大多数现有系统未能充分利用底层结构信息，这限制了渲染质量并常常导致地图不一致。为了解决这些限制，我们提出了MMD-SLAM，一个结构增强的视觉SLAM框架，利用亚特兰大世界（AW）假设来引导多元高斯表示以实现逼真的建图。首先，我们引入了一种点线融合策略用于位姿优化，其中3D线段被纳入以提高跟踪鲁棒性并为建图提供额外约束。其次，我们设计了一种具有主导方向的多元高斯表示，显式编码来自AW假设的结构先验。最后，我们提出了一种高斯进化策略，该策略适应场景几何并将结构线索融入全局优化。大量实验表明，这些创新使MMD-SLAM在跟踪精度和建图质量方面均达到了最先进的性能。例如，与MonoGS相比，我们的方法在ScanNet上实现了48.56%的ATE RMSE降低，在Replica上实现了5.71%的PSNR提升。

英文摘要

3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.

URL PDF HTML ☆

赞 0 踩 0

2605.09383 2026-06-19 cs.RO 版本更新 80%

Safety-Critical LiDAR-Inertial Odometry with On-Manifold Deterministic Protection Level

安全关键的激光雷达-惯性里程计与在线流形确定性保护级别

Yueqi Zhu, Yan Pan, Chufan Rui, Jiasheng Luo, Shihua Li, Bo Zhou

发表机构 * School of Automation, Southeast University（东南大学自动化学院）； Key Laboratory of Measurement and Control of CSE, Ministry of Education（教育部测控CSE重点实验室）

专题命中多传感器融合：融合LiDAR与惯性测量，实现安全关键里程计

AI总结本文提出一种安全关键的激光雷达-惯性里程计，通过在线流形确定性状态估计提供确定性保护级别，以提升移动机器人在安全关键场景中的导航安全性。

详情

AI中文摘要

在安全关键场景中，自主导航系统的保护级别对于使移动机器人安全执行任务至关重要。然而，现有针对机器人概率导航系统的研究通常使用有限数据集进行离线准确性评估，并假设结果可应用于未知真实环境。因此，当前自主移动机器人往往缺乏在线安全评估的保护级别。为填补这一空白，我们提出了一种安全关键的激光雷达-惯性里程计（LIO），其基于在线流形确定性状态估计提供确定性保护级别。通过采用未知但有界的假设，我们推导出点云噪声与迭代最近点算法估计不确定性之间的简洁闭式关系。利用这一关系，我们设计了一种在线流形椭球集成员滤波器，并将其实现于LIO系统中。利用集成员滤波器的性质，我们的系统将估计位置的可行集作为确定性保护级别，用作机器人下游自主操作的安全参考。实验结果表明，我们的系统能够为各种环境中的不同机器人提供有效的确定性在线安全参考。

英文摘要

In safety-critical scenarios, the protection level of the autonomous navigation system is crucial for enabling mobile robots to perform safe tasks. However, existing studies on probabilistic navigation systems for robots usually perform offline accuracy evaluations using limited datasets and assume that the results can be applied to unknown real-world environments. As a result, current autonomous mobile robots often lack protection levels for online safety assessment. To fill this gap, we propose a safety-critical LiDAR-inertial odometry (LIO) that provides deterministic protection levels based on on-manifold deterministic state estimation. By adopting the unknown but bounded assumption, we derive a neat closed-form relationship between point cloud noise and the uncertainty of the estimation from the iterated closest point algorithm. Using this relationship, we design an on-manifold ellipsoidal set-membership filter and implement it within the LIO system. Leveraging the properties of the set-membership filter, our system offers the feasible sets of the estimated locations as the deterministic protection levels, serving as safety references for the robots' downstream autonomous operations. The experimental results show that our system can provide effective deterministic online safety references for diverse robots in various environments.

URL PDF HTML ☆

赞 0 踩 0

2603.27361 2026-06-19 cs.RO 80%

Online Inertia Tensor Identification for Non-Cooperative Spacecraft via Augmented UKF

非合作航天器在线惯性张量识别：基于增强型UKF

Batu Candan, Simone Servadio

发表机构 * Department of Aerospace Engineering, Iowa State University（航空航天工程系，爱荷华州立大学）

专题命中多传感器融合：融合视觉CNN和LiDAR深度数据估计航天器姿态

AI总结本文提出一种增强型UKF框架，用于同时估计非合作目标航天器的六自由度姿态和完整惯性张量，结合视觉和LiDAR数据，实现实时惯性参数估计，提升深空环境下的导航与引导精度。

Journal ref AIAA 2026 Region V Student Conference, AIAA 2026-108993

详情

DOI: 10.2514/6.2026-108993

AI中文摘要

自主接近操作，如主动碎片清除和在轨服务，需要高保真的相对导航解决方案，在参数不确定性存在时仍保持鲁棒性。传统估计框架通常假设目标航天器的质量特性已知，但对于非合作或翻滚目标，这些参数往往未知或不确定，导致基于模型的传播器快速发散。本文提出一种增强型无迹卡尔曼滤波（UKF）框架，旨在联合估计非合作目标航天器的相对六自由度姿态和完整惯性张量。所提出的架构融合了基于单目视觉的卷积神经网络（CNN）的视觉测量与LiDAR的深度信息，以约束耦合刚体动力学。通过将状态向量扩展以包含惯性张量的六个独立元素，滤波器能够动态恢复目标的归一化质量分布，而无需地面预校准。为确保估计常数参数时的数值稳定性和物理一致性，滤波器采用自适应过程噪声公式，防止协方差崩溃，同时允许惯性参数逐步收敛。通过蒙特卡洛模拟进行数值验证，证明所提出的增强型UKF能够同时收敛运动学状态和惯性参数，从而实现非合作深空环境中的准确长期轨迹预测和鲁棒引导。

英文摘要

Autonomous proximity operations, such as active debris removal and on-orbit servicing, require high-fidelity relative navigation solutions that remain robust in the presence of parametric uncertainty. Standard estimation frameworks typically assume that the target spacecraft's mass properties are known a priori; however, for non-cooperative or tumbling targets, these parameters are often unknown or uncertain, leading to rapid divergence in model-based propagators. This paper presents an augmented Unscented Kalman Filter (UKF) framework designed to jointly estimate the relative 6-DOF pose and the full inertia tensor of a non-cooperative target spacecraft. The proposed architecture fuses visual measurements from monocular vision-based Convolutional Neural Networks (CNN) with depth information from LiDAR to constrain the coupled rigid-body dynamics. By augmenting the state vector to include the six independent elements of the inertia tensor, the filter dynamically recovers the target's normalized mass distribution in real-time without requiring ground-based pre-calibration. To ensure numerical stability and physical consistency during the estimation of constant parameters, the filter employs an adaptive process noise formulation that prevents covariance collapse while allowing for the gradual convergence of the inertial parameters. Numerical validation is performed via Monte Carlo simulations, demonstrating that the proposed Augmented UKF enables the simultaneous convergence of kinematic states and inertial parameters, thereby facilitating accurate long-term trajectory prediction and robust guidance in non-cooperative deep-space environments.

URL PDF HTML ☆

赞 0 踩 0

2602.15707 2026-06-19 cs.MM cs.CL cs.LG 版本更新 80%

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

基于音频和IMU的主动式程序性任务对话助手

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

发表机构 * Qualcomm Technologies, Inc.（高通技术公司）

专题命中多传感器融合：融合音频和IMU多模态输入实现对话助手。

AI总结提出首个仅使用音频和IMU模态的实时对话助手，通过微调语言模型减少不必要对话并提升问答准确性，在边缘设备上实现无云依赖。

Comments 5 figures. 5 more in appendix

详情

AI中文摘要

实时对话助手用于程序性手工任务通常依赖视频输入，这会导致计算成本高且侵犯用户隐私。我们首次提出一种实时对话助手，仅使用来自用户可穿戴设备的轻量级隐私保护模态（如音频和IMU输入）来理解上下文，为程序性手工任务提供全面指导。通过家具组装任务和烹饪任务，我们展示了该助手如何主动向执行程序性任务的用户提供逐步指令，并回答用户问题。我们阐述了实现该助手的数据生成方法和系统设计。观察到现成的语言模型健谈但并非总能正确回答问题，我们展示了微调模型如何将其减少不必要对话的能力提升50%（精确度），同时将正确回答问题的能力提升150%（召回率）。我们进一步描述了如何在边缘设备上实现该助手，无需依赖云端。

英文摘要

Real-time conversational assistants for procedural manual tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for procedural manual tasks using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. Using a furniture assembly task and a cooking task, we show how this assistant proactively communicates step-by-step instructions to a user performing a procedural task, and answers user questions. We illustrate the data generation method and the system design to achieve such an assistant. On observing that an off-the-shelf language model is a talkative assistant but is not always able to answer questions correctly, we demonstrate how finetuning the model improves its ability to limit unnecessary dialogues with a 50% increase in the precision, while also improving its ability to answer questions correctly, measured by a 150% increase in the recall of answers. We further describe how such an assistant is implemented on an edge device with no dependence on the cloud.

URL PDF HTML ☆

赞 0 踩 0

2606.19961 2026-06-19 cs.CV 新提交 75%

Addressing Detail Bottlenecks in Latent Diffusion for RGB-to-SWIR Image Translation

解决潜在扩散模型中RGB到SWIR图像翻译的细节瓶颈

Kaili Wang, Martin Dimitrievski, Jose Maria Salvador, Ben Stoffelen, David Van Hamme, Lore Goetschalckx

发表机构 * imec ； imec-IPI-Ghent University（imec-IPI-根特大学）； Yale University（耶鲁大学）

专题命中多传感器融合：RGB到SWIR翻译，融合多模态传感器数据。

AI总结针对潜在扩散模型在RGB到SWIR图像翻译中丢失空间细节的问题，提出源条件自编码器和可学习引导编码器两种轻量级改进，在驾驶场景下将检测mAP提升至2倍，小目标提升3.4倍，并达到最优FID。

详情

AI中文摘要

潜在扩散模型（LDM）能够高效地进行图像到图像的翻译，但在压缩过程中丢弃了精细的空间细节，从而降低了下游感知任务的性能。我们识别出两个瓶颈：自编码器（丢失空间信息）和条件路径（通过朴素下采样进一步退化源信号）。我们提出了两种轻量级、与骨干网络无关的修复方法：源条件自编码器（SCAE），通过跳跃连接将高分辨率源特征注入解码器；以及可学习引导编码器（LGE），用学习到的条件信号替代朴素下采样。在驾驶场景的RGB到SWIR翻译任务上，使用两种去噪骨干网络（U-Net和DiT）进行评估，我们的方法在潜在扩散基线基础上将检测mAP提升了高达2倍，小目标（COCO-small，<32^2像素^2）上提升高达3.4倍，同时达到了最先进的FID。我们进一步表明FID与检测性能相关性较差，从而激励多轴评估。结果零样本泛化到公开的RASMD基准。我们将公开发布带有标注的测试数据、所有检查点和训练代码。

英文摘要

Latent diffusion models (LDMs) enable efficient image-to-image translation but discard fine spatial details during compression, degrading downstream perception tasks. We identify two bottlenecks: the autoencoder, which loses spatial information, and the conditioning pathway, which further degrades the source signal through naive downsampling. We propose two lightweight, backbone-agnostic fixes: a Source-Conditioned Autoencoder (SCAE) that injects high-resolution source features into the decoder via skip connections, and a Learnable Guidance Encoder (LGE) that replaces naive downsampling with a learned conditioning signal. Evaluated on RGB-to-SWIR translation for driving scenes with two denoiser backbones (U-Net and DiT), our approach improves detection mAP by up to 2x over the latent diffusion baseline, with up to 3.4x gains on small objects (COCO-small, <32^2 px^2), while achieving state-of-the-art FID. We further show that FID and detection performance are poorly correlated, motivating multi-axis evaluation. Results generalise zero-shot to the public RASMD benchmark. We will publicly release test data with annotations, all checkpoints, and training code.

URL PDF HTML ☆

赞 0 踩 0

2507.21460 2026-06-19 cs.CV 版本更新 75%

An Angular-Temporal Interaction Network for Light Field Object Tracking in Low-Light Scenes

用于低光场景光场目标跟踪的角-时交互网络

Mianzhao Wang, Fan Shi, Xu Cheng, Feifei Zhang, Shengyong Chen

发表机构 * Engineering Research Center of Learning-Based Intelligent System (Ministry of Education)（教育部学习驱动智能系统工程研究中心）； key Laboratory of Computer Vision and System (Ministry of Education)（教育部计算机视觉与系统重点实验室）； School of Computer Science and Engineering, Tianjin University of Technology（天津工业大学计算机科学与工程学院）

专题命中多传感器融合：光场与时间交互，属于多传感器融合

AI总结提出一种光场极线平面结构图像表示和角-时交互网络，通过显式建模几何结构和自监督优化，在低光场景下实现高效目标跟踪，性能达到最优。

详情

AI中文摘要

高质量的四维光场表示结合高效的角特征建模对于场景感知至关重要，因为它可以提供判别性的空间-角度线索来识别移动目标。然而，近期的发展仍然难以在时间域中提供可靠的角建模，尤其是在复杂的低光场景中。在本文中，我们提出了一种新颖的光场极线平面结构图像（ESI）表示，该表示显式定义了光场内的几何结构。通过利用极线平面内光线角度的突变，这种表示可以增强低光场景中的视觉表达，并减少高维光场的冗余。我们进一步提出了一种用于光场目标跟踪的角-时交互网络（ATINet），该网络从光场的几何结构线索和角-时交互线索中学习角感知表示。此外，ATINet还可以通过自监督方式进行优化，以增强时间域上的几何特征交互。最后，我们引入了一个大规模的光场低光数据集用于目标跟踪。大量实验表明，ATINet在单目标跟踪中达到了最先进的性能。此外，我们将所提方法扩展到多目标跟踪，这也显示了高质量光场角-时建模的有效性。

英文摘要

High-quality 4D light field representation with efficient angular feature modeling is crucial for scene perception, as it can provide discriminative spatial-angular cues to identify moving targets. However, recent developments still struggle to deliver reliable angular modeling in the temporal domain, particularly in complex low-light scenes. In this paper, we propose a novel light field epipolar-plane structure image (ESI) representation that explicitly defines the geometric structure within the light field. By capitalizing on the abrupt changes in the angles of light rays within the epipolar plane, this representation can enhance visual expression in low-light scenes and reduce redundancy in high-dimensional light fields. We further propose an angular-temporal interaction network (ATINet) for light field object tracking that learns angular-aware representations from the geometric structural cues and angular-temporal interaction cues of light fields. Furthermore, ATINet can also be optimized in a self-supervised manner to enhance the geometric feature interaction across the temporal domain. Finally, we introduce a large-scale light field low-light dataset for object tracking. Extensive experimentation demonstrates that ATINet achieves state-of-the-art performance in single object tracking. Furthermore, we extend the proposed method to multiple object tracking, which also shows the effectiveness of high-quality light field angular-temporal modeling.

URL PDF HTML ☆

赞 0 踩 0

2509.13972 2026-06-19 cs.RO 版本更新 70%

BIM Informed Visual SLAM for Construction Environments

BIM 引导的视觉 SLAM 在建筑环境中的应用

Asier Bikandi-Noya, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg（自动化与机器人研究组，安全、可靠与信任跨学科研究中心（SnT），卢森堡大学）

专题命中多传感器融合：融合BIM与RGB-D数据，属于多传感器融合

AI总结针对建筑环境中视觉SLAM轨迹漂移问题，提出利用建筑信息模型（BIM）的结构先验增强RGB-D SLAM系统，通过墙面对应与几何约束优化减少漂移，提升全局一致性，实验显示轨迹误差降低25.23%，地图精度提升7.14%。

Comments 9 pages, 7 tables, 4 figures

详情

AI中文摘要

监测建筑施工现场需要将计划设计与实际建造状态进行比较，而同步定位与地图构建（SLAM）技术可以实时估计实际状态。然而，视觉SLAM在建筑环境中容易产生轨迹漂移，生成的地图在几何上与实际环境不准确。为解决这一局限，我们利用从建筑信息模型（BIM）导出的结构先验增强现有的RGB-D SLAM系统。该系统将检测到的墙面与BIM中的对应墙面关联，并将这些对应关系作为几何约束加入后端优化，从而减少漂移并增强全局一致性。所提方法实时运行，并在多个真实建筑工地上验证，与最先进的基线相比，平均轨迹误差降低25.23%，地图精度提升7.14%。鲁棒性分析进一步表明，该方法对不完整的BIM数据以及计划模型与实际环境之间的几何差异具有韧性。

英文摘要

Monitoring building construction sites requires comparing the as-planned design with the as-built state, which can be estimated in real time using Simultaneous Localization and Mapping (SLAM) techniques. However, visual SLAM is prone to trajectory drift in construction environments, producing maps that are geometrically inaccurate with the actual environment. To address this limitation, we augment an existing RGB-D SLAM system with structural priors derived from the Building Information Model (BIM). The system associates detected walls with their BIM counterparts and includes these correspondences as geometric constraints in the back-end optimization, reducing drift and enhancing global consistency. The proposed method operates in real time and is validated on multiple real construction sites, achieving an average trajectory error reduction of 25.23% and a 7.14% improvement in map accuracy over state-of-the-art baselines. Robustness analyses further demonstrate resilience to incomplete BIM data and geometric discrepancies between as-planned models and the as-built environment.

URL PDF HTML ☆

赞 0 踩 0