arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

多模态信息融合

面向图像、视频、多传感器和跨模态感知的信息融合,包括 Image Fusion、红外可见光、遥感、医学影像、LiDAR/雷达/相机和音视频融合。

今日/当前日期收录 8 信号源:cs.CV, eess.IV, eess.SP, cs.RO, cs.MM
2606.20103 2026-06-19 cs.CV 新提交 95%

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

3D高斯溅射中保持几何结构的LiDAR-相机外参标定

Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang

发表机构 * Kyung Hee University(庆熙大学)

专题命中 多传感器融合 :LiDAR-相机外参标定,典型多传感器融合

AI总结 针对LiDAR-相机标定中跨模态特征稀缺问题,提出通过多视图LiDAR深度监督和阻止光度梯度更新高斯空间参数来保持3DGS代理的度量几何,提升标定精度。

Comments Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

详情
AI中文摘要

精确的LiDAR-相机标定对于鲁棒的多模态感知至关重要。无目标方法避免了手动设置,但仍受限于跨模态判别特征的稀缺性。最近的方法通过在可微模型中重建场景,通过密集光度监督实现外参优化。其中,3D高斯溅射(3DGS)被广泛用作几何代理,在单一可微框架内桥接LiDAR和相机。然而,由于3DGS最初是为新视图合成设计的,现有方法倾向于优先考虑渲染质量,导致代理几何偏离真实的LiDAR结构。我们提出了一种框架,通过聚合多视图LiDAR观测进行密集深度监督,并阻止光度梯度更新高斯空间参数,从而保持高斯代理的度量几何。我们在公开驾驶数据集上验证了该方法,在标定精度上持续优于现有无目标方法。

英文摘要

Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

2606.20189 2026-06-19 cs.CV cs.AI cs.RO 新提交 90%

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA:利用扩散的分层蒸馏推进自监督LiDAR预训练

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson, Patric Jensfelt, Olov Andersson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院) Linköping University(林雪平大学) TRATON AB(TRATON公司) Qualcomm Auto Ltd Sweden Filial(高通汽车有限公司瑞典分公司)

专题命中 多传感器融合 :相机到LiDAR知识蒸馏,融合视觉与激光雷达

AI总结 提出HilDA框架,通过分层蒸馏(多层蒸馏和全局上下文蒸馏)结合时间占用扩散目标,自监督预训练LiDAR骨干网络,在3D检测、场景流和语义占用预测任务上达到最先进水平。

Comments Accepted to ECCV 2026. Maciej and Jesper contributed equally

详情
AI中文摘要

利用视觉基础模型(VFM)进行相机到LiDAR的知识蒸馏为解决真实世界自动驾驶中巨大的几何和运动多样性所需的标注数据稀缺问题提供了一种有前景的方案。然而,当前方法通常将VFM视为黑盒教师,仅依赖逐帧特征相似性。因此,它们未能充分利用教师的逐层语义结构和全局上下文,以及LiDAR序列中固有的丰富时空信息。我们提出HilDA,一个用于LiDAR骨干网络的自监督预训练框架,能更好地捕捉驾驶任务所需的语义“是什么”和几何“在哪里”。HilDA结合了分层蒸馏(包括用于渐进语义对齐的多层蒸馏和用于场景级语义的全局上下文蒸馏)与一个促进时空一致性的时间占用扩散目标。使用HilDA预训练的模型在跨模态蒸馏基准上取得了最先进的结果,并在3D目标检测、场景流和语义占用预测任务上优于通过先前蒸馏方法训练的模型。代码见:此 https URL。

英文摘要

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

2606.20300 2026-06-19 cs.CV 新提交 85%

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

CMDS-AD: 跨模态双流解耦用于少样本异常检测

Junhao Cai, Deyu Zeng, Junhao Pang, Junyu Chen, Qiwei Liang, Xiaopin Zhong, Zongze Wu

发表机构 * Shenzhen University(深圳大学) Guangzhou Maritime University(广州航海学院) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

专题命中 多传感器融合 :融合RGB和3D几何信息进行少样本异常检测

AI总结 提出跨模态双流异常检测框架CMDS-AD,通过扩散模型生成多样本并利用低频正常估计辅助解耦高频缺陷,在1-shot设置下MVTec 3D-AD上I-AUROC提升5.7%。

Comments Accepted to ECCV 2026!

详情
AI中文摘要

少样本异常检测由于训练数据有限仍然具有挑战性。多模态异常检测(MAD)提供了一种可行的解决方案,利用3D几何线索丰富2D RGB表示并弥补这一稀缺性。然而,现有的MAD方法采用空间均匀的特征处理,混淆了稳定的宏观结构与高频局部缺陷信号,加剧了跨模态错位并增加了假阳性率。为了克服这一问题,我们提出了CMDS-AD,一种跨模态双流异常检测框架。一个LoRA引导的扩散模型生成多样的RGB样本以缓解极端数据稀缺。对于3D正常增强,我们采用预训练的扩散模型作为正常估计器。关键的是,该估计器本质上充当非线性低通滤波器,直接从RGB输入中提取低频正常表示。这建立了一个纯低频信息的辅助估计流,锚定稳健的结构模板,并帮助包含耦合高低频分量的未压缩真实流精确隔离微缺陷。一个坐标感知的分层特征映射器自适应地对齐跨模态语义,而一个乘法评分机制过滤模态特定噪声。在极端1-shot设置下,CMDS-AD在MVTec 3D-AD上实现了5.7%(I-AUROC)和2.0%(AUPRO)的绝对性能提升,在EyeCandies上分别提升了7.7%和5.6%,确立了新的最先进水平。

英文摘要

Few-shot anomaly detection remains challenging due to limited training data. Multi-modal anomaly detection (MAD) offers a viable solution, leveraging 3D geometric cues to enrich 2D RGB representations and compensate for this scarcity. However, existing MAD methods apply spatially uniform feature processing, conflating stable macroscopic structures with high-frequency localized defect signals, exacerbating cross-modal misalignment and inflating false-positive rates. To overcome this, we present CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework. A LoRA-guided diffusion model generates diverse RGB samples to mitigate extreme data scarcity. For 3D normal augmentation, we employ a pre-trained diffusion model as a normal estimator. Crucially, this estimator inherently acts as a non-linear low-pass filter, directly extracting low-frequency normal representations from RGB inputs. This establishes an auxiliary estimated stream of purely low-frequency information, anchoring robust structural templates and assisting the uncompressed real stream, containing coupled high- and low-frequency components, to precisely isolate micro-defects. A Coordinate-Aware Hierarchical Feature Mapper adaptively aligns cross-modal semantics, while a multiplicative scoring mechanism filters modality-specific noise. Under the extreme 1-shot setting, CMDS-AD achieves absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, alongside 7.7% and 5.6% improvements on EyeCandies, establishing a new state-of-the-art.

2606.20044 2026-06-19 cs.CV 新提交 85%

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

FUSE:面向多模态目标重识别的频域统一与频谱能量对齐

Xuanhao Qi, Tom H. Luan, Yukang Zhang, Jinkai Zheng, Zhou Su, Shuwei Li, Lei Tan

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University(西安交通大学网络空间安全学院) School of Informatics, Xiamen University(厦门大学信息学院) National University of Singapore(新加坡国立大学)

专题命中 多传感器融合 :提出频域框架FUSE,对齐多模态特征,提升重识别性能。

AI总结 提出频域框架FUSE,通过频谱解耦和能量对齐两阶段处理,解决多模态重识别中低频偏置问题,在三个数据集上mAP提升9.1%。

Comments Accepted in ICML 2026

详情
AI中文摘要

尽管多模态重识别(ReID)取得了显著进展,现有方法往往强调低频线索。因此,它们关注颜色、光照和粗略外观等属性,而忽略了编码几何、纹理和身份判别细节的中高频结构。这种不平衡导致频谱表示不完整和跨模态对齐不稳定。为了克服这些限制,我们引入了FUSE,一个频域框架,将多模态ReID重新表述为频谱解耦和能量对齐的两阶段过程。所提出的频谱分解模块(SDM)自适应地将特征划分为低频、中频和高频子空间,实现分层频谱建模。跨模态对齐模块(CAM)进一步通过频率一致性正则化强制实现跨模态的能量对齐和子空间互补性。此外,FUSE结合了可学习的频率调制,以增强在不同光照和异构传感器条件下的鲁棒性。在RGBNT201、RGBNT100和MSVR310上的大量实验表明,FUSE实现了9.1%的mAP和9.5%的Rank-1改进,为多模态表示学习建立了一个可解释的频域范式。

英文摘要

Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

2606.19929 2026-06-19 cs.RO 新提交 80%

Motor Angular Speed Preintegration for Multirotor UAV State Estimation

多旋翼无人机状态估计中的电机角速度预积分

Matěj Petrlík, Filip Novák, Robert Pěnička, Martin Saska

发表机构 * Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague(电子工程系控制学系,布拉格捷克技术大学)

专题命中 多传感器融合 :融合电机转速与LiDAR,提升无人机状态估计。

AI总结 针对无人机振动导致IMU精度下降的问题,提出基于电机转速加速度预积分的方法,替代IMU进行状态传播,并构建因子用于图优化,结合LiDAR形成MAS-LO算法,相比LIO-SAM位置精度提升28%,速度精度提升65%。

详情
AI中文摘要

精确的状态估计对于实现无人机的敏捷和近障碍飞行所需的紧密反馈控制至关重要。最先进的方法融合慢速位姿测量与高频惯性测量以获得精确的状态估计。然而,来自无人机上IMU的惯性测量会受到旋转螺旋桨振动的退化,导致估计状态的精度下降。我们提出了一种基于电机转速加速度预积分的新方法。我们展示了以这种方式获得的加速度可以单独用于状态传播,在不包含IMU的情况下实现更好的精度。此外,我们提出了一个由预积分电机转速组成的因子,可以直接用于因子图优化框架。我们将该因子与LiDAR测量结合,提出电机角速度LiDAR里程计(MAS-LO)算法,用于精确状态估计,并开源该算法。最后,我们与最先进的惯性算法LIO-SAM进行估计精度评估,结果显示位置估计精度提升28%,速度估计精度提升65%,测量延迟降低14%,并且对错误参数值具有高鲁棒性。

英文摘要

A precise state estimate is crucial for a tight feedback control that enables agile and near-obstacle flights of UAVs. The state-of-the-art methods fuse slow pose measurements with high-frequency inertial measurements to obtain a precise state estimate. However, the inertial measurements from the IMU onboard the UAV are degraded by vibrations from spinning propellers and the precision of the estimated state suffers. We propose a novel approach based on the preintegration of accelerations obtained from motor speeds. We show that the accelerations obtained in this manner can be used for state propagation on their own to achieve better precision without including the IMU. Further, we propose a factor composed of the preintegrated motor speeds that can be directly employed in factor graph optimization frameworks. We combine our factor with LiDAR measurements into the proposed Motor Angular Speed LiDAR Odometry (MAS-LO) algorithm for precise state estimation, which we open-source. Lastly, we evaluate the estimation precision against a state-of-the-art inertial algorithm LIO-SAM to show 28% improvement in position and 65% in velocity estimation accuracy, 14% lower measurement lag, and high robustness to wrong parameter values.

2606.19874 2026-06-19 cs.RO cs.CV 新提交 80%

MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM

MMD-SLAM:结构增强的多元高斯分布引导视觉SLAM

Fan Zhu, Ziyu Chen, Peichen Liu, Yifan Zhao, Zhisong Xu, Hui Zhu, Hongxing Zhou, Sixun Liu, Chunmao Jiang

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院合肥物质科学研究院) University of Science and Technology of China(中国科学技术大学) Aarhus University(奥胡斯大学) University of Tokyo(东京大学) Beijing University of Chemical Technology(北京化工大学) North China Electric Power University(华北电力大学)

专题命中 多传感器融合 :视觉SLAM融合点线特征,多传感器融合

AI总结 提出MMD-SLAM,利用亚特兰大世界假设引导多元高斯表示,通过点线融合、主导方向编码和高斯进化策略,提升视觉SLAM的跟踪精度与建图质量。

Comments ICRA 2026

详情
AI中文摘要

3D高斯泼溅(3DGS)显著提升了新视角合成和高保真场景重建,扩展了基于3DGS的视觉同步定位与建图(SLAM)方法的潜力。然而,大多数现有系统未能充分利用底层结构信息,这限制了渲染质量并常常导致地图不一致。为了解决这些限制,我们提出了MMD-SLAM,一个结构增强的视觉SLAM框架,利用亚特兰大世界(AW)假设来引导多元高斯表示以实现逼真的建图。首先,我们引入了一种点线融合策略用于位姿优化,其中3D线段被纳入以提高跟踪鲁棒性并为建图提供额外约束。其次,我们设计了一种具有主导方向的多元高斯表示,显式编码来自AW假设的结构先验。最后,我们提出了一种高斯进化策略,该策略适应场景几何并将结构线索融入全局优化。大量实验表明,这些创新使MMD-SLAM在跟踪精度和建图质量方面均达到了最先进的性能。例如,与MonoGS相比,我们的方法在ScanNet上实现了48.56%的ATE RMSE降低,在Replica上实现了5.71%的PSNR提升。

英文摘要

3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.

2603.27361 2026-06-19 cs.RO 80%

Online Inertia Tensor Identification for Non-Cooperative Spacecraft via Augmented UKF

非合作航天器在线惯性张量识别:基于增强型UKF

Batu Candan, Simone Servadio

发表机构 * Department of Aerospace Engineering, Iowa State University(航空航天工程系,爱荷华州立大学)

专题命中 多传感器融合 :融合视觉CNN和LiDAR深度数据估计航天器姿态

AI总结 本文提出一种增强型UKF框架,用于同时估计非合作目标航天器的六自由度姿态和完整惯性张量,结合视觉和LiDAR数据,实现实时惯性参数估计,提升深空环境下的导航与引导精度。

Journal ref AIAA 2026 Region V Student Conference, AIAA 2026-108993

详情
AI中文摘要

自主接近操作,如主动碎片清除和在轨服务,需要高保真的相对导航解决方案,在参数不确定性存在时仍保持鲁棒性。传统估计框架通常假设目标航天器的质量特性已知,但对于非合作或翻滚目标,这些参数往往未知或不确定,导致基于模型的传播器快速发散。本文提出一种增强型无迹卡尔曼滤波(UKF)框架,旨在联合估计非合作目标航天器的相对六自由度姿态和完整惯性张量。所提出的架构融合了基于单目视觉的卷积神经网络(CNN)的视觉测量与LiDAR的深度信息,以约束耦合刚体动力学。通过将状态向量扩展以包含惯性张量的六个独立元素,滤波器能够动态恢复目标的归一化质量分布,而无需地面预校准。为确保估计常数参数时的数值稳定性和物理一致性,滤波器采用自适应过程噪声公式,防止协方差崩溃,同时允许惯性参数逐步收敛。通过蒙特卡洛模拟进行数值验证,证明所提出的增强型UKF能够同时收敛运动学状态和惯性参数,从而实现非合作深空环境中的准确长期轨迹预测和鲁棒引导。

英文摘要

Autonomous proximity operations, such as active debris removal and on-orbit servicing, require high-fidelity relative navigation solutions that remain robust in the presence of parametric uncertainty. Standard estimation frameworks typically assume that the target spacecraft's mass properties are known a priori; however, for non-cooperative or tumbling targets, these parameters are often unknown or uncertain, leading to rapid divergence in model-based propagators. This paper presents an augmented Unscented Kalman Filter (UKF) framework designed to jointly estimate the relative 6-DOF pose and the full inertia tensor of a non-cooperative target spacecraft. The proposed architecture fuses visual measurements from monocular vision-based Convolutional Neural Networks (CNN) with depth information from LiDAR to constrain the coupled rigid-body dynamics. By augmenting the state vector to include the six independent elements of the inertia tensor, the filter dynamically recovers the target's normalized mass distribution in real-time without requiring ground-based pre-calibration. To ensure numerical stability and physical consistency during the estimation of constant parameters, the filter employs an adaptive process noise formulation that prevents covariance collapse while allowing for the gradual convergence of the inertial parameters. Numerical validation is performed via Monte Carlo simulations, demonstrating that the proposed Augmented UKF enables the simultaneous convergence of kinematic states and inertial parameters, thereby facilitating accurate long-term trajectory prediction and robust guidance in non-cooperative deep-space environments.

2606.19961 2026-06-19 cs.CV 新提交 75%

Addressing Detail Bottlenecks in Latent Diffusion for RGB-to-SWIR Image Translation

解决潜在扩散模型中RGB到SWIR图像翻译的细节瓶颈

Kaili Wang, Martin Dimitrievski, Jose Maria Salvador, Ben Stoffelen, David Van Hamme, Lore Goetschalckx

发表机构 * imec imec-IPI-Ghent University(imec-IPI-根特大学) Yale University(耶鲁大学)

专题命中 多传感器融合 :RGB到SWIR翻译,融合多模态传感器数据。

AI总结 针对潜在扩散模型在RGB到SWIR图像翻译中丢失空间细节的问题,提出源条件自编码器和可学习引导编码器两种轻量级改进,在驾驶场景下将检测mAP提升至2倍,小目标提升3.4倍,并达到最优FID。

详情
AI中文摘要

潜在扩散模型(LDM)能够高效地进行图像到图像的翻译,但在压缩过程中丢弃了精细的空间细节,从而降低了下游感知任务的性能。我们识别出两个瓶颈:自编码器(丢失空间信息)和条件路径(通过朴素下采样进一步退化源信号)。我们提出了两种轻量级、与骨干网络无关的修复方法:源条件自编码器(SCAE),通过跳跃连接将高分辨率源特征注入解码器;以及可学习引导编码器(LGE),用学习到的条件信号替代朴素下采样。在驾驶场景的RGB到SWIR翻译任务上,使用两种去噪骨干网络(U-Net和DiT)进行评估,我们的方法在潜在扩散基线基础上将检测mAP提升了高达2倍,小目标(COCO-small,<32^2像素^2)上提升高达3.4倍,同时达到了最先进的FID。我们进一步表明FID与检测性能相关性较差,从而激励多轴评估。结果零样本泛化到公开的RASMD基准。我们将公开发布带有标注的测试数据、所有检查点和训练代码。

英文摘要

Latent diffusion models (LDMs) enable efficient image-to-image translation but discard fine spatial details during compression, degrading downstream perception tasks. We identify two bottlenecks: the autoencoder, which loses spatial information, and the conditioning pathway, which further degrades the source signal through naive downsampling. We propose two lightweight, backbone-agnostic fixes: a Source-Conditioned Autoencoder (SCAE) that injects high-resolution source features into the decoder via skip connections, and a Learnable Guidance Encoder (LGE) that replaces naive downsampling with a learned conditioning signal. Evaluated on RGB-to-SWIR translation for driving scenes with two denoiser backbones (U-Net and DiT), our approach improves detection mAP by up to 2x over the latent diffusion baseline, with up to 3.4x gains on small objects (COCO-small, <32^2 px^2), while achieving state-of-the-art FID. We further show that FID and detection performance are poorly correlated, motivating multi-axis evaluation. Results generalise zero-shot to the public RASMD benchmark. We will publicly release test data with annotations, all checkpoints, and training code.