arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

多模态信息融合

面向图像、视频、多传感器和跨模态感知的信息融合,包括 Image Fusion、红外可见光、遥感、医学影像、LiDAR/雷达/相机和音视频融合。

今日/当前日期收录 46 信号源:cs.CV, eess.IV, eess.SP, cs.RO, cs.MM

1. 多传感器融合 8 篇

2606.20103 2026-06-19 cs.CV 新提交 95%

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

3D高斯溅射中保持几何结构的LiDAR-相机外参标定

Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang

发表机构 * Kyung Hee University(庆熙大学)

专题命中 多传感器融合 :LiDAR-相机外参标定,典型多传感器融合

AI总结 针对LiDAR-相机标定中跨模态特征稀缺问题,提出通过多视图LiDAR深度监督和阻止光度梯度更新高斯空间参数来保持3DGS代理的度量几何,提升标定精度。

Comments Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

详情
AI中文摘要

精确的LiDAR-相机标定对于鲁棒的多模态感知至关重要。无目标方法避免了手动设置,但仍受限于跨模态判别特征的稀缺性。最近的方法通过在可微模型中重建场景,通过密集光度监督实现外参优化。其中,3D高斯溅射(3DGS)被广泛用作几何代理,在单一可微框架内桥接LiDAR和相机。然而,由于3DGS最初是为新视图合成设计的,现有方法倾向于优先考虑渲染质量,导致代理几何偏离真实的LiDAR结构。我们提出了一种框架,通过聚合多视图LiDAR观测进行密集深度监督,并阻止光度梯度更新高斯空间参数,从而保持高斯代理的度量几何。我们在公开驾驶数据集上验证了该方法,在标定精度上持续优于现有无目标方法。

英文摘要

Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

2603.00654 2026-06-19 cs.CV 版本更新 95%

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

RC-GeoCP:雷达-相机协同感知的几何一致性

Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Songkai Wang, Huiliang Shen

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) School of Automotive Studies, Tongji University(同济大学汽车学院) Thrust of Artificial Intelligence, Hong Kong University of Science and Technology(香港科技大学人工智能研究所)

专题命中 多传感器融合 :提出4D雷达与相机协同感知框架,融合多传感器信息。

AI总结 提出首个4D雷达与相机协同感知框架RC-GeoCP,通过雷达锚定几何一致性解决深度模糊和空间分散导致的错位,实现高效通信与全局一致表示。

Comments 11 pages, 6 figures, 9 tables

详情
AI中文摘要

协同感知(CP)通过多智能体信息共享增强场景理解。尽管以LiDAR为中心的系统提供精确几何,但高成本和恶劣天气下的性能下降需要多模态替代方案。尽管具有密集的视觉语义和鲁棒的空间测量,相机与4D雷达之间的协同在协作环境中仍未得到充分探索。本文介绍RC-GeoCP,这是首个探索CP中4D雷达与图像融合的框架。为解决由深度模糊和跨智能体空间分散引起的错位,RC-GeoCP建立了雷达锚定的几何一致性。具体而言,几何结构修正(GSR)将视觉语义与雷达导出的几何对齐,以生成空间有根基的、几何一致的表示。不确定性感知通信(UAC)将选择性传输表述为条件熵减少过程,基于智能体间分歧优先处理信息特征。最后,共识驱动聚合器(CDA)通过共享几何锚聚合多智能体信息,形成全局一致的表示。我们在V2X-Radar和V2X-R上建立了首个统一的雷达-相机CP基准,展示了最先进的性能,同时显著降低了通信开销。代码即将发布。

英文摘要

Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.

2606.20189 2026-06-19 cs.CV cs.AI cs.RO 新提交 90%

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA:利用扩散的分层蒸馏推进自监督LiDAR预训练

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson, Patric Jensfelt, Olov Andersson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院) Linköping University(林雪平大学) TRATON AB(TRATON公司) Qualcomm Auto Ltd Sweden Filial(高通汽车有限公司瑞典分公司)

专题命中 多传感器融合 :相机到LiDAR知识蒸馏,融合视觉与激光雷达

AI总结 提出HilDA框架,通过分层蒸馏(多层蒸馏和全局上下文蒸馏)结合时间占用扩散目标,自监督预训练LiDAR骨干网络,在3D检测、场景流和语义占用预测任务上达到最先进水平。

Comments Accepted to ECCV 2026. Maciej and Jesper contributed equally

详情
AI中文摘要

利用视觉基础模型(VFM)进行相机到LiDAR的知识蒸馏为解决真实世界自动驾驶中巨大的几何和运动多样性所需的标注数据稀缺问题提供了一种有前景的方案。然而,当前方法通常将VFM视为黑盒教师,仅依赖逐帧特征相似性。因此,它们未能充分利用教师的逐层语义结构和全局上下文,以及LiDAR序列中固有的丰富时空信息。我们提出HilDA,一个用于LiDAR骨干网络的自监督预训练框架,能更好地捕捉驾驶任务所需的语义“是什么”和几何“在哪里”。HilDA结合了分层蒸馏(包括用于渐进语义对齐的多层蒸馏和用于场景级语义的全局上下文蒸馏)与一个促进时空一致性的时间占用扩散目标。使用HilDA预训练的模型在跨模态蒸馏基准上取得了最先进的结果,并在3D目标检测、场景流和语义占用预测任务上优于通过先前蒸馏方法训练的模型。代码见:此 https URL。

英文摘要

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

2606.20300 2026-06-19 cs.CV 新提交 85%

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

CMDS-AD: 跨模态双流解耦用于少样本异常检测

Junhao Cai, Deyu Zeng, Junhao Pang, Junyu Chen, Qiwei Liang, Xiaopin Zhong, Zongze Wu

发表机构 * Shenzhen University(深圳大学) Guangzhou Maritime University(广州航海学院) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

专题命中 多传感器融合 :融合RGB和3D几何信息进行少样本异常检测

AI总结 提出跨模态双流异常检测框架CMDS-AD,通过扩散模型生成多样本并利用低频正常估计辅助解耦高频缺陷,在1-shot设置下MVTec 3D-AD上I-AUROC提升5.7%。

Comments Accepted to ECCV 2026!

详情
AI中文摘要

少样本异常检测由于训练数据有限仍然具有挑战性。多模态异常检测(MAD)提供了一种可行的解决方案,利用3D几何线索丰富2D RGB表示并弥补这一稀缺性。然而,现有的MAD方法采用空间均匀的特征处理,混淆了稳定的宏观结构与高频局部缺陷信号,加剧了跨模态错位并增加了假阳性率。为了克服这一问题,我们提出了CMDS-AD,一种跨模态双流异常检测框架。一个LoRA引导的扩散模型生成多样的RGB样本以缓解极端数据稀缺。对于3D正常增强,我们采用预训练的扩散模型作为正常估计器。关键的是,该估计器本质上充当非线性低通滤波器,直接从RGB输入中提取低频正常表示。这建立了一个纯低频信息的辅助估计流,锚定稳健的结构模板,并帮助包含耦合高低频分量的未压缩真实流精确隔离微缺陷。一个坐标感知的分层特征映射器自适应地对齐跨模态语义,而一个乘法评分机制过滤模态特定噪声。在极端1-shot设置下,CMDS-AD在MVTec 3D-AD上实现了5.7%(I-AUROC)和2.0%(AUPRO)的绝对性能提升,在EyeCandies上分别提升了7.7%和5.6%,确立了新的最先进水平。

英文摘要

Few-shot anomaly detection remains challenging due to limited training data. Multi-modal anomaly detection (MAD) offers a viable solution, leveraging 3D geometric cues to enrich 2D RGB representations and compensate for this scarcity. However, existing MAD methods apply spatially uniform feature processing, conflating stable macroscopic structures with high-frequency localized defect signals, exacerbating cross-modal misalignment and inflating false-positive rates. To overcome this, we present CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework. A LoRA-guided diffusion model generates diverse RGB samples to mitigate extreme data scarcity. For 3D normal augmentation, we employ a pre-trained diffusion model as a normal estimator. Crucially, this estimator inherently acts as a non-linear low-pass filter, directly extracting low-frequency normal representations from RGB inputs. This establishes an auxiliary estimated stream of purely low-frequency information, anchoring robust structural templates and assisting the uncompressed real stream, containing coupled high- and low-frequency components, to precisely isolate micro-defects. A Coordinate-Aware Hierarchical Feature Mapper adaptively aligns cross-modal semantics, while a multiplicative scoring mechanism filters modality-specific noise. Under the extreme 1-shot setting, CMDS-AD achieves absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, alongside 7.7% and 5.6% improvements on EyeCandies, establishing a new state-of-the-art.

2606.20044 2026-06-19 cs.CV 新提交 85%

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

FUSE:面向多模态目标重识别的频域统一与频谱能量对齐

Xuanhao Qi, Tom H. Luan, Yukang Zhang, Jinkai Zheng, Zhou Su, Shuwei Li, Lei Tan

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University(西安交通大学网络空间安全学院) School of Informatics, Xiamen University(厦门大学信息学院) National University of Singapore(新加坡国立大学)

专题命中 多传感器融合 :提出频域框架FUSE,对齐多模态特征,提升重识别性能。

AI总结 提出频域框架FUSE,通过频谱解耦和能量对齐两阶段处理,解决多模态重识别中低频偏置问题,在三个数据集上mAP提升9.1%。

Comments Accepted in ICML 2026

详情
AI中文摘要

尽管多模态重识别(ReID)取得了显著进展,现有方法往往强调低频线索。因此,它们关注颜色、光照和粗略外观等属性,而忽略了编码几何、纹理和身份判别细节的中高频结构。这种不平衡导致频谱表示不完整和跨模态对齐不稳定。为了克服这些限制,我们引入了FUSE,一个频域框架,将多模态ReID重新表述为频谱解耦和能量对齐的两阶段过程。所提出的频谱分解模块(SDM)自适应地将特征划分为低频、中频和高频子空间,实现分层频谱建模。跨模态对齐模块(CAM)进一步通过频率一致性正则化强制实现跨模态的能量对齐和子空间互补性。此外,FUSE结合了可学习的频率调制,以增强在不同光照和异构传感器条件下的鲁棒性。在RGBNT201、RGBNT100和MSVR310上的大量实验表明,FUSE实现了9.1%的mAP和9.5%的Rank-1改进,为多模态表示学习建立了一个可解释的频域范式。

英文摘要

Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

2604.13240 2026-06-19 cs.CV cs.LG 版本更新 85%

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

基于概念的可解释AI的高分辨率景观数据集及其在物种分布模型中的应用

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

发表机构 * Université Rennes 2, CNRS, Nantes Université, Univ Brest, LETG, UMR 6554(里昂大学第二分校、法国国家科学研究中心、南特大学、布列塔尼大学、LETG、UMR 6554) LTSER Zone Atelier Armorique(Armorique 领域实验室区) University of Würzburg, Center for Artificial Intelligence and Data Science(乌尔姆大学、人工智能与数据科学中心)

专题命中 多传感器融合 :融合多光谱和LiDAR无人机影像,属于多传感器融合

AI总结 提出首个基于概念的可解释AI方法用于物种分布模型,利用高分辨率多光谱和LiDAR无人机影像构建景观概念数据集,通过Robust TCAV量化景观概念对模型预测的影响,案例研究验证了方法的有效性。

详情
AI中文摘要

绘制物种空间分布对于保护政策和入侵物种管理至关重要。物种分布模型(SDMs)是完成此任务的主要工具,具有两个目的:实现稳健的预测性能,同时提供关于分布驱动因素的生态见解。然而,深度学习SDMs日益增长的复杂性使得提取这些见解更具挑战性。为了调和这些目标,我们提出了首个基于概念的可解释AI(XAI)在SDMs中的实现。我们利用Robust TCAV(测试与概念激活向量)方法量化景观概念对模型预测的影响。为此,我们提供了一个新的开放获取的景观概念数据集,该数据集源自高分辨率多光谱和LiDAR无人机影像。它包括跨越15个不同景观概念的653个斑块和1,450个随机参考斑块,旨在适用于广泛的物种。我们通过两个水生昆虫(襀翅目和毛翅目)的案例研究,使用两个卷积神经网络和一个视觉Transformer来展示这种方法。结果表明,基于概念的XAI有助于根据专家知识验证SDMs,同时发现产生新生态假说的新颖关联。Robust TCAV还提供了景观层面的信息,对政策制定和土地管理有用。代码和数据集公开可用。

英文摘要

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

2606.19929 2026-06-19 cs.RO 新提交 80%

Motor Angular Speed Preintegration for Multirotor UAV State Estimation

多旋翼无人机状态估计中的电机角速度预积分

Matěj Petrlík, Filip Novák, Robert Pěnička, Martin Saska

发表机构 * Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague(电子工程系控制学系,布拉格捷克技术大学)

专题命中 多传感器融合 :融合电机转速与LiDAR,提升无人机状态估计。

AI总结 针对无人机振动导致IMU精度下降的问题,提出基于电机转速加速度预积分的方法,替代IMU进行状态传播,并构建因子用于图优化,结合LiDAR形成MAS-LO算法,相比LIO-SAM位置精度提升28%,速度精度提升65%。

详情
AI中文摘要

精确的状态估计对于实现无人机的敏捷和近障碍飞行所需的紧密反馈控制至关重要。最先进的方法融合慢速位姿测量与高频惯性测量以获得精确的状态估计。然而,来自无人机上IMU的惯性测量会受到旋转螺旋桨振动的退化,导致估计状态的精度下降。我们提出了一种基于电机转速加速度预积分的新方法。我们展示了以这种方式获得的加速度可以单独用于状态传播,在不包含IMU的情况下实现更好的精度。此外,我们提出了一个由预积分电机转速组成的因子,可以直接用于因子图优化框架。我们将该因子与LiDAR测量结合,提出电机角速度LiDAR里程计(MAS-LO)算法,用于精确状态估计,并开源该算法。最后,我们与最先进的惯性算法LIO-SAM进行估计精度评估,结果显示位置估计精度提升28%,速度估计精度提升65%,测量延迟降低14%,并且对错误参数值具有高鲁棒性。

英文摘要

A precise state estimate is crucial for a tight feedback control that enables agile and near-obstacle flights of UAVs. The state-of-the-art methods fuse slow pose measurements with high-frequency inertial measurements to obtain a precise state estimate. However, the inertial measurements from the IMU onboard the UAV are degraded by vibrations from spinning propellers and the precision of the estimated state suffers. We propose a novel approach based on the preintegration of accelerations obtained from motor speeds. We show that the accelerations obtained in this manner can be used for state propagation on their own to achieve better precision without including the IMU. Further, we propose a factor composed of the preintegrated motor speeds that can be directly employed in factor graph optimization frameworks. We combine our factor with LiDAR measurements into the proposed Motor Angular Speed LiDAR Odometry (MAS-LO) algorithm for precise state estimation, which we open-source. Lastly, we evaluate the estimation precision against a state-of-the-art inertial algorithm LIO-SAM to show 28% improvement in position and 65% in velocity estimation accuracy, 14% lower measurement lag, and high robustness to wrong parameter values.

2606.19874 2026-06-19 cs.RO cs.CV 新提交 80%

MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM

MMD-SLAM:结构增强的多元高斯分布引导视觉SLAM

Fan Zhu, Ziyu Chen, Peichen Liu, Yifan Zhao, Zhisong Xu, Hui Zhu, Hongxing Zhou, Sixun Liu, Chunmao Jiang

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院合肥物质科学研究院) University of Science and Technology of China(中国科学技术大学) Aarhus University(奥胡斯大学) University of Tokyo(东京大学) Beijing University of Chemical Technology(北京化工大学) North China Electric Power University(华北电力大学)

专题命中 多传感器融合 :视觉SLAM融合点线特征,多传感器融合

AI总结 提出MMD-SLAM,利用亚特兰大世界假设引导多元高斯表示,通过点线融合、主导方向编码和高斯进化策略,提升视觉SLAM的跟踪精度与建图质量。

Comments ICRA 2026

详情
AI中文摘要

3D高斯泼溅(3DGS)显著提升了新视角合成和高保真场景重建,扩展了基于3DGS的视觉同步定位与建图(SLAM)方法的潜力。然而,大多数现有系统未能充分利用底层结构信息,这限制了渲染质量并常常导致地图不一致。为了解决这些限制,我们提出了MMD-SLAM,一个结构增强的视觉SLAM框架,利用亚特兰大世界(AW)假设来引导多元高斯表示以实现逼真的建图。首先,我们引入了一种点线融合策略用于位姿优化,其中3D线段被纳入以提高跟踪鲁棒性并为建图提供额外约束。其次,我们设计了一种具有主导方向的多元高斯表示,显式编码来自AW假设的结构先验。最后,我们提出了一种高斯进化策略,该策略适应场景几何并将结构线索融入全局优化。大量实验表明,这些创新使MMD-SLAM在跟踪精度和建图质量方面均达到了最先进的性能。例如,与MonoGS相比,我们的方法在ScanNet上实现了48.56%的ATE RMSE降低,在Replica上实现了5.71%的PSNR提升。

英文摘要

3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.

2. 遥感融合与全色锐化 2 篇

2606.20291 2026-06-19 cs.LG cs.CV 新提交 90%

Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision

整合国家森林清查、机载激光雷达和卫星影像,利用计算机视觉实现森林结构的全覆盖制图

Luke J. Zachmann, David D. Diaz, Vincent A. Landau, Chelsey Walden-Schreiner, Tony Chang, Nathan E. Rutenbeck, Katharyn A. Duffy, Kiarie Ndegwa, Andreas Gros, Scott Conway, Guy Bayes

发表机构 * Vibrant Planet Public Benefit Corporation(Vibrant Planet 公益公司)

专题命中 遥感融合与全色锐化 :融合卫星影像、激光雷达和森林清查数据制图

AI总结 提出VibrantForests框架,结合卫星影像、激光雷达样本和计算机视觉,以10米分辨率生成美国本土的冠层覆盖、高度、生物量等森林属性图,减少饱和与回归均值问题。

详情
AI中文摘要

遥感技术越来越被依赖,以提供可操作的科学研究,用于大型景观的森林和野火风险管理。全覆盖、每年更新的地图是有效森林管理的持续需求。许多规划系统和数据收集结合了不同目的、年份和预测质量的异质数据源,导致运营规划系统中的混淆行为。我们介绍了VibrantForests框架,该框架被开发并应用于绘制森林属性,为有效的森林和野火规划提供一致的基础。VibrantForests包括一个基于卫星的森林结构模型,该模型在激光雷达衍生的样本上训练,并应用于美国本土,以10米分辨率同时生成冠层覆盖度、冠层高度、地上活树生物量、胸高断面积和二次平均直径的估计。我们展示了跨越从稀疏冠层/低生物量到密集冠层/高生物量的全部森林条件的预测能力。结果表明,我们的模型扩展了在类似被动传感器模型中常见的饱和范围,并减少了回归均值行为,该行为通常在小/稀疏条件下高估森林属性,在大/密集条件下低估森林属性。VibrantForests框架通过以年度节奏和10米分辨率提供管理相关属性的一致全覆盖估计,解决了大面积森林和野火规划中的一个关键限制。

英文摘要

Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management. Many planning systems and data collections combine disparate data sources with different purposes, vintages, and prediction quality, which leads to confounding behavior in operational planning systems. We introduce the VibrantForests framework, developed and applied to map forest attributes and provide a coherent foundation for effective forest and wildfire planning. VibrantForests includes a satellite-based forest structure model trained on lidar-derived samples and applied across the contiguous United States to concurrently generate estimates of canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter at 10-meter resolution. We demonstrate predictive capability spanning the full spectrum of forest conditions ranging from sparse-canopy/low-biomass to dense-canopy/high-biomass. Results show that our model extends the range at which saturation is commonly encountered in comparable passive-sensor models, and reduces regression-to-mean behavior that commonly produces overestimation of forest attributes in small/sparse conditions and underestimation in large/dense conditions. The VibrantForests framework addresses a key limitation in large-area forest and wildfire planning by delivering coherent wall-to-wall estimates of management-relevant attributes at annual cadence and 10m resolution.

2606.20032 2026-06-19 cs.CV 新提交 90%

ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement

ReA-OVCD:通过语义和空间精炼的可靠性感知开放词汇变化检测

Hongming Zhu, Huaji Chen, Bowen Du, Sicong Liu, Qin Liu

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) College of Surveying and Geo-Informatics, Tongji University(同济大学测绘与地理信息学院)

专题命中 遥感融合与全色锐化 :开放词汇变化检测,融合语义与空间信息,用于遥感。

AI总结 提出一种无需训练的可靠性感知开放词汇变化检测框架,通过语义变化推理和边界感知精炼策略,解决实例级比较忽略细粒度变化和像素级比较不可靠的问题,在多个数据集上F1提升2.13%-9.75%。

详情
AI中文摘要

与依赖预定义类别的传统遥感变化检测不同,开放词汇变化检测(OVCD)使用任意文本提示灵活识别土地覆盖变化。然而,现有方法在建模变化时存在固有折衷:实例级比较忽略了细粒度语义变化(例如部分建筑扩建),而直接像素比较不可靠,由于语义模糊和空间不一致导致不稳定响应和边界伪影。为此,我们提出一种高效的无训练可靠性感知开放词汇变化检测(ReA-OVCD)框架。它首先从像素级语义差异中推导候选变化区域,以确保灵活和详细的定位。为确保可靠性,随后引入协作精炼策略,从语义和空间角度显式建模变化有效性。具体而言,我们开发了语义变化推理(SCR)模块,通过联合分析分布差异和响应变化重新评估变化,从而抑制偶然不一致性同时保留可靠的语义转变。此外,设计了边界感知变化精炼(BCR)模块,通过验证候选区域是否得到可靠内部像素支持来减轻由边界错位和不确定性引起的伪影。在多个数据集(LEVIR-CD、WHU-CD、DSIFN和SECOND)上的大量实验表明,我们的方法持续优于现有技术,在更高计算效率下实现了2.13%至9.75%的F1提升。代码已公开于此 https URL。

英文摘要

Unlike traditional remote sensing change detection that relies on predefined categories, Open-Vocabulary Change Detection (OVCD) identifies land cover changes flexibly using arbitrary text prompts. However, existing methods suffer from an inherent trade-off when modeling changes: instance-level comparison overlooks fine-grained semantic variations (e.g., partial building extensions), while direct pixel comparison proves unreliable, yielding unstable responses and boundary artifacts due to semantic ambiguity and spatial inconsistency. To this end, we propose an efficient training-free Reliability-Aware Open-Vocabulary Change Detection (ReA-OVCD) framework. It first derives candidate change regions from pixel-wise semantic discrepancies to ensure flexible and detailed localization. To ensure reliability, it subsequently introduces a collaborative refinement strategy to explicitly model change validity from both semantic and spatial perspectives. Specifically, we develop a Semantic Change Reasoning (SCR) module that reassesses changes by jointly analyzing distributional divergence and response variation, enabling the suppression of incidental inconsistencies while preserving reliable semantic shifts. In addition, a Boundary-aware Change Refinement (BCR) module is designed to mitigate artifacts stemming from boundary misalignment and uncertainty through validating whether candidate regions are supported by reliable interior pixels. Extensive experiments across multiple datasets (LEVIR-CD, WHU-CD, DSIFN, and SECOND) demonstrate that our method consistently outperforms state-of-the-art approaches, achieving $\mathrm{F}_{1}^{C}$ improvements of 2.13\% to 9.75\% with higher computational efficiency. The code is publicly available at \https://github.com/Funny0101/ReA-OVCD

3. 医学影像融合 6 篇

2606.20143 2026-06-19 cs.CV 新提交 90%

HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation, Diagnosis, and Prognosis in Multimodal PET/CT

头颈肿瘤 (HECKTOR) 2025 挑战赛:多模态 PET/CT 中的分割、诊断与预后基准

Numan Saeed, Salma Hassan, Shahad Hardan, Lishan Cai, Xinglong Liang, Moona Mazher, Abdul Qayyum, Yansong Bu, Mengye Lyu, Yue Lin, Mingyuan Meng, Chuanyi Huang, Lisheng Wang, Dalal Chamseddine, Shamimeh Ahrari, Beining Wu, Yifei Chen, Fuyou Mao, Hao Zhang, Baixiang Zhao, Surajit Ray, Muzi Guo, Lei Xiang, Jakob Dexl, Michael Ingrisch, Adrien Depeursinge, Arman Rahmim, Mathieu Hatt, Vincent Andrearczyk, Mohammad Yaqub

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) Amsterdam UMC(阿姆斯特丹大学医学中心) The Netherlands Cancer Institute(荷兰癌症研究所) Radboud University Medical Centre(拉德堡德大学医学中心) University College London(伦敦大学学院) Imperial College London(帝国理工学院) Shenzhen Technology University(深圳技术大学) Shenzhen University(深圳大学) Newland Digital Technology(新大陆数字技术) The University of Sydney(悉尼大学) Shanghai Jiao Tong University(上海交通大学) University Hospital, Nantes(南特大学医院) Nantes Université, Centrale Nantes, CNRS, LS2N(南特大学、南特中央理工学院、法国国家科学研究中心、LS2N实验室) Hangzhou Dianzi University(杭州电子科技大学) Tsinghua University(清华大学) Central South University(中南大学) University of Glasgow(格拉斯哥大学) China Mobile System Integration Co., Ltd.(中移系统集成有限公司) Subtle Medical Inc.(Subtle Medical公司) University Hospital, LMU Munich(慕尼黑大学医院) Munich Center for Machine Learning(慕尼黑机器学习中心) BC Cancer Research Institute(不列颠哥伦比亚癌症研究所) HES-SO Valais-Wallis University of Applied Sciences and Arts(HES-SO瓦莱州应用科学与艺术大学) Lausanne University Hospital (CHUV)(洛桑大学医院) LaTIM, INSERM, UMR 1101, Univ Brest(LaTIM实验室、法国国家健康与医学研究院、UMR 1101、布雷斯特大学)

专题命中 医学影像融合 :多模态PET/CT影像用于头颈癌分割、诊断与预后

AI总结 HECKTOR 2025 挑战赛利用多模态 PET/CT 和电子健康记录,建立了头颈癌自动分析的基准,涵盖肿瘤分割、复发预测和 HPV 分类三个任务,最佳算法分别达到 Dice 0.75、C-index 0.66 和平衡准确率 0.56。

Comments 17 pages, 4 figures, 4 tables. Overview paper for the HECKTOR 2025 challenge, held as a satellite event at MICCAI 2025. Challenge website: https://hecktor.grand-challenge.org/

详情
AI中文摘要

头颈癌 (HNC) 构成显著的全球健康负担,准确的肿瘤勾画对于有效的放疗计划至关重要。口咽部解剖结构的复杂性,加上肿瘤在影像上的异质性表现,使得手动分割耗时且存在观察者间差异。除分割外,从非侵入性影像预测长期临床结局(如无复发生存期 RFS)和确定人乳头瘤病毒 (HPV) 状态,仍然是具有挑战性但临床价值高的目标。HECKTOR 2025 挑战赛通过使用多模态 PET/CT 影像和电子健康记录,建立了一个用于自动 HNC 分析的全面基准。基于前几届(2020-2022),本次挑战赛采用了扩展的多机构数据集,包含来自全球 10 个中心的 1100 多名患者。参与者需完成三个互补目标:(1) 分割原发肿瘤体积 (GTVp) 和转移淋巴结 (GTVn),(2) 预测无复发生存期,(3) 分类 HPV 状态。挑战赛吸引了 35 个注册团队,其中 15 个最终提交在保留测试集上进行了评估。表现最佳的算法在分割上达到平均 Dice 相似系数 0.75,在生存预测上达到一致性指数 0.66,在 HPV 分类上达到平衡准确率 0.56。本文对所提交的方法进行了全面分析,评估了它们在不同病变特征上的性能,并讨论了它们在自动化肿瘤学工作流程和决策支持系统中临床转化的意义。

英文摘要

Head and neck cancers (HNC) represent a significant global health burden, with accurate tumor delineation being essential for effective radiotherapy planning. The complexity of the oropharyngeal anatomy, combined with the heterogeneous appearance of tumors on imaging, makes manual segmentation time-intensive and subject to inter-observer variability. Beyond segmentation, predicting long-term clinical outcomes, such as recurrence-free survival (RFS), and determining human papillomavirus (HPV) status from noninvasive imaging, remain challenging yet clinically valuable goals. The HECKTOR 2025 challenge addresses these needs by establishing a comprehensive benchmark for automated HNC analysis using multimodal PET/CT imaging and electronic health records. Building on previous editions (2020-2022), this challenge features an expanded multi-institutional dataset comprising over 1,100 patients from 10 centers worldwide. Participants were tasked with three complementary objectives: (1) segmenting primary gross tumor volumes (GTVp) and metastatic lymph nodes (GTVn), (2) predicting recurrence-free survival, and (3) classifying HPV status. The challenge attracted 35 registered teams, with 15 final submissions evaluated on a held-out test set. Top-performing algorithms achieved a mean Dice similarity coefficient of 0.75 for segmentation, a concordance index of 0.66 for survival prediction, and a balanced accuracy of 0.56 for HPV classification. This paper presents a comprehensive analysis of the submitted methodologies, evaluates their performance across different lesion characteristics, and discusses their implications for clinical translation in automated oncology workflows and decision support systems.

2606.20112 2026-06-19 cs.CV eess.IV 新提交 85%

Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

像素级残差扩散Transformer:可扩展的3D CT体生成

Zhenkai Zhang, Markus Hiller, Krista A. Ehinger, Tom Drummond

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院)

专题命中 医学影像融合 :生成3D CT体数据,涉及医学影像生成

AI总结 提出像素级残差扩散Transformer(PRDiT),通过两阶段训练(局部MLP盲估计器分离低频结构+全局残差扩散Transformer建模高频残差)实现高保真3D CT体生成,在LIDC-IDRI和RAD-ChestCT数据集上优于现有方法。

Comments Accepted at ICLR 2026. Code available at https://github.com/Fredy-Zhang/PRDiT

详情
AI中文摘要

由于现有生成模型固有的巨大计算需求和优化困难,生成具有精细细节的高分辨率3D CT体仍然具有挑战性。在本文中,我们提出了像素级残差扩散Transformer(PRDiT),这是一种可扩展的生成框架,可直接在体素级别合成高质量的3D医学体。PRDiT引入了一个两阶段训练架构,包括:1)一个局部去噪器,形式为基于MLP的盲估计器,作用于重叠的3D块,以有效分离低频结构;2)一个全局残差扩散Transformer,采用内存高效注意力来建模和细化整个体上的高频残差。这种从粗到细的建模策略简化了优化,增强了训练稳定性,并有效保留了细微结构,而无需自编码器瓶颈。在LIDC-IDRI和RAD-ChestCT数据集上进行的大量实验表明,PRDiT始终优于最先进的模型,如HA-GAN、3D LDM和WDM-3D,在3D FID、MMD和Wasserstein距离指标上显著降低。

英文摘要

Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.

2606.19966 2026-06-19 cs.CV cs.LG 新提交 85%

Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis

语义锚定证据融合用于域鲁棒的全切片生存分析

Yucheng Xing, Ling Huang, Pei Liu, Jingying Ma, Jiaqing Xu, Kai He, Mengling Feng

发表机构 * National University of Singapore(新加坡国立大学) Imperial College London(帝国理工学院) Hunan University(湖南大学)

专题命中 医学影像融合 :语义锚定证据融合,用于全切片生存分析。

AI总结 提出SAEFS框架,通过视觉问答提取语义锚点,结合双流证据提取和狄利克雷主观逻辑建模不确定性,实现跨域零样本生存分析,平均C-index提升10.2%。

详情
AI中文摘要

全切片图像(WSIs)广泛用于计算癌症预后。然而,现有方法主要关注域内性能,难以泛化到不同临床中心。这一局限性源于它们依赖像素级表示,极易受到染色协议和扫描硬件导致的域特定伪影影响。我们假设高级病理语义(如肿瘤分级和微环境结构)提供了域不变的语义表示,反映了人类病理学家的鲁棒诊断逻辑。因此,我们提出了语义锚定证据融合生存(SAEFS)框架,其中SAEFS通过视觉问答(VQA)从WSIs中推导语义锚点,采用双流WSI证据提取架构,使用基于狄利克雷的主观逻辑建模不确定性,并通过谨慎合取规则融合语义和视觉证据,以避免来自相关源的过度自信融合。仅在单一源域上训练并在四个未见域上进行零样本评估,SAEFS在预测准确性和可靠性上均一致优于最先进模型,平均C-index提升10.2%。定量分析进一步表明,VQA导出的语义特征比像素级特征表现出显著更低的跨中心差异,突显了其在跨中心临床应用中的鲁棒性。

英文摘要

Whole-slide images (WSIs) are widely used for computational cancer prognosis. However, most existing methods primarily focus on in-domain performance and fail to generalize across clinical centers. This limitation stems from their reliance on pixel-derived representations that are highly susceptible to domain-specific artifacts caused by staining protocols and scanner hardware. We hypothesize that high-level pathology semantics, such as tumor grade and micro-environmental architecture, provide a domain-invariant semantic representation that mirrors the robust diagnostic logic of human pathologists. Therefore, we propose a Semantic-Anchored Evidential Fusion Survival (SAEFS) framework, where SAEFS derives semantic anchors from WSIs via Visual Question Answering (VQA), employs a dual-stream WSI evidence extraction architecture, uses Dirichlet-based Subjective Logic to model uncertainty, and fuses semantic and visual evidence through a cautious conjunction rule to avoid overconfident fusion from correlated sources. Trained exclusively on one source domain and evaluated zero-shot across four unseen domains, SAEFS consistently outperforms state-of-the-art models both in prediction accuracy and reliability, improving the average C-index by 10.2%. Quantitative analyses further show that VQA-derived semantic features exhibit significantly lower cross-center divergence than pixel-derived features, highlighting their robustness for cross-center clinical applications.

2606.19838 2026-06-19 cs.CV 新提交 85%

OTCHA: Optimal Transport-driven Confidence-aware Latent Hub Alignment for Multi-View Medical Image Classification

OTCHA: 基于最优传输的置信度感知潜在中心对齐用于多视图医学图像分类

Jiwoong Yang, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University(汉阳大学) Hankuk University of Foreign Studies(韩国外国语大学)

专题命中 医学影像融合 :多视图医学图像分类,融合补丁令牌

AI总结 提出OTCHA模块,通过最优传输对齐多视图补丁令牌与共享潜在中心令牌,结合置信度门控和部分匹配,消除无关特征,提升多视图医学图像分类鲁棒性。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

多视图成像(如乳腺X线摄影和胸部X线摄影)是临床实践的标准组成部分。然而,医学图像通常未配准,且包含视图特定的伪影或无关背景线索,这些可能掩盖诊断相关发现。许多现有方法直接融合每个视图的表征,使得此类无关内容污染融合嵌入,并在不同视图配置下降低鲁棒性。我们提出OTCHA,一种基于最优传输(OT)的置信度感知潜在中心令牌对齐模块,在融合前细化补丁令牌以用于多视图分类。OTCHA引入一组跨视图共享的可学习潜在中心令牌。对于每个视图,我们计算补丁令牌与中心令牌之间的OT计划,该计划联合考虑特征相似性和几何结构,并通过令牌条件尘埃箱增强OT公式以实现部分匹配并丢弃无关令牌。所得传输计划提供令牌级匹配置信度,该置信度门控中心介导的消息传递,并加权一种新的基于最优传输的表征对齐损失以稳定细化。在三个多视图医学图像数据集上的实验表明,在不同解剖结构和视图配置下,相比竞争基线方法取得一致改进。我们的代码可在该https URL获取。

英文摘要

Multi-view imaging, such as mammography and chest radiography, is a standard component of clinical practice. However, medical images are often unregistered and contain view-specific artifacts or irrelevant background cues that can obscure diagnostically relevant findings. Many existing methods directly fuse per-view representations, allowing such irrelevant content to contaminate the fused embedding and reducing robustness under varying view configurations. We propose OTCHA, a confidence-aware latent hub token alignment module based on optimal transport (OT) that refines patch tokens before fusion for multi-view classification. OTCHA introduces a set of learnable latent hub tokens shared across views. For each view, we compute an OT plan between patch tokens and hub tokens that jointly considers feature similarity and geometry, and augment the OT formulation with token-conditional dustbins to enable partial matching and discard irrelevant tokens. The resulting transport plan provides token-wise matching confidence, which gates hub-mediated message passing and weights a novel optimal-transport-based representation alignment loss to stabilize refinement. Experiments on three multi-view medical image datasets demonstrate consistent improvements over competing baselines across diverse anatomies and view configurations. Our code is available at https://github.com/labhai/OTCHA.

2606.19371 2026-06-19 cs.LG cs.AI cs.CV 新提交 85%

ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification

ProMUSE: 渐进式多模态不确定性引导的分阶段证据阿尔茨海默病分类

Long Doan, Branden Chen, Ethan Litton, Huan Huang, Jiajing Huang, Yixin Xie, Weihua Zhou, Nandakumar Narayanan, Chen Zhao

发表机构 * Kennesaw State University(肯尼索州立大学) Michigan Technological University(密歇根理工大学) University of Iowa(爱荷华大学)

专题命中 医学影像融合 :利用多模态数据(临床、MRI、PET)进行AD分类,核心是多模态融合。

AI总结 提出ProMUSE,一种渐进式多模态不确定性引导的分阶段证据网络,通过自适应决定何时需要额外模态,在保持准确性的同时降低数据采集成本。

详情
AI中文摘要

阿尔茨海默病(AD)是一种致命性疾病,会破坏老年人的记忆和认知能力。大多数AD治疗在早期阶段有效,导致对早期AD诊断的需求日益增加。AD诊断越来越依赖多模态数据,如临床评估、结构磁共振成像(MRI)和正电子发射断层扫描(PET)成像。然而,MRI和PET采集仍然昂贵且不易普及,使得全模态推理在现实临床工作流程中不切实际。我们提出ProMUSE,一种渐进式多模态不确定性引导的分阶段证据网络,该网络自适应地确定何时需要额外模态,有助于在保持准确性的同时降低数据采集的总体成本。ProMUSE首先使用低成本临床数据进行证据分类,并通过基于Dirichlet的主观逻辑模型量化不确定性。当不确定性超过学习阈值时,ProMUSE逐步引入MRI或PET特征,通过Dempster-Shafer理论融合模态层面的信念和不确定性,获得校准的多模态预测。这种分阶段采集策略能够在最小化对昂贵成像依赖的同时实现准确诊断。在ADNI、AIBL和OASIS数据集上针对CN-AD、CN-MCI和MCI-AD任务的实验表明,ProMUSE在减少50-90%的MRI/PET使用量的同时,实现了与全模态基线相当或更优的准确性,从而大幅节省成本。这些结果突显了ProMUSE作为现实世界AD筛查中一种实用、不确定性感知且资源高效的解决方案。

英文摘要

Alzheimer's disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.

2606.14957 2026-06-19 cs.CV 新提交 85%

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

学习用于多模态神经影像的稀疏潜在预测基础模型

Haoxu Huang, Long Chen, Jingyun Chen, Jinu Hyun, James Ryan Loftus, Kara Melmed, Daniel Orringer, Jennifer Frontera, Seena Dehkharghani, Arjun Masurkar, Narges Razavian

发表机构 * New York University, Center for Data Science(纽约大学数据科学中心) NYU Grossman School of Medicine, Department of Radiology(纽约大学格罗斯曼医学院放射学系) State University of New York at Binghamton, School of Computing(纽约州立大学宾汉姆顿分校计算机学院) NYU Grossman School of Medicine, Department of Neurology(纽约大学格罗斯曼医学院神经病学系) NYU Grossman School of Medicine, Department of Neurosurgery(纽约大学格罗斯曼医学院神经外科学系) NYU Grossman School of Medicine, Department of Pathology(纽约大学格罗斯曼医学院病理学系) School of Medicine, Department of Radiology, Stanford(斯坦福大学医学院放射学系) NYU Grossman School of Medicine, Department of Neuroscience(纽约大学格罗斯曼医学院神经科学系) NYU Grossman School of Medicine, Neuroscience Institute(纽约大学格罗斯曼医学院神经科学研究所)

专题命中 医学影像融合 :融合T1w、T2w和FLAIR三种MRI序列,学习统一表示

AI总结 提出Neuro-JEPA模型,结合潜在预测目标和专家混合架构,学习T1w、T2w和FLAIR三种MRI序列的统一表示,在25项临床任务和22项公开数据集任务上优于现有基础模型和CNN基线。

Comments Under Review Preprint

详情
AI中文摘要

脑部MRI通常作为多个互补序列采集,具有独特的对比度加权,包括T1加权成像(T1w)解剖对比和液体敏感T2加权(T2w)对比。然而,在健康系统规模上,跨多种MRI对比机制学习统一表示的方法尚缺乏。在本研究中,我们引入了Neuro-JEPA,一种稀疏多模态神经影像基础模型,它结合了潜在预测目标和专家混合架构,以编码跨核心T1w、T2w和液体抑制FLAIR成像(FLAIR)的脑部MRI。我们进一步对架构、掩码、目标和稀疏性设计选择进行了系统的方法论研究,这些选择有利于稳健的神经影像多模态表示学习。Neuro-JEPA在428,647项研究的1,551,862次扫描上进行了预训练,这些扫描经过了模态特定的预处理和跨三种核心结构脑部MRI序列的数据整理。我们在临床和研究环境中评估了学习到的表示,包括来自三个健康系统(NYU Langone、NYU Long Island和Massachusetts General Hospital)的25项任务,以及来自12个公开数据集的22项任务,涵盖了单模态、多模态和跨域评估配置。在这些基准测试中,现有的神经影像基础模型相对于简单的卷积神经网络(CNN)基线显示出不一致的提升,而Neuro-JEPA在所有评估设置中实现了更强且更一致的性能。这些结果建立了一个可扩展的多模态神经影像表示学习方法论框架,并强调了基础模型评估协议需要包括简单基线、临床异质性队列和受控的多模态比较。

英文摘要

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

4. 音视频/视觉语言融合 12 篇

2606.19927 2026-06-19 cs.CV 新提交 90%

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

CARE: 面向视频多模态大语言模型的自适应推理长度的能力感知奖励塑形

Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

专题命中 音视频/视觉语言融合 :视频多模态推理,涉及视觉与语言融合

AI总结 提出CARE框架,通过能力感知奖励塑形自适应优化推理长度,利用指数移动平均估计能力并分阶段调整奖励偏好,结合批次归一化和后验放大器提升效率与准确性。

详情
AI中文摘要

在多模态视频推理中,基于强化学习的方法通常依赖简单且不灵活的推理长度控制策略,无法适应模型不断变化的能力。这种不匹配可能在早期阶段抑制必要的探索,而在模型变得更有能力后鼓励冗余推理和低效解码。本文提出CARE,一种用于多模态推理中自适应推理长度优化的能力感知奖励塑形框架。具体来说,CARE通过通过率的指数移动平均维护平滑的能力估计,并利用它将训练路由到渐进阶段,将奖励偏好从探索导向的长形式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混淆,CARE进一步使用批次级统计归一化推理努力,并引入后验放大器以增强对历史上困难样本上意外强性能的奖励信号。所提出的机制无缝集成到GRPO训练流程中,且不增加额外推理开销。在多个视频推理和通用视频理解基准上的大量实验表明,CARE持续提高推理准确性,稳定强化学习,并显著提升令牌效率。此外,CARE在训练过程中展现出推理长度的特征性倒U型轨迹,并在收敛时产生更短但信息更丰富的推理轨迹,表明推理预算的有效自适应分配。我们在以下网址提供CARE框架和实验的源代码:此https URL。

英文摘要

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

2606.19882 2026-06-19 cs.CV cs.LG 新提交 90%

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego(加州大学圣地亚哥分校)

专题命中 音视频/视觉语言融合 :多模态概念瓶颈模型,对齐图像和文本嵌入

AI总结 提出多模态概念瓶颈模型(MM-CBM),利用双概念瓶颈层对齐图像和文本嵌入,实现可解释的零样本分类和图像检索,在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情
AI中文摘要

概念瓶颈模型(CBM)通过将图像提取的特征与自然概念对齐,增强了深度学习网络的可解释性。然而,现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制,其中预期概念之外的预测信号被无意中利用。在本文中,我们提出了多模态概念瓶颈模型(MM-CBM)来解决这些问题,并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层(CBL)将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务,如零样本分类或图像检索。与现有方法相比,MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率,在黑盒性能的约5%以内,同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

2603.10791 2026-06-19 eess.IV 版本更新 90%

Semantic Satellite Communications for Synchronized Audiovisual Reconstruction

面向同步视听重建的语义卫星通信

Fangyu Liu, Peiwen Jiang, Wenjin Wang, Xiao Li, Shi Jin

专题命中 音视频/视觉语言融合 :提出视听语义传输系统,实现跨模态生成与同步重建

AI总结 提出自适应多模态语义传输系统,通过双流生成架构和动态关键帧更新机制,在带宽受限的卫星场景下实现高质量同步视听重建,显著降低带宽消耗并提升鲁棒性。

详情
AI中文摘要

卫星通信在支持高保真同步视听服务方面面临严重瓶颈,因为传统方案在信道波动、带宽有限和长传播延迟下难以处理跨模态一致性。为了解决这些问题,本文提出了一种针对卫星场景的自适应多模态语义传输系统,旨在带宽约束下实现高质量同步视听重建。与具有固定模态优先级的静态方案不同,我们的框架采用双流生成架构,可灵活切换视频驱动音频生成和音频驱动视频生成。这使得系统能够动态解耦语义,仅传输最重要的模态,同时利用跨模态生成恢复另一种模态。为了平衡重建质量和传输开销,动态关键帧更新机制根据无线场景和用户需求自适应维护共享知识库。此外,引入基于大语言模型的决策模块以增强系统适应性。通过集成卫星特定知识,该模块联合考虑任务需求和信道因素(如天气引起的衰落),主动调整传输路径和生成工作流。仿真结果表明,所提系统在实现高保真视听同步的同时显著降低带宽消耗,提高了挑战性卫星场景下的传输效率和鲁棒性。

英文摘要

Satellite communications face severe bottlenecks in supporting high-fidelity synchronized audiovisual services, as conventional schemes struggle with cross-modal coherence under fluctuating channel conditions, limited bandwidth, and long propagation delays. To address these limitations, this paper proposes an adaptive multimodal semantic transmission system tailored for satellite scenarios, aiming for high-quality synchronized audiovisual reconstruction under bandwidth constraints. Unlike static schemes with fixed modal priorities, our framework features a dual-stream generative architecture that flexibly switches between video-driven audio generation and audio-driven video generation. This allows the system to dynamically decouple semantics, transmitting only the most important modality while employing cross-modal generation to recover the other. To balance reconstruction quality and transmission overhead, a dynamic keyframe update mechanism adaptively maintains the shared knowledge base according to wireless scenarios and user requirements. Furthermore, a large language model based decision module is introduced to enhance system adaptability. By integrating satellite-specific knowledge, this module jointly considers task requirements and channel factors such as weather-induced fading to proactively adjust transmission paths and generation workflows. Simulation results demonstrate that the proposed system significantly reduces bandwidth consumption while achieving high-fidelity audiovisual synchronization, improving transmission efficiency and robustness in challenging satellite scenarios.

2606.20077 2026-06-19 cs.CV cs.AI 新提交 85%

The Hidden Evolution of Disguised Visual Context inside the VLM

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey(萨里大学以人为本人工智能研究所) Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心)

专题命中 音视频/视觉语言融合 :视觉语言模型中视觉令牌与语言空间的融合

AI总结 研究视觉语言模型中视觉令牌如何通过不同集成架构(上下文注入与逐层注入)转化为有意义表示,揭示其内部演化过程及对性能的影响。

详情
AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型(LLM)。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示,还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成,目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式,在单图像、多图像和视频基准上进行公平比较。在此过程中,我们揭示了一个隐藏的演化:视觉令牌作为伪装的视觉上下文(缺乏语言结构的原始表示)进入LLM,但根据集成范式逐渐被重塑,每种范式捕捉视觉信号的不同频率特征。我们表明,LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐,以及最终每种范式在不同任务上的表现。我们进一步证明,仅关注注意力分配是不够的,性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

2606.19944 2026-06-19 cs.CV 新提交 85%

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

Timage: 一种用于微调视觉语言模型的文本嵌入图像生成范式

Yifeng Wu, Huimin Huang, Ruiluo Wu, Chunyi Lin, Guanhua Chen, Xian Wu, Wang Song, Ruize Han

发表机构 * Fudan University(复旦大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Tencent Jarvis Lab(腾讯贾维斯实验室) Southern University of Science and Technology(南方科技大学)

专题命中 音视频/视觉语言融合 :文本嵌入图像,增强视觉语言模型空间推理。

AI总结 提出Timage范式,通过约束薛定谔桥将查询文本作为排版覆盖层嵌入图像,以显式空间锚点引导模型关注,在不侵蚀骨干能力前提下提升细粒度空间推理性能。

Comments ECCV

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度空间推理中常丢失正确图像区域,因为文本查询很少携带明确的几何锚点进入像素域。现有补救方法要么重新调整模型权重,要么用冗长指令填充提示,但都无法在不侵蚀骨干通用能力的情况下可靠地将语言定位到正确的视觉坐标。我们提出Timage,一种将多模态理解重新定义为输入层面对齐问题的范式:查询被绘制为排版覆盖层直接叠加在图像上。该覆盖层的放置和外观由约束薛定谔桥(cSB)生成,这是一种熵最优传输采样器,将布局合成分解为两个耦合的随机阶段。第一阶段——区域搜索,将噪声向查询对齐的图像区域传输,同时遵守硬遮挡屏障以保护显著前景内容;第二阶段——外观塑造,通过“墨水预算”正则化调整字形大小,使渲染文本保持可读和视觉平衡。生成的覆盖层作为显式注意力信标,引导模型沿空间语义聚焦。在VMCBench基准上,Timage搭配7B骨干模型明显超越更大的专有系统和参数调优基线。该研究将审慎的输入重构定位为一种强大的、架构中立的杠杆,以增强多模态推理。

英文摘要

Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model's weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone's general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget'' regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model's focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.

2606.19915 2026-06-19 cs.CV 新提交 85%

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

SpatialSV: 通过任务导向的视觉监督在多模态大语言模型中内化可解释的3D空间感知

Jiayu Tang, Yuchen Zhou, Chao Gou

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能工程学院)

专题命中 音视频/视觉语言融合 :将2D视觉特征提升为3D表示,多模态融合

AI总结 提出SpatialSV框架,通过任务导向的视觉监督将MLLM的2D特征提升为显式3D表示(深度图、相机姿态、点云),实现可解释的3D空间感知内化,无需外部工具,并在半监督设置中展现强泛化能力。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

解锁多模态大语言模型(MLLMs)的空间智能对于理解和与3D世界交互至关重要。当前主流方法通常通过外部工具注入空间先验,这会带来显著的推理开销,或依赖潜在特征蒸馏,后者缺乏可解释性和细粒度几何约束。为解决这些问题,我们提出SpatialSV,一个旨在将鲁棒的3D空间感知内化到MLLMs中,同时提供内在可解释性的框架。与被动特征模仿不同,SpatialSV采用任务导向的视觉监督,迫使模型主动将其2D视觉特征提升为显式3D表示,包括深度图、相机姿态和点云。关键的是,这个2D到3D的提升过程为模型的表示提供了一个透明窗口:生成的3D重建作为可视化和诊断模型内在空间知识质量的直观代理。跨多个模型和基准的广泛实验证明了SpatialSV在增强和解释MLLMs空间智能方面的有效性。此外,该框架在半监督设置中展现出强泛化能力,验证了其利用未标记视觉数据进行可扩展、可解释空间表示学习的潜力。

英文摘要

Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

2606.19776 2026-06-19 cs.CV 新提交 85%

Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

Occ-VLM: 面向室内场景理解的占用接地视觉语言模型

Jianing Li, Zhou Fang, Yijiang Liu, Li Du

发表机构 * School of Electronic Science and Engineering, Nanjing University(南京大学电子科学与工程学院)

专题命中 音视频/视觉语言融合 :占用接地视觉语言模型,融合3D与2D语义

AI总结 提出Occ-VLM,仅用姿态RGB图像和单一2D视觉编码器,通过重建3D占用作为几何先验,实现统一的3D场景理解,在占用预测、3D VQA和密集描述任务上达到领先水平。

详情
AI中文摘要

近期,视觉语言模型(VLM)在3D场景理解方面取得了显著进展,推动了具身智能和机器人视觉等应用的发展。然而,现有方法通常要么直接依赖显式的3D输入(如点云或RGB-D序列),要么引入额外的3D几何编码器从2D图像中推导出3D感知的视觉标记。这种设计在结构上将3D几何感知与通过视觉语言预训练学到的丰富2D语义解耦,阻碍了统一3D视觉语言表示的发展。在这项工作中,我们提出了Occ-VLM,一个仅基于姿态RGB图像并采用单一2D视觉编码器的3D场景理解新框架。具体而言,Occ-VLM重建3D场景占用作为辅助几何先验,用于将前景2D标记与3D空间进行空间关联。然后,这些标记由大型语言模型(LLM)解码,实现统一的场景理解。大量实验表明,Occ-VLM实现了准确的几何感知和稳健的视觉语言推理:在多视角占用预测上达到最先进性能,同时在3D视觉问答(VQA)和3D密集描述基准上与使用3D输入的VLM表现相当。

英文摘要

Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

2508.15228 2026-06-19 cs.CV 版本更新 85%

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore(南洋理工大学S实验室) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

专题命中 音视频/视觉语言融合 :协作多模态编码融合RGB、RGBD和点云特征。

AI总结 提出TriMM,首个前馈式3D原生生成模型,通过协作多模态编码融合RGB、RGBD和点云特征,结合辅助2D/3D监督和三平面潜在扩散模型,实现高质量3D资产生成。

详情
AI中文摘要

3D内容本质上具有多模态特性,可投影到不同模态(如RGB图像、RGBD和点云)。每种模态在3D资产建模中表现出独特优势:RGB图像包含生动的3D纹理,而点云定义精细的3D几何。然而,现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势,要么局限于3D结构,从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模,我们提出了TriMM,这是第一个从基本多模态(如RGB、RGBD和点云)学习的前馈式3D原生生成模型。具体来说,1) TriMM首先引入协作多模态编码,该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外,引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码,TriMM采用三平面潜在扩散模型生成更高质量的3D资产,增强了纹理和几何细节。在多个知名数据集上的大量实验表明,TriMM通过有效利用多模态,尽管使用少量训练数据,仍能达到与在大规模数据集上训练的模型相竞争的性能。此外,我们在最近的RGB-D数据集上进行了额外实验,验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

2508.04424 2026-06-19 cs.CV 版本更新 85%

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

组合对象检索:通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China(新一代人工智能技术及跨学科应用国家重点实验室,东南大学,教育部,江苏,中国) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE(穆罕默德·本·扎耶德人工智能大学(MBZUAI),阿布扎赫德,阿联酋)

专题命中 音视频/视觉语言融合 :组合对象检索结合视觉与文本,属于视觉语言融合

AI总结 提出组合对象检索(COR)任务,通过组合参考对象、掩码和检索文本进行对象级检索,并构建COR125K基准和CORE模型,显著优于现有方法。

详情
AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索(CIR)方法结合了参考图像和检索文本,但它们局限于图像级匹配,无法定位特定对象。为此,我们提出了组合对象检索(COR),一种新的对象级检索任务,从目标图像中的候选对象中检索目标对象,并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本,COR要求模型执行组合视觉-文本推理,而不是依赖显式的类别名称。这一设置带来了若干挑战,包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K,第一个大规模COR基准,包含408个类别的125,541个检索三元组,并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE,一个统一的端到端模型,集成了参考区域编码、自适应视觉-文本交互和区域级对比学习,以将组合表示与目标对象对齐,同时抑制背景和干扰物。大量实验表明,CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线,为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 新提交 80%

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Fisheries College, Ocean University of China(中国海洋大学水产学院) College of Information and Electrical Engineering, China Agricultural University(中国农业大学信息与电气工程学院)

专题命中 音视频/视觉语言融合 :指令引导音频编辑,涉及文本与音频融合

AI总结 提出混合两阶段扩散变压器架构,通过粗到细策略平衡全局语义对齐与局部细节编辑,在重叠音频事件和复杂指令任务上提升性能与效率。

详情
AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容,同时保留其余声学内容。尽管扩散模型取得了显著进展,但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互,这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下,扩散变压器提供了更强的全局建模和多模态融合,但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率,我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构,用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐,然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明,所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升,同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

2606.19985 2026-06-19 cs.CV 新提交 80%

Vision-Reasoning-Guided Occlusion Removal from Light Fields

视觉推理引导的光场遮挡去除

Mohamed Youssef, Oliver Bimber

发表机构 * Johannes Kepler University(约翰·开普勒大学)

专题命中 音视频/视觉语言融合 :融合光场与视觉语言模型,去除遮挡恢复场景。

AI总结 提出结合光场积分与视觉语言模型的框架,通过多视图融合和语义先验恢复被遮挡场景,在合成和真实数据上取得最优性能。

详情
AI中文摘要

遮挡鲁棒的场景恢复仍然是计算成像中的一个主要挑战,特别是在自然环境中,密集的前景植被严重限制了可见性。我们提出了一种视觉推理引导的光场遮挡去除框架,该框架结合了光场积分(LFI)的可见性恢复能力和视觉语言模型(VLM)的语义推理能力。首先通过LFI集成多视图观测以抑制前景遮挡,生成初始的可见性增强表示。然后,引入VLM作为条件语义先验,在观测测量的指导下恢复退化结构并恢复细节。为了提高恢复一致性并减少幻觉伪影,我们引入了一种多样本融合策略,将多个生成的假设聚合为统一的估计。在合成和真实世界数据集上的实验结果表明,该方法达到了最先进的性能,在四个合成光场基准场景(4-Syn)上取得了最高的平均SSIM,并在结构化和非结构化采集设置中表现出强大的泛化能力。这些结果凸显了将物理成像约束与视觉语言推理相结合在严重遮挡下实现鲁棒感知的有效性,可应用于搜索救援和探索性机器人导航。

英文摘要

Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.

2606.19950 2026-06-19 cs.CV cs.AI 新提交 80%

Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA

多模态大语言模型的置信度校准:基于医学视觉问答的实证研究

Yuetian Du, Yucheng Wang, Ming Kong, Tian Liang, Qiang Long, Bingdi Chen, Qiang Zhu

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Zhihui Medical Technology (Shanghai) Co., Ltd.(智汇医疗科技(上海)有限公司)

专题命中 音视频/视觉语言融合 :多模态LLM置信度校准,用于医学视觉问答。

AI总结 针对多模态大语言模型在医学任务中置信度与准确性不匹配的问题,提出结合多策略融合询问与专家大语言模型评估的方法,在三个医学VQA数据集上将期望校准误差平均降低40%,提升了模型可靠性。

Comments Accepted by MICCAI 2025

详情
AI中文摘要

多模态大语言模型(MLLMs)在医学任务中展现出巨大潜力,但其引发的置信度常常与实际准确性不一致,可能导致误诊或忽略正确建议。本研究首次全面分析了医学MLLMs中准确性与置信度之间的关系。提出了一种新方法,将多策略融合询问(MS-FBI)与辅助专家大语言模型评估相结合,旨在改善医学视觉问答(VQA)中的置信度校准。实验表明,我们的方法在三个医学VQA数据集上将期望校准误差(ECE)平均降低了40%,显著增强了MLLMs的可靠性。研究结果强调了领域特定校准对医疗领域MLLMs的重要性,为AI辅助诊断提供了更可信的解决方案。

英文摘要

Multimodal Large Language Models (MLLMs) show great potential in medical tasks, but their elicited confidence often misaligns with actual accuracy, potentially leading to misdiagnosis or overlooking correct advice. This study presents the first comprehensive analysis of the relationship between accuracy and confidence in medical MLLMs. It proposes a novel method that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment, aiming to improve confidence calibration in Medical Visual Question Answering (VQA). Experiments demonstrate that our method reduces the Expected Calibration Error (ECE) by an average of 40\% across three Medical VQA datasets, significantly enhancing MLLMs' reliability. The findings highlight the importance of domain-specific calibration for MLLMs in healthcare, offering a more trustworthy solution for AI-assisted diagnosis.

5. 融合架构与评测 2 篇

2504.11171 2026-06-19 cs.CV cs.AI 版本更新 90%

TerraMind: Large-Scale Generative Multimodality for Earth Observation

TerraMind:面向地球观测的大规模生成式多模态模型

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

发表机构 * IBM Research – Europe(IBM欧洲研究院) ETH Zurich(苏黎世联邦理工学院) Forschungszentrum Jülich(尤利希研究中心) European Space Agency(欧洲航天局) Φ \Phi -Lab(Φ实验室) NASA IMPACT University of Iceland(爱沙尼亚大学)

专题命中 融合架构与评测 :多模态地球观测基础模型,属于融合架构

AI总结 提出首个任意到任意生成式多模态基础模型TerraMind,通过双尺度表示(token级和像素级)预训练,实现零样本/少样本应用,并引入“模态思考”能力,在PANGAEA等基准上达到领先性能。

Comments Accepted at ICCV'25

详情
AI中文摘要

我们提出了TerraMind,这是首个面向地球观测(EO)的任意到任意生成式多模态基础模型。与其他多模态模型不同,TerraMind在跨模态的双尺度表示(结合token级和像素级数据)上进行预训练。在token级别,TerraMind编码高层上下文信息以学习跨模态关系;在像素级别,TerraMind利用细粒度表示捕捉关键空间细节。我们在一个全球大规模数据集的九种地理空间模态上预训练了TerraMind。在本文中,我们证明:(i)TerraMind的双尺度早期融合方法为地球观测解锁了一系列零样本和少样本应用;(ii)TerraMind引入了“模态思考”(TiM)——在微调和推理过程中生成额外人工数据以改善模型输出的能力;(iii)TerraMind在PANGAEA等社区标准的地球观测基准上达到了超越现有最优的性能。预训练数据集、模型权重和我们的代码均在宽松许可下开源。

英文摘要

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

2506.06952 2026-06-19 cs.CV 版本更新 85%

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Maryland(马里兰大学) Nvidia(英伟达) Salesforce AI Research(Salesforce AI研究) Intuit AI Research(Intuit AI研究)

专题命中 融合架构与评测 :统一图像理解与生成,属于融合架构

AI总结 提出LaTtE-Flow,一种基于预训练视觉语言模型的高效统一架构,通过层间时间步专家流和条件残差注意力机制,实现图像理解与生成,生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情
AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展,为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展,现有的统一模型通常需要大量的预训练,并且与专门针对每项任务的模型相比,难以达到相同的性能水平。此外,许多这些模型存在图像生成速度慢的问题,限制了它们在实时或资源受限环境中的实际部署。在这项工作中,我们提出了基于层间时间步专家流的Transformer(LaTtE-Flow),一种新颖且高效的架构,可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型(VLM)之上,以继承强大的多模态理解能力,并通过新颖的层间时间步专家流架构扩展它们,以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中,每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层,显著提高了采样效率。为了进一步提升性能,我们提出了一种时间步条件残差注意力机制,用于跨层高效的信息重用。实验表明,LaTtE-Flow在多模态理解任务上取得了强劲的性能,同时与最近的统一多模态模型相比,实现了具有竞争力的图像生成质量,推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.