arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

多模态信息融合

面向图像、视频、多传感器和跨模态感知的信息融合,包括 Image Fusion、红外可见光、遥感、医学影像、LiDAR/雷达/相机和音视频融合。

今日/当前日期收录 12 信号源:cs.CV, eess.IV, eess.SP, cs.RO, cs.MM
2606.19927 2026-06-19 cs.CV 新提交 90%

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

CARE: 面向视频多模态大语言模型的自适应推理长度的能力感知奖励塑形

Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

专题命中 音视频/视觉语言融合 :视频多模态推理,涉及视觉与语言融合

AI总结 提出CARE框架,通过能力感知奖励塑形自适应优化推理长度,利用指数移动平均估计能力并分阶段调整奖励偏好,结合批次归一化和后验放大器提升效率与准确性。

详情
AI中文摘要

在多模态视频推理中,基于强化学习的方法通常依赖简单且不灵活的推理长度控制策略,无法适应模型不断变化的能力。这种不匹配可能在早期阶段抑制必要的探索,而在模型变得更有能力后鼓励冗余推理和低效解码。本文提出CARE,一种用于多模态推理中自适应推理长度优化的能力感知奖励塑形框架。具体来说,CARE通过通过率的指数移动平均维护平滑的能力估计,并利用它将训练路由到渐进阶段,将奖励偏好从探索导向的长形式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混淆,CARE进一步使用批次级统计归一化推理努力,并引入后验放大器以增强对历史上困难样本上意外强性能的奖励信号。所提出的机制无缝集成到GRPO训练流程中,且不增加额外推理开销。在多个视频推理和通用视频理解基准上的大量实验表明,CARE持续提高推理准确性,稳定强化学习,并显著提升令牌效率。此外,CARE在训练过程中展现出推理长度的特征性倒U型轨迹,并在收敛时产生更短但信息更丰富的推理轨迹,表明推理预算的有效自适应分配。我们在以下网址提供CARE框架和实验的源代码:此https URL。

英文摘要

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

2606.19882 2026-06-19 cs.CV cs.LG 新提交 90%

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego(加州大学圣地亚哥分校)

专题命中 音视频/视觉语言融合 :多模态概念瓶颈模型,对齐图像和文本嵌入

AI总结 提出多模态概念瓶颈模型(MM-CBM),利用双概念瓶颈层对齐图像和文本嵌入,实现可解释的零样本分类和图像检索,在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情
AI中文摘要

概念瓶颈模型(CBM)通过将图像提取的特征与自然概念对齐,增强了深度学习网络的可解释性。然而,现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制,其中预期概念之外的预测信号被无意中利用。在本文中,我们提出了多模态概念瓶颈模型(MM-CBM)来解决这些问题,并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层(CBL)将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务,如零样本分类或图像检索。与现有方法相比,MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率,在黑盒性能的约5%以内,同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

2606.20077 2026-06-19 cs.CV cs.AI 新提交 85%

The Hidden Evolution of Disguised Visual Context inside the VLM

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey(萨里大学以人为本人工智能研究所) Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心)

专题命中 音视频/视觉语言融合 :视觉语言模型中视觉令牌与语言空间的融合

AI总结 研究视觉语言模型中视觉令牌如何通过不同集成架构(上下文注入与逐层注入)转化为有意义表示,揭示其内部演化过程及对性能的影响。

详情
AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型(LLM)。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示,还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成,目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式,在单图像、多图像和视频基准上进行公平比较。在此过程中,我们揭示了一个隐藏的演化:视觉令牌作为伪装的视觉上下文(缺乏语言结构的原始表示)进入LLM,但根据集成范式逐渐被重塑,每种范式捕捉视觉信号的不同频率特征。我们表明,LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐,以及最终每种范式在不同任务上的表现。我们进一步证明,仅关注注意力分配是不够的,性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

2606.19944 2026-06-19 cs.CV 新提交 85%

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

Timage: 一种用于微调视觉语言模型的文本嵌入图像生成范式

Yifeng Wu, Huimin Huang, Ruiluo Wu, Chunyi Lin, Guanhua Chen, Xian Wu, Wang Song, Ruize Han

发表机构 * Fudan University(复旦大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Tencent Jarvis Lab(腾讯贾维斯实验室) Southern University of Science and Technology(南方科技大学)

专题命中 音视频/视觉语言融合 :文本嵌入图像,增强视觉语言模型空间推理。

AI总结 提出Timage范式,通过约束薛定谔桥将查询文本作为排版覆盖层嵌入图像,以显式空间锚点引导模型关注,在不侵蚀骨干能力前提下提升细粒度空间推理性能。

Comments ECCV

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度空间推理中常丢失正确图像区域,因为文本查询很少携带明确的几何锚点进入像素域。现有补救方法要么重新调整模型权重,要么用冗长指令填充提示,但都无法在不侵蚀骨干通用能力的情况下可靠地将语言定位到正确的视觉坐标。我们提出Timage,一种将多模态理解重新定义为输入层面对齐问题的范式:查询被绘制为排版覆盖层直接叠加在图像上。该覆盖层的放置和外观由约束薛定谔桥(cSB)生成,这是一种熵最优传输采样器,将布局合成分解为两个耦合的随机阶段。第一阶段——区域搜索,将噪声向查询对齐的图像区域传输,同时遵守硬遮挡屏障以保护显著前景内容;第二阶段——外观塑造,通过“墨水预算”正则化调整字形大小,使渲染文本保持可读和视觉平衡。生成的覆盖层作为显式注意力信标,引导模型沿空间语义聚焦。在VMCBench基准上,Timage搭配7B骨干模型明显超越更大的专有系统和参数调优基线。该研究将审慎的输入重构定位为一种强大的、架构中立的杠杆,以增强多模态推理。

英文摘要

Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model's weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone's general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget'' regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model's focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.

2606.19915 2026-06-19 cs.CV 新提交 85%

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

SpatialSV: 通过任务导向的视觉监督在多模态大语言模型中内化可解释的3D空间感知

Jiayu Tang, Yuchen Zhou, Chao Gou

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能工程学院)

专题命中 音视频/视觉语言融合 :将2D视觉特征提升为3D表示,多模态融合

AI总结 提出SpatialSV框架,通过任务导向的视觉监督将MLLM的2D特征提升为显式3D表示(深度图、相机姿态、点云),实现可解释的3D空间感知内化,无需外部工具,并在半监督设置中展现强泛化能力。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

解锁多模态大语言模型(MLLMs)的空间智能对于理解和与3D世界交互至关重要。当前主流方法通常通过外部工具注入空间先验,这会带来显著的推理开销,或依赖潜在特征蒸馏,后者缺乏可解释性和细粒度几何约束。为解决这些问题,我们提出SpatialSV,一个旨在将鲁棒的3D空间感知内化到MLLMs中,同时提供内在可解释性的框架。与被动特征模仿不同,SpatialSV采用任务导向的视觉监督,迫使模型主动将其2D视觉特征提升为显式3D表示,包括深度图、相机姿态和点云。关键的是,这个2D到3D的提升过程为模型的表示提供了一个透明窗口:生成的3D重建作为可视化和诊断模型内在空间知识质量的直观代理。跨多个模型和基准的广泛实验证明了SpatialSV在增强和解释MLLMs空间智能方面的有效性。此外,该框架在半监督设置中展现出强泛化能力,验证了其利用未标记视觉数据进行可扩展、可解释空间表示学习的潜力。

英文摘要

Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

2606.19776 2026-06-19 cs.CV 新提交 85%

Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

Occ-VLM: 面向室内场景理解的占用接地视觉语言模型

Jianing Li, Zhou Fang, Yijiang Liu, Li Du

发表机构 * School of Electronic Science and Engineering, Nanjing University(南京大学电子科学与工程学院)

专题命中 音视频/视觉语言融合 :占用接地视觉语言模型,融合3D与2D语义

AI总结 提出Occ-VLM,仅用姿态RGB图像和单一2D视觉编码器,通过重建3D占用作为几何先验,实现统一的3D场景理解,在占用预测、3D VQA和密集描述任务上达到领先水平。

详情
AI中文摘要

近期,视觉语言模型(VLM)在3D场景理解方面取得了显著进展,推动了具身智能和机器人视觉等应用的发展。然而,现有方法通常要么直接依赖显式的3D输入(如点云或RGB-D序列),要么引入额外的3D几何编码器从2D图像中推导出3D感知的视觉标记。这种设计在结构上将3D几何感知与通过视觉语言预训练学到的丰富2D语义解耦,阻碍了统一3D视觉语言表示的发展。在这项工作中,我们提出了Occ-VLM,一个仅基于姿态RGB图像并采用单一2D视觉编码器的3D场景理解新框架。具体而言,Occ-VLM重建3D场景占用作为辅助几何先验,用于将前景2D标记与3D空间进行空间关联。然后,这些标记由大型语言模型(LLM)解码,实现统一的场景理解。大量实验表明,Occ-VLM实现了准确的几何感知和稳健的视觉语言推理:在多视角占用预测上达到最先进性能,同时在3D视觉问答(VQA)和3D密集描述基准上与使用3D输入的VLM表现相当。

英文摘要

Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 新提交 80%

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Fisheries College, Ocean University of China(中国海洋大学水产学院) College of Information and Electrical Engineering, China Agricultural University(中国农业大学信息与电气工程学院)

专题命中 音视频/视觉语言融合 :指令引导音频编辑,涉及文本与音频融合

AI总结 提出混合两阶段扩散变压器架构,通过粗到细策略平衡全局语义对齐与局部细节编辑,在重叠音频事件和复杂指令任务上提升性能与效率。

详情
AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容,同时保留其余声学内容。尽管扩散模型取得了显著进展,但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互,这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下,扩散变压器提供了更强的全局建模和多模态融合,但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率,我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构,用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐,然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明,所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升,同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

2606.19985 2026-06-19 cs.CV 新提交 80%

Vision-Reasoning-Guided Occlusion Removal from Light Fields

视觉推理引导的光场遮挡去除

Mohamed Youssef, Oliver Bimber

发表机构 * Johannes Kepler University(约翰·开普勒大学)

专题命中 音视频/视觉语言融合 :融合光场与视觉语言模型,去除遮挡恢复场景。

AI总结 提出结合光场积分与视觉语言模型的框架,通过多视图融合和语义先验恢复被遮挡场景,在合成和真实数据上取得最优性能。

详情
AI中文摘要

遮挡鲁棒的场景恢复仍然是计算成像中的一个主要挑战,特别是在自然环境中,密集的前景植被严重限制了可见性。我们提出了一种视觉推理引导的光场遮挡去除框架,该框架结合了光场积分(LFI)的可见性恢复能力和视觉语言模型(VLM)的语义推理能力。首先通过LFI集成多视图观测以抑制前景遮挡,生成初始的可见性增强表示。然后,引入VLM作为条件语义先验,在观测测量的指导下恢复退化结构并恢复细节。为了提高恢复一致性并减少幻觉伪影,我们引入了一种多样本融合策略,将多个生成的假设聚合为统一的估计。在合成和真实世界数据集上的实验结果表明,该方法达到了最先进的性能,在四个合成光场基准场景(4-Syn)上取得了最高的平均SSIM,并在结构化和非结构化采集设置中表现出强大的泛化能力。这些结果凸显了将物理成像约束与视觉语言推理相结合在严重遮挡下实现鲁棒感知的有效性,可应用于搜索救援和探索性机器人导航。

英文摘要

Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.

2606.19950 2026-06-19 cs.CV cs.AI 新提交 80%

Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA

多模态大语言模型的置信度校准:基于医学视觉问答的实证研究

Yuetian Du, Yucheng Wang, Ming Kong, Tian Liang, Qiang Long, Bingdi Chen, Qiang Zhu

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Zhihui Medical Technology (Shanghai) Co., Ltd.(智汇医疗科技(上海)有限公司)

专题命中 音视频/视觉语言融合 :多模态LLM置信度校准,用于医学视觉问答。

AI总结 针对多模态大语言模型在医学任务中置信度与准确性不匹配的问题,提出结合多策略融合询问与专家大语言模型评估的方法,在三个医学VQA数据集上将期望校准误差平均降低40%,提升了模型可靠性。

Comments Accepted by MICCAI 2025

详情
AI中文摘要

多模态大语言模型(MLLMs)在医学任务中展现出巨大潜力,但其引发的置信度常常与实际准确性不一致,可能导致误诊或忽略正确建议。本研究首次全面分析了医学MLLMs中准确性与置信度之间的关系。提出了一种新方法,将多策略融合询问(MS-FBI)与辅助专家大语言模型评估相结合,旨在改善医学视觉问答(VQA)中的置信度校准。实验表明,我们的方法在三个医学VQA数据集上将期望校准误差(ECE)平均降低了40%,显著增强了MLLMs的可靠性。研究结果强调了领域特定校准对医疗领域MLLMs的重要性,为AI辅助诊断提供了更可信的解决方案。

英文摘要

Multimodal Large Language Models (MLLMs) show great potential in medical tasks, but their elicited confidence often misaligns with actual accuracy, potentially leading to misdiagnosis or overlooking correct advice. This study presents the first comprehensive analysis of the relationship between accuracy and confidence in medical MLLMs. It proposes a novel method that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment, aiming to improve confidence calibration in Medical Visual Question Answering (VQA). Experiments demonstrate that our method reduces the Expected Calibration Error (ECE) by an average of 40\% across three Medical VQA datasets, significantly enhancing MLLMs' reliability. The findings highlight the importance of domain-specific calibration for MLLMs in healthcare, offering a more trustworthy solution for AI-assisted diagnosis.

2606.16615 2026-06-19 cs.CV 新提交 80%

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

SUP-MCRL:面向EEG视觉解码的感知主体统一伪特征编码多模态对比表示学习

Shengyu Gong, Weiming Zeng, Yueyang Li, Zijian Kang, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University(上海海事大学数字图像与智能计算实验室) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) Affiliated Lianyungang Hospital of Xuzhou Medical University(徐州医科大学附属连云港医院)

专题命中 音视频/视觉语言融合 :多模态对比学习融合EEG和视觉特征,用于视觉解码。

AI总结 提出SUP-MCRL框架,通过语义感知视觉编码器、统一EEG增强器和原型渐进增强器,解决多模态对比学习中语义一致性和主体选择性问题,在THINGS-EEG零样本任务上达到66.0%/91.9%的Top-1/Top-5准确率。

详情
AI中文摘要

非侵入式脑机接口在泛化到自然视觉体验时,神经视觉解码面临严重的保真度退化。传统的多模态对比表示学习仅优化几何距离对齐,忽略了语义一致性和主体选择性,导致虚假的零样本对齐。我们提出SUP-MCRL,一个统一框架,集成了三种协作机制:(1) 语义实体感知视觉编码器(SAVE),学习空间注意力以提取语义内容,无需预训练的显著性模型;(2) 统一EEG增强器(UEE),采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性;(3) 基于原型的渐进增强器(PPA),维护一个EMA更新的伪特征池以防止表示崩溃。在THINGS-EEG上的零样本实验实现了66.0%/91.9%(Top-1/Top-5)的个体内准确率和24.0%/52.9%的LOSO准确率,超越了现有最先进方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。

英文摘要

Non-invasive brain-computer interfaces exhibit significant performance degradation when moving from controlled laboratory stimuli to real-world natural images. This degradation occurs because conventional multimodal contrastive representation learning models focus exclusively on optimizing geometric distance alignment, thereby failing to account for semantic consistency and inter-subject variability in neural representation and selective attention. As a result, these models are prone to producing spurious zero-shot matches. To address these limitations, we propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) a Semantic-entity Aware Visual Encoder (SAVE) that learns spatial attention to extract semantic content without relying on pre-trained saliency models; (2) a Unified EEG Enhancer (UEE) that employs multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) a Prototype-based Progressive Augmenter (PPA) that maintains an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on the THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, significantly surpassing state-of-the-art methods and demonstrating that structured alignment supervision is key to overcoming the limitations of cross-modal decoding. Code is available at https://github.com/NZWANG/SUP-MCRL.

2606.20083 2026-06-19 cs.CV 新提交 75%

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Holo-World: 视频世界模型的统一相机、物体和天气控制

Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen, Wei Li, Dachun Kai, Chunfeng Wang, Xiaoyan Sun

发表机构 * University of Science and Technology of China(中国科学技术大学) Li Auto Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

专题命中 音视频/视觉语言融合 :视频世界模型,联合控制相机、物体和天气

AI总结 提出Holo-World,一种从单张图像联合控制相机、物体运动和天气的统一视频世界模型,通过场景适配器和解耦CFG实现世界保持与天气迁移。

Comments Project Page: \url{https://xiangchenyin.github.io/Holo-World} Code: \url{https://github.com/XiangchenYin/Holo-World}

详情
AI中文摘要

视频世界模型正朝着在可控相机和物体运动下保持观察到的世界,同时允许其环境状态变化的方向发展。然而,这些控制仍然是孤立的,天气生成通常依赖于已经指定未来结构的源视频或重建场景。我们研究了一种基于第一帧锚定的源到状态设置,其中模型从单张图像开始,遵循明确的相机和物体控制以及可选的天气指令,然后生成一个视频,该视频要么保持源世界,要么将其转移到目标天气状态。为了解决这些挑战,我们首先构建了HoloStateData,一个状态视频数据集,将多样化的视频转换为用于相机、物体和天气监督的统一控制样本。其次,我们引入了Holo-World,一个统一的、可控制的视频世界模型,从单张图像联合控制场景。其统一场景适配器将世界保持和天气迁移分解为不同的参数子空间,使用渲染背景、几何缓冲区和物体控制来维持受控场景结构,同时建模依赖天气的外观和粒子效果。此外,场景-天气解耦CFG分别引导场景和天气残差,增强目标天气效果而不过度放大完整条件。定量和定性实验表明,Holo-World在保持精确的相机和物体控制以及一致场景结构的同时,将场景迁移到多样化的目标天气状态,在天气状态生成上优于视频到视频的天气编辑基线。我们的项目页面可在\url{this https URL}获取。

英文摘要

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 新提交 70%

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror:在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon(亚马逊)

专题命中 音视频/视觉语言融合 :化妆迁移,涉及图像与文本条件融合

AI总结 提出MakeupMirror扩散模型,通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器,在保持面部特征和肤色的同时实现高质量化妆迁移,相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情
AI中文摘要

化妆迁移模型能够实现有趣的增强现实(AR)体验以及在线化妆购物的虚拟试妆(VTO)。尽管最近最先进的基于扩散的解决方案(如Stable-Makeup)显著提高了化妆迁移的准确性和逼真度,但在身份和肤色保持方面仍存在局限性,使得用于化妆购物的生产级VTO不切实际。在这项工作中,我们提出了MakeupMirror,一种基于扩散的化妆迁移方法,在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新:(1)将面部几何条件与ControlNets集成以保持面部保真度;(2)区域特定的化妆迁移控制,以便在面部区域(如皮肤、眼睛和嘴唇)实现精确的化妆应用;(3)基于肤色的化妆迁移调制,防止跨主体迁移场景中的肤色改变;(4)集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及(本文新收集的、更多样化的)MakeupSelfies数据集上的实验表明,与Stable-Makeup相比,MakeupMirror将相对面部识别相似度提高了+60%,将相对肤色差异降低了-50%,延迟为0.7秒,同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.