多模态信息融合

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新 95%

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3：面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

专题命中音视频/视觉语言融合：全模态世界模型联合处理语言、图像、视频、音频和动作

AI总结提出基于统一混合Transformer架构的全模态世界模型Cosmos 3，联合处理语言、图像、视频、音频和动作序列，在理解和生成任务上达到新最优，为具身智能体提供可扩展的通用骨干。

详情

AI中文摘要

我们介绍了Cosmos 3，一个全模态世界模型家族，设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置，Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明，Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平，展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型，并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署，我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准，网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

URL PDF HTML ☆

赞 0 踩 0

2606.19325 2026-06-18 cs.SD cs.AI cs.CV 新提交 90%

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

参考驱动的野外先验多说话人音频场景生成

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

发表机构 * Lightricks ； Tel Aviv University（特拉维夫大学）

专题命中音视频/视觉语言融合：多参考声音和文本提示生成多说话人音频场景

AI总结提出ScenA方法，利用预训练的文本到音频流匹配基础模型，通过多参考声音和自然语言提示生成多说话人音频场景，并采用高噪声偏置时间步分布解决参考捷径问题，在CoVoMix2-Dialogue基准上优于现有系统。

Comments Project page at https://finmickey.github.io/scena/

详情

AI中文摘要

现有的多说话人对话系统通过结构化监督（如每轮标签、多流转录或可学习说话人嵌入）将说话人与话语绑定。这些系统在仅语音的流水线中运行，生成干净的语音序列，缺乏真实对话的环境纹理。我们采取不同的方法。我们的方法ScenA将文本到音频流匹配基础模型（在大规模野外数据上预训练）直接以多个参考声音和描述整个多说话人音频场景的自由形式自然语言提示为条件。利用这样的基础模型使我们能够继承其生成自然、非录音室音频的能力：背景噪声、房间声学、重叠对话和自发的副语言事件，同时添加多说话人控制而无需任何每轮结构。具体地，参考潜在向量被连接到模型的令牌序列中，并通过轻量级的身份感知位置编码进行区分。然而，我们识别出这种方法的一个关键障碍：参考捷径。在标准噪声调度下的训练过程中，模型可以通过声学相似性识别匹配的参考与噪声目标，从而完全绕过文本提示。我们通过高噪声偏置的时间步分布来解决这个问题，迫使模型依赖文本提示进行说话人分配。我们在CoVoMix2-Dialogue基准上评估ScenA，结果表明它在说话人绑定指标上优于现有的多说话人系统，同时生成具有重叠语音、情感发声和环境声音的丰富对话音频。我们的结果证明了使用以自由形式场景描述为条件的通用音频模型，而不是通过仅语音流水线传递结构化对话脚本的优势。

英文摘要

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.19062 2026-06-18 cs.CV 新提交 90%

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University（世宗大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； Ulsan National Institute of Science and Technology（乌山国立科学研究院）

专题命中音视频/视觉语言融合：提出双路径视觉语言模型用于跨模态视频检索。

AI总结提出DREAM模型，通过双路径表示增强与对齐，结合层级视觉编码器和混合语言建模，在视频检索任务中实现新SOTA。

详情

AI中文摘要

在当今媒体驱动的世界中，视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射，限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐，但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM：双路径表示增强与对齐模型，一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略，结合掩码和排列语言建模目标，以捕捉局部和全局语言语义。在视觉方面，我们设计了一个具有级联组注意力的层级视觉编码器，通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM，分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性，并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.14702 2026-06-18 cs.CV 新提交 90%

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K：通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University（南京大学）； CASIA（中国科学院自动化研究所）

专题命中音视频/视觉语言融合：音视频问答数据集，涉及音频与视觉模态融合推理

AI总结提出OmniVideo-100K数据集，通过实体锚定视频脚本和线索引导的QA生成机制，解决音视频问答中跨段实体不一致和长时推理不足的问题，微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情

AI中文摘要

当前的音视频问答（QA）自动化流水线通常采用“视频-字幕-QA”范式。然而，这些方法通常将视频分割成短片段，并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联，而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外，将长文本理解和QA合成耦合到单一步骤中，往往将模型限制在局部事件上，生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题，我们提出了一种自动化数据引擎，包含两种机制：（1）**实体锚定视频脚本**将视频转换为结构化脚本，包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验，确保跨片段引用一致性并重建音视频关联。（2）**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索，然后基于这些高价值线索生成QA对。利用这一流水线，我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B，在OmniVideo-Test上获得了高达20.59%的性能提升，并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力（提升高达12.64%）。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

URL PDF HTML ☆

赞 0 踩 0

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 新提交 85%

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； Qwen Team, Alibaba Group（阿里巴巴集团Qwen团队）

专题命中音视频/视觉语言融合：全模态智能体融合音视频线索进行视频理解

AI总结提出OmniAgent，一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体，通过主动感知将推理复杂度与视频时长解耦，在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情

AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式，无论查询难度如何都统一处理帧，导致计算成本随视频时长增长。尽管出现了交互式框架，但它们通常依赖于全局预扫描，其上下文成本仍随视频长度扩展。我们提出OmniAgent，第一个原生全模态智能体，将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作，选择性地将视听线索提炼到持久文本记忆中，有效将推理复杂度与原始视频时长解耦。为实现这一点，我们引入了(1)智能体监督微调，通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知；(2)带TAURA（轮次感知自适应不确定性重缩放优势）的智能体强化学习，利用轮次级熵将信用分配引导至关键发现轮次。关键的是，OmniAgent表现出正向测试时缩放，性能随推理轮次增加而提升，验证了主动感知的有效性。在十个基准（如VideoMME、LVBench）上的实验结果表明，OmniAgent在开源模型中达到了最先进性能。值得注意的是，在LVBench上，我们的7B智能体优于10倍大的Qwen2.5-VL-72B（50.5% vs. 47.3%）。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

URL PDF HTML ☆

赞 0 踩 0

2606.18974 2026-06-18 cs.CV 新提交 85%

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD：用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

发表机构 * Xi’an Jiaotong University（西安交通大学）； MOE KLINNS Lab（MOE KLINNS实验室）； Shaanxi Province Key Laboratory of Big Data Knowledge Engineering（陕西省大数据知识工程重点实验室）； Sun Yat-sen University（中山大学）

专题命中音视频/视觉语言融合：跨模态自蒸馏将视觉推理转移到文本模型。

AI总结提出Visual-OPSD方法，通过跨模态在策略自蒸馏，将多步扩散生成的可视化思维推理能力转移到纯文本学生模型，实现14.3倍加速且性能提升3.40个百分点。

详情

AI中文摘要

统一多模态模型（UMMs）将生成的“可视化思维”（VTs）与文本推理交错以改进空间任务。这导致多步扩散带来大约一个数量级的推理成本。我们发现这种成本带来的直接收益有限。在ThinkMorph上，移除或噪声化VTs在九个基准上几乎不改变准确率。一旦渲染，注意力集中在VT上，无论其内容如何。然而，KL诊断表明，以特权VT轨迹为条件会改变模型的完成分布。这表明生成路径编码了超出渲染像素的有用推理。受此差距启发，我们提出了Visual On-Policy Self-Distillation（Visual-OPSD）。教师和学生共享相同权重，但上下文不同：教师看到特权VTs，而学生只看到问题。在策略学生轨迹上的token级JSD蒸馏将教师的推理转移到纯文本学生。在九个基准上，Visual-OPSD相比其生成教师提高了$+3.40$个百分点，加速$14.3\times$（每个样本10.0秒 vs. 142.8秒），并在VSP上比同规模VLM提高了$+63.83$个百分点。高斯噪声控制（真实VT为$+0.40$pp vs. $+10.28$pp）和$58.4\%$的KL差距闭合证实，收益来自生成路径的语义内容。

英文摘要

Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

URL PDF HTML ☆

赞 0 踩 0

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 新提交 85%

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA：面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）

专题命中音视频/视觉语言融合：多模态信息抽取增强，融合视觉与语言模态。

AI总结提出语义锚定对齐增强框架SAMA，通过构建结构化语义锚引导多专家多模态大模型生成高保真文本，并利用锚保留扩散机制合成图像，结合双约束过滤模块，在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

详情

AI中文摘要

多模态信息抽取（MIE）——涵盖多模态命名实体识别（MNER）、关系抽取（MRE）和事件抽取（MEE）等任务——对于理解多媒体内容至关重要，但受到严重数据稀缺的限制。尽管数据增强是一种有前景的补救措施，但现有方法受到粗粒度跨模态对齐和碎片化、任务特定设计的阻碍，未能利用共享语义知识。为克服这些限制，我们引入了语义锚定对齐多模态增强（SAMA），一个用于生成高保真、任务感知合成数据的统一框架。SAMA从真实标签构建结构化语义锚，以指导协作多专家多模态大语言模型（CME-MLLM），该模型集成了用于共享语义的通用适配器和任务特定适配器，以生成多样且符合约束的文本样本。对于图像合成，SAMA采用锚保留扩散机制，使用锚加权提示和潜在条件来维持关键语义锚，同时多样化视觉上下文。为消除人工验证需求，SAMA进一步引入双约束过滤模块，基于跨模态一致性和锚保真度选择合成样本。在MNER、MRE和MEE基准数据集上的大量实验表明，SAMA在全监督和低资源设置下均一致优于最先进的增强基线，突显了其通用性、鲁棒性和有效性。

英文摘要

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.18586 2026-06-18 cs.CV cs.AI 新提交 85%

APT: Atomic Physical Transitions for Causal Video-Language Understanding

APT: 用于因果视频语言理解的原子物理转变

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

发表机构 * Northwestern University（西北大学）； Dolby Laboratories（杜比实验室）

专题命中音视频/视觉语言融合：提出APT表示视频因果状态变化，用于视频语言理解，属于视觉语言融合。

AI总结提出原子物理转变（APT）作为视频中因果状态变化的显式表示，并构建混合来源数据集，通过APT-Tune微调方法使VLM学习物理转变而不遗忘事件级知识。

详情

AI中文摘要

物理事件不仅通过其名称来理解，还通过组成它们的因果状态变化来理解。诸如“弹跳”之类的片段级标签可能是正确的，但同时隐藏了使事件在物理上有效的过程，从支撑丧失和接触开始到反弹和稳定。为了使这一隐藏过程显式化，我们引入了原子物理转变（APT）：最小的、时间局部化的状态变化，将可见线索与活跃的物理机制以及前后动力学状态联系起来。APT链将视频表示为有序的因果转变序列，而不是单个聚合事件标签：事件标签说明发生了什么；APT链解释为什么会发生。为了使VLM能够学习APT，我们从人工标注和模拟器真实数据构建了混合来源的APT数据，涵盖接触、重力、摩擦和旋转/稳定性中的14种转变类型，包含1,246个试验中的27,303个计时实例。利用这些数据，我们发现当前的VLM在转变级物理理解上存在不足，零样本召回率最多为14%，错误主要由遗漏的转变主导。直接在APT链上进行微调可以改善转变检测，但会导致事件级遗忘，表明模型学习的是专门的答案格式，而不是可复用的物理表示。因此，我们提出了APT-Tune，一种参数高效的方案，教会VLM使用因果转变而不遗忘如何回答视频问题。它结合了图像填充感知监督、格式条件协同训练和机制条件域到类型解码，使APT学习具有格式鲁棒性和物理基础。在Qwen3-VL-2B上仅使用11M LoRA参数，APT-Tune显著提高了APT召回率，同时改善了事件级视频迁移。这些结果表明，APT不是一种新的答案格式，而是一种用于物理视频理解的人类对齐的因果监督信号。

英文摘要

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.18553 2026-06-18 cs.CV 新提交 85%

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

基于知识的分层多模态检索用于新闻图像描述生成

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（越南国立大学胡志明市分校理学院）； Vietnam National University, Ho Chi Minh City（越南国立大学胡志明市分校）

专题命中音视频/视觉语言融合：分层多模态检索增强新闻图像描述，融合视觉与文本。

AI总结提出分层多模态文章检索增强的图像描述框架，通过结构感知检索和上下文精炼，结合VLM和LLM生成富含上下文细节的描述，在EVENTA 2025挑战赛中获得第5名。

Comments SOICT 2025

详情

AI中文摘要

传统的图像描述方法通常难以生成全面、上下文丰富的描述，尤其是对于无法直接从视觉线索中观察到的细节。为了克服这一问题，我们提出了一种新颖的检索增强图像描述框架，通过利用外部知识生成具有更深层次洞察的描述，如对象属性、事件背景和潜在意义。我们的方法采用分层多模态文章检索机制，超越了单一的文本实体。该检索考虑了文章结构感知特征，包括加权文本组件（例如，标题、正文部分）和视觉布局模式，以及多方面的相似性计算（内容-视觉、视觉-视觉和话语定位）。后续的上下文相关性精炼阶段进一步增强了检索到的信息。检索到的文章随后作为描述生成的知识库：首先，VLM生成简洁的图像描述；其次，我们基于该描述从检索到的文章中分割出相关信息；最后，LLM利用描述和提取的知识生成全面、上下文详细的描述。我们参加了ACM Multimedia EVENTA 2025挑战赛，并在OpenEvent-V1数据集的私有测试集上以0.2824的总分获得第5名。源代码已在此https URL公开发布。

英文摘要

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

URL PDF HTML ☆

赞 0 踩 0

2606.18472 2026-06-18 cs.CV 新提交 85%

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

通过正则化微调实现可域泛化的3D视觉-语言模型适应

Sneha Paul, Zachary Patterson, Nizar Bouguila

发表机构 * Concordia University（康考迪亚大学）

专题命中音视频/视觉语言融合：3D视觉语言模型域泛化，融合点云、视觉和文本模态。

AI总结提出ReFine3D框架，通过选择性层调优、多视图一致性、同义词提示及点渲染视觉监督等正则化策略，提升3D大语言模型在域泛化中的性能。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情

AI中文摘要

域适应仍然是3D视觉中的一个核心挑战，特别是对于将3D点云与视觉和文本数据对齐的多模态基础模型。尽管这些模型表现出强大的通用能力，但将其适应到数据有限的下游领域往往会导致过拟合和灾难性遗忘。为了解决这个问题，我们引入了ReFine3D，一个正则化的微调框架，专为3D大语言模型（LMMs）的可域泛化调优而设计。ReFine3D将选择性层调优与两种针对性的正则化策略相结合：跨增强点云的多视图一致性，以及通过大语言模型生成的基于同义词的提示实现的文本多样性。此外，我们加入了点渲染的视觉监督和一种基于置信度聚合的测试时增强机制，以进一步增强鲁棒性。在不同3D域泛化基准上的大量实验表明，ReFine3D将基类到新类泛化提高了1.36%，跨数据集迁移提高了2.43%，对损坏的鲁棒性提高了1.80%，少样本准确率提高了最多3.11%，以最小的额外计算开销超越了先前的最先进方法。

英文摘要

Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.26672 2026-06-18 cs.MM cs.SD 版本更新 85%

Can We Hear from Events? Generating Speech from Event Camera

我们能从事件中听到声音吗？从事件相机生成语音

Jingping Fang, Lin Chen, Chenyang Xu, Tong Zhao, Weidong Cai, Xiaoming Chen

发表机构 * Beijing Technology and Business University（北京技术与商业大学）； Xidian University（西安电子科技大学）； Tongji University（同济大学）； University of Sydney（悉尼大学）

专题命中音视频/视觉语言融合：从事件相机生成语音，跨视觉与听觉模态

AI总结提出EventSpeech框架，利用神经形态事件相机的高时间精度解决传统RGB语音生成中的时间粒度不匹配问题，实现情感丰富且抗运动模糊的语音生成。

详情

AI中文摘要

传统的基于RGB的语音生成面临时间粒度不匹配问题，因为固定的相机曝光时间不可避免地模糊了渲染情感语音所需的高频发音瞬态。为了打破这一限制，我们提出EventSpeech，这是一个新颖的文本条件框架，率先利用神经形态事件进行表达性语音生成，因为这些微秒级精确的事件自然与声学波形动态对齐。我们的架构集成了一个专用的事件编码器来建模稀疏的神经形态事件，以及一个多尺度音频编码器，其中包含分层小波上下文器（HWC）。双向对齐机制无缝地将语言内容和视觉动态与密集的声学特征同步。此外，我们构建了EVT-SPK作为第一个基准，包括大规模合成数据和来自专用神经形态硬件的真实世界记录。大量评估表明，EventSpeech通过保留细粒度情感和抵抗运动模糊，显著优于当前基线，为多模态语音生成建立了新范式。代码和演示可在https://xrfang-0102.github.io/EventSpeechWeb/获取。

英文摘要

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

URL PDF HTML ☆

赞 0 踩 0

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新 85%

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni：从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳分校）； National University of Singapore（新加坡国立大学）

专题命中音视频/视觉语言融合：评估多模态大模型从音视频线索预测未来

AI总结提出FutureOmni基准，评估多模态大模型从音视频线索预测未来的能力，发现现有模型在语音密集场景下表现差，并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）展现出强大的全模态感知能力，但它们从音视频线索预测未来事件的能力仍未被充分探索，因为现有基准主要关注回顾性理解。为弥补这一差距，我们引入了FutureOmni，这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理，并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建，包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明，当前系统在音视频未来预测方面存在困难，尤其是在语音密集场景中，Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限，我们整理了一个7K样本的指令微调数据集，并提出全模态未来预测（OFF）训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明，OFF增强了未来预测和泛化能力。我们公开发布所有代码（此 https URL ）和数据集（此 https URL ）。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

URL PDF HTML ☆

赞 0 踩 0

2606.18354 2026-06-18 eess.IV cs.LG 新提交 90%

Structural MRI Synthesis for Alzheimer's Disease via Conditional Diffusion on Anatomical Masks

基于解剖掩膜条件扩散的阿尔茨海默病结构MRI合成

Muge Zhang, Muhammad Ali Khaliq, Jamal Alsakran, Byeong Kil Lee, Jeeho Ryoo

发表机构 * Fairleigh Dickinson University（Fairleigh Dickinson大学）； University of Colorado at Colorado Springs（科罗拉多州立大学）

专题命中医学影像融合：条件扩散模型生成3D结构MRI，融合解剖掩膜

AI总结针对阿尔茨海默病结构MRI合成中细微解剖变化难以捕捉的问题，本文扩展Med-DDPM条件扩散模型，以解剖分割掩膜为条件生成3D结构MRI，实验表明合成数据训练的模型Dice分数与真实数据相当，混合数据训练则显著提升性能。

Journal ref 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR)

详情

DOI: 10.1109/MIPR67560.2025.00037

AI中文摘要

生成式机器学习模型的最新进展显著改善了医学成像，为数据增强、隐私保护和模型泛化提供了有前景的解决方案。然而，由于神经退行性病变相关的细微、区域特异性和渐进性解剖变化，合成阿尔茨海默病（AD）的高质量结构MRI数据仍然具有挑战性。在本文中，我们将最初为脑肿瘤合成设计的Med-DDPM条件扩散模型扩展，以生成专门针对AD的3D结构MRI。我们采用Med-DDPM，因为与其他生成模型相比，它具有稳定的结构和保真度，特别适合捕捉AD特征的细微解剖变化。我们的方法以来自ADNI数据集的解剖分割掩膜为条件，将关键的AD相关脑结构纳入生成过程。我们通过在真实、合成和混合数据集上训练分割模型，系统评估了合成图像的质量和实用性。实验结果表明，仅在合成数据上训练的分割模型达到了与真实数据训练（0.6513）相当的Dice分数（0.6532），同时召回率显著提高。值得注意的是，在混合数据集（混合真实和合成图像）上训练的模型优于真实和纯合成基线，Dice分数达到0.7244。这些发现强调了条件扩散模型在生成解剖准确、AD特异性合成MRI方面的成功应用，并突出了它们在增强训练数据可用性、提高诊断准确性和促进神经影像研究可重复性方面的潜力。

英文摘要

Recent advances in generative machine learning models have significantly improved medical imaging, offering promising solutions for data augmentation, privacy preservation, and improved model generalization. However, synthesizing high-quality structural MRI data for Alzheimer's Disease (AD) remains challenging due to the subtle, region-specific, and progressive anatomical changes associated with neurodegeneration. In this paper, we extend the Med-DDPM conditional diffusion model -- originally designed for brain tumor synthesis -- to generate 3D structural MRIs specifically tailored to AD. We adopted Med-DDPM due to its established stability and structural fidelity compared to other generative models, which makes it particularly suitable for capturing the subtle anatomical changes characteristic of AD. Our approach conditions the diffusion process on anatomical segmentation masks derived from the ADNI dataset, incorporating key AD-relevant brain structures into the generation process. We systematically evaluate the quality and utility of the synthetic images by training segmentation models on real, synthetic, and hybrid (mixed) datasets. Experimental results demonstrate that segmentation models trained exclusively on synthetic data achieve comparable Dice scores (0.6532) to those trained on real data (0.6513), while exhibiting significantly enhanced recall. Notably, models trained on hybrid datasets (mixing real and synthetic images) outperform both real and synthetic-only baselines, achieving a Dice score of 0.7244. These findings underscore the successful use of conditional diffusion models for generating anatomically accurate, AD-specific synthetic MRIs, and highlight their potential for enhancing training data availability, improving diagnostic accuracy, and promoting research reproducibility in neuroimaging studies.

URL PDF HTML ☆

赞 0 踩 0

2606.18825 2026-06-18 cs.CV 新提交 90%

DreamReg: Belief-Driven World Model for 2D-3D Ultrasound Registration

DreamReg：基于信念驱动的世界模型用于2D-3D超声配准

Luoyao Kang, Yuelin Zhang, Jiwei Shan, Haifan Gong, Qingpeng Ding, Shing Shin Cheng

发表机构 * T Stone Robotics Institute, The Chinese University of Hong Kong（香港中文大学T Stone机器人研究所）； Multi-scale Medical Robotics Center（多尺度医疗机器人中心）； Perelman School of Medicine, University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学院）

专题命中医学影像融合：2D-3D超声配准，融合术中2D切片与术前3D体积。

AI总结提出DreamReg框架，将2D-3D超声配准建模为信念更新，通过世界模型模拟探头运动并整合想象结果，在CAMUS和u-RegPro数据集上实现鲁棒且准确的实时配准。

详情

AI中文摘要

超声（US）广泛应用于手术导航，但由于部分可观测性、散斑噪声以及依赖于动作的US采集，术中2D切片与术前3D体积之间的实时配准仍然具有挑战性。现有方法是一次性的或短视的，难以随时间收集证据或捕捉外科医生如何根据屏幕反馈调整探头运动。我们提出DreamReg，一个基于信念驱动的世界模型框架，将2D-3D配准形式化为对刚性变换的信念更新。DreamReg维护一个潜在信念状态，总结过去的观测和位姿信息，并在新切片到达时通过学习到的动态不断细化变换。在训练期间，DreamReg暴露于模拟临床扫描行为的探头运动轨迹，并通过将位姿细化条件于当前US观测来学习更新其信念。在推理期间，DreamReg通过内部想象来细化配准：它展开学习到的世界模型以模拟候选探头运动及其预测的观测，并整合这些想象的结果以收敛到准确的刚性变换。在CAMUS和u-RegPro数据集上的实验表明，与最先进方法相比，DreamReg在实时引导中具有改进的鲁棒性和有竞争力的配准精度。

英文摘要

Ultrasound (US) is widely used for surgical navigation, yet real-time registration between intraoperative 2D slices and preoperative 3D volumes remains challenging due to partial observability, speckle noise, and the action-dependent US acquisition. Existing methods are one-shot or short-horizon, making it hard for them to gather evidence over time or capture how surgeons adjust probe motion based on on-screen feedback. We propose DreamReg, a belief-driven world-model framework that formulates 2D-3D registration as belief updating over rigid transformations. DreamReg maintains a latent belief state that summarizes past observations and poses information, and continuously refines the transformation through learned dynamics as new slices arrive. During training, DreamReg is exposed to probe-motion trajectories that mimic clinical scanning behavior and learns to update its belief by conditioning pose refinement on the current US observation. During inference, DreamReg refines registration via internal imagination: it rolls out the learned world model to simulate candidate probe motions and their predicted observations, and integrates these imagined outcomes to converge to an accurate rigid transformation. Experiments on CAMUS and u-RegPro datasets demonstrate improved robustness and competitive registration accuracy for real-time guidance compared with state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.18723 2026-06-18 cs.CV cs.LG 新提交 90%

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

临床对齐的几何约束用于鲁棒的IVUS血管边界分割

Yunshu Chen, Litao Yang, Giuseppe Di Giovanni, Jordan Tan, Deval Mehta, Andrew Lin, Derek Chew, Masasi Fujino, Julie Butters, Stephen Nicholls, Zongyuan Ge, Kyung Hoon Cho

发表机构 * AIM For Health Lab, Monash University（莫纳什大学AIM健康实验室）； Department of Data Science and Artificial Intelligence, Faculty of IT, Monash University（莫纳什大学信息技术学院数据科学与人工智能系）； Monash University Victorian Heart Institute（莫纳什大学维多利亚心脏研究所）； School of Computing Technologies, RMIT University（皇家墨尔本理工大学计算技术学院）； National Cerebral and Cardiovascular Center（国立循环器病研究中心）； Department of Cardiology, Chonnam National University Hospital and Medical School（全南大学医院和医学院心脏病学系）

专题命中医学影像融合：IVUS血管边界分割，融合双编码器与几何约束。

AI总结提出GeoCat网络，通过双编码器与可微几何一致性损失，在IVUS分割中降低边界漂移和拓扑错误，提升临床几何测量精度。

Comments MICCAI2026 Accepted

详情

AI中文摘要

血管内超声（IVUS）管腔和外弹性膜（EEM）分割对于定量评估冠状动脉斑块负荷至关重要。管腔或EEM勾画的误差会直接传播到斑块面积、斑块负荷和几何测量中。然而，优先考虑重叠分数的标准方法常常遭受边界漂移和拓扑错误，导致临床测量不准确。我们提出GeoCat，一个几何一致性网络，使用双笛卡尔-极坐标编码器，结合跨域注意力和时间融合，处理5帧IVUS片段。可微的几何一致性损失直接监督临床相关描述符，包括直径、方向和横截面积。该模型在来自146名患者的12,242张标注帧上训练，这些帧使用两种商用IVUS系统采集。我们使用分割准确性和斑块相关临床指标评估性能，包括Dice/IoU、边界测量（95HD（mm）、ASSD）、拓扑违规率和临床几何误差（dmax/dmin、角度和面积）。在我们的数据集上，GeoCat实现了0.93的Dice，将95HD降低到0.14 mm，并将拓扑违规率降低到1.0%。重要的是，它显著提高了几何保真度，产生0.13-0.16 mm的直径误差和约8度的角度误差，支持可靠的斑块负荷量化。

英文摘要

Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

URL PDF HTML ☆

赞 0 踩 0

2606.18523 2026-06-18 q-bio.QM cs.CV 新提交 85%

DART: A design-aware microfluidic chip paradigm for real-time live-cell image analysis

DART: 一种设计感知的微流控芯片范式用于实时活细胞图像分析

Johannes Seiffarth, Matthias Pesch, Lukas Scholtes, Dietrich Kohlheyer, Hanno Scharr, Katharina Nöh

发表机构 * Institute for Bio- and Geosciences, IBG-1: Biotechnology（生物与地质科学研究所，IBG-1：生物技术）； Computational Systems Biotechnology (AVT.CSB), RWTH Aachen University（计算系统生物技术（AVT.CSB），亚琛工业大学）； Institute for Advanced Simulation, IAS-8: Data Analytics and Machine Learning（先进模拟研究所，IAS-8：数据分析与机器学习）

专题命中医学影像融合：融合CAD蓝图与物理芯片，实现实时活细胞图像分析

AI总结提出DART范式，通过嵌入式标记和深度学习检测对齐CAD蓝图与物理芯片，实现高通量微流控芯片中所有感兴趣区域的快速定位和全自动图像处理，支持实时分析。

详情

AI中文摘要

高通量微流控活细胞成像产生丰富的单细胞数据。然而，用于定位每个包含一个细胞群体的感兴趣区域（RoI）并从记录图像中移除周围微流控结构的半自动化流程随RoI数量扩展，这阻碍了实时图像分析并将洞察时间延迟数小时至数天。我们提出了用于微流控培养芯片的设计感知和实时能力（DART）范式，该范式将CAD蓝图与物理芯片对齐，从而实现了对所有RoI的通量无关定位以及跨不同RoI几何形状和芯片布局的全自动图像处理。DART通过嵌入式基准标记和基于深度学习的标记检测建立这种对齐。我们使用瑞士军刀芯片验证DART，该芯片在1164个RoI位置上组合了八种结构不同的RoI设计。DART在五分钟内定位所有RoI，在40毫秒内从原始显微镜图像中移除微流控结构，并在每张图像1.1秒内执行全自动图像分析，包括细胞分割。这些能力共同使DART成为一个端到端的硬件-软件范式，具有实时分析能力，为闭环和结果驱动的智能显微镜铺平了道路。

英文摘要

High-throughput microfluidic live-cell imaging generates rich single-cell data. Yet semi-automated procedures for locating regions of interest (RoIs), each containing one cell population, and removing surrounding microfluidic structures from recorded images, scale with the number of RoIs. This prevents real-time image analysis and delays time-to-insight by hours to days. We introduce the Design-Aware and Real-Time capable (DART) paradigm for microfluidic cultivation chips, which aligns the CAD blueprint with the physical chip and thereby enables throughput-independent localization of all RoIs and fully automated image processing across diverse RoI geometries and chip layouts. DART establishes this alignment through embedded fiducial markers and deep-learning-based marker detection. We validate DART using the Swiss Army Knife chip, which combines eight structurally distinct RoI designs across 1164 RoI locations. DART localizes all RoIs in five minutes, removes microfluidic structures from raw microscopy images in 40 ms, and performs fully automated image analysis, including cell segmentation, in under 1.1 s per image. Together, these capabilities establish DART as an end-to-end hardware-software paradigm with real-time-capable analysis that paves the way toward closed-loop and outcome-driven smart microscopy.

URL PDF HTML ☆

赞 0 踩 0

2606.18886 2026-06-18 cs.CV 新提交 85%

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

DINO-Med3D：通过渐进式适应弥合体分割中的维度与领域差距

Haoyu Hu, Xiyao Ma, Shiqi Liu, Linsen Zhang, Xiaoliang Xie, Xiaohu Zhou, Zeng-Guang Hou

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

专题命中医学影像融合：DINOv3适配3D医学分割，属于医学影像融合。

AI总结提出两阶段渐进框架DINO-Med3D，通过多切片嵌入模块、3D适配器和并行细节恢复流，将DINOv3适配到3D医学分割，在五个数据集上超越现有方法。

Comments Accepted at MICCAI 2026. The camera-ready version and link will be made publicly available upon publication

详情

AI中文摘要

尽管DINOv3在自然图像中展现了显著的语义判别能力，但其直接应用于体医学分割受到固有的维度和领域差异的阻碍。为解决这些问题，我们提出DINO-Med3D，一个两阶段渐进框架，将预训练的DINOv3编码器重新用于3D医学任务。在第一阶段，我们通过引入融合伪3D上下文的多切片嵌入模块来弥合维度差距，同时采用分割代理任务将从自然场景学到的表示适应到医学领域。随后，我们通过在冻结的主干中添加轻量级3D适配器来增强体理解，以强制执行全局切片间连续性。最后，为补偿嵌入过程中固有的空间信息损失，我们设计了一个并行细节恢复流，以显式保留高频边界线索。在五个公共数据集上的大量实验表明，我们的方法成功地将DINOv3适应到医学领域，并显著优于最先进的基线方法。

英文摘要

Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18860 2026-06-18 cs.CV cs.LG 新提交 85%

Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

医学图像分割中对抗模型的不确定性量化

Hana Jebril, Thomas Pinetz, Günter Klambauer, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria（人工智能研究所、医学数据科学中心、维也纳医学大学，奥地利）； Comprehensive Center for AI in Medicine, Medical University of Vienna, Austria（医学人工智能综合中心、维也纳医学大学，奥地利）； ELLIS Unit Linz, LIT AI Lab and Institute for Machine Learning, Johannes Kepler University Linz, Austria（林茨ELLIS单位、LIT人工智能实验室和机器学习研究所、林茨约瑟夫·冯·克拉夫特大学，奥地利）； Institute for Machine Learning, Johannes Kepler University Linz, Austria（机器学习研究所、林茨约瑟夫·冯·克拉夫特大学，奥地利）； Clinical Research Center for Medical AI, Johannes Kepler University Linz, Austria（医学人工智能临床研究中心、林茨约瑟夫·冯·克拉夫特大学，奥地利）

专题命中医学影像融合：提出QUAM-SM框架，针对医学图像分割不确定性量化，属于医学影像融合范畴。

AI总结提出QUAM-SM后处理框架，通过针对性对抗搜索识别脆弱像素，量化不确定性并分离认知与偶然不确定性，在公开数据集上优于现有方法。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

可靠的像素级不确定性量化具有通过实现高保真纵向监测和区分真实病理变化与伪影来改变临床工作流程的潜力。理想情况下，这些模型提供关键治疗计划和手术干预所需的稳定性。然而，标准深度学习模型常常遭受校准不良，产生过度自信的预测，掩盖了微妙病理边界处的潜在脆弱性。为了解决这个问题，我们提出了QUAM-SM，一种使用针对性对抗搜索来识别“对抗脆弱”像素的后处理框架。通过主动寻找暴露预测不稳定性的扰动，我们的方法突出了决策最容易被翻转的区域。重要的是，该框架将认知不确定性与偶然不确定性分离。在两个具有多个专家标注的公开数据集上的实验表明，QUAM-SM在可靠性和边界敏感性方面优于标准和最新的不确定性估计方法。代码可在以下网址获取：https://this https URL

英文摘要

Reliable pixel-level uncertainty quantification holds the potential to transform clinical workflows by enabling high-fidelity longitudinal monitoring and distinguishing true pathological changes from artifacts. Ideally, these models provide the stability required for critical treatment planning and surgical intervention. However, standard deep learning models often suffer from miscalibration, yielding overconfident predictions that mask underlying vulnerabilities at subtle pathological boundaries. To address this, we propose QUAM-SM, a post-hoc framework using targeted adversarial search to identify "adversarially fragile" pixels. By actively seeking perturbations that expose predictive instability, our method highlights regions where decisions are most vulnerable to being flipped. Importantly, the framework disentangles epistemic uncertainty from aleatoric uncertainty. Experiments on two public datasets with multiple expert annotations demonstrate that QUAM-SM outperforms both standard and recent uncertainty estimation approaches in terms of reliability and boundary sensitivity. Code is available at https://github.com/HanaJebril/quam_sm

URL PDF HTML ☆

赞 0 踩 0

2606.18749 2026-06-18 cs.CV 新提交 85%

Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

迈向3D医学图像的无训练零样本异常检测：基于批次的方法使用2D基础模型

Tai Le-Gia

发表机构 * Chungnam National University（忠南大学）

专题命中医学影像融合：3D医学图像零样本异常检测，融合多轴切片信息。

AI总结提出CS3F框架，利用2D基础模型对3D医学图像进行零样本异常检测，通过沿多轴分解、切片编码和跨主体相似性计算异常分数，并引入粗到细的分词策略减少信号衰减。

详情

AI中文摘要

零样本异常检测（ZSAD）在医学成像中具有吸引力，因为临床系统必须处理异构采集协议、变化的患者群体以及可能缺乏标注训练数据的病理。大多数现有的零样本异常检测方法是为2D图像设计的，它们直接扩展到3D医学体积受到大规模体积基础模型稀缺或利用体积上下文困难的限制。我们提出CS3F，一个无训练的基于批次的框架，用于3D医学图像中的ZSAD，使用2D基础模型。每个体积沿多个解剖轴分解，并由2D视觉变换器逐切片编码。然后通过池化相邻切片特征将其转换为局部体积令牌。异常分数通过跨主体互相似性获得：在其他主体中缺乏相似令牌的令牌被赋予更高的异常分数。为了减少深度池化引起的病灶信号衰减，我们引入了一种粗到细的分词策略，无需穷举匹配即可实现细分辨率体积评分。CS3F在脑部MRI上针对转移瘤、胶质瘤和中风进行评估，并在肺部CT上验证其泛化能力，超越标准图谱对齐的脑部MRI。结果表明，冻结的2D基础模型可以支持3D医学图像中的异常定位，且细分词化的益处很大程度上取决于病灶对比度和成像模态。

英文摘要

Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.

URL PDF HTML ☆

赞 0 踩 0

2606.18707 2026-06-18 cs.CV 新提交 85%

PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

PEFT-MedSAM：面向可解释皮肤病变分割的医学基础模型高效微调

Asad Channa, Abdullah Khan, Asghar Ali Chandio, Aamir Akbar, Shahzad Memon, Aqib Hussain, Ameer Hamza

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology（计算机科学系，卡迪尔-阿瓦姆工程、科学与技术大学）； Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology（人工智能系，卡迪尔-阿瓦姆工程、科学与技术大学）； Department of Computer Science, Sindh Madressatul Islam University, City Campus, Karachi（计算机科学系， Sind 阿里斯坦伊斯兰大学，卡拉奇城校区）； Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London（计算机科学与数字技术系，建筑、计算与工程学院，东伦敦大学）

专题命中医学影像融合：皮肤病变分割，微调医学基础模型，属于医学影像融合。

AI总结提出参数高效微调方法PEFT-MedSAM，冻结预训练编码器仅训练轻量解码器，在ISIC 2018上达到0.9411 Dice系数，并通过Grad-CAM可解释性增强临床可信度。

详情

AI中文摘要

使用深度学习模型对皮肤镜图像进行皮肤病变自动分割，有助于比常规检测更早发现黑色素瘤。然而，大多数现有的深度学习方法性能不佳。本文旨在提出一种名为PEFT-MedSAM的参数高效微调方法，用于适配医学分割一切模型（MedSAM）以自动分割皮肤镜皮肤病变。PEFT-MedSAM方法仅使用轻量级掩码解码器训练模型，同时保持预训练图像编码器和提示编码器冻结。在ISIC 2018基准数据集上的实验表明，与完全训练的U-Net基线（0.8715 Dice系数）和零样本MedSAM推理（0.8997 Dice系数）相比，PEFT-MedSAM获得了0.9411的Dice系数和0.8918的交并比。使用PH2数据集进行的外部验证显示Dice系数为0.9467，标准差为±0.0310。这些主张的支持证据包括比较两个数据集的Wilcoxon符号秩检验p值小于0.0001，以及bootstrap估计的95%置信区间[0.9364, 0.9447]，该区间表示重复测试获得的平均Dice系数的估计范围。为了增加临床可信度，我们使用Grad-CAM可解释性以及基于指向游戏的评估方法，在验证集上评估CNN基线模型。结果表明，在包含519张图像的验证集上，准确率达到98.27%，并确认模型正确分类了包含皮肤病变的区域。

英文摘要

Automated segmentation of skin lesions using deep learning models for dermoscopic images can be very helpful in finding melanomas earlier than they would normally be detected. However, most deep learning methods available do not perform well. The aim of this paper is to present a parameter-efficient fine-tuning method called PEFT-MedSAM for adapting the Medical Segment Anything Model (MedSAM) to automatically segment dermoscopic skin lesions. The PEFT-MedSAM method uses only the lightweight mask decoder for training the model while keeping the pre-trained image encoder and prompt encoder frozen. The experiments performed on the ISIC 2018 benchmark dataset shows that PEFT-MedSAM obtains a dice coefficient of .9411 and an intersection over union value of .8918 when compared to both a fully trained U-Net baseline (.8715 dice coefficient) and zero-shot MedSAM inference (.8997 dice coefficient). The external validation of the model using PH2 dataset shows .9467 dice coefficient with +/- .0310 standard deviation. Supportive evidence for these claims include a p-value less than .0001 for Wilcoxon signed rank tests comparing the two datasets and bootstrap-estimated 95% confidence intervals of [.9364,.9447] that represent the estimated range of possible values for the average dice coefficient obtained by repeating the test. To increase clinical trustworthiness, we used Grad-CAM explainability along with a pointing game based evaluation methodology to evaluate the CNN baseline model on the validation set. The results showed that we had an accuracy rate of 98.27% on the validation set of 519 images and confirmed that the model classified regions containing skin lesions.

URL PDF HTML ☆

赞 0 踩 0

2606.15554 2026-06-18 cs.CV 新提交 85%

RaLMPH: Reliability-aware Learning for Multi-Pathologist Harmonization in Whole-Slide Image Classification

RaLMPH：全切片图像分类中面向多病理学家协调的可靠性感知学习

Sungrae Hong, Jiwon Jeong, Soeun Cheon, Donghee Han, Sol Lee, Jisu Shin, Kyungeun Kim, Mun Yong Yi

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）； Seegene Medical Foundation（Seegene医学基金会）

专题命中医学影像融合：多病理学家标注的全切片图像标签协调，属于医学影像融合

AI总结提出RaLMPH框架，通过可靠性场建模局部邻域结构和专家不确定性，实现多病理学家标注的全切片图像标签协调，提升多实例学习性能。

Comments Accepted by MICCAI 2026

详情

AI中文摘要

多实例学习（MIL）是全切片图像（WSI）分析的标准范式，并在计算病理学中取得了显著成果。然而，大多数MIL流程假设每张切片只有一个“金标准”标签，这与临床实践中常见的病理学家间显著差异相矛盾。现有的多标注者学习和标签细化方法通常估计全局标注者可靠性或依赖单实例假设，使其难以适应MIL以及专家意见不一致的局部诊断场景。我们提出RaLMPH（面向多病理学家协调的可靠性感知学习），一种基于MIL的标签协调框架，用于由多位病理学家标注的WSI。RaLMPH引入了一个可靠性场，该场联合建模（i）WSI特征空间中的局部邻域结构和（ii）专家不确定性（熵），从而能够识别每个样本的可信参考邻域。利用该场，RaLMPH执行样本级局部标注者排序以选择每张切片的可靠意见，并应用自适应门控机制根据局部可靠性融合标签。在由六位病理学家标注的临床WSI数据集以及受控模拟基准上的实验表明，RaLMPH始终优于现有方法。进一步分析阐明了我们的可靠性感知机制如何改进标签协调和下游MIL性能。

英文摘要

Multiple Instance Learning (MIL) is a standard paradigm for Whole-Slide Image (WSI) analysis and has achieved strong results in computational pathology. However, most MIL pipelines assume a single "gold" label per slide, which conflicts with clinical practice where substantial inter-pathologist variability is common. Existing multi-annotator learning and label-refinement methods typically estimate global annotator reliability or rely on single-instance assumptions, making them poorly suited to MIL and to localized diagnostic contexts where experts disagree. We propose RaLMPH (Reliability-aware Learning for Multi-Pathologist Harmonization), a MIL-based label reconciliation framework for WSIs annotated by multiple pathologists. RaLMPH introduces a reliability field that jointly models (i) local neighborhood structure in WSI feature space and (ii) expert uncertainty (entropy), enabling per-sample identification of trustworthy reference neighborhoods. Leveraging this field, RaLMPH performs sample-wise local annotator ranking to select reliable opinions per slide and applies an adaptive gating mechanism to fuse labels conditioned on local reliability. Experiments on a clinical WSI dataset with labels from six pathologists, as well as controlled simulated benchmarks, show that RaLMPH consistently outperforms existing approaches. Further analyses clarify how our reliability-aware mechanism improves label reconciliation and downstream MIL performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19190 2026-06-18 cs.RO 新提交 90%

FAST-LIVGO: A Degeneracy-Robust LiDAR-Inertial-Visual-GNSS Fusion Odometry

FAST-LIVGO：一种退化鲁棒的LiDAR-惯性-视觉-GNSS融合里程计

Zhiyu Chen, Chunran Zheng, Jiayu Wen, XiaoLei Zhang, Jiaming Xu, Feng Pan, Yukang Cui

发表机构 * College of Mechatronics and Control Engineering, Shenzhen University（深圳大学机电与控制工程学院）； Department of Mechanical Engineering, The University of Hong Kong（香港大学机械工程系）； College of Automation, Harbin Engineering University（哈尔滨工程大学自动化学院）

专题命中多传感器融合：紧耦合LiDAR-惯性-视觉-GNSS融合里程计

AI总结提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架，通过动态时间规整的时空对齐模块、多普勒和时差载波相位观测模型以及退化感知的双模式异常值拒绝策略，在长期大尺度动态环境中实现高精度鲁棒的状态估计。

Comments Accepted for presentation at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情

AI中文摘要

在长期、大规模和高度动态环境中的鲁棒状态估计与建图仍然是机器人领域的关键挑战。现有的LiDAR-惯性-视觉里程计（LIVO）系统在局部精度上表现良好，但在长距离下会累积漂移，并在几何退化或无纹理场景中可能失效。同时，GNSS辅助融合框架通常依赖LiDAR或视觉里程计进行状态预测和异常值拒绝，使其在里程计退化时变得脆弱。为解决这些局限，我们提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架。引入基于动态时间规整的在线时空对齐模块以应对高度动态条件。为更好利用GNSS精度，我们开发了基于多普勒频移和固定锚点时间差载波相位的观测模型，在不增加历史锚点状态的情况下提供毫米级相对约束。我们进一步设计了一种退化感知的双模式异常值拒绝策略，根据LIVO退化程度在LIVO先验引导拒绝和GNSS辅助恢复之间切换。在公开M3DGR数据集和自建20 m/s固定翼无人机数据集上的实验表明，我们的系统减少了累积漂移和地图重影，在精度和鲁棒性上优于现有方法。

英文摘要

Robust state estimation and mapping in long-term, large-scale, and highly dynamic environments remains a key challenge in robotics. Existing LiDAR-Inertial-Visual Odometry (LIVO) systems achieve strong local accuracy but suffer from accumulated drift over long distances and may fail in geometrically degraded or textureless scenes. Meanwhile, GNSS-aided fusion frameworks often rely on LiDAR or visual odometry for state prediction and outlier rejection, making them vulnerable when odometry degenerates. To address these limitations, we propose a tightly coupled LiDAR-Inertial-Visual-GNSS fusion framework based on an Error-State Iterated Kalman Filter. An online spatiotemporal alignment module using Dynamic Time Warping is introduced for highly dynamic conditions. To better exploit GNSS precision, we develop observation models based on Doppler shifts and fixed-anchor Time-Differenced Carrier Phase, providing millimeter-level relative constraints without augmenting historical anchor states. We further design a degeneracy-aware dual-mode outlier rejection strategy that switches between LIVO-prior-guided rejection and GNSS-aided recovery according to the LIVO degeneracy level. Experiments on the public M3DGR dataset and a custom 20~m/s fixed-wing UAV dataset demonstrate that our system reduces accumulated drift and map ghosting, outperforming state-of-the-art methods in accuracy and robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.19154 2026-06-18 cs.RO 新提交 90%

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Viking Hill数据集：用于森林场景检测与分割的激光雷达-雷达-相机数据集

Vladimír Kubelka, Oleksandr Kotlyar, Unal Artan, Martin Magnusson

发表机构 * Örebro University（奥雷布罗大学）； AASS research centre（AASS研究中心）； Robot Navigation and Perception Lab（机器人导航与感知实验室）

专题命中多传感器融合：提供LiDAR-雷达-相机多传感器森林数据集

AI总结提出首个包含4D成像雷达的森林多传感器数据集，通过MinkowskiUNet实现雷达与激光雷达点云的语义分割，并评估树干分割质量与树木尺寸的关系。

Comments 33 pages, 11 figures

详情

AI中文摘要

在森林冠层下运行的自主机器人需要对树木及周围植被在不同季节条件下进行稳健感知。现有的林业数据集提供带有单棵树标注的激光雷达或相机数据，但均未包含共配准的4D成像雷达——这一模态因其对视觉退化、表面污染和植被遮挡的鲁棒性而日益受到关注。我们介绍了一个由移动机器人收集的多传感器森林数据集，该机器人配备了高分辨率FMCW成像雷达、激光雷达、RGB相机、IMU和RTK-GNSS。该场地在两个不同植被状态的会话中记录，3D立方体标注（包括每棵树的直径估计）为所有三种感知模态提供了共享语义标签。此外，我们提供了使用MinkowskiUNet对雷达和激光雷达点云进行语义分割的基线结果。雷达在主要类别（地面91%，冠层86%）上取得了与激光雷达竞争性的IoU分数，但在几何精细结构（如树干）上落后（56%对74%）。跨模态分析进一步比较了激光雷达和雷达的树干分割与RGB检测模型，而按直径分层的评估揭示了树干分割质量如何随树木尺寸变化。除了分割，共配准的多模态数据和RTK-GNSS辅助参考定位支持冠层下地图构建、定位和传感器融合的研究。数据集和标注工具已公开。

英文摘要

Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per-tree annotations, but none include co-registered 4D imaging radar -- a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi-sensor forest dataset collected by a mobile robot equipped with a high-resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK-GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations -- including per-tree diameter estimates -- provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross-modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter-stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co-registered multi-modal data and RTK-GNSS-aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.18583 2026-06-18 cs.CV cs.RO 新提交 90%

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

空地激光雷达地点识别：基于块级自监督学习和扩展互逆重排序

Yandi Yang, Xianghong Zou, Jianping Li, Haofeng Xie, Saurav Uprety, Hongzhou Yang, Naser El-Sheimy

发表机构 * University of Calgary（卡尔加里大学）； Nanchang University（南昌大学）； Nanyang Technological University（南洋理工大学）； Wuhan University（武汉大学）

专题命中多传感器融合：融合航空与地面LiDAR点云进行地点识别，属于多传感器融合。

AI总结提出一种空地激光雷达地点识别框架，通过多尺度块级自监督学习缩小域差距，并利用扩展互逆重排序算法减少误检，在多个数据集上显著提升检索精度。

详情

AI中文摘要

激光雷达地点识别用于确定在预先采集的点云地图上的位置。最常研究的基于地面激光雷达的地点识别存在预访问要求、覆盖不完整和视角有限等缺点。使用预先采集的全覆盖机载激光扫描（ALS）数据作为空中先验地图可以克服这些缺点，使得跨视角地点识别变得必要且有利。然而，空地激光雷达地点识别面临重大挑战，包括空中和地面点云之间的域差距以及初始检索中的误检。为了解决这些问题，我们提出了一种用于空地激光雷达地点识别的新型检索和重排序框架。基于相邻点云块与锚点块共享相似语义的先验知识，我们的检索网络在多个尺度上引入了块级自监督学习模块，并与场景级学习相结合，以提高空中和地面点云之间全局特征的判别性。此外，利用ALS点云的结构化空间分布，我们引入了一种扩展互逆（ER）重排序算法，以最大化利用邻域信息，并根据邻域特征优化每个特征，然后用于更新相似度矩阵以进行最终排序。大量实验表明，我们的检索网络优于现有最先进（SOTA）方法，在CS-Urban-Scenes数据集上平均Recall@1提高了9.8%，平均Recall@1%提高了3.2%，同时在CS-Campus3D数据集上也展示了最佳性能。此外，我们的ER重排序算法在无需额外训练的情况下，进一步将CS-Campus3D上的平均Recall@1提高了4.9%，CS-Urban-Scenes上提高了10.2%。

英文摘要

LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.

URL PDF HTML ☆

赞 0 踩 0

2606.19307 2026-06-18 cs.RO 新提交 85%

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

基于锚定特征参数化的视觉惯性导航的可观性与一致性分析

Mitchell Cohen, Vassili Korotkine, James Richard Forbes

发表机构 * Department of Mechanical Engineering, McGill University（麦吉尔大学机械工程系）

专题命中多传感器融合：视觉惯性导航系统融合视觉与惯性测量

AI总结分析基于滤波的视觉惯性导航系统（VINS）使用锚定特征表示时的可观性与一致性，证明其不可观子空间独立于估计的地标状态，从而改善一致性，但仍依赖导航状态，需额外一致性增强技术。

Comments Accepted to IEEE/RSJ IROS. 8 pages, 3 figures, 4 tables

详情

AI中文摘要

本文分析了使用锚定特征表示的基于滤波的视觉惯性导航系统（VINS）的可观性和一致性特性。结果表明，采用锚定地标参数化的VINS的不可观子空间独立于估计的地标状态，从而无需任何额外修改即可改善估计器的一致性。然而，不可观子空间仍然依赖于估计的导航状态，因此需要额外的一致性增强技术。本文提出了两种方法来改善采用锚定特征表示的VINS的一致性。仿真结果表明，与使用全局参考系解析特征的算法相比，所有采用锚定特征参数化的估计器都表现出更好的一致性，特别是在特征初始化可能较差的情况下。在TUM-VI数据集上的真实世界实验表明，仅使用锚定特征表示即可获得与采用全局特征表示的一致性改进估计器相当的性能，证明了在VINS中使用锚定特征参数化的优势。

英文摘要

This paper presents an analysis of the observability and consistency properties of filtering-based visual-inertial navigation systems (VINS) that utilize anchored feature representations. The unobservable subspace of VINS with anchored landmark parameterizations is shown to be independent of the estimated landmark state, which leads to improved estimator consistency properties without any additional modifications. However, the unobservable subspace is still found to depend on the estimated navigation state, necessitating additional consistency-enforcing techniques. Two methods to improve the consistency of VINS with anchored feature representations are presented. Simulation results showcase that all estimators employing anchored feature paramterizations exhibit improved consistency properties compared to algorithms that estimate features resolved in a global reference frame, especially in scenarios where feature initialization may be poor. Real-world experiments on the TUM-VI dataset showcase that the use of anchored feature representations alone can yield comparable performance to consistency-improved estimators employing a global feature representation, demonstrating the benefit of using anchored feature parameterizations for VINS.

URL PDF HTML ☆

赞 0 踩 0

2606.19067 2026-06-18 cs.RO cs.CV 新提交 85%

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

传感器配置至关重要：四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院智能机器人实验室）； Institute for Intelligent Systems, Esslingen University of Applied Sciences（埃森堡应用科学大学智能系统研究所）； Department of Computer Science, University of Freiburg（弗赖堡大学计算机科学系）

专题命中多传感器融合：评估视觉、惯性、LiDAR多模态SLAM，涉及多传感器融合。

AI总结针对四足机器人运动中的传感器配置问题，系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法，发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情

AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建（SLAM）。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟，但在腿部运动的剧烈动态下，硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战，包括足部冲击、高频机械振动和快速角旋转，这些都会降低标准感知管道的性能。为了填补这一空白，我们使用在ANYmal D四足机器人上记录的GrandTour数据集，对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响，分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明，硬件选择对系统鲁棒性有显著影响：立体配置始终优于单目和RGB-D模态，全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败，并且关键的是，在剧烈的腿部运动下，标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南，以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18952 2026-06-18 cs.CV 新提交 85%

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

SP-TransientBench: 一个真实捕获的单光子感知基准

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng, Wenle Dong, Ziwen Jiang, Xinyang Li, Rui Lu, Shuoyao Sun, Wenyu Wang, Ziyi Xia, Haitao Zheng, Guodong Shi, Xiaoqiang Ren

发表机构 * Shanghai University（上海大学）； Southern University of Science and Technology（南方科技大学）； The University of Sydney（悉尼大学）

专题命中多传感器融合：单光子LiDAR多任务基准，涉及多模态感知。

AI总结针对单光子LiDAR在真实场景中因噪声和多回波瞬态现象导致的感知挑战，提出包含10个场景、10297个视角的真实捕获多任务基准STB，支持深度估计、多视图重建和3D语义理解评估。

详情

AI中文摘要

基于单光子雪崩二极管（SPAD）传感的单光子LiDAR（SPL）能够以极高灵敏度进行时间分辨光子测量，为光子匮乏环境下的主动3D感知提供了独特潜力。然而，由于独特的测量噪声和复杂的多回波瞬态现象，真实世界的单光子感知仍然面临根本性挑战，这些因素共同使几何重建和语义场景理解变得复杂。尽管对基于SPAD的传感兴趣日益增长，现有研究大多局限于模拟数据或小规模受控捕获。因此，在深度估计、多视图重建和3D语义理解方面，对真实世界单光子感知的系统评估仍未得到充分探索。为弥补这一空白，我们引入了SP-TransientBench（STB），一个真实捕获的多任务单光子感知基准。STB包含10个多样化场景和10297个视图，使用固态单光子LiDAR以256×192分辨率捕获。每个视图提供具有多回波行为的完整飞行时间直方图、标准化元数据和用于多视图评估的校准相机位姿。我们还为选定场景提供了13类3D语义标注。通过为每个任务提供专用数据划分和评估协议，STB能够在多个3D视觉问题上实现真实世界单光子感知的一致且可重复的基准测试。数据集和代码将在接收后发布。

英文摘要

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.18566 2026-06-18 cs.CV cs.AI cs.GR 新提交 85%

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

多模态超图融合用于低光照人群计数

Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang, Bangjun Wang

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

专题命中多传感器融合：融合RGB、深度和边缘多模态信息进行低光照人群计数。

AI总结针对低光照环境下人群计数难题，构建三个新基准数据集，提出多模态超图融合模块和可变形矩形稀疏注意力模块，形成低光照计数网络LCNet，在三个基准上取得最优性能。

详情

AI中文摘要

人群计数是计算机视觉中的一项基本任务。然而，低光照环境下的人群计数在实际世界中具有重要实用价值，却仍未得到充分探索。现有方法主要关注良好光照场景或依赖单模态红绿蓝（RGB）表示，这在极端黑暗和复杂非均匀光照下往往变得不可靠。为解决此问题，我们构建了三个新的低光照人群计数基准，包括两个合成数据集SHA_Dark和SHB_Dark，以及一个真实世界基准LC-Crowd（低光照人群数据集）。受Retinex物理建模启发，我们引入深度和Canny边缘线索作为互补的几何和结构先验，以增强低光照条件下的内在反射率表示。我们提出多模态超图融合模块，将RGB外观、深度几何和边缘结构线索统一表示为超图中的节点，并通过动态超边构建和消息传递显式捕获它们的高阶互补关系。此外，为在密集预测中自适应分配计算，我们提出可变形矩形稀疏注意力（DRSA）模块，通过锚点感知估计和自适应矩形窗口建模将计算集中在信息丰富区域。基于这些设计，我们开发了统一的低光照计数网络（LCNet）用于鲁棒的低光照人群计数。在三个基准上的大量实验表明，所提方法在整体性能上优于现有最先进（SOTA）方法。代码见补充材料。数据集将在接收后公开。

英文摘要

Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA\_Dark and SHB\_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.19277 2026-06-18 cs.CV 新提交 85%

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架：适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Computational Data Science and Engineering（计算数据科学与工程）； College of Science and Technology（科学与技术学院）

专题命中遥感融合与全色锐化：遥感视觉问答中多模态融合的适配策略

AI总结提出RS Adapter参数高效微调策略，在三种视觉语言模型架构上注入轻量瓶颈适配器，仅用不到5%可训练参数实现遥感VQA，混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情

AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功，但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter（一种参数高效微调策略）在三种不同的视觉语言模型架构上进行了比较分析：双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线，将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层，从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明，虽然所有适配模型均实现收敛，但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.19204 2026-06-18 cs.CV 新提交 85%

ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

ROSA-TFormer: 一种雷达-光学传感器感知的时间Transformer用于基于GEE导出的Sentinel-1/2时间序列的陕北樟子松人工林分类

Nengbo Zhang, Chang sheng

发表机构 * Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences (AIRCAS)（遥感与数字地球重点实验室，航天信息研究所，中国科学院（AIRCAS））

专题命中遥感融合与全色锐化：融合雷达与光学时间序列数据用于森林分类

AI总结提出ROSA-TFormer模型，集成SAR和光学嵌入分支、传感器感知门和时间注意力池化，利用Sentinel-1/2时间序列数据实现高精度樟子松人工林分类，总体精度达99.67%。

Comments journal in tree classification

详情

AI中文摘要

准确识别樟子松人工林对于监测陕北地区造林质量和生态恢复具有重要意义。本文提出ROSA-TFormer，一种雷达-光学传感器感知的时间Transformer，利用Google Earth Engine生成的Sentinel-1/2时间序列数据进行樟子松分类。该模型集成了独立的SAR和光学嵌入分支、传感器感知门以及时间注意力池化，以捕获多源季节特征。在月度与半月点级数据集上的实验表明，ROSA-TFormer在HalfMonth-dataBig数据集上实现了强分类性能，总体精度99.67%，宏F1 99.56%，樟子松F1 98.91%。空间块验证和消融实验进一步表明了雷达-光学时间融合和传感器感知建模的有效性。结果展示了ROSA-TFormer在点级樟子松人工林分类中的潜力，但更广泛的wall-to-wall验证仍有必要。

英文摘要

Accurate identification of Pinus sylvestris var. mongolica plantations is important for monitoring afforestation quality and ecological restoration in northern Shaanxi. This paper proposes ROSA-TFormer, a radar-optical sensor-aware temporal Transformer for P. sylvestris classification using Sentinel-1/2 time-series data generated on Google Earth Engine. The model integrates separate SAR and optical embedding branches, a sensor-aware gate, and temporal attention pooling to capture multi-source seasonal features. Experiments on monthly and half-month point-level datasets show that ROSA-TFormer achieves strong classification performance, with 99.67% overall accuracy, 99.56% macro F1, and 98.91% P. sylvestris F1 on the HalfMonth-dataBig dataset. Spatial block validation and ablation results further indicate the effectiveness of radar-optical temporal fusion and sensor-aware modeling. The results demonstrate the potential of ROSA-TFormer for point-level P. sylvestris plantation classification, while broader wall-to-wall validation remains necessary.

URL PDF HTML ☆

赞 0 踩 0

1. 音视频/视觉语言融合 12 篇

Cosmos 3: Omnimodal World Models for Physical AI

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

Native Active Perception as Reasoning for Omni-Modal Understanding

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

APT: Atomic Physical Transitions for Causal Video-Language Understanding

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

Can We Hear from Events? Generating Speech from Event Camera

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

2. 医学影像融合 9 篇

Structural MRI Synthesis for Alzheimer's Disease via Conditional Diffusion on Anatomical Masks

DreamReg: Belief-Driven World Model for 2D-3D Ultrasound Registration

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

DART: A design-aware microfluidic chip paradigm for real-time live-cell image analysis

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

RaLMPH: Reliability-aware Learning for Multi-Pathologist Harmonization in Whole-Slide Image Classification

3. 多传感器融合 7 篇

FAST-LIVGO: A Degeneracy-Robust LiDAR-Inertial-Visual-GNSS Fusion Odometry

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

4. 遥感融合与全色锐化 2 篇

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series