多模态信息融合

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新 95%

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3：面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

专题命中音视频/视觉语言融合：全模态世界模型联合处理语言、图像、视频、音频和动作

AI总结提出基于统一混合Transformer架构的全模态世界模型Cosmos 3，联合处理语言、图像、视频、音频和动作序列，在理解和生成任务上达到新最优，为具身智能体提供可扩展的通用骨干。

详情

AI中文摘要

我们介绍了Cosmos 3，一个全模态世界模型家族，设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置，Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明，Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平，展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型，并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署，我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准，网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

URL PDF HTML ☆

赞 0 踩 0

2605.26672 2026-06-18 cs.MM cs.SD 版本更新 85%

Can We Hear from Events? Generating Speech from Event Camera

我们能从事件中听到声音吗？从事件相机生成语音

Jingping Fang, Lin Chen, Chenyang Xu, Tong Zhao, Weidong Cai, Xiaoming Chen

发表机构 * Beijing Technology and Business University（北京技术与商业大学）； Xidian University（西安电子科技大学）； Tongji University（同济大学）； University of Sydney（悉尼大学）

专题命中音视频/视觉语言融合：从事件相机生成语音，跨视觉与听觉模态

AI总结提出EventSpeech框架，利用神经形态事件相机的高时间精度解决传统RGB语音生成中的时间粒度不匹配问题，实现情感丰富且抗运动模糊的语音生成。

详情

AI中文摘要

传统的基于RGB的语音生成面临时间粒度不匹配问题，因为固定的相机曝光时间不可避免地模糊了渲染情感语音所需的高频发音瞬态。为了打破这一限制，我们提出EventSpeech，这是一个新颖的文本条件框架，率先利用神经形态事件进行表达性语音生成，因为这些微秒级精确的事件自然与声学波形动态对齐。我们的架构集成了一个专用的事件编码器来建模稀疏的神经形态事件，以及一个多尺度音频编码器，其中包含分层小波上下文器（HWC）。双向对齐机制无缝地将语言内容和视觉动态与密集的声学特征同步。此外，我们构建了EVT-SPK作为第一个基准，包括大规模合成数据和来自专用神经形态硬件的真实世界记录。大量评估表明，EventSpeech通过保留细粒度情感和抵抗运动模糊，显著优于当前基线，为多模态语音生成建立了新范式。代码和演示可在https://xrfang-0102.github.io/EventSpeechWeb/获取。

英文摘要

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

URL PDF HTML ☆

赞 0 踩 0

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新 85%

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni：从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳分校）； National University of Singapore（新加坡国立大学）

专题命中音视频/视觉语言融合：评估多模态大模型从音视频线索预测未来

AI总结提出FutureOmni基准，评估多模态大模型从音视频线索预测未来的能力，发现现有模型在语音密集场景下表现差，并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）展现出强大的全模态感知能力，但它们从音视频线索预测未来事件的能力仍未被充分探索，因为现有基准主要关注回顾性理解。为弥补这一差距，我们引入了FutureOmni，这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理，并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建，包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明，当前系统在音视频未来预测方面存在困难，尤其是在语音密集场景中，Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限，我们整理了一个7K样本的指令微调数据集，并提出全模态未来预测（OFF）训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明，OFF增强了未来预测和泛化能力。我们公开发布所有代码（此 https URL ）和数据集（此 https URL ）。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

URL PDF HTML ☆

赞 0 踩 0

2602.08355 2026-06-18 cs.CV 版本更新 80%

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds：面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

发表机构 * Alimama Tech, Taobao \& Tmail Group of Alibaba ； Huazhong University of Science ； Vin University

专题命中音视频/视觉语言融合：电商短视频理解基准，涉及多模态信息融合

AI总结提出电商短视频理解基准E-VAds，通过多模态信息密度评估框架量化领域复杂性，并构建多智能体生成的问答数据集，最后开发基于强化学习的推理模型E-VAds-R1，在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

详情

AI中文摘要

电商短视频代表了在线视频行业中高收入的细分领域，其特点是目标驱动的格式和密集的多模态信号。当前模型通常难以处理这些视频，因为现有基准主要关注通用任务，忽略了商业意图的推理。在这项工作中，我们首先提出了一个多模态信息密度评估框架，以量化该领域的复杂性。我们的评估显示，与主流数据集相比，电商内容在视觉、音频和文本模态上表现出显著更高的密度，为视频理解建立了更具挑战性的前沿。为了弥补这一差距，我们引入了电商视频广告基准（E-VAds），这是首个专门为电商短视频理解设计的基准。我们从淘宝精选了3,961个高质量视频，涵盖广泛的产品类别，并使用多智能体系统生成了19,785个开放式问答对。这些问题被组织成两个主要维度，即感知与认知和推理，包含五个不同的任务。最后，我们开发了E-VAds-R1，一个基于强化学习的推理模型，具有称为MG-GRPO的多粒度奖励设计。该策略为早期探索提供平滑指导，同时为专家级精度创造非线性激励。实验结果表明，E-VAds-R1在仅使用几百个训练样本的情况下，在商业意图推理上实现了109.2%的性能提升。

英文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.05368 2026-06-18 cs.CV 版本更新 80%

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Biomazon：亚马逊盆地三维森林结构与生物量建模的多模态数据集

Sayan Mandal, Rocco Sedona, Simon Besnard, Mikhail Urbazaev, Morris Riedel, Ehsan Zandi, Gabriele Cavallaro

发表机构 * Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich（julich超级计算中心（JSC），julich研究所）； School of Engineering and Natural Sciences (SENS), University of Iceland（工程与自然科学学院（SENS），冰岛大学）； Global Land Monitoring Group, GFZ Helmholtz Centre for Geosciences（全球土地监测组，geofz赫尔姆霍兹研究中心）

专题命中遥感融合与全色锐化：多传感器预测因子融合用于森林结构建模

AI总结针对现有方法未将森林垂直结构作为有序轮廓学习的问题，提出Biomazon多模态基准数据集，结合GEDI RH和AGBD目标与多传感器预测因子，通过共享编码器-解码器框架进行消融研究，为热带森林结构一致RH轮廓预测和结构-生物量建模建立参考基准。

Comments 32 pages, 21 figures, 8 tables

详情

AI中文摘要

准确、空间明确的描述热带森林结构对于碳核算和生态系统监测至关重要，然而大多数机器学习流程预测冠层顶部高度代理（例如RH95/RH98）或AGBD作为单独的标量目标，而不是将森林垂直结构作为有序轮廓学习。社区缺乏一个ML就绪的多模态基准，用于联合预测整个GEDI RH轮廓与AGBD，或评估强制RH百分位数之间物理一致排序的方法。我们通过Biomazon解决了这一问题，这是一个覆盖亚马逊盆地的20米多模态基准数据集，在标准化的空间划分和评估协议下，将GEDI RH和AGBD目标与多传感器预测因子（Sentinel-1/2、ALOS-2 PALSAR-2、Copernicus DEM、Dynamic World LULC和AlphaEarth嵌入）配对。使用共享编码器-解码器与任务特定头作为基线框架，我们对（i）骨干/模型规模、（ii）模态贡献以及（iii）在独立和融合设置下使用辅助嵌入进行了全面的消融研究，并报告了单目标和联合目标结果，以量化统一训练协议下的权衡。最后，我们通过与现有网格化产品（包括GEDI L4D RH10-RH98和AGBD）在匹配时间尺度上的区域对齐比较，将基线性能置于背景中。Biomazon连同随附的协议和基线结果，为未来热带森林中结构一致的RH轮廓预测和结构-生物量建模工作建立了参考基准。

英文摘要

Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.

URL PDF HTML ☆

赞 0 踩 0

2511.20302 2026-06-18 cs.CV 版本更新 80%

CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

CrossEarth-Gate：基于Fisher引导的自适应调优引擎用于高效跨域遥感语义分割

Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu

发表机构 * Sun Yat-sen University（中山大学）； The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）； National Supercomputing Center in Shenzhen（深圳国家超算中心）； The Hong Kong University of Science and Technology（香港科技大学）； Beijing Institute of Technology（北京理工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Tsinghua University（清华大学）

专题命中遥感融合与全色锐化：跨域遥感语义分割的自适应调优

AI总结提出CrossEarth-Gate，通过Fisher信息引导的自适应模块选择机制，动态激活最关键的跨域模块，在18个跨域基准中16个达到最优性能。

详情

AI中文摘要

在遥感（RS）中，参数高效微调（PEFT）已成为激活基础模型泛化表示能力以用于下游任务的关键方法。然而，现有的专用PEFT方法在应用于大规模地球观测任务时常常失败，因为它们无法完全处理遥感数据中固有的多面且不可预测的域差距（例如空间、语义和频率偏移）。为克服这一问题，我们提出CrossEarth-Gate，它包含两个主要贡献。首先，我们建立了一个全面的遥感模块工具箱，以解决多方面的域差距，包括空间、语义和频率模块。其次，我们开发了一种基于Fisher引导的自适应选择机制，该机制作用于该工具箱。该选择由Fisher信息引导，通过衡量每个模块对任务特定梯度流的贡献来量化其重要性。它动态地仅在适当层激活最关键模块，引导梯度流以最大化适应效果和效率。全面实验验证了我们方法的有效性和泛化能力，其中CrossEarth-Gate在18个遥感语义分割跨域基准中的16个上达到了最先进性能。

英文摘要

In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance on 16 out of 18 cross-domain benchmarks for RS semantic segmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.03827 2026-06-18 cs.CV cs.AI 版本更新 75%

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

基于傅里叶运动建模的条件潜扩散模型用于虚拟人群合成

Shaokun Lan, Haoran Dou, Jinghan Huang, Arezoo Zakeri, Fengming Lin, Zherui Zhou, Jinming Duan, Alejandro F. Frangi

发表机构 * Centre for Computational Imaging and Modelling in Medicine (CIMIM)（计算医学成像与建模中心）； University of Manchester（曼彻斯特大学）； Christabel Pankhurst Institute（克里斯塔贝尔·潘克赫斯特研究所）； Department of Computer Science（计算机科学系）； Division of Informatics, Imaging & Data Sciences（信息学、成像与数据科学分会）； Department of Electrical & Electronic Engineering（电子与电气工程系）； NIHR Manchester Biomedical Research Centre, Manchester Academic Health Sciences Centre, University of Manchester（尼日利亚卫生研究委员会曼彻斯特生物医学研究中心、曼彻斯特学术健康科学中心、曼彻斯特大学）

专题命中医学影像融合：条件扩散模型生成心脏网格序列，属于医学影像生成

AI总结提出4D F-MeshLDM框架，结合卷积网格VAE、截断傅里叶级数运动参数化和条件扩散先验，实现可控的3D+t心脏网格序列生成，在UK Biobank数据上优于基线方法。

Comments This work has been early accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2026

详情

AI中文摘要

医疗设备的计算机模拟试验需要生成虚拟解剖人群。在心血管应用中，虚拟解剖通常表示为从生成模型采样的3D+t网格。然而，大多数现有网格生成器关注静态解剖，而序列模型往往缺乏显式周期性。为此，我们提出4D F-MeshLDM，一个条件生成框架，包括用于编码网格的卷积网格VAE、使用截断傅里叶级数参数化运动的结构化潜空间，以及学习傅里叶系数令牌上潜分布的先验扩散。通过仿射调制将扩散过程条件化于临床协变量，我们实现了可控合成。采样令牌并执行逆傅里叶合成产生周期一致的潜轨迹，可解码为3D+t心脏网格序列。在5,000名UK Biobank受试者上的实验表明，4D F-MeshLDM在解剖保真度上优于最先进的基线，并实现了接近零的周期闭合误差。此外，生成的队列准确保留了临床功能指标，突显了我们的框架在可靠的心脏计算机模拟试验中的潜力。

英文摘要

In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.

URL PDF HTML ☆

赞 0 踩 0

2606.00491 2026-06-18 cs.CV cs.AI 版本更新 70%

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

CT分割系统的部署前鲁棒性压力测试：使用临床驱动的多损坏增强

CholMin Kanga, Jonghyun Chung, Amanpreet Kaur, Nagesh Gulkotwar, Aarthi Sivasankaran

发表机构 * Seoul National University（首尔国立大学）； Google Inc.（谷歌公司）

专题命中医学影像融合：CT分割系统的多损坏增强，属于医学影像处理

AI总结提出RAMP框架，通过多损坏增强提升CT分割模型在临床异质成像条件下的鲁棒性，显著缩小干净与损坏图像性能差距。

详情

AI中文摘要

基于深度学习的CT分割系统在干净基准图像上通常能达到高精度，但在噪声、分辨率损失、对比度变化、强度偏移和伪影等异质临床成像条件下，其性能可能会下降。这种不稳定性可能限制其在真实医疗成像工作流程中的可靠部署。我们提出鲁棒性增强多损坏流水线（RAMP），这是一个面向鲁棒性的CT分割增强框架。RAMP结合了解剖约束的空间扰动、CT强度变换和随机多损坏组合，使模型在训练过程中暴露于临床可行的图像退化。在两个CT分割评估设置中，RAMP实现了最强的损坏图像性能和最小的干净到损坏鲁棒性差距。在五器官噪声评估基准中，与nnU-Net基线相比，RAMP将平均损坏Dice从0.610提高到0.753，并将鲁棒性差距从0.264降低到0.064。在Abdomen1K中，RAMP将平均损坏Dice从0.633提高到0.789，并将鲁棒性差距从0.290降低到0.070。尽管RAMP未达到最高的干净图像Dice，但它显著减轻了严重图像退化下的最坏情况分割崩溃。这些结果表明，多损坏增强可以作为提高CT分割系统在异质临床环境中可靠性的实用部署前策略。

英文摘要

Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under heterogeneous clinical imaging conditions such as noise, resolution loss, contrast variation, intensity shift, and artifacts. This instability can limit reliable deployment in real-world medical imaging workflows. We propose Robustness via Augmented Multi-corruption Pipeline (RAMP), a robustness-oriented augmentation framework for CT segmentation. RAMP combines anatomically constrained spatial perturbations, CT intensity transformations, and stochastic multi-corruption composition to expose models to clinically plausible image degradation during training. Across two CT segmentation evaluation settings, RAMP achieved the strongest corrupted-image performance and the smallest clean-to-corrupted robustness gap. In the five-organ noisy evaluation benchmark, RAMP improved mean corrupted Dice from 0.610 to 0.753 and reduced the robustness gap from 0.264 to 0.064 compared with the nnU-Net baseline. In Abdomen1K, RAMP improved mean corrupted Dice from 0.633 to 0.789 and reduced the robustness gap from 0.290 to 0.070. Although RAMP did not achieve the highest clean-image Dice, it substantially mitigated worst-case segmentation collapse under severe image degradation. These results suggest that multi-corruption augmentation can serve as a practical pre-deployment strategy for improving the reliability of CT segmentation systems in heterogeneous clinical environments.

URL PDF HTML ☆

赞 0 踩 0

2512.10353 2026-06-18 cs.CV 版本更新 70%

Hybrid Transformer-Mamba for Weakly Supervised Volumetric Medical Segmentation

混合Transformer-Mamba用于弱监督体积医学分割

Yiheng Lyu, Lian Xu, Coen Arrow, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi

发表机构 * University of Western Australia（西澳大学）； Harry Perkins Institute of Medical Research（哈利·佩金斯医学研究所）； National Imaging Facility（国家成像设施）； Fiona Stanley Hospital（菲奥娜·斯蒂尔医院）； Victor Chang Cardiac Research Institute（维多利亚·张心脏研究中心）

专题命中医学影像融合：混合架构用于弱监督体积医学分割

AI总结提出TranSamba混合架构，通过跨平面建模捕获3D上下文，在弱监督下实现高效体积分割，在三个数据集上达到最优性能。

2507.16859 2026-06-18 cs.RO cs.AI 版本更新 70%

Enhancing Fatigue Detection through Heterogeneous Multi-Source Data Integration and Cross-Domain Modality Imputation

通过异构多源数据集成与跨域模态插补增强疲劳检测

Luobin Cui, Yanlai Wu, Tang Ying, Weikai Li

专题命中多传感器融合：异构多源数据集成用于疲劳检测

AI总结针对实际部署环境中高质量传感器不可用的问题，提出异构多源疲劳检测框架，利用共享模态进行跨域模态插补，融合源域知识提升目标域疲劳检测性能。

Comments 4figures,14pages

详情

AI中文摘要

疲劳检测对于安全相关应用（如航空、采矿和长途运输）中的人类操作员至关重要。可靠的操作员疲劳估计可以支持人机系统中的及时警告、自适应任务调度、接管提醒和其他安全管理决策。然而，这些功能的有效性取决于疲劳相关信号是否能在部署环境中可靠捕获。虽然许多研究已显示高保真传感器在受控实验室环境中的价值，但在实际环境中，由于噪声、光照条件和视野限制，其性能往往会下降，从而限制了实际应用。本文形式化了一种面向实际部署的疲劳检测设置，其中高质量传感器在实际应用中通常不可用。为解决这一问题，我们利用来自异构源域的知识，包括难以在现场部署但常用于受控环境的高保真传感器，来辅助真实目标域中的疲劳检测。基于这一思想，我们设计了一个异构多源疲劳检测框架，该框架利用目标域中的可用模态，同时通过基于共享模态的跨域模态插补来利用源域中的多样化配置。

英文摘要

Fatigue detection for human operators is important in safety-related applications such as aviation, mining, and long-haul transport. Reliable estimation of operator fatigue can support timely warnings, adaptive task scheduling, takeover reminders, and other safety-management decisions in human-machine systems. However, the effectiveness of these functions depends on whether fatigue-related signals can be reliably captured in the deployment environment. While many studies have shown the value of high-fidelity sensors in controlled laboratory environments, their performance often degrades when used in real-world settings because of noise, lighting conditions, and field-of-view constraints, thereby limiting their practical use. This paper formalizes a deployment-oriented setting for real-world fatigue detection, where high-quality sensors are often unavailable in practical applications. To address this issue, we use knowledge from heterogeneous source domains, including high-fidelity sensors that are difficult to deploy in the field but commonly used in controlled environments, to assist fatigue detection in the real-world target domain. Based on this idea, we design a heterogeneous and multi-source fatigue-detection framework that uses the available modalities in the target domain while leveraging diverse configurations in the source domains through cross-domain modality imputation based on shared modalities.

URL PDF HTML ☆

赞 0 踩 0

2606.01605 2026-06-18 cs.RO 版本更新 65%

Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control

将语义风险嵌入距离场和CBF用于在线单目安全控制

Dawei Zhang, Nuo Chen, Shuo Liu, Roberto Tron, Zhiwen Fan

发表机构 * Division of Systems Engineering, Boston University（系统工程系，波士顿大学）； Department of Mechanical Engineering, Boston University（机械工程系，波士顿大学）； Department of Electrical and Computer Engineering, Texas A&M University（电气与计算机工程系，德克萨斯农工大学）

专题命中多传感器融合：单目感知与语义风险嵌入距离场，涉及视觉与语义融合

AI总结提出一种在线单目感知到控制框架，通过将语义风险直接嵌入欧几里得符号距离场（ESDF），在控制优化前编码风险，实现基于控制障碍函数（CBF）的语义感知安全导航与遥操作。

详情

AI中文摘要

我们提出了一种在线单目感知到控制框架，将语义风险嵌入到用于基于控制障碍函数（CBF）的安全导航和遥操作的距离场中。许多基于感知的安全过滤器对所有映射的障碍物分配相同的基于距离的安全裕度，或者仅将语义用作下游控制器调整，而不是在空间表示中编码语义风险。我们的框架通过将语义信息直接嵌入欧几里得符号距离场（ESDF），在线推理障碍物几何和类别相关风险。这种设计在控制优化前编码语义风险，因此高风险对象在安全场中施加更大的空间影响，同时保留运行时高效的ESDF查询。具体来说，基于基础模型的SLAM前端从单目RGB视频重建密集3D几何，而每帧语义分割提供像素级类别标签，这些标签被融合到重建的几何中。得到的几何-语义表示随后被转换为ESDF，其中语义标签识别安全相关区域并在场计算前施加类别相关的膨胀。语义感知的ESDF提供CBF控制器所需的局部距离值和空间导数，而类别相关的增益进一步调节控制器响应。广泛的仿真和硬件实验证明了在线操作在10-20 Hz的频率以及遥操作和自主导航中的语义感知安全行为。

英文摘要

We propose an online monocular perception-to-control framework that embeds semantic risk into the distance field used by Control Barrier Function (CBF)-based safe navigation and teleoperation. Many perception-based safety filters assign the same distance-based safety margin to all mapped obstacles or use semantics only as a downstream controller adjustment, rather than encoding semantic risk in the spatial representation. Our framework instead reasons online about obstacle geometry and class-dependent risk by embedding semantic information directly into the Euclidean Signed Distance Field (ESDF). This design encodes semantic risk before control optimization, so high-risk objects exert a larger spatial influence in the safety field while retaining efficient ESDF queries at runtime. Specifically, a foundation-model-based SLAM front end reconstructs dense 3-D geometry from monocular RGB video, while per-frame semantic segmentation provides pixel-level class labels that are fused into the reconstructed geometry. The resulting geometric-semantic representation is then converted into an ESDF, where semantic labels identify safety-relevant regions and impose class-dependent inflation before field computation. The semantic-aware ESDF provides the local distance values and spatial derivatives required by the CBF controller, while class-dependent gains further regulate the controller response. Extensive simulation and hardware experiments demonstrate online operation at 10--20 Hz and semantic-aware safe behavior in both teleoperation and autonomous navigation.

URL PDF HTML ☆

赞 0 踩 0

2512.14428 2026-06-18 cs.RO 版本更新 60%

Odyssey: An Automotive Lidar-Inertial Odometry Dataset with GNSS-denied situations

Odyssey：一种面向GNSS拒止场景的汽车激光雷达-惯性里程计数据集

Aaron Kurda, Simon Steuernagel, Lukas Jung, Marcus Baum

发表机构 * University of Göttingen（哥廷根大学）； iMAR Navigation（iMAR导航）

专题命中多传感器融合：激光雷达-惯性里程计数据集，涉及多传感器

AI总结提出Odyssey数据集，采用导航级环形激光陀螺仪RTK/INS提供高精度真值，包含36个序列和长时间GNSS拒止环境（隧道、室内停车场），用于评估LIO/SLAM系统。

Comments 10 pages, 4 figures, 3 tables, submitted to International Journal of Robotics Research (IJRR)

详情

AI中文摘要

激光雷达-惯性里程计（LIO）及同时定位与建图（SLAM）系统的开发与评估需要精确的真值。全球导航卫星系统（GNSS）常作为其基础，但在遮挡环境中，由于多径效应或信号丢失，其信号可能不可靠。现有数据集通过引入惯性测量单元（IMU）测量来补偿偶发的GNSS丢失，但由于累积漂移，常用系统不允许对GNSS拒止环境进行长时间研究。因此，此类数据集的多样性有限。为弥补这一空白，我们提出了Odyssey，一个汽车LIO数据集，其特点包括：（1）基于导航级环形激光陀螺仪（RLG）的RTK/INS导出的真值，其偏置稳定性比现有汽车数据集好1到4个数量级；（2）跨不同环境的36个序列的全面收集，支持稳健且全面的评估；（3）长时间的GNSS拒止环境，包括隧道以及汽车基准测试中此前未见过的室内停车场。在此，我们的RLG系统能够在常用系统会过度漂移的场景中实现准确评估。除了为LIO提供数据外，Odyssey还通过三次轨迹重复和通过精确大地坐标集成外部地图数据来支持地点识别任务。所有数据、数据加载器和补充材料均可在线获取，网址为：https://this https URL。

英文摘要

The development and evaluation of Lidar-Inertial Odometry (LIO) and Simultaneous Localization and Mapping (SLAM) systems requires a precise ground truth. The Global Navigation Satellite System (GNSS) is often used as a foundation for this, but its signals can be unreliable in obstructed environments due to multi-path effects or loss-of-signal. While existing datasets compensate for sporadic GNSS loss by incorporating Inertial Measurement Unit (IMU) measurements, the commonly used systems do not permit prolonged study of GNSS-denied environments due to accumulated drift. Therefore, the diversity of such datasets is limited. To close this gap, we present Odyssey, an automotive LIO dataset featuring: (1) a ground truth derived from a navigation-grade Ring Laser Gyroscope (RLG)-based RTK/INS, offering bias stability one to four orders of magnitude better than existing automotive datasets; (2) a comprehensive collection of 36 sequences across diverse environments, enabling robust and comprehensive evaluation and (3) prolonged GNSS-denied environments, including tunnels and, previously unseen in the context of automotive benchmarks, indoor parking garages. Here, our RLG-based system enables accurate evaluation in scenarios where commonly employed systems would drift excessively. Besides providing data for LIO, Odyssey also supports place recognition tasks through threefold trajectory repetition and integration of external mapping data via precise geodetic coordinates. All data, dataloader and supplementary material are available online at https://odyssey.uni-goettingen.de/ .

URL PDF HTML ☆

赞 0 踩 0

2204.14224 2026-06-18 cs.CV cs.LG eess.IV 版本更新 60%

Investigation of Neural Network Methods for Reconstruction and Classification of Texture Images Under Conditions of Incomplete Information

不完全信息条件下纹理图像重建与分类的神经网络方法研究

Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Darkhan Kurmangaliyev, Daniyar Nurseitov, Tatyana Dedova, Larissa Balakay, Serik Nurakynov

发表机构 * Satbayev University（萨特巴耶夫大学）； Institute of Ionosphere LLP（电离层研究所）； Information Technology Department（信息技术部门）； Assiut University（阿西乌特大学）

专题命中 Image Fusion ：涉及图像修复与分类，但非典型融合任务，相关性一般。

AI总结提出结合目标检测、GAN（CRA）修复和Transformer/CNN分类的端到端框架，发现重建质量高（PSNR 28.7dB）但分类准确率仅53%，通过置信度混合集成将MCA从48%提升至58%，揭示生成模型产生语义模糊特征的问题。

Comments IEEE ACCESS

详情

DOI: 10.1109/ACCESS.2026.3705029

AI中文摘要

异质自然纹理的自动化分析常因物理损伤和数据丢失而受阻，这对计算机视觉构成了重大挑战。虽然深度学习在受控环境中已显示出成功，但其在信息不完全条件下对复杂地质材料的应用仍未被充分探索。本研究提出了一个用于高分辨率岩心样本图像修复和分类的集成框架。我们设计了一个端到端流水线，利用目标检测进行样本分割，随后使用具有上下文残差聚合（CRA）的生成对抗网络（GAN）进行图像修复，以重建缺失的高频细节。接着，我们在重建数据上评估了现代基于Transformer（Swin、ViT）和CNN架构的性能。实验揭示了重建质量与下游效用之间的关键分歧：尽管结构保真度高（PSNR 28.7 dB，FID 74.01），分类准确率却停滞在53%。为了改善少数类检测，我们提出了一种基于置信度的混合集成方法，将MCA从48%提升至58%。这些结果凸显了当前最先进生成模型的局限性，它们可能产生视觉上合理但语义模糊的特征（“幻觉”），从而混淆分类器。本工作深入探讨了图像重建质量与分类性能之间的依赖关系，为无损检测和材料科学领域的未来研究提供了可复现的基线。鉴于井间准确率仍处于49-53%范围，我们将所得到的系统定位为岩相解释的决策支持和筛选工具，而非完全自主的分类器。代码可在以下网址获取：https://github.com/your-repo（注：原文URL未提供，此处为示例）

英文摘要

The automated analysis of heterogeneous natural textures is frequently hindered by physical damage and data loss, presenting a significant challenge to computer vision. While deep learning has shown success in controlled environments, its application to complex geological materials under conditions of incomplete information remains underexplored. This study presents an integrated framework for the inpainting and classification of high-resolution core sample images. We propose an end-to-end pipeline that utilizes object detection for sample segmentation, followed by image inpainting using Generative Adversarial Networks (GANs) with Contextual Residual Aggregation (CRA) to reconstruct missing high-frequency details. Subsequently, we evaluate the performance of modern Transformer-based (Swin, ViT) and CNN architectures on the reconstructed data. Our experiments revealed a critical divergence between reconstruction quality and downstream utility: despite high structural fidelity (PSNR 28.7~dB, FID 74.01), classification accuracy plateaued at 53\%. To improve minority-class detection, we propose a confidence-based hybrid ensemble that raises MCA from 48\% to 58\%. These results highlight the limitations of current state-of-the-art generative models, which may produce visually plausible but semantically ambiguous features ("hallucinations") that confound classifiers. This work provides insights into the dependencies between image reconstruction quality and classification performance, offering a reproducible baseline for future research in non-destructive testing and material science. Given that cross-well accuracy remains in the 49--53\% range, we position the resulting system as a decision-support and screening tool for lithofacies interpretation rather than as a fully autonomous classifier. The code is available at https://github.com/GalymzhanAbdimanap/Lithology_recognition

URL PDF HTML ☆

赞 0 踩 0

1. 音视频/视觉语言融合 4 篇

Cosmos 3: Omnimodal World Models for Physical AI

Can We Hear from Events? Generating Speech from Event Camera

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

2. 遥感融合与全色锐化 2 篇

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

3. 医学影像融合 3 篇

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

Hybrid Transformer-Mamba for Weakly Supervised Volumetric Medical Segmentation

4. 多传感器融合 3 篇

Enhancing Fatigue Detection through Heterogeneous Multi-Source Data Integration and Cross-Domain Modality Imputation

Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control

Odyssey: An Automotive Lidar-Inertial Odometry Dataset with GNSS-denied situations

5. Image Fusion 1 篇

Investigation of Neural Network Methods for Reconstruction and Classification of Texture Images Under Conditions of Incomplete Information