arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

多模态大模型

跨文本、图像、视频、音频等模态的大模型与学习方法。

今日/当前日期收录 33 信号源:cs.CV, cs.CL, cs.AI, cs.MM, eess.AS

1. 音视频多模态 14 篇

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新 95%

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

专题命中 音视频多模态 :全模态世界模型,统一语言图像视频音频动作

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

详情
AI中文摘要

我们介绍了Cosmos 3,一个全模态世界模型家族,设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置,Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明,Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型,并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署,我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准,网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

2606.14702 2026-06-18 cs.CV 新提交 90%

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K:通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University(南京大学) CASIA(中国科学院自动化研究所)

专题命中 音视频多模态 :音视频推理数据集与问答

AI总结 提出OmniVideo-100K数据集,通过实体锚定视频脚本和线索引导的QA生成机制,解决音视频问答中跨段实体不一致和长时推理不足的问题,微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情
AI中文摘要

当前的音视频问答(QA)自动化流水线通常采用“视频-字幕-QA”范式。然而,这些方法通常将视频分割成短片段,并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联,而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外,将长文本理解和QA合成耦合到单一步骤中,往往将模型限制在局部事件上,生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题,我们提出了一种自动化数据引擎,包含两种机制:(1)**实体锚定视频脚本**将视频转换为结构化脚本,包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验,确保跨片段引用一致性并重建音视频关联。(2)**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索,然后基于这些高价值线索生成QA对。利用这一流水线,我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B,在OmniVideo-Test上获得了高达20.59%的性能提升,并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力(提升高达12.64%)。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

2602.04796 2026-06-18 eess.AS cs.SD 版本更新 90%

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

LALM-as-a-Judge:用于多轮口语对话安全评估的大型音频语言模型基准测试

Amir Ivry, Shinji Watanabe

发表机构 * Computer Engineering, Technion--Israel Institute of Technology, Haifa, Israel(技术学院电子工程系,技术离子技术研究所,以色列海法) Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA(语言技术研究所,卡内基梅隆大学,美国匹兹堡)

专题命中 音视频多模态 :音频语言模型安全评估基准

AI总结 针对口语对话中社会不安全内容评估仍以文本为中心、忽略韵律和转录失败的问题,提出包含24000个多轮口语对话的开放基准,评估6种大型音频语言模型在文本、音频和多模态设置下的敏感性、严重性顺序特异性和轮次位置偏差,发现音频提供非词汇证据,多模态增益非普遍且存在多种模式。

Comments Accepted to ICML 2026

详情
AI中文摘要

对口语对话中社会不安全内容的评估仍然以文本为中心,忽略了韵律和转录失败。我们提出了LALM-as-a-Judge,其中包括一个包含24000个多轮口语对话的开放基准,每个对话包含一个局部不安全轮次,这些对话基于8个社会不安全类别和5个严重级别生成。我们评估了6种大型音频语言模型(LALMs)作为评判者,包括开源和闭源模型,在纯文本、纯音频和多模态设置下,针对对话中社会有害内容的敏感性、严重性顺序特异性和轮次位置偏差。结果表明,音频提供了超越转录语义的非词汇证据,并且多模态增益并非普遍存在,而是可以表现为文本锚定、平衡、保守和干扰,我们将这些归因于音频路径瓶颈和融合限制。我们将该基准定位为诊断工具,并为模型、模态和提示选择提供实践者指导。

英文摘要

Evaluation of socially unsafe content in spoken dialogues remains text-centric, missing prosody and transcription failures. We present LALM-as-a-Judge, which includes an open benchmark of 24,000 multi-turn spoken dialogues with one localized unsafe turn, generated out of 8 socially unsafe categories and 5 severity levels. We evaluate 6 large audio-language models (LALMs) as judges, open and closed-source, in text-only, audio-only, and multimodal setups by their sensitivity, severity-order specificity, and turn-position bias for socially harmful content in the dialogue. Results show that audio contributes non-lexical evidence beyond transcript semantics and that multimodal gains are not universal but can be text-anchored, balanced, conservative, and interfering, which we link to the audio pathway bottlenecks and fusion limits. We position the benchmark as diagnostic and derive practitioner guidance for model, modality, and prompts choices.

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新 90%

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni:从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) National University of Singapore(新加坡国立大学)

专题命中 音视频多模态 :评估多模态大模型从音视频预测未来的能力

AI总结 提出FutureOmni基准,评估多模态大模型从音视频线索预测未来的能力,发现现有模型在语音密集场景下表现差,并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)展现出强大的全模态感知能力,但它们从音视频线索预测未来事件的能力仍未被充分探索,因为现有基准主要关注回顾性理解。为弥补这一差距,我们引入了FutureOmni,这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理,并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建,包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明,当前系统在音视频未来预测方面存在困难,尤其是在语音密集场景中,Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限,我们整理了一个7K样本的指令微调数据集,并提出全模态未来预测(OFF)训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明,OFF增强了未来预测和泛化能力。我们公开发布所有代码(此 https URL )和数据集(此 https URL )。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

2606.19157 2026-06-18 eess.AS cs.CL 新提交 85%

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

IndicContextEval:评估8种印度语言音频大语言模型上下文利用能力的基准

Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

发表机构 * AI4Bharat, Indian Institute of Technology Madras, India(AI4Bharat,印度理工学院马德拉斯分校) Sarvam AI, India(Sarvam AI,印度)

专题命中 音视频多模态 :评估音频大语言模型的上下文利用能力

AI总结 提出IndicContextEval基准,包含8种印度语言555位说话人的56小时自然语音,通过7级提示框架评估音频大语言模型是否真正利用上下文而非依赖参数化知识。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

音频大语言模型(AudioLLMs)能够基于文本提示(如领域描述或实体列表)进行语音识别。然而,尚不清楚这些模型是真正利用此类上下文,还是依赖预训练期间学到的参数化知识。现有基准无法回答这个问题,因为它们仅在固定提示条件下评估转录,且很少包含明确的上下文输入。我们引入IndicContextEval,这是一个56小时的多语言基准,包含来自8种印度语言和23个专业领域的555位说话人的自然语音。我们设计了一个7级提示框架,逐步引入上下文信号,包括元数据、自然语言描述、英语和本地文字的实体列表,以及包含错误实体的对抗性提示。评估五个模型揭示了上下文利用行为的显著差异,凸显了对音频大语言模型中上下文基础进行显式评估的必要性。

英文摘要

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

2606.18924 2026-06-18 cs.SD 新提交 85%

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突?音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

专题命中 音视频多模态 :分析音频大模型中文本偏差机制

AI总结 本文通过机制分析揭示音频大模型中的文本主导偏差,发现文本路径主动抑制完整音频表征,并提出无训练干预方法back-patching以增强音频表征,缓解文本主导。

Comments Preprint

详情
AI中文摘要

虽然音频大模型在多模态理解方面表现出色,但它们存在文本主导偏差,即模型盲目偏向文本而忽视声学证据,导致幻觉。然而,当音频和文本输入相互矛盾时,这些模型内部行为的底层机制尚未被探索。在这项工作中,我们通过追踪内部表征在层间的传播,首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现:(i)文本主导在模型中系统性地且经验性地存在;(ii)虽然文本和音频依赖功能不同的路径,但它们最终在后期层中汇聚到一个共享语义空间;(iii)文本路径不会擦除音频信息,而是主动抑制完整的音频表征。基于这些见解,我们利用back-patching,一种无训练干预方法,将后期层的音频激活路由回早期层。这放大了音频表征,使其能够克服文本抑制。我们的评估表明,back-patching持续减少文本主导,为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 新提交 85%

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST(韩国科学技术院)

专题命中 音视频多模态 :提出CoAT框架,增强音频语言模型的连续音频思考能力。

AI总结 提出连续音频思考(CoAT)框架,通过专家蒸馏在连续潜在空间中组织声学信息,使音频语言模型在生成响应前利用丰富声学特征,无需额外自回归解码成本,在多个音频任务上提升性能。

Comments Preprint

详情
AI中文摘要

大型音频语言模型(LALMs)在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而,由于LALMs通常被训练生成与文本对齐的响应,其隐藏状态逐渐为文本生成而塑造,而非保留声学信息。因此,音频携带的多样化声学内容,如语音细节、韵律、声音事件、情感和音调,在过程中丢失,难以在响应中利用。我们引入了连续音频思考(CoAT),这是一个框架,为音频语言模型配备一个连续的潜在工作空间,用于在响应生成之前组织声学信息,并通过音频专家的蒸馏进行基础化。在思考空间内,模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外,所提出的连续思考块可以在单个预填充中处理,因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上,Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3,在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实,辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

2606.06170 2026-06-18 eess.AS 版本更新 85%

CoSTA: Cognitive-State-Conditioned TTS Data Augmentation Using ASR Transcripts for Alzheimer's Disease Detection

CoSTA: 基于认知状态条件的TTS数据增强,使用ASR转录文本用于阿尔茨海默病检测

Yin-Long Liu, Yuanchao Li, Yiming Wang, Yue Li, Rui Feng, Jiaxin Chen, Shaobo Liu, Liu He, Yuang Chen, Jiahong Yuan, Zhen-Hua Ling

专题命中 音视频多模态 :TTS与ASR结合的多模态数据增强用于AD检测

AI总结 提出CoSTA框架,通过认知状态条件TTS模型合成语音,结合ASR转录文本进行数据增强,在ADReSS数据集上实现85.83%的音频检测准确率。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

基于语音的阿尔茨海默病(AD)检测受限于稀缺的病理语音数据。为此,我们提出CoSTA,一种基于文本转语音(TTS)的数据增强框架。具体而言,我们首先通过适配CosyVoice2和F5-TTS开发了两个认知状态条件(CS-Cond)TTS模型,以合成具有不同AD和健康对照特征的语音。此外,通过构建包含人工转录(MT)和36个自动语音识别(ASR)转录的转录池,我们研究了文本来源对基于TTS的数据增强的影响。我们还进行了增强因子分析和测试时增强。在ADReSS数据集上的实验表明,CS-Cond TTS显著提升了合成语音的效用,且ASR驱动的增强通常优于MT驱动的增强。最后,CoSTA相比基线获得了4.16%的提升,在ADReSS测试集上实现了85.83%的纯音频准确率,并超越了先前的方法。

英文摘要

Speech-based Alzheimer's Disease (AD) detection is constrained by scarce pathological speech data. To address this, we propose CoSTA, a Text-to-Speech (TTS)-based data augmentation framework. Specifically, we first develop two Cognitive-State-Conditioned (CS-Cond) TTS models by adapting CosyVoice2 and F5-TTS to synthesize speech with distinct AD and Healthy Control characteristics. Furthermore, by constructing a transcript pool comprising Manual Transcripts (MT) and 36 Automatic Speech Recognition (ASR) transcripts, we investigate the impact of text sources on TTS-based augmentation. We also perform augmentation-factor analysis and test-time augmentation. Experiments on the ADReSS dataset show that CS-Cond TTS significantly improves synthetic speech utility, and ASR-driven augmentation frequently outperforms MT-driven augmentation. Finally, CoSTA yields a 4.16% gain over the baseline, achieving an audio-only accuracy of 85.83% on the ADReSS test set and outperforming prior methods.

2605.26672 2026-06-18 cs.MM cs.SD 版本更新 85%

Can We Hear from Events? Generating Speech from Event Camera

我们能从事件中听到声音吗?从事件相机生成语音

Jingping Fang, Lin Chen, Chenyang Xu, Tong Zhao, Weidong Cai, Xiaoming Chen

发表机构 * Beijing Technology and Business University(北京技术与商业大学) Xidian University(西安电子科技大学) Tongji University(同济大学) University of Sydney(悉尼大学)

专题命中 音视频多模态 :事件相机生成语音,跨模态语音生成

AI总结 提出EventSpeech框架,利用神经形态事件相机的高时间精度解决传统RGB语音生成中的时间粒度不匹配问题,实现情感丰富且抗运动模糊的语音生成。

详情
AI中文摘要

传统的基于RGB的语音生成面临时间粒度不匹配问题,因为固定的相机曝光时间不可避免地模糊了渲染情感语音所需的高频发音瞬态。为了打破这一限制,我们提出EventSpeech,这是一个新颖的文本条件框架,率先利用神经形态事件进行表达性语音生成,因为这些微秒级精确的事件自然与声学波形动态对齐。我们的架构集成了一个专用的事件编码器来建模稀疏的神经形态事件,以及一个多尺度音频编码器,其中包含分层小波上下文器(HWC)。双向对齐机制无缝地将语言内容和视觉动态与密集的声学特征同步。此外,我们构建了EVT-SPK作为第一个基准,包括大规模合成数据和来自专用神经形态硬件的真实世界记录。大量评估表明,EventSpeech通过保留细粒度情感和抵抗运动模糊,显著优于当前基线,为多模态语音生成建立了新范式。代码和演示可在https://xrfang-0102.github.io/EventSpeechWeb/获取。

英文摘要

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

2509.22363 2026-06-18 cs.LG eess.AS 版本更新 85%

Investigating Faithfulness in Large Audio Language Models

大型音频语言模型中的忠实性研究

Pooneh Mousavi, Lovenya Jain, Mirco Ravanelli, Cem Subakan

发表机构 * Concordia University(康科迪亚大学) Mila - Quebec AI Institute(魁北克人工智能研究院) Université Laval(拉瓦尔大学) Birla Institute of Technology and Science, Pilani(比拉理工学院和科学学院,皮兰尼)

专题命中 音视频多模态 :评估大型音频语言模型的推理忠实性

AI总结 提出系统框架评估大型音频语言模型在推理链忠实性上的表现,定义三个音频忠实性标准,并通过基准测试发现模型推理与音频输入存在脱节。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型音频语言模型(LALMs)将音频编码器与预训练的大型语言模型集成,以执行复杂的多模态推理任务。虽然这些模型可以生成思维链(CoT)解释,但这些推理链的忠实性仍不清楚。在这项工作中,我们提出了一个系统框架来评估LALMs中CoT在输入音频和最终模型预测方面的忠实性。我们定义了音频忠实性的三个标准:无幻觉、整体性和专注聆听。我们还引入了一个基于音频和CoT干预的基准来评估忠实性\footnote{基准测试界面和评估结果可在以下网址获取:https://this https URL。}。在Audio Flamingo 3和Qwen2.5-Omni上的实验表明存在潜在的多模态脱节:推理通常与最终预测一致,但并不总是强烈基于音频,并且可能容易受到幻觉或对抗性扰动的影响。

英文摘要

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness\footnote{The benchmarking interface and evaluation results are available at https://poonehmousavi.github.io/faithfulness/. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

2603.10827 2026-06-18 cs.SD cs.AI 版本更新 85%

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

语音感知大语言模型的说话人验证:评估与增强

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, Najim Dehak

发表机构 * Electrical and Computer Engineering Department, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学电气与计算机工程系) Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学人机语言技术中心卓越中心)

专题命中 音视频多模态 :语音感知大模型用于说话人验证

AI总结 提出模型无关的评分协议评估语音感知LLM的说话人区分能力(EER>20%),并通过注入冻结的ECAPA-TDNN说话人嵌入和LoRA微调,实现接近专用系统的性能(EER 1.03%)。

Comments 3 Tables, 1 Figure, Published in Interspeech 2026

详情
AI中文摘要

语音感知大语言模型(LLMs)可以接受语音输入,但其训练目标主要强调语言内容或特定领域(如情感或说话人性别),尚不清楚它们是否编码了说话人身份。首先,我们提出了一种模型无关的评分协议,该协议利用Yes/No令牌概率的置信度分数或对数似然比,为仅API模型和开放权重模型生成连续验证分数。使用该协议,我们评估了最近的语音感知LLMs,观察到较弱的说话人区分能力(在VoxCeleb1上EER高于20%)。其次,我们引入了一种轻量级增强方法,通过可学习的投影注入冻结的ECAPA-TDNN说话人嵌入,并仅训练LoRA适配器,使LLM具备自动说话人验证(ASV)能力。在TinyLLaMA-1.1B上,得到的ECAPA-LLM在VoxCeleb1-E上实现了1.03%的EER,接近专用说话人验证系统,同时保留了自然语言接口。

英文摘要

Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.

2603.05128 2026-06-18 eess.AS cs.SD 版本更新 85%

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

PolyBench:多声部音频中组合推理的基准测试

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

发表机构 * Harbin University of Science and Technology(哈尔滨理工大学) The University of Melbourne(墨尔本大学) KAIST(韩国成均馆大学) University of Surrey(萨里大学)

专题命中 音视频多模态 :多声部音频组合推理基准测试

AI总结 针对多声部音频中组合推理评估缺失的问题,提出PolyBench基准,包含计数、分类、检测、并发和时长估计五个子集,评估发现现有大音频语言模型在多声部场景下性能持续下降。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

大型音频语言模型(LALMs)在音频推理方面能力日益增强,然而现有基准对多声部音频(多个声音事件同时发生并产生组合结构)中的推理覆盖有限。为弥补这一空白,我们引入了PolyBench,这是一个旨在评估多声部音频中组合推理的基准,包含五个评估子集,涵盖计数、分类、检测、并发和时长估计,所有这些都需要对多个并发事件及其关系进行推理。我们对最先进的LALMs的评估揭示了在多声部设置中性能持续下降,表明当前LALMs存在根本性瓶颈。

英文摘要

Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio, yet existing benchmarks offer limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. To address this gap, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio, comprising five evaluation subsets that cover counting, classification, detection, concurrency, and duration estimation, all of which require reasoning over multiple concurrent events and their relations. Our evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic settings, indicating a fundamental bottleneck in current LALMs.

2606.19203 2026-06-18 eess.AS 新提交 80%

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

DASH: 基于多层隐藏表示的双视角自蒸馏用于鲁棒语音识别

Jaeeun Baik, Ui-Hyeop Shin, Jiwoon Lee, Woocheol Jeong, Hyung-Min Park

专题命中 音视频多模态 :提出自蒸馏框架提升语音识别鲁棒性,属于音频处理

AI总结 提出DASH自蒸馏框架,通过双视角学习干净-噪声一致性,从多层编码器蒸馏隐藏表示并最小化原型分配分布的KL散度,在保持干净准确率的同时提升噪声鲁棒性,额外开销仅约微调时间的4%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

自动语音识别(ASR)在现实噪声环境中常常性能下降,因此噪声鲁棒性对于部署至关重要。有监督的噪声增强微调是一种常见的补救措施,但它可能引入鲁棒性与干净性能之间的权衡,并过度拟合特定噪声,导致干净条件下的识别性能下降。我们提出了DASH,一种自蒸馏框架,通过从配对视图中学习干净-噪声一致性来提高鲁棒性。DASH从多个编码器层蒸馏隐藏表示,以捕获从低级声学到高级语义的特征,并通过最小化干净视图和噪声视图的原型分配分布之间的KL散度来稳定训练。在LibriSpeech上的实验表明,DASH在保持干净准确率的同时,在各种噪声条件下持续提高识别性能,这是通过在标准微调之外增加一个无标签的预训练阶段实现的,额外开销极小(约为微调时间的4%)。

英文摘要

Automatic Speech Recognition (ASR) often degrades in real-world noisy environments, making noise robustness essential for deployment. Supervised noise-augmented fine-tuning is a common remedy, but it can introduce a robustness-clean trade-off and overfit to specific corruptions, degrading recognition in clean conditions. We propose DASH, a self-distillation framework that improves robustness by learning clean--noisy consistency from paired views. DASH distills hidden representations from multiple encoder layers to capture features from low-level acoustics to high-level semantics, and stabilizes training by minimizing KL divergence between prototype assignment distributions of clean and noisy views. Experiments on LibriSpeech show that DASH consistently improves recognition under diverse noisy conditions while preserving clean accuracy, achieved by a label-free pre-training stage with minimal additional overhead (about 4% of fine-tuning time) beyond standard fine-tuning.

2606.05739 2026-06-18 cs.SD eess.AS 版本更新 80%

Do speech foundation models perceive speaker similarity as humans do?

语音基础模型是否像人类一样感知说话人相似性?

Minoru Kishi, Hayato Yagi, Shinnosuke Takamichi, Yuki Saito

发表机构 * Keio University, Japan(庆应大学,日本) The University of Tokyo, Japan(东京大学,日本)

专题命中 音视频多模态 :语音基础模型说话人嵌入与人类感知比较

AI总结 本研究通过比较40多个语音基础模型的说话人嵌入与人类主观相似性评分,探究模型距离是否与人类感知一致,并识别影响模型与人类感知一致性的关键配置因素。

Comments Accepted by INTERSPEECH 2026. Camera-ready version

详情
AI中文摘要

本研究对语音基础模型的说话人嵌入与人类对说话人相似性的主观感知进行了比较分析。人类听众能够在一个连续尺度上判断说话人的相似性,辨别两个声音的相似程度。相比之下,语音基础模型将说话人特征嵌入到数值表示中。然而,一个问题仍然存在:这些模型中说话人嵌入之间的数值距离是否真正与人类感知的相似性一致?为了解决这个问题,我们使用超过40个模型进行了全面调查,将模型导出的距离与人类感知的相似性评分进行比较。此外,我们确定了模型配置中的哪些因素对产生反映人类感知的说话人嵌入贡献最大。我们的发现为开发更具感知基础的语音基础模型提供了见解。

英文摘要

This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.

2. 图文多模态 13 篇

2606.01711 2026-06-18 cs.CV 版本更新 90%

Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference

通过纠正失真改进视觉令牌减少以实现高效多模态大语言模型推理

Hyeonwoo Cho, Donghyeon Baek, Yewon Kim, Bumsub Ham

发表机构 * KAIST(韩国科学技术院)

专题命中 图文多模态 :多模态大模型视觉令牌减少,提升推理效率

AI总结 提出RESTORE框架,通过校准位置和注意力失真来改进视觉令牌减少,在保持效率的同时提升多模态大语言模型性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务中取得了显著成功,但大量视觉令牌带来的二次计算复杂度导致了严重的内存和延迟瓶颈。虽然已经探索了视觉令牌减少(VTR)策略来缓解这一负担,但现有方法忽略了完整序列与减少序列之间的位置和注意力一致性,导致表示失真。为此,我们提出RESTORE,一种新颖的VTR框架,在保持效率的同时纠正位置和注意力失真。具体来说,我们提出一种简单而有效的校准方法,通过基于相对距离增强注意力权重来恢复丢失的视觉注意力。我们还引入了一种独特的锚点选择用于令牌合并,以减轻特征平均过程中的信息损失。在多个基准上的实验结果表明,我们的方法持续提高了各种减少方法的准确性,在保持计算效率的同时实现了最先进的性能。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency. Project page is available at https://cvlab.yonsei.ac.kr/projects/RESTORE

2606.19120 2026-06-18 cs.LG cs.CV 新提交 85%

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思:解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences(机器人与智能系统国家重点实验室,沈阳自动化研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学)

专题命中 图文多模态 :MLLM后训练框架,解耦感知与推理

AI总结 提出ViGOS框架,通过解耦感知和推理,在MLLM后训练中避免文本捷径,提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情
AI中文摘要

在策略自蒸馏(OPSD)训练模型在其自身rollouts上,并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好,但直接扩展到多模态大语言模型(MLLMs)可能产生捷径:特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS,一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述,然后推理出最终答案。对于有效rollouts,仅图像的感知教师监督描述,而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中,ViGOS保持了OPSD的主要优势,并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

2606.18988 2026-06-18 cs.AI 新提交 85%

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception: 一种用于可解释多模态欺骗检测的渐进式强化学习框架

Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

发表机构 * Xi'an Jiaotong-Liverpool University(西安交通大学利物浦大学)

专题命中 图文多模态 :引入多模态大模型进行可解释欺骗检测,结合视觉和音频。

AI总结 提出ThinkDeception框架,将多模态大语言模型引入欺骗检测,通过逐步推理和视觉-音频一致性组相对策略优化(VAC-GRPO)实现可解释的认知推理,在主流基准上达到新SOTA。

Comments 10pages,4figures

详情
AI中文摘要

多模态欺骗检测对于识别欺诈意图至关重要,然而现有方法主要依赖于端到端的黑箱范式。这些方法严重缺乏可解释性,无法提供透明的推理轨迹,也难以明确捕捉欺骗行为中固有的细微跨模态不一致性。为了超越这些限制,我们提出了ThinkDeception,一个新颖且可解释的多模态欺骗检测框架。作为开创性工作,它将多模态大语言模型(MLLMs)引入该领域,将欺骗检测从传统的二分类任务转变为显式的认知推理过程。借助首个精心标注的逐步多模态思维链(CoT)数据集,我们开发了基础模型ThinkDeception Base,实证验证了模态不一致性在解码欺骗中的关键作用。在此基础之上,我们的核心创新在于提出了配备渐进式训练策略的视觉-音频一致性组相对策略优化(VAC-GRPO)。与标准GRPO不同,我们将训练数据分为四个渐进难度等级,引导模型经历基于心理学的从易到难的认知转变。通过创新地将这一动态课程调度器与多维度的过程感知奖励机制及反思学习范式相结合,我们显著提升了模型的整体推理质量。在主流基准上的大量实验表明,ThinkDeception建立了新的SOTA,在检测准确性和推理质量上均显著优于现有方法。最终,这项工作成功地将欺骗检测领域推向可解释的多模态认知推理。

英文摘要

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 新提交 85%

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA:面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

专题命中 图文多模态 :多模态信息抽取,利用多专家MLLM增强数据。

AI总结 提出语义锚定对齐增强框架SAMA,通过构建结构化语义锚引导多专家多模态大模型生成高保真文本,并利用锚保留扩散机制合成图像,结合双约束过滤模块,在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

详情
AI中文摘要

多模态信息抽取(MIE)——涵盖多模态命名实体识别(MNER)、关系抽取(MRE)和事件抽取(MEE)等任务——对于理解多媒体内容至关重要,但受到严重数据稀缺的限制。尽管数据增强是一种有前景的补救措施,但现有方法受到粗粒度跨模态对齐和碎片化、任务特定设计的阻碍,未能利用共享语义知识。为克服这些限制,我们引入了语义锚定对齐多模态增强(SAMA),一个用于生成高保真、任务感知合成数据的统一框架。SAMA从真实标签构建结构化语义锚,以指导协作多专家多模态大语言模型(CME-MLLM),该模型集成了用于共享语义的通用适配器和任务特定适配器,以生成多样且符合约束的文本样本。对于图像合成,SAMA采用锚保留扩散机制,使用锚加权提示和潜在条件来维持关键语义锚,同时多样化视觉上下文。为消除人工验证需求,SAMA进一步引入双约束过滤模块,基于跨模态一致性和锚保真度选择合成样本。在MNER、MRE和MEE基准数据集上的大量实验表明,SAMA在全监督和低资源设置下均一致优于最先进的增强基线,突显了其通用性、鲁棒性和有效性。

英文摘要

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

2606.17030 2026-06-18 cs.CV 新提交 85%

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告:通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team(Qwen团队)

专题命中 图文多模态 :融合视觉与语言的多模态世界模型

AI总结 提出Qwen-RobotWorld,一种以自然语言为统一动作接口的语言条件视频世界模型,通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练,在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测,在多个基准上取得最优结果。

详情
AI中文摘要

我们介绍Qwen-RobotWorld,一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口,它从当前观测预测物理上合理的未来视觉轨迹,涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向:用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境,以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的:a) 双流MMDiT与MLLM动作编码,其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合;b) 具身世界知识(EWK),一个860万视频-文本语料库(2亿+帧),包含20+种具身形态和500+动作类别的动作-语言映射;c) 通用+专家渐进式课程,一种两阶段训练策略,首先学习通用视觉先验,然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力:在EWMBench和DreamGen Bench上总体排名第一,在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

2606.15088 2026-06-18 cs.SD cs.CL eess.AS 新提交 85%

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

当相同的音乐知识以不同方式遗忘:路径依赖遗忘的干净探测

Yu Liu, Zhiwei Yang, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Kun Peng, Haimei Qin, Lei Jiang, Jin B. Hong, Hao Peng, Yanbing Liu

发表机构 * Institute of Information Engineering, CAS(中国科学院信息工程研究所) School of Cyber Security, UCAS(中国科学院大学网络空间安全学院) The University of Western Australia(西澳大利亚大学) Beihang University(北京航空航天大学)

专题命中 图文多模态 :研究多模态模型中知识遗忘路径依赖

AI总结 提出配对路径控制协议(PPCP),发现多模态模型中通过文本路径获取的知识比音频路径更易遗忘,且该效应不受架构深度影响,主要源于输入表示差异。

详情
AI中文摘要

一个模型可以通过听音频或阅读文本描述来学习钢琴曲《致爱丽丝》是平静而沉思的,但当这些知识后来面临遗忘风险时,获取路径是否重要?多模态模型中的遗忘研究衡量了在适应过程中丢失了哪些知识,但尚未探究获取路径是否影响知识被遗忘的难易程度。我们将这个未经检验的前提称为路径不变假设。音乐理解提供了一个干净的测试,因为一段音乐剪辑和一段规范的文本描述可以对齐到相同的感知内容,使得相同的知识单元可以通过听或读进入模型,而目标保持不变。在多个架构不同的音频-语言模型中,我们观察到一致的不对称性:在相同的适应压力下,文本路径知识比匹配的音频路径知识更容易被遗忘。为了将这种效应归因于路径而非混淆因素,我们引入了配对路径控制协议(PPCP),这是一个三阶段设计,建立匹配的路径基线,在相同的知识池上以对称监督激活两条路径,并对两条路径施加相同的遗忘压力。这种差距在模型间和增益控制分析中稳定存在,当矛盾覆盖被替换为正确标签的跨域学习时仍然存在,在单模态压力下仍然存在,并且不会被轻量级重放消除。两个独立的路径深度控制证实,该效应不能由架构深度解释,表明输入表示是主导因素。在PPCP下,我们的结果表明遗忘高度依赖于路径,将获取路径确立为遗忘研究和多模态系统设计的一个新的分析维度。

英文摘要

A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.

2606.18974 2026-06-18 cs.CV 新提交 80%

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD:用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

发表机构 * Xi’an Jiaotong University(西安交通大学) MOE KLINNS Lab(MOE KLINNS实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering(陕西省大数据知识工程重点实验室) Sun Yat-sen University(中山大学)

专题命中 图文多模态 :跨模态自蒸馏将视觉推理能力转移到纯文本模型。

AI总结 提出Visual-OPSD方法,通过跨模态在策略自蒸馏,将多步扩散生成的可视化思维推理能力转移到纯文本学生模型,实现14.3倍加速且性能提升3.40个百分点。

详情
AI中文摘要

统一多模态模型(UMMs)将生成的“可视化思维”(VTs)与文本推理交错以改进空间任务。这导致多步扩散带来大约一个数量级的推理成本。我们发现这种成本带来的直接收益有限。在ThinkMorph上,移除或噪声化VTs在九个基准上几乎不改变准确率。一旦渲染,注意力集中在VT上,无论其内容如何。然而,KL诊断表明,以特权VT轨迹为条件会改变模型的完成分布。这表明生成路径编码了超出渲染像素的有用推理。受此差距启发,我们提出了Visual On-Policy Self-Distillation(Visual-OPSD)。教师和学生共享相同权重,但上下文不同:教师看到特权VTs,而学生只看到问题。在策略学生轨迹上的token级JSD蒸馏将教师的推理转移到纯文本学生。在九个基准上,Visual-OPSD相比其生成教师提高了$+3.40$个百分点,加速$14.3\times$(每个样本10.0秒 vs. 142.8秒),并在VSP上比同规模VLM提高了$+63.83$个百分点。高斯噪声控制(真实VT为$+0.40$pp vs. $+10.28$pp)和$58.4\%$的KL差距闭合证实,收益来自生成路径的语义内容。

英文摘要

Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

2606.18893 2026-06-18 cs.CL 新提交 80%

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

学习鲁棒的成对置信度用于多模态情感-原因对提取

Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia

发表机构 * Institute for Advanced Studies(先进研究院) Universiti Malaya(马来大学) School of Information Engineering(信息工程学院) Suqian University(宿州学院) Digitization Department(数字化部门)

专题命中 图文多模态 :多模态情感-原因对提取,学习鲁棒置信度

AI总结 提出RPCL框架,通过置信度差异边界约束和对抗性扰动,增强多模态情感-原因对提取中成对置信度的判别性和稳定性,在三个数据集上提升Pair F1约2.6-2.8个百分点。

Comments 11 pages, 3 figures, 5 tables

详情
AI中文摘要

多模态情感-原因对提取(MECPE)需要候选对上的可靠成对置信度。现有的成对评分器通常对有效候选使用成对级别的交叉熵,这大多独立地处理链接。这使得竞争原因之间的相对置信度几何结构约束不足,允许黄金对接近硬负例或依赖偶然的非黄金上下文。我们将这种脆弱性研究为成对置信度脆弱性,并提出RPCL(鲁棒成对置信度学习),一种仅用于训练的成对置信度学习框架。RPCL鼓励成对置信度既具有判别性又具有稳定性:通过置信度差异边界约束将黄金对与行方向硬负例分离,并将干净成对预测与来自损坏视图的预测对齐,其中非黄金上下文话语表示被部分损坏。在推理时,原始的干净成对评分器和解码流水线保持不变。在ECF、MECAD和MEC4上,RPCL在全文本-音频-视频设置下将三种子平均Pair F1相对于匹配基线模型提高了2.58到2.83个百分点,并在所有三个数据集上提高了平均Pair AUPRC。诊断分析进一步显示更大的黄金-负例置信度差距和更低的边界违反严重性。这些结果表明,显式塑造成对置信度是MECPE的一种有效训练策略。

英文摘要

Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

2606.18710 2026-06-18 cs.CR 新提交 80%

Image Prompt Reconstruction Attacks on Distributed MLLM Inference Frameworks

分布式多模态大模型推理框架上的图像提示重建攻击

Xinjian Luo, Hongyan Chang, Jianxin Wei, Yuncheng Wu, Xiaofeng Gao, Meikang Qiu, Ting Yu, Xue Liu

专题命中 图文多模态 :分布式MLLM图像提示重建攻击。

AI总结 研究分布式MLLM推理中中间嵌入泄露图像提示的风险,提出两种被动黑盒攻击方法MPAA和IEDA,实现像素级和语义级图像重建。

详情
AI中文摘要

分布式大语言模型(LLM)推理框架将孤立的消费级设备连接起来进行大规模模型推理,大幅降低了硬件限制。然而,最近的研究表明,参与者之间传输的中间嵌入可能会泄露私有提示。随着LLM演变为多模态LLM(MLLM),这种风险已扩展到文本之外:图像提示包含丰富的视觉和语义信息,使其中间嵌入高度隐私敏感。然而,分布式MLLM推理中的图像提示泄露问题在很大程度上尚未被探索。在本文中,我们研究了分布式MLLM框架中由中间嵌入引起的输入图像隐私风险。我们首先分析了从图像像素到中间表示的信息流。由于图像和文本嵌入通常在MLLM各层中交织,我们设计了一种图像嵌入提取算法作为重建攻击的前提,在我们的实验中,该算法在几乎所有MLLM层上实现了100%的提取准确率。在此基础上,我们开发了两种被动的黑盒图像重建攻击:MPAA和IEDA,反映了来自知识有限、能力有限的正常参与者的现实威胁。MPAA通过逐块信息提取和组装进行细粒度像素级重建,而IEDA通过嵌入引导的扩散生成进行粗粒度语义重建。我们在四个代表性的MLLM系列上评估了我们的攻击:Gemma 3、Phi 4 Multimodal、Qwen 2.5 VL和Llama 4 Scout。结果显示在各种设置下均具有一致优越的重建性能。我们进一步分析了MoE架构、图像预处理、模型大小和文本-图像依赖关系对攻击性能的影响。据我们所知,这是对MLLM图像重建攻击的首次研究。

英文摘要

Distributed large language model (LLM) inference frameworks connect isolated consumer-grade devices for large-scale model inference, substantially reducing hardware constraints. However, recent studies show that intermediate embeddings transmitted among participants can leak private prompts. As LLMs evolve into multimodal LLMs (MLLMs), this risk extends beyond text: image prompts contain rich visual and semantic information, making their intermediate embeddings highly privacy-sensitive. Yet, image-prompt leakage in distributed MLLM inference remains largely unexplored. In this paper, we investigate privacy risks to input images caused by intermediate embeddings in distributed MLLM frameworks. We first analyze the information flow from image pixels to intermediate representations. Since image and text embeddings are often intertwined across MLLM layers, we design an image embedding extraction algorithm as a prerequisite for reconstruction attacks, achieving 100% extraction accuracy across almost all MLLM layers in our experiments. Building on this, we develop two passive black-box image reconstruction attacks, MPAA and IEDA, reflecting realistic threats from normal participants with limited knowledge and capability. MPAA performs fine-grained pixel-level reconstruction via patch-wise information extraction and assembly, while IEDA performs coarse-grained semantic reconstruction through embedding-guided diffusion generation. We evaluate our attacks on four representative MLLM families: Gemma 3, Phi 4 Multimodal, Qwen 2.5 VL, and Llama 4 Scout. Results show consistently superior reconstruction performance in various settings. We further analyze the effects of MoE architecture, image preprocessing, model size, and text-image dependency on attack performance. To our knowledge, this is the first study of image reconstruction attacks on MLLMs.

2606.18262 2026-06-18 cs.HC 新提交 75%

When Prompts Mislead: Textual Dominance and Diagnostic Bias in MLLMs

当提示误导:多模态大语言模型中的文本主导与诊断偏差

Inhyuk Park, Doohyun Park

专题命中 图文多模态 :研究多模态LLM在医学诊断中的文本主导偏差。

AI总结 研究揭示在医学多模态大语言模型中,文本提示会主导视觉线索,导致诊断偏差,即使模型具备空间定位能力,提示策略仍可能不安全。

Comments Accepted to the CVPR 2026 MMFM-BIOMED Workshop

详情
AI中文摘要

多模态大语言模型(MLLMs)正越来越多地被评估用于医疗应用,其中计算约束通常使提示策略成为微调之外唯一实用的替代方案。这类策略通常被认为支持诊断推理,但其在医学MLLMs中的潜在故障模式仍缺乏特征描述。我们分析了开源眼科MLLM FundusExpert-1B,在公共BRSET数据集上执行出血与玻璃膜疣的鉴别任务,该数据集被用作我们分析的受控测试平台。(i) 通过人工注入标记的受控探针证实,模型保留了粗粒度的区域级空间定位能力。(ii) 与零样本推理相比,单样本文本提示使预测偏向于提示的发现。(iii) 当叠加的病灶轮廓与不一致的文本声明配对时,文本提示覆盖了正确的视觉线索:整体准确率从仅视觉条件下的75%下降到46%,而思维链(CoT)推理与进一步退化而非自我纠正相关。尽管仅限于单个模型和数据集,我们的发现表明,仅靠提示策略可能不足以实现医学MLLMs的安全临床部署。

英文摘要

Multimodal large language models (MLLMs) are increasingly being evaluated for medical applications, where computational constraints often make prompting strategies the only practical alternative to fine-tuning. Such strategies are generally assumed to support diagnostic reasoning, yet their potential failure modes in medical MLLMs remain poorly characterized. We analyze FundusExpert-1B, an open-source ophthalmology MLLM, on a hemorrhage versus drusen discrimination task using the public BRSET dataset, adopted here as a controlled testbed for our analysis. (i) A controlled probe with artificially injected markers confirms that the model retains coarse, region-level spatial grounding. (ii) Compared with zero-shot inference, one-shot textual prompts bias predictions toward the prompted finding. (iii) When an overlaid lesion contour is paired with an inconsistent textual claim, the textual prompt overrides the correct visual cue: overall accuracy drops from 75% to 46% relative to the visual-only condition, and Chain-of-Thought (CoT) reasoning is associated with further degradation rather than self-correction. Although limited to a single model and dataset, our findings suggest that prompting strategies alone may be insufficient for the safe clinical deployment of medical MLLMs.

2606.18661 2026-06-18 cs.CV cs.AI 新提交 70%

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench:一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University(中南大学)

专题命中 图文多模态 :多模态数据集包含图像、掩码和文本描述

AI总结 提出指令驱动智能体框架,包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent,实现自主滑坡识别与分析。

详情
AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要,然而当前范式难以同时提取视觉特征和高层次地球科学语义,而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战,我们提出一个指令驱动的智能体框架,包含三个组成部分。首先,通过多VLM交叉验证和交互式标注构建LandslideBench,这是一个多模态细粒度数据集,包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后,通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM,以增强地质语义理解。最后,以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent,采用双规则控制器,结合结构化报告元数据约束和交叉验证识别约束,来调控自动化工具调用。实验表明,LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理,实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

2606.18441 2026-06-18 cs.CV 新提交 70%

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集:视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

专题命中 图文多模态 :涉及视频多模态大语言模型推理优化

AI总结 提出无时间标注的过程级奖励框架CF-GRPO,通过视频内在线索构建一致性帧先验,并利用一致性帧奖励优化模型帧使用与先验的对齐,提升视频推理性能。

详情
AI中文摘要

强化学习提升了大型语言模型的推理能力,但将仅结果奖励应用于视频多模态大语言模型(Video-MLLMs)时,对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发(其中一致的线索可以增强感知估计的显著性和可靠性),我们引入了一致性帧GRPO(CF-GRPO),一种无需时间标注的过程级奖励框架,用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验,包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后,它从视觉和响应表示中计算模型侧的帧使用分数,并通过一致性帧奖励(CFR)优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化,CFR提供了高对比度的奖励信号,无需人工时间标注。实验表明,VideoCFR在复杂视频推理基准上取得了有竞争力的性能,并在多个指标上优于代表性的Video-MLLM和RL基线,同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见:https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

2604.18109 2026-06-18 cs.CL cs.SD 版本更新 70%

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

FLiP:理解和解释多模态多语句子嵌入

Santosh Kesiraju, Bolaji Yusuf, Šimon Sedláček, Oldřich Plchot, Petr Schwarz

发表机构 * Brno University of Technology(布拉格技术大学)

专题命中 图文多模态 :多模态多语句子嵌入的理解与解释

AI总结 提出因子化线性投影(FLiP)模型,从多语言、多模态句子嵌入中恢复词汇内容,揭示编码器的模态和语言偏差。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

本文提出了因子化线性投影(FLiP)模型,用于理解预训练句子嵌入空间。我们训练FLiP模型从多语言(LaBSE)、多模态(SONAR)和基于API(Gemini)的句子嵌入空间中恢复多种高资源和中等资源语言的词汇内容。我们表明,FLiP可以从嵌入中召回超过75%的词汇内容,显著优于现有的非因子化基线。使用此作为诊断工具,我们揭示了所选句子编码器的模态和语言偏差,并为从业者提供了关于编码器的内在见解,而无需依赖传统的下游评估任务。我们的实现已公开,链接见此:https://this URL。

英文摘要

This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public https://github.com/BUTSpeechFIT/FLiP.

3. 多模态评测 1 篇

2606.19338 2026-06-18 cs.CV 新提交 85%

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测:评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

专题命中 多模态评测 :非马尔可夫博弈评估多模态模型记忆

AI总结 提出RNG-Bench基准套件,通过配对记忆和3D迷宫两个博弈,评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力,发现主要错误源于遗忘而非决策,微调可提升性能。

详情
AI中文摘要

将多模态基础模型部署为闭环策略时,越来越需要基于不再可见的观测来调节动作。然而,现有基准要么暴露完整状态,将隐藏状态重建与其他智能体技能混为一谈,要么仅在回合结束后测试记忆。我们引入了RNG-Bench(重建性非马尔可夫博弈),这是一个基准套件,旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈:配对记忆,其中卡片身份在特定位置短暂显示后需被回忆;以及3D迷宫,其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估,具有三个可控难度轴:网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差,以及记忆差距指标,将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入,前沿MLLMs远未饱和。记忆差距分析表明,大多数残余错误源于遗忘较早的观测,而非次优决策。最后,在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B,提高了RNG-Bench的性能,并迁移到现有基准,而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

4. 跨模态检索 2 篇

2606.19062 2026-06-18 cs.CV 新提交 85%

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University(世宗大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) Ulsan National Institute of Science and Technology(乌山国立科学研究院)

专题命中 跨模态检索 :跨模态检索,双目标编码。

AI总结 提出DREAM模型,通过双路径表示增强与对齐,结合层级视觉编码器和混合语言建模,在视频检索任务中实现新SOTA。

详情
AI中文摘要

在当今媒体驱动的世界中,视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射,限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐,但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM:双路径表示增强与对齐模型,一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略,结合掩码和排列语言建模目标,以捕捉局部和全局语言语义。在视觉方面,我们设计了一个具有级联组注意力的层级视觉编码器,通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM,分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性,并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

2606.18885 2026-06-18 cs.CV cs.IR 新提交 75%

LARE: Low-Attention Region Encoding for Text-Image Retrieval

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

发表机构 * Saudi Data and Artificial Intelligence Authority (SDAIA)(沙特数据与人工智能局)

专题命中 跨模态检索 :文本-图像跨模态检索

AI总结 提出LARE框架,通过并行编码低注意力区域和完整图像,解决拥挤场景下视觉编码器忽视关键细节的问题,在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

详情
AI中文摘要

拥挤场景中的图像检索尤其具有挑战性,因为传统视觉编码器存在显著性偏差,倾向于关注主要对象而忽略低注意力区域,而这些区域通常对细粒度检索至关重要。我们提出了LARE(低注意力区域编码),一个显式建模这些被忽略区域的框架。LARE采用双编码策略,并行编码图像的低注意力区域和完整图像,从而产生更多样化和信息丰富的图像嵌入。为了评估拥挤场景下的图像检索性能,我们引入了Dense-Set,一个源自COCO和Flickr30K的具有挑战性的子集。在该子集中,图像被重新标注,以提供对低注意力或先前被忽略区域的更丰富描述。该数据集突显了现有检索模型的局限性,并能够在密集拥挤场景条件下进行更严格的评估。实验结果表明,所提出的框架通过在共享潜在空间中保留微妙的非主导视觉线索来提高检索性能。

英文摘要

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.