arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

多模态大模型

跨文本、图像、视频、音频等模态的大模型与学习方法。

今日/当前日期收录 10 信号源:cs.CV, cs.CL, cs.AI, cs.MM, eess.AS
2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新 95%

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

专题命中 音视频多模态 :全模态世界模型,统一语言图像视频音频动作

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

详情
AI中文摘要

我们介绍了Cosmos 3,一个全模态世界模型家族,设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置,Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明,Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型,并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署,我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准,网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

2602.04796 2026-06-18 eess.AS cs.SD 版本更新 90%

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

LALM-as-a-Judge:用于多轮口语对话安全评估的大型音频语言模型基准测试

Amir Ivry, Shinji Watanabe

发表机构 * Computer Engineering, Technion--Israel Institute of Technology, Haifa, Israel(技术学院电子工程系,技术离子技术研究所,以色列海法) Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA(语言技术研究所,卡内基梅隆大学,美国匹兹堡)

专题命中 音视频多模态 :音频语言模型安全评估基准

AI总结 针对口语对话中社会不安全内容评估仍以文本为中心、忽略韵律和转录失败的问题,提出包含24000个多轮口语对话的开放基准,评估6种大型音频语言模型在文本、音频和多模态设置下的敏感性、严重性顺序特异性和轮次位置偏差,发现音频提供非词汇证据,多模态增益非普遍且存在多种模式。

Comments Accepted to ICML 2026

详情
AI中文摘要

对口语对话中社会不安全内容的评估仍然以文本为中心,忽略了韵律和转录失败。我们提出了LALM-as-a-Judge,其中包括一个包含24000个多轮口语对话的开放基准,每个对话包含一个局部不安全轮次,这些对话基于8个社会不安全类别和5个严重级别生成。我们评估了6种大型音频语言模型(LALMs)作为评判者,包括开源和闭源模型,在纯文本、纯音频和多模态设置下,针对对话中社会有害内容的敏感性、严重性顺序特异性和轮次位置偏差。结果表明,音频提供了超越转录语义的非词汇证据,并且多模态增益并非普遍存在,而是可以表现为文本锚定、平衡、保守和干扰,我们将这些归因于音频路径瓶颈和融合限制。我们将该基准定位为诊断工具,并为模型、模态和提示选择提供实践者指导。

英文摘要

Evaluation of socially unsafe content in spoken dialogues remains text-centric, missing prosody and transcription failures. We present LALM-as-a-Judge, which includes an open benchmark of 24,000 multi-turn spoken dialogues with one localized unsafe turn, generated out of 8 socially unsafe categories and 5 severity levels. We evaluate 6 large audio-language models (LALMs) as judges, open and closed-source, in text-only, audio-only, and multimodal setups by their sensitivity, severity-order specificity, and turn-position bias for socially harmful content in the dialogue. Results show that audio contributes non-lexical evidence beyond transcript semantics and that multimodal gains are not universal but can be text-anchored, balanced, conservative, and interfering, which we link to the audio pathway bottlenecks and fusion limits. We position the benchmark as diagnostic and derive practitioner guidance for model, modality, and prompts choices.

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新 90%

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni:从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) National University of Singapore(新加坡国立大学)

专题命中 音视频多模态 :评估多模态大模型从音视频预测未来的能力

AI总结 提出FutureOmni基准,评估多模态大模型从音视频线索预测未来的能力,发现现有模型在语音密集场景下表现差,并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)展现出强大的全模态感知能力,但它们从音视频线索预测未来事件的能力仍未被充分探索,因为现有基准主要关注回顾性理解。为弥补这一差距,我们引入了FutureOmni,这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理,并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建,包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明,当前系统在音视频未来预测方面存在困难,尤其是在语音密集场景中,Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限,我们整理了一个7K样本的指令微调数据集,并提出全模态未来预测(OFF)训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明,OFF增强了未来预测和泛化能力。我们公开发布所有代码(此 https URL )和数据集(此 https URL )。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

2606.06170 2026-06-18 eess.AS 版本更新 85%

CoSTA: Cognitive-State-Conditioned TTS Data Augmentation Using ASR Transcripts for Alzheimer's Disease Detection

CoSTA: 基于认知状态条件的TTS数据增强,使用ASR转录文本用于阿尔茨海默病检测

Yin-Long Liu, Yuanchao Li, Yiming Wang, Yue Li, Rui Feng, Jiaxin Chen, Shaobo Liu, Liu He, Yuang Chen, Jiahong Yuan, Zhen-Hua Ling

专题命中 音视频多模态 :TTS与ASR结合的多模态数据增强用于AD检测

AI总结 提出CoSTA框架,通过认知状态条件TTS模型合成语音,结合ASR转录文本进行数据增强,在ADReSS数据集上实现85.83%的音频检测准确率。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

基于语音的阿尔茨海默病(AD)检测受限于稀缺的病理语音数据。为此,我们提出CoSTA,一种基于文本转语音(TTS)的数据增强框架。具体而言,我们首先通过适配CosyVoice2和F5-TTS开发了两个认知状态条件(CS-Cond)TTS模型,以合成具有不同AD和健康对照特征的语音。此外,通过构建包含人工转录(MT)和36个自动语音识别(ASR)转录的转录池,我们研究了文本来源对基于TTS的数据增强的影响。我们还进行了增强因子分析和测试时增强。在ADReSS数据集上的实验表明,CS-Cond TTS显著提升了合成语音的效用,且ASR驱动的增强通常优于MT驱动的增强。最后,CoSTA相比基线获得了4.16%的提升,在ADReSS测试集上实现了85.83%的纯音频准确率,并超越了先前的方法。

英文摘要

Speech-based Alzheimer's Disease (AD) detection is constrained by scarce pathological speech data. To address this, we propose CoSTA, a Text-to-Speech (TTS)-based data augmentation framework. Specifically, we first develop two Cognitive-State-Conditioned (CS-Cond) TTS models by adapting CosyVoice2 and F5-TTS to synthesize speech with distinct AD and Healthy Control characteristics. Furthermore, by constructing a transcript pool comprising Manual Transcripts (MT) and 36 Automatic Speech Recognition (ASR) transcripts, we investigate the impact of text sources on TTS-based augmentation. We also perform augmentation-factor analysis and test-time augmentation. Experiments on the ADReSS dataset show that CS-Cond TTS significantly improves synthetic speech utility, and ASR-driven augmentation frequently outperforms MT-driven augmentation. Finally, CoSTA yields a 4.16% gain over the baseline, achieving an audio-only accuracy of 85.83% on the ADReSS test set and outperforming prior methods.

2605.26672 2026-06-18 cs.MM cs.SD 版本更新 85%

Can We Hear from Events? Generating Speech from Event Camera

我们能从事件中听到声音吗?从事件相机生成语音

Jingping Fang, Lin Chen, Chenyang Xu, Tong Zhao, Weidong Cai, Xiaoming Chen

发表机构 * Beijing Technology and Business University(北京技术与商业大学) Xidian University(西安电子科技大学) Tongji University(同济大学) University of Sydney(悉尼大学)

专题命中 音视频多模态 :事件相机生成语音,跨模态语音生成

AI总结 提出EventSpeech框架,利用神经形态事件相机的高时间精度解决传统RGB语音生成中的时间粒度不匹配问题,实现情感丰富且抗运动模糊的语音生成。

详情
AI中文摘要

传统的基于RGB的语音生成面临时间粒度不匹配问题,因为固定的相机曝光时间不可避免地模糊了渲染情感语音所需的高频发音瞬态。为了打破这一限制,我们提出EventSpeech,这是一个新颖的文本条件框架,率先利用神经形态事件进行表达性语音生成,因为这些微秒级精确的事件自然与声学波形动态对齐。我们的架构集成了一个专用的事件编码器来建模稀疏的神经形态事件,以及一个多尺度音频编码器,其中包含分层小波上下文器(HWC)。双向对齐机制无缝地将语言内容和视觉动态与密集的声学特征同步。此外,我们构建了EVT-SPK作为第一个基准,包括大规模合成数据和来自专用神经形态硬件的真实世界记录。大量评估表明,EventSpeech通过保留细粒度情感和抵抗运动模糊,显著优于当前基线,为多模态语音生成建立了新范式。代码和演示可在https://xrfang-0102.github.io/EventSpeechWeb/获取。

英文摘要

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.

2509.22363 2026-06-18 cs.LG eess.AS 版本更新 85%

Investigating Faithfulness in Large Audio Language Models

大型音频语言模型中的忠实性研究

Pooneh Mousavi, Lovenya Jain, Mirco Ravanelli, Cem Subakan

发表机构 * Concordia University(康科迪亚大学) Mila - Quebec AI Institute(魁北克人工智能研究院) Université Laval(拉瓦尔大学) Birla Institute of Technology and Science, Pilani(比拉理工学院和科学学院,皮兰尼)

专题命中 音视频多模态 :评估大型音频语言模型的推理忠实性

AI总结 提出系统框架评估大型音频语言模型在推理链忠实性上的表现,定义三个音频忠实性标准,并通过基准测试发现模型推理与音频输入存在脱节。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型音频语言模型(LALMs)将音频编码器与预训练的大型语言模型集成,以执行复杂的多模态推理任务。虽然这些模型可以生成思维链(CoT)解释,但这些推理链的忠实性仍不清楚。在这项工作中,我们提出了一个系统框架来评估LALMs中CoT在输入音频和最终模型预测方面的忠实性。我们定义了音频忠实性的三个标准:无幻觉、整体性和专注聆听。我们还引入了一个基于音频和CoT干预的基准来评估忠实性\footnote{基准测试界面和评估结果可在以下网址获取:https://this https URL。}。在Audio Flamingo 3和Qwen2.5-Omni上的实验表明存在潜在的多模态脱节:推理通常与最终预测一致,但并不总是强烈基于音频,并且可能容易受到幻觉或对抗性扰动的影响。

英文摘要

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness\footnote{The benchmarking interface and evaluation results are available at https://poonehmousavi.github.io/faithfulness/. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

2603.10827 2026-06-18 cs.SD cs.AI 版本更新 85%

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

语音感知大语言模型的说话人验证:评估与增强

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, Najim Dehak

发表机构 * Electrical and Computer Engineering Department, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学电气与计算机工程系) Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学人机语言技术中心卓越中心)

专题命中 音视频多模态 :语音感知大模型用于说话人验证

AI总结 提出模型无关的评分协议评估语音感知LLM的说话人区分能力(EER>20%),并通过注入冻结的ECAPA-TDNN说话人嵌入和LoRA微调,实现接近专用系统的性能(EER 1.03%)。

Comments 3 Tables, 1 Figure, Published in Interspeech 2026

详情
AI中文摘要

语音感知大语言模型(LLMs)可以接受语音输入,但其训练目标主要强调语言内容或特定领域(如情感或说话人性别),尚不清楚它们是否编码了说话人身份。首先,我们提出了一种模型无关的评分协议,该协议利用Yes/No令牌概率的置信度分数或对数似然比,为仅API模型和开放权重模型生成连续验证分数。使用该协议,我们评估了最近的语音感知LLMs,观察到较弱的说话人区分能力(在VoxCeleb1上EER高于20%)。其次,我们引入了一种轻量级增强方法,通过可学习的投影注入冻结的ECAPA-TDNN说话人嵌入,并仅训练LoRA适配器,使LLM具备自动说话人验证(ASV)能力。在TinyLLaMA-1.1B上,得到的ECAPA-LLM在VoxCeleb1-E上实现了1.03%的EER,接近专用说话人验证系统,同时保留了自然语言接口。

英文摘要

Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.

2603.05128 2026-06-18 eess.AS cs.SD 版本更新 85%

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

PolyBench:多声部音频中组合推理的基准测试

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

发表机构 * Harbin University of Science and Technology(哈尔滨理工大学) The University of Melbourne(墨尔本大学) KAIST(韩国成均馆大学) University of Surrey(萨里大学)

专题命中 音视频多模态 :多声部音频组合推理基准测试

AI总结 针对多声部音频中组合推理评估缺失的问题,提出PolyBench基准,包含计数、分类、检测、并发和时长估计五个子集,评估发现现有大音频语言模型在多声部场景下性能持续下降。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

大型音频语言模型(LALMs)在音频推理方面能力日益增强,然而现有基准对多声部音频(多个声音事件同时发生并产生组合结构)中的推理覆盖有限。为弥补这一空白,我们引入了PolyBench,这是一个旨在评估多声部音频中组合推理的基准,包含五个评估子集,涵盖计数、分类、检测、并发和时长估计,所有这些都需要对多个并发事件及其关系进行推理。我们对最先进的LALMs的评估揭示了在多声部设置中性能持续下降,表明当前LALMs存在根本性瓶颈。

英文摘要

Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio, yet existing benchmarks offer limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. To address this gap, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio, comprising five evaluation subsets that cover counting, classification, detection, concurrency, and duration estimation, all of which require reasoning over multiple concurrent events and their relations. Our evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic settings, indicating a fundamental bottleneck in current LALMs.

2606.05739 2026-06-18 cs.SD eess.AS 版本更新 80%

Do speech foundation models perceive speaker similarity as humans do?

语音基础模型是否像人类一样感知说话人相似性?

Minoru Kishi, Hayato Yagi, Shinnosuke Takamichi, Yuki Saito

发表机构 * Keio University, Japan(庆应大学,日本) The University of Tokyo, Japan(东京大学,日本)

专题命中 音视频多模态 :语音基础模型说话人嵌入与人类感知比较

AI总结 本研究通过比较40多个语音基础模型的说话人嵌入与人类主观相似性评分,探究模型距离是否与人类感知一致,并识别影响模型与人类感知一致性的关键配置因素。

Comments Accepted by INTERSPEECH 2026. Camera-ready version

详情
AI中文摘要

本研究对语音基础模型的说话人嵌入与人类对说话人相似性的主观感知进行了比较分析。人类听众能够在一个连续尺度上判断说话人的相似性,辨别两个声音的相似程度。相比之下,语音基础模型将说话人特征嵌入到数值表示中。然而,一个问题仍然存在:这些模型中说话人嵌入之间的数值距离是否真正与人类感知的相似性一致?为了解决这个问题,我们使用超过40个模型进行了全面调查,将模型导出的距离与人类感知的相似性评分进行比较。此外,我们确定了模型配置中的哪些因素对产生反映人类感知的说话人嵌入贡献最大。我们的发现为开发更具感知基础的语音基础模型提供了见解。

英文摘要

This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.

2603.09234 2026-06-18 eess.AS 版本更新 70%

StuPASE: Towards Low-Hallucination Studio-Quality Generative Speech Enhancement

StuPASE:迈向低幻觉、工作室质量的生成式语音增强

Xiaobin Rong, Jun Gao, Zheng Wang, Mansur Yesilbursa, Kamil Wojcicki, Jing Lu

专题命中 音视频多模态 :生成式语音增强,属于音频处理

AI总结 提出StuPASE,基于PASE框架,通过使用干目标微调和流匹配模块替代GAN,在保持低幻觉的同时实现工作室级语音质量,优于现有方法。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

在生成式语音增强中,实现无幻觉的高感知质量仍然是一个挑战。一种代表性方法PASE对幻觉具有鲁棒性,但在不利条件下感知质量有限。我们提出StuPASE,基于PASE构建,在保持其低幻觉特性的同时实现工作室级质量。首先,我们表明使用干目标而非包含模拟早期反射的目标对PASE进行微调,显著改善了去混响。其次,为解决强加性噪声下的性能限制,我们将PASE中基于GAN的生成模块替换为流匹配模块,即使在极具挑战性的条件下也能实现工作室级生成。实验表明,StuPASE始终能生成感知高质量语音,同时保持低幻觉,优于最先进的语音增强方法。音频演示见:此 https URL。

英文摘要

Achieving high perceptual quality without hallucination remains a challenge in generative speech enhancement (SE). A representative approach, PASE, is robust to hallucination but has limited perceptual quality under adverse conditions. We propose StuPASE, built upon PASE to achieve studio-level quality while retaining its low-hallucination property. First, we show that finetuning PASE with dry targets rather than targets containing simulated early reflections substantially improves dereverberation. Second, to address performance limitations under strong additive noise, we replace the GAN-based generative module in PASE with a flow-matching module, enabling studio-quality generation even under highly challenging conditions. Experiments demonstrate that StuPASE consistently produces perceptually high-quality speech while maintaining low hallucination, outperforming state-of-the-art SE methods. Audio demos are available at: https://xiaobin-rong.github.io/stupase_demo/.