arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 语音识别与关键词检测 2 篇

2606.18094 2026-06-17 cs.SD 新提交

Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction

Next-Turn: 通过预测下一次语音开始时间进行持续时间感知的流式端点检测

Tristan Tsoi, Jiajun Deng, Yingke Zhu, Huu Quyen Dang, Tianxiang Cao, Nikita Kuzmin, Tao Zhong, Simon Lui

发表机构 * Central Media Technology Institute, Huawei(华为中央媒体技术研究院) The Chinese University of Hong Kong(香港中文大学) Nanyang Technological University(南洋理工大学)

AI总结 提出Next-Turn方法,以到下一次语音开始的时间为训练目标,直接利用语音时间戳,无需额外标注,在端点检测中比最强基线提高25.9%的绝对准确率,且与持续时间感知目标联合训练可进一步提升性能。

Comments Interspeech 2026

详情
AI中文摘要

端点检测(EPD)对于流式语音系统中的自然轮换至关重要。然而,由于说话者常因犹豫和不流畅而在话语中停顿,可靠地确定话语的端点具有挑战性。语义EPD已成为解决此问题的有前景方向,但受到模糊监督和严格流式约束的阻碍。我们提出Next-Turn,使用到下一次语音开始的时间作为训练目标,其中目标直接源自语音时间戳,无需额外标注。实验表明,所提方法优于传统的声学方法和最近的语义EPD基线,在320毫秒内端点准确率比最强基线绝对提高了25.9%。此外,与持续时间感知目标联合训练补充了标准二进制EPD,其增益随停顿增加而单调递增。

英文摘要

Endpoint detection (EPD) is essential for natural turn-taking in streaming speech systems. However, reliably determining the endpoint of an utterance is challenging because speakers often pause mid-utterance due to hesitations and disfluencies. Semantic EPD has emerged as a promising direction to address this issue but is hindered by ambiguous supervision and strict streaming constraints. We propose Next-Turn that uses the time-to-next-speech-onset as the training objective, where targets are derived directly from speech timestamps and require no additional annotation. Experiments show that the proposed method outperforms conventional acoustic and recent semantic EPD baselines, achieving a 25.9% absolute improvement in endpoint accuracy within 320 ms over the strongest baseline. In addition, joint training with the duration-aware objective complements standard binary EPD, with gains that increase monotonically with increasing pauses.

2606.17281 2026-06-17 cs.CL cs.SD eess.AS 交叉投稿

Are you speaking my languages? On spoken language adherence in multimodal LLMs

你在说我的语言吗?多模态大语言模型中的口语遵循问题

Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 针对多模态大语言模型在自动语音识别中输出语言识别错误的问题,提出软提示方法、监督微调和思维链推理三种缓解策略,并引入新指标量化语言违背,比较各方法在减少违规和保持ASR性能上的效果。

Comments 7 pages, 3 tables in the main body

详情
AI中文摘要

虽然基于大语言模型(LLM)的自动语音识别(ASR)能够实现无缝的多语言使用,但模型经常错误识别输出语言,损害转录保真度和下游应用质量。为了保持灵活性和代码切换能力,我们提出了一种软提示方法,该方法暗示潜在的口语语言而不严格约束输出。我们正式将这一挑战定义为缺乏语言遵循,引入了一个新的指标来量化违规行为,并评估了三种缓解策略:(1)零样本提示,在不确定性下提供稳健指导;(2)监督微调(SFT),以提高提示遵循度;(3)思维链(CoT)推理,在解码过程中强制遵循。我们跨多种语言对这些方法进行了比较分析,评估了它们在减少语言违规同时保持整体ASR性能方面的有效性。最后,我们讨论了权衡,以指导在不同计算约束下的策略选择。

英文摘要

While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.

2. 语音合成与声音生成 3 篇

2606.17126 2026-06-17 cs.SD cs.AI 新提交

Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control

通过改进独立控制实现歌唱声音转换中的颤音表达控制

Joon-Seung Choi, Dong-Min Byun, Seong-Whan Lee

发表机构 * Korea University(高丽大学)

AI总结 提出VibE-SVC2框架,通过能量风格转换器、零样本音高风格转换器、颤音速率缩放和次谐波校正算法,实现对音高和音色两种歌唱风格的精细独立控制,性能优于现有方法。

Comments Accepted to IEEE Transactions on Audio, Speech, and Language Processing (TASLP)

详情
AI中文摘要

歌唱风格是自然且富有表现力的歌声的关键方面。歌手利用歌唱风格来传达歌曲的情感。已有若干工作提出控制歌唱风格以制作更具表现力的歌声。最近,VibE-SVC通过预测高频F0轮廓成功控制了颤音。在本文中,我们引入了一个名为VibE-SVC2的歌唱声音转换框架,以改进歌唱风格转换性能和可控性。该模型提供对两种歌唱风格的控制:音高风格和音色风格。对于音高风格,为了解决我们先前工作中未解决的能量-音高纠缠问题,我们引入了一种新颖的能量风格转换器来处理能量轮廓中剩余的样式信息。此外,我们提出了一种零样本音高风格转换器,它模仿参考音频的音高风格。为了扩展模型的可控性,我们提出了颤音速率缩放,这是对颤音程度的独立控制,这在VibE-SVC中是不可用的。对于音色风格,我们扩展了模型以处理多种发声风格。然而,解决诸如气泡音等特定风格带来了挑战,因为传统的F0提取由于其固有的次谐波特性而常常失败,这降低了转换质量。为了解决这个问题,我们提出了一种新颖的次谐波校正算法来细化F0轮廓,以实现更自然的音色转换。通过全面的客观和主观评估,我们证明了VibE-SVC2提供了对两种歌唱风格的精细、独立控制,优于现有方法。

英文摘要

Singing style is a crucial aspect of a natural and expressive singing voice. Singers utilize singing styles to convey the feeling or emotion of the songs. Several works have been proposed to control singing style for making the more expressive singing voice. Recently, VibE-SVC successfully controls vibrato by predicting high-frequency F0 contour. In this paper, we introduce a singing voice conversion framework, called VibE-SVC2, to improve singing style conversion performance and controllability. The model offers control over two types of singing styles: a pitch style and a timbre style. For the pitch style, to resolve the pitch-energy entanglement issue that is unresolved in our previous work, we introduce a novel Energy Style Converter to address remaining style information in the energy contour. In addition, we propose a Zero-shot Pitch Style Converter, which mimics the pitch style of reference audio. To expand the controllability of the model, we propose vibrato rate scaling that is an independent control of vibrato extent, which is unavailable in VibE-SVC. For the timbre style, we extend the model to handle a variety of phonation styles. However, addressing specific styles such as vocal fry poses a challenge, as conventional F0 extraction often fails due to their inherent subharmonic characteristics, which degrades the conversion quality. To address this, we propose a novel Subharmonic Correction algorithm to refine the F0 contour for more natural timbre conversion. Through comprehensive objective and subjective evaluations, we demonstrate that VibE-SVC2 provides fine-grained, independent control over two types of singing styles, outperforming existing methods.

2606.17669 2026-06-17 cs.SD 新提交

DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention

DeSRPA: 通过推理时干预的解耦语音角色扮演智能体

Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide

发表机构 * Nagoya University(名古屋大学) National Institute of Informatics(国立情报学研究所)

AI总结 提出DeSRPA框架,通过推理时干预冻结骨干模型,利用双层控制向量机制解耦认知推理与副语言表达,在语音角色扮演中实现个性与情感一致性,超越端到端微调方法。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

虽然大型语言模型(LLMs)已经革新了基于文本的角色扮演,但创建沉浸式语音角色扮演智能体(SRPAs)需要在认知推理和副语言细微差别之间建立无缝桥梁。当前的SRPAs主要依赖于端到端(E2E)微调。然而,这种范式由于依赖角色特定数据而难以泛化到未见过的角色,同时施加了“模态对齐税”,降低了LLM固有的推理能力。我们提出DeSRPA,一种通过在冻结骨干模型上进行推理时干预来实现角色扮演的智能体框架。DeSRPA采用双层控制向量机制,即内部认知引导和外部表达渲染,以同步“思维”和“声音”。在SpeechRole和OmniCharacter基准上的实验表明,DeSRPA在个性和情感一致性上显著优于E2E基线。它实现了高语音自然度,缩小了与GPT-4o Audio等专有模型的差距,同时保持了一种可扩展且无需训练的范式。

英文摘要

While Large Language Models (LLMs) have revolutionized text-based role-playing, creating immersive Speech Role-Playing Agents (SRPAs) requires a seamless bridge between cognitive reasoning and paralinguistic nuances. Current SRPAs primarily rely on end-to-end (E2E) fine-tuning. However, this paradigm suffers from poor generalization to unseen characters due to its reliance on role-specific data, while imposing a "modality alignment tax" that degrades intrinsic LLM reasoning capabilities. We propose DeSRPA, an agentic framework for character role play via inference-time intervention on frozen backbones. DeSRPA employs a dual-level control vector mechanism, Internal Cognitive Steering and External Expressive Rendering, to synchronize "mind" and "voice". Experiments on SpeechRole and OmniCharacter benchmarks demonstrate that DeSRPA significantly outperforms E2E baselines in personality and emotional consistency. It achieves high speech naturalness, narrowing the gap with proprietary models like GPT-4o Audio, while remaining a scalable and training-free paradigm.

2602.03420 2026-06-17 cs.SD cs.LG 版本更新

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

CoCoEmo: 通过激活引导实现可组合且可控的类人情感语音合成

Siyi Wang, Shihong Tan, Siyi Liu, Hong Jia, Gongping Huang, James Bailey, Ting Dang

AI总结 提出基于激活引导的框架,在混合TTS模型中实现可组合的混合情感合成和文本-情感不匹配合成,发现情感韵律主要由语言模块而非流匹配模块生成。

详情
AI中文摘要

人类语音中的情感表达是微妙且组合的,通常涉及多种、有时相互冲突的情感线索,这些线索可能与语言内容不一致。相比之下,大多数表现性文本转语音系统强制执行单一话语级别的情感,压缩了情感多样性并抑制了混合或文本-情感不匹配的表达。虽然通过潜在方向向量进行激活引导提供了一种有前景的解决方案,但情感表示在TTS中是否线性可引导、在混合TTS架构中应在何处应用引导以及如何评估这种复杂的情感行为仍不清楚。本文首次系统分析了混合TTS模型中用于情感控制的激活引导,引入了一个定量、可控的引导框架,以及多评估者评估协议,实现了可组合的混合情感合成和可靠的文本-情感不匹配合成。我们的结果首次证明,情感韵律和表达变异性主要由TTS语言模块而非流匹配模块合成,并提供了一种轻量级引导方法,用于生成自然、类人的情感语音。

英文摘要

Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.

3. 说话人识别、验证与分离 1 篇

2606.17416 2026-06-17 cs.SD cs.AI 新提交

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification

L-Proto: 面向多语言说话人验证的语言感知情景原型训练

Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

发表机构 * Department of Artificial Intelligence, Korea University(高丽大学人工智能系)

AI总结 针对多语言说话人验证中语言相关声学变异导致说话人身份与语言特征纠缠的问题,提出语言感知情景原型训练策略L-Proto,通过构建语言一致的训练情景减少语言驱动变异,提升跨语言泛化能力。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

多语言说话人验证仍然具有挑战性,因为语言相关的声学变异导致说话人身份与语言特征纠缠,降低了跨语言的泛化能力。在多语言训练中,嵌入向量通常将语言线索与说话人身份一起编码,导致说话人形成特定语言的聚类。我们提出L-Proto,一种语言感知的情景原型训练策略,该策略构建语言一致的训练情景。通过在每个情景中从单一语言采样说话人,L-Proto减少了训练期间的语言驱动变异,并鼓励嵌入向量更直接地关注说话人身份。在TidyVoice挑战基准上的实验表明,与传统的微调和随机情景采样相比,在多种骨干架构上均取得了一致的性能提升。

英文摘要

Multilingual speaker verification remains challenging because language-dependent acoustic variability causes speaker identity to become entangled with linguistic characteristics, degrading generalization across languages. In multilingual training, embeddings often encode language cues with speaker identity, causing speakers to form language-specific clusters. We propose L-Proto, a language-aware episodic prototypical training strategy that constructs language-consistent episodes. By sampling speakers from a single language per episode, L-Proto reduces language-driven variation during training and encourages embeddings to focus more directly on speaker identity. Experiments on the TidyVoice Challenge benchmark demonstrate consistent performance improvements over conventional fine-tuning and random episodic sampling across multiple backbone architectures.

4. 语音增强、降噪与音频修复 3 篇

2606.17259 2026-06-17 eess.AS cs.SD 交叉投稿

Intelligibility of Speech in Noise: Investigating Contribution of Magnitude and Phase Spectra

噪声中语音的可懂度:幅度谱和相位谱贡献的研究

Bhanu Teja Nellore, Sudarsana Reddy Kadiri, Rohit Kumar, Karan Nathwani, Suryakanth V Gangashetty

发表机构 * Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA(美国南加州大学信号分析与解释实验室) National Institute of Technology, Patna, India(印度帕坦国家理工学院) Indian Institute of Technology, Jammu, India(印度朱默尔理工学院) Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur District, Andhra Pradesh, India(安得拉邦戈塔瓦德区瓦达萨瓦拉姆康纳鲁拉克希玛伊教育基金会)

AI总结 通过三个实验评估幅度谱和相位谱对噪声中辅音可懂度的贡献,发现幅度谱在干净条件下贡献更大,而相位谱在噪声条件下更鲁棒,且鼻音比擦音和近音更易受噪声影响。

详情
AI中文摘要

众所周知,语音的可懂度在环境噪声中会降低。然而,研究表明并非所有声音都受到均匀(或同等)影响,元音比辅音对噪声更鲁棒。本研究评估并分析了各种辅音在平稳白噪声和非平稳嘈杂噪声条件下的可懂度。具体而言,本研究探讨了给定语音信号的幅度谱和相位谱对噪声条件下辅音人类语音识别的各自贡献。为此,进行了三个实验。实验1中,评估了干净信号、仅用幅度谱信息重建的信号(仅幅度信号)和仅用相位谱信息重建的信号(仅相位信号)的可懂度。实验2中,将噪声添加到干净语音中。从带噪语音中重建仅相位信号和仅幅度信号,并对所有这三种信号进行可懂度测试。实验3中,将噪声直接添加到从干净语音重建的仅幅度和仅相位信号中,并评估其可懂度。这些实验结果表明,在干净条件下幅度谱对可懂度的贡献大于相位谱,而相位谱的信息在噪声条件下更鲁棒。还观察到,在辅音中,鼻音更容易受噪声影响,而擦音和近音相对更鲁棒。

英文摘要

It is well known that intelligibility of speech reduces in the presence of ambient noise. However, studies show that all sounds are not affected uniformly (or equally) and that vowels are more robust to noise than consonants. In this study, intelligibility of various consonants is assessed and analyzed in stationary white noise and non-stationary babble noise conditions. Specifically, this study investigates the individual contribution of magnitude and phase spectra of a given speech signal on human speech recognition of consonants in noisy conditions. In this regard, three experiments are carried out. In experiment 1, clean signal, signal reconstructed with only magnitude spectrum information (magnitude only signal) and signal reconstructed with only phase spectrum information (phase only signal) are assessed for intelligibility. In experiment 2, noise is added to clean speech. From noisy speech, phase only signal and magnitude only signal are reconstructed and intelligibility tests are performed for all these three signals. In experiment 3, noise is added directly to the magnitude only and phase only signals reconstructed from clean speech and their intelligibility is assessed. Results of these experiments show that magnitude spectrum contributes more to intelligibility in clean condition than phase spectrum, while information from phase spectrum is more robust in noisy conditions. It is also observed that, among consonants, nasals are more susceptible to noise whereas fricatives and approximants were observed to be comparatively more robust.

2506.13127 2026-06-17 cs.SD eess.AS 版本更新

Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

利用局部和全局知识整合与时间频率校准蒸馏进行语音增强

Jiaming Cheng, Ruiyu Liang, Ye Ni, Chao Xu, Jing Li, Wei Zhou, Rui Liu, Björn W. Schuller, Xiaoshuai Hao

发表机构 * School of Computer Science, Nanjing Audit University(南京审计大学计算机科学学院) School of Communication Engineering, Nanjing Institute of Technology(南京工程技术学院通信工程学院) School of Information Science and Engineering, Southeast University(东南大学信息科学与工程学院) Cardiff University(卡迪夫大学) Inner Mongolia University(内蒙古大学) CHI – the Chair of Health Informatics, TUM University Hospital(健康信息学系,技术大学医院) GLAM – the Group on Language, Audio, & Music, Imperial College London(语言、音频与音乐组,伦敦帝国理工学院) Xiaomi EV(小米电动车)

AI总结 本文提出了一种融合框架,通过时间频率校准知识蒸馏提升语音增强性能,结合局部信息聚焦与全局知识流通,改进了低复杂度学生模型的表现。

Comments submitted to IEEE Transactions on Cognitive and Developmental Systems

详情
AI中文摘要

本文提出了一种内集和外集递归融合框架,结合时间频率校准知识蒸馏(I$^2$SRF-TFCKD)用于语音增强。与以往的语音增强蒸馏策略不同,该框架充分利用了语音的时间频率差异信息,同时促进局部信息聚焦和全局知识流通。首先,我们构建了内集和外集的相关蒸馏范式。在相关集合内,多层教师-学生特征进行成对匹配以实现校准蒸馏。随后,通过递归融合生成每个相关集合的代表性特征,形成融合特征集以促进跨集知识交互。其次,我们提出了一种基于双流时间频率交叉校准的多层交互蒸馏,分别在时间和频率域内计算教师-学生相似性校准权重,并进行交叉加权,从而根据语音特性对不同层的蒸馏贡献进行精细化分配。所提出的蒸馏策略应用于在L3DAS23挑战赛语音增强赛道排名第一的双路径扩张卷积循环网络(DPDCRN)。为了评估I$^2$SRF-TFCKD的有效性,我们在单通道和多通道语音增强数据集上进行了实验。客观评估显示,所提出的KD策略一致且有效地提升了低复杂度学生模型的性能,并优于其他蒸馏方案。

英文摘要

In this paper, we propose an intra-set and inter-set recursive fusion framework with time-frequency calibrated knowledge distillation (I$^2$SRF-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully exploits the time-frequency differential information of speech while facilitating both local information focusing and global knowledge circulation. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through recursive fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$SRF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

2512.16420 2026-06-17 cs.SD 版本更新

DPDFNet: Boosting DeepFilterNet2 via Dual-Path RNN

DPDFNet: 通过双路径RNN提升DeepFilterNet2

Daniel Rika, Nino Sapir, Ido Gus

AI总结 提出DPDFNet,在DeepFilterNet2编码器中引入双路径块增强长时跨带建模,结合过衰减抑制损失和微调策略,在多个基准上超越现有因果模型,并部署于边缘NPU实现实时性能。

Comments Accepted manuscript version. Accepted for publication in Speech Communication

详情
AI中文摘要

我们提出DPDFNet,一种因果单通道语音增强模型,它在DeepFilterNet2架构的基础上,在编码器中引入双路径块,增强了长时域和跨频带建模能力,同时保留了原有的增强框架。此外,我们证明,添加一个损失分量以减轻增强语音中的过度衰减,并结合针对“始终在线”应用定制的微调阶段,可以显著提升模型整体性能。我们在标准VoiceBank+DEMAND和DNS4盲测基准上评估DPDFNet,结果显示其相比DeepFilterNet2有一致提升,并且与其他因果开源模型相比整体性能强劲。此外,我们引入了一个补充的多语言低信噪比评估集,包含12种语言在日常噪声场景下的长录音,DPDFNet在此评估集上表现出优于其他因果开源模型的性能,包括一些规模更大、计算需求更高的模型。我们还提出了一种整体指标PRISM,它是侵入式和非侵入式指标的复合、尺度归一化聚合,该指标清晰展示了与双路径块数量的可扩展性。我们通过在Ceva-NeuPro-Nano边缘NPU上部署DPDFNet进一步证明了其在设备上的可行性。结果表明,我们的第二大模型DPDFNet-4在NPN32上实现了实时性能,在NPN64上运行更快,证实了在最先进的嵌入式功耗和延迟约束下可以维持高质量。

英文摘要

We present DPDFNet, a causal single-channel speech enhancement model that extends DeepFilterNet2 architecture with dual-path blocks in the encoder, strengthening long-range temporal and cross-band modeling while preserving the original enhancement framework. In addition, we demonstrate that adding a loss component to mitigate over-attenuation in the enhanced speech, combined with a fine-tuning phase tailored for "always-on" applications, leads to substantial improvements in overall model performance. We evaluate DPDFNet on the standard VoiceBank+DEMAND and DNS4 blind test benchmarks, where it shows consistent gains over DeepFilterNet2 and strong overall performance against other causal open-source models. In addition, we introduce a supplementary multilingual low-SNR evaluation set comprising long recordings in 12 languages across everyday noise scenarios, on which DPDFNet delivers superior performance to other causal open-source models, including some that are substantially larger and more computationally demanding. We also propose an holistic metric named PRISM, a composite, scale-normalized aggregate of intrusive and non-intrusive metrics, which demonstrates clear scalability with the number of dual-path blocks. We further demonstrate on-device feasibility by deploying DPDFNet on Ceva-NeuPro-Nano edge NPUs. Results indicate that DPDFNet-4, our second-largest model, achieves real-time performance on NPN32 and runs even faster on NPN64, confirming that state-of-the-art quality can be sustained within strict embedded power and latency constraints.

5. 音频事件检测与场景理解 2 篇

2606.17160 2026-06-17 cs.SD 新提交

Transductive Zero-Shot Audio Classification with Audio-Language Models

基于音频-语言模型的直推式零样本音频分类

Jingwen Zhou, Mingzhe Wang

发表机构 * Xidian University, Xi'an, China(西安电子科技大学)

AI总结 提出一种文本锚定的球面高斯混合EM算法,利用测试批次音频嵌入统计信息改进零样本后验,无需标签和梯度,在三个数据集上提升4.6-9.2个点。

详情
AI中文摘要

对比语言-音频预训练(CLAP)实现了零样本音频分类,但标准推理孤立地对每个片段进行分类,忽略了未标记测试集的结构。我们首次对CLAP的TransCLIP风格直推式推理进行了系统研究:一种文本锚定的球面高斯混合EM算法,利用测试批次的音频嵌入统计信息改进零样本后验,无需标签、无需梯度,且计算量可忽略(在单个CPU核心上处理2000个片段约需15毫秒)。在ESC-50、UrbanSound8K和VocalSound上,该方法始终将top-1准确率提升+4.6至+9.2个百分点(例如,ESC-50从89.1%提升至94.8%,UrbanSound8K从73.8%提升至81.8%)。我们进一步表明,该增益(i)受一个简单的操作边界控制——每批次每类约需2.5个测试样本,超过约5个样本后收益递减;(ii)与熵引导的提示加权互补,两者结合在ESC-50上达到96.2%;以及(iii)在长尾批次下衰减但仍为正(在20:1不平衡下从+4.9降至+3.1个百分点),我们将其报告为显式限制。我们还记录了一个负面结果:在TUT Urban Acoustic Scenes 2018上,零样本CLAP接近随机水平,直推式没有信号可放大。

英文摘要

Contrastive language-audio pretraining (CLAP) enables zero-shot audio classification, but standard inference classifies each clip in isolation and ignores the structure of the unlabeled test set. We present the first systematic study of TransCLIP-style transductive inference for CLAP: a text-anchored spherical Gaussian-mixture EM that refines zero-shot posteriors using the audio-embedding statistics of the test batch, with no labels, no gradients, and negligible compute (about 15 ms on one CPU core for 2,000 clips). Across ESC-50, UrbanSound8K, and VocalSound, this consistently improves top-1 accuracy by +4.6 to +9.2 points over the zero-shot baseline (e.g., 89.1 -> 94.8% on ESC-50, 73.8 -> 81.8% on UrbanSound8K). We further show that the gain (i) is governed by a simple operating boundary -- roughly 2.5 test samples per class per batch are required, with diminishing returns beyond ~5; (ii) is complementary to entropy-guided prompt weighting, with the combination reaching 96.2% on ESC-50; and (iii) attenuates but remains positive under long-tailed batches (+4.9 -> +3.1 points at a 20:1 imbalance), which we report as an explicit limitation. We also document a negative result: on TUT Urban Acoustic Scenes 2018, where zero-shot CLAP is near chance, transduction has no signal to amplify.

2606.17775 2026-06-17 cs.SD cs.AI cs.NE 新提交

A Neuromorphic Trigger for Efficient Audio Event Detection

一种用于高效音频事件检测的神经形态触发器

Benjamin Hatton, Oliver Rhodes, Luca Peres

发表机构 * ICNS, University of Manchester(曼彻斯特大学ICNS)

AI总结 提出基于脉冲神经网络(SNN)的低成本前端触发器,选择性筛选音频片段,在异常声音检测和声音事件检测任务上分别实现0.97的F1分数和42.6倍FLOPs减少。

Comments 9 pages, 4 figures, 6 tables

详情
AI中文摘要

连续音频流的高效处理仍然是实时和资源受限系统面临的关键挑战。本文介绍了一种用于音频事件检测的神经形态触发器,基于脉冲神经网络(SNN)选择性门控下游模型的输入。所提出的触发器作为低成本前端,识别显著音频片段,仅将这些片段转发给计算密集型的模型进行分类等任务。触发器实现为轻量级全连接SNN,并在两个代表性任务上评估:异常声音检测(ASD)和声音事件检测(SED)。对于ASD,触发器在URBAN-SED数据集的类别无关形式下,实现了基于一秒片段的F1分数0.97,显示出识别相关音频区域的高可靠性。对于SED,触发器与Dang分类器结合在DCASE 2017挑战赛任务2数据集上,展示了潜在的42.6倍FLOPs减少,同时将基于事件错误率的下限从0.41降低到0.25。这些结果凸显了神经形态触发器作为实时、节能前端滤波器的潜力,能够大幅降低计算成本。

英文摘要

Efficient processing of continuous audio streams remains a key challenge for real-time and resource-constrained systems. This paper introduces a neuromorphic trigger for audio event detection, based on a spiking neural network (SNN) that selectively gates input to downstream models. The proposed trigger acts as a low-cost front-end, identifying salient audio segments and forwarding only these to a more computationally intensive model for tasks such as classification. The trigger is implemented as a lightweight fully connected SNN and evaluated on two representative tasks: Anomalous Sound Detection (ASD) and Sound Event Detection (SED). For ASD, the trigger achieves a one-second segment-based F1 score of 0.97 on a class-agnostic form of the URBAN-SED dataset, demonstrating high reliability in identifying relevant audio regions. For SED, the trigger is combined with the Dang classifier on the DCASE 2017 Challenge Task 2 dataset, showing a potential $42.6\times$ reduction in FLOPs while reducing the lower bound of the event-based error rate from 0.41 to 0.25. These results highlight the potential of neuromorphic triggers as real-time, energy-efficient front-end filters, enabling substantial reductions in computational cost.

6. 音乐信息检索与音乐生成 1 篇

2606.17301 2026-06-17 cs.SD cs.LG 新提交

Turning music identification into a neural forward pass

将音乐识别转化为神经前向传播

Muhammad Taimoor Haseeb, Ahmad Hammoudeh, Gus Xia

发表机构 * Music X Lab(音乐X实验室) Mohamed Bin Zayed University of Artificial Intelligence(Mohamed Bin Zayed人工智能大学)

AI总结 提出用生成式Transformer通过单次神经前向传播实现音乐识别,在短音频片段上超越传统声学指纹方法,存储和延迟显著降低。

详情
AI中文摘要

搜索是计算机科学中的基础操作,它将查询映射到集合中的匹配项。通常,它被实现为类似系统2的基于规则的流水线:计算键、探测索引、验证候选。相比之下,人类识别类似于系统1的联想式身份恢复模型,其中即使部分线索也能触发回忆,而无需显式枚举、排序甚至访问离散候选。在这里,我们展示了音乐声音识别——一个困难的搜索问题——可以通过生成式Transformer在单次神经前向传播中完成。该模型在音频数据集上训练,从短音频片段预测对应的曲目标识符。这种方法超越了最先进的声学指纹识别,对于短音频片段(1秒)的提升最大,证明了该方法不仅可行而且具有优势。此外,它将外部存储减少到基线的0.33%,并将推理延迟提高了2.3倍(p95)。而且,该模型可以拒绝未见曲目的查询,支持开放集操作,同时降低误归因风险。以音乐曲目识别为例,这项工作重新定义了搜索,使其更接近人类联想识别,远离算法数据库查找。

英文摘要

Search, a foundational operation in computer science, maps a query to a matching item in a collection. It is typically implemented as a System-2 like, rule-based pipeline in which a key is computed, an index is probed, and candidates are verified. By contrast, human recognition resembles a System-1 like, associative model of identity recovery, in which even partial cues can trigger a recall without explicitly enumerating, ranking, or even accessing discrete candidates. Here, we show that music sound identification, a difficult search problem, can be performed in a single neural feed-forward pass by a generative transformer. Trained on an audio dataset, the model predicts the corresponding track identifier from a short audio excerpt. This approach surpasses state-of-the-art acoustic fingerprinting, with the largest gains for short audio segments (1 second), demonstrating the method is not only viable but advantageous. Moreover, it reduces external storage to 0.33% of the baseline footprint and improves inference latency by 2.3x (p95). Furthermore, the model can reject queries for unseen tracks, supporting open-set operation while reducing misattribution risk. Using music track identification as an example, this work reframes search, bringing it closer in spirit to human associative recognition and away from algorithmic database lookup.

7. 语音翻译与语音语言模型 1 篇

2606.17417 2026-06-17 cs.SD cs.LG 新提交

A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models

大型音频语言模型时间理解失败模式的深入分析

Apoorva Kulkarni, Kaousheik Jayakumar, Sreyan Ghosh, Sarah Wiegreffe, Dinesh Manocha, Ramani Duraiswami

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 本文通过行为与因果机制分析,揭示大型音频语言模型在时间推理中因模态不平衡而失败,并提出注意力重分配方法提升准确率。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型音频语言模型(LALMs)在各种音频理解任务上表现出色,但在时间推理这一人类听觉感知的核心能力上仍存在困难。理解这些失败的原因仍然具有挑战性,因为现有基准报告了性能差距,但没有探究潜在机制。为此,我们引入了一个包含1657个问题的基准测试,涵盖三项基础任务,专门用于机制分析。检查模型在不同输入设置下的输出(行为分析)表明,当文本线索可用时,模型往往未充分利用音频。我们还首次对LALMs中的时间推理失败进行了因果机制分析。比较注意力加权与缩放,我们发现重新分配音频令牌上的注意力比增加音频注意力更有效。针对任务相关令牌进一步提升了效果。这些发现表明,模态不平衡本身不能解释失败。瓶颈层的注意力缩放在不进行微调的情况下将准确率从55.9%提高到59.1%,为未来工作展示了一个有前景的方向。

英文摘要

Large Audio Language Models (LALMs) achieve strong performance on a variety of audio understanding tasks but continue to struggle with temporal reasoning, a fundamental capability central to human auditory perception. Understanding the causes of these failures remains challenging as existing benchmarks report performance gaps without probing underlying mechanisms. To address this, we introduce a benchmark with 1,657 questions across three foundational tasks designed specifically for mechanistic analysis. Examining model outputs across varying input settings (behavioral analysis) reveals that models often under-utilize audio when textual cues are available. We also provide the first causal mechanistic analysis of temporal reasoning failures in LALMs. Comparing attention upweighting against scaling, we find that redistributing attention across audio tokens is more effective than increasing audio attention. Targeting task-relevant tokens yields further gains. These findings suggest that modality imbalance alone cannot explain failures. Attention scaling at bottleneck layers improves accuracy from 55.9% to 59.1% without fine-tuning, demonstrating a promising direction for future work.

8. 多模态音频与视听学习 1 篇

2603.19697 2026-06-17 eess.AS cs.MM cs.SD 版本更新

Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction

Plug-and-Steer:解耦分离与选择的音视频目标说话人提取

Doyeop Kwak, Suyeon Lee, Joon Son Chung

AI总结 提出Plug-and-Steer方法,通过解耦分离与目标选择,利用冻结的纯音频骨干网络和潜引导矩阵实现高保真音视频目标说话人提取。

Comments Accepted by Interspeech 2026; demo available https://plugandsteer.github.io

详情
AI中文摘要

本文的目标是通过解耦分离和目标选择,为音视频目标说话人提取(AV-TSE)提供新视角。传统的AV-TSE系统通常深度融合音频和视觉特征以重新学习整个分离过程,由于野外音视频数据集的噪声特性,这可能会成为保真度的上限。为了解决这个问题,我们提出了Plug-and-Steer,它将高保真分离分配给冻结的纯音频骨干网络,并将视觉模态的作用严格限制在目标选择上。我们引入了潜引导矩阵(LSM),这是一种最小化的线性变换,它重新路由骨干网络内的潜特征,将目标说话人锚定到指定通道。在四种代表性架构上的实验表明,我们的方法有效地保留了不同骨干网络的声学先验,实现了与原始骨干网络相当的可感知质量。音频样本可在以下网址获取:this https URL

英文摘要

The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of the visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to that of the original backbones. Audio samples are available at: https://plugandsteer.github.io

9. 数据集、基准与评测 6 篇

2606.18135 2026-06-17 cs.SD cs.AI 新提交

Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)

描述符:Certus 口径分类枪声数据集 (C3GD)

Sinclair Gurny, Ryan Quinn

发表机构 * Certus Innovations

AI总结 介绍一个公开的枪声数据集 C3GD,包含超过8000个来自28种枪支、16种口径的实地采集数据点,用于口径分类、枪声检测等任务,提供丰富的元数据以支持泛化与学术分析。

详情
AI中文摘要

在这项工作中,我们介绍了 Certus 口径分类枪声数据集 (C3GD),这是一个公开可访问的数据集,用于分析枪口爆炸声。该数据集旨在提供多种枪支、口径、弹药、麦克风和麦克风位置,其元数据详细程度超过当前已有的其他数据集。它包含来自28种枪支、16种口径的超过8000个实地采集数据点。由于实地数据采集成本高昂,现有研究多使用从互联网收集的枪声音频,这增加了低质量数据和标签噪声的风险。该数据集主要关注口径分类,但也可用于枪声检测、音频分离和音频信号处理,提供了多样化的真实世界参考。该数据集旨在提供足够的多样性,以便泛化到更多实际应用,同时提供足够的元数据以进行详细的学术分析。

英文摘要

In this work, we introduce the Certus Caliber Classification Gunshot Dataset (C3GD), a publicly accessible data set developed for the analysis of firearm muzzle blast sounds. The dataset aims to provide a wide variety of firearms, calibers, cartridges, microphones, and microphone locations with metadata detailed beyond what is currently otherwise available. It comprises more than 8000 field-collected data points from 28 firearms across 16 calibers. Because data collection in the field is costly, much of the existing research has been done using gunshot audio collected from the internet, which increases the risk of low-quality data and label noise. This dataset is primarily focused on caliber classification, but can also be used for gunshot detection, audio separation, and audio signal processing, providing a diversified and real-world reference. The dataset aims to provide enough diversity to be able to generalize to more real-world applications while also providing enough metadata for detailed academic analysis.

2606.17339 2026-06-17 cs.AI cs.CL cs.SD 交叉投稿

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx: 面向临床语音AI的多任务基准

Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

发表机构 * University of Toronto(多伦多大学)

AI总结 提出SpeechDx基准,涵盖12个数据集和27个任务,通过语音产生阶段(概念化、公式化、发音)组织任务,评估12种音频编码器,发现大规模语音模型表现最佳,但尚无表示能可靠泛化。

详情
AI中文摘要

语音通过同时涉及神经、运动、呼吸和发声系统,为健康提供了一个独特的窗口。当前的临床语音AI方法主要通过孤立的特定疾病研究取得进展,导致结果难以比较,泛化能力难以评估。我们引入了SpeechDx,这是一个大规模的临床语音AI基准,涵盖12个数据集和27个任务,涉及多种健康状况。为了能够基于共享的临床机制进行评估,SpeechDx根据任务所破坏的语音产生阶段(概念化、公式化和发音)来组织任务。该基准通过包含有限标注数据的任务以及跨多个数据集评估同一健康状况来测试泛化能力,从而区分有临床意义的模式与数据集伪影。我们系统评估了12个最先进的音频编码器在所有任务以及零样本跨条件迁移下的表现。结果表明,大规模语音模型代表了最强的整体基线,领域特定模型仅在紧密匹配的任务上提升性能,而当前没有任何表示能在临床语音领域可靠泛化。SpeechDx建立了一个共享评估框架,用于追踪通用临床语音表示的进展。

英文摘要

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

2606.17404 2026-06-17 eess.AS cs.SD 交叉投稿

ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

ELSA: 面向细粒度无参考文本到音频评估的声学事件级语义对齐

Shuntaro Suzuki, Kento Tokura, Daichi Yashima, Kanon Amemiya, Komei Sugiura, Shinnosuke Takamichi

发表机构 * Keio University(Keio大学)

AI总结 提出ELSA指标,通过将生成音频分解为文本查询中的声学事件并评估事件级对齐,实现细粒度无参考文本到音频评估,在四个基准上比现有指标更符合人类评分。

Comments Accepted for presentation at Interspeech2026

详情
AI中文摘要

文本到音频(TTA)生成,即从自然语言合成音频,因其能够捕捉精确的用户意图而被广泛研究。为了有效推进TTA模型,必须在不依赖昂贵的人类主观评分的情况下可靠地评估生成的音频,这促使开发与人类判断高度相关的自动评估指标。虽然最近的基于CLAP的指标提供了实用的无参考解决方案,但其粗粒度的文本-音频相似度匹配往往与人类评分的相关性较差。为了解决这个问题,我们提出了ELSA,一种用于细粒度文本-音频对齐的无参考评估指标。ELSA将生成的音频分解为由文本查询中的不同声学事件引导,并评估事件级对齐。在四个TTA基准上的实验表明,ELSA与人类主观评分的相关性高于先前的指标,突显了其在可靠TTA评估中的有效性。

英文摘要

Text-to-audio (TTA) generation, synthesizing audio from natural language, has been widely studied for its ability to capture precise user intent. To effectively advance TTA models, it is essential to reliably evaluate generated audio without relying on costly human subjective ratings, motivating the development of automatic evaluation metrics that correlate well with human judgments. While recent CLAP-based metrics provide practical reference-free solutions, their coarse-grained text-audio similarity matching often correlates poorly with human ratings. To address this, we propose ELSA, a reference-free evaluation metric for fine-grained text-audio alignment. ELSA decomposes generated audio guided by distinct acoustic events derived from the text query and assesses event-level alignment. Experiments across four TTA benchmarks show that ELSA reveals a higher correlation with human subjective ratings than prior metrics, highlighting its effectiveness for reliable TTA evaluation.

2509.15626 2026-06-17 cs.SD eess.AS 版本更新

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

LibriTTS-VI:用于高效语音印象控制的公开语料库与新方法

Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura

AI总结 针对数值语音印象控制中缺乏公开语料库和印象泄漏问题,构建首个公开语料库LibriTTS-VI,并提出解耦训练和无参考方法,显著提升控制精度。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

数值语音印象(VI)控制(例如,缩放明亮度)能够在文本到语音(TTS)中实现细粒度控制。然而,它面临两个挑战:缺乏公开语料库和印象泄漏,其中参考音频会使合成语音偏离目标VI。针对第一个挑战,我们引入了LibriTTS-VI,这是基于LibriTTS-R构建的首个公开VI语料库。针对第二个挑战,我们假设单个参考通过纠缠说话人身份和VI导致泄漏。为了缓解这一问题,我们提出:1)使用同一说话人的两个话语进行解耦训练,分别用于说话人和VI条件化;2)一种无参考方法,仅通过目标VI控制印象。实验表明,我们的最佳方法提高了可控性:11维VI均方误差从0.61降至0.42(客观)和从1.15降至0.92(主观)。与基于提示的TTS比较显示,后者存在数值控制不精确以及VI与文本语义纠缠的问题,而我们的方法克服了这些缺陷。

英文摘要

Numerical voice impression (VI) control (e.g., scaling brightness) enables fine-grained control in text-to-speech (TTS). However, it faces two challenges: no public corpus and impression leakage, where reference audio biases synthesized voice away from the target VI. To address the first challenge, we introduce LibriTTS-VI, the first public VI corpus built on LibriTTS-R. For the second, we hypothesize a single reference causes leakage by entangling speaker identity and VI. To mitigate this, we propose 1) disentangled training with two utterances from the same speaker for speaker and VI conditioning, and 2) a reference-free method controlling the impression solely via target VI. Experimentally, our best method improves controllability: 11-dimensional VI mean squared error drops from 0.61 to 0.41 objectively and 1.15 to 0.92 subjectively. A comparison with a prompt-based TTS reveals imprecise numerical control and entanglement between VI and text semantics, which our methods overcome.

2208.03023 2026-06-17 eess.AS cs.SD 版本更新

AID: Open-source Anechoic Interferer Dataset

AID:开源消声干扰源数据集

Philipp Götz, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

发表机构 * International Audio Laboratories Erlangen(国际声学实验室埃尔朗根) Fraunhofer Institute for Integrated Circuits IIS(弗劳恩霍夫整合电路研究所IIS)

AI总结 提出一个家庭环境中各种声源的消声录音数据集,用于模拟复杂声学场景的非平稳环境噪声信号,并提供Python库生成随机混合干扰信号。

Comments Accepted for publication at IWAENC 2022

详情
AI中文摘要

本文提出了一个数据集,包含家庭环境中遇到的各种声源的消声录音。该数据集旨在作为非平稳环境噪声信号的资源,这些信号与声学脉冲响应卷积后可用于模拟复杂的声学场景。此外,还提供了一个Python库,用于生成数据集中录音的随机混合,这些混合可用作非平稳干扰信号。

英文摘要

A dataset of anechoic recordings of various sound sources encountered in domestic environments is presented. The dataset is intended to be a resource of non-stationary, environmental noise signals that, when convolved with acoustic impulse responses, can be used to simulate complex acoustic scenes. Additionally, a Python library is provided to generate random mixtures of the recordings in the dataset, which can be used as non-stationary interference signals.

2505.19937 2026-06-17 cs.CL cs.SD eess.AS 版本更新

ALAS: An Automatic Latent Alignment Score for Audio Language Models

ALAS:音频语言模型的自动潜在对齐分数

Pooneh Mousavi, Yingzhi Wang, Mirco Ravanelli, Cem Subakan

AI总结 提出ALAS指标,通过计算音频与文本表示的跨模态余弦相似度,无需训练即可评估语音-LLM的音频-文本对齐质量,揭示模型对齐深度与任务需求的关系。

详情
AI中文摘要

大型语言模型(LLM)被扩展为语音-LLM,它们学习的音频-文本对齐质量影响大多数下游口语理解(SLU)行为。然而,尽管融合策略不断增长,但没有标准方法来衡量语音-LLM内部如何将音频帧与文本标记绑定。我们引入ALAS(自动潜在对齐分数),一种模型和任务无关的度量,探测LLM的逐层隐藏状态,将音频和文本表示之间的跨模态余弦相似度与Whisper导出的参考进行评分。ALAS仅需要冻结的前向传递和现成的ASR参考,无需训练或拟合分类器,并校准到可解释的均匀基线,可在任务间比较。将ALAS应用于四个开源语音-LLM(AF3、Qwen2-Audio、Qwen-Omni、SALMONN),在情感识别(IEMOCAP)、开放式SQA(LibriSQA)和多选音频理解(MMAU-speech)上,我们发现对齐的深度和强度反映了每个模型的音频编码器设计以及任务的声学与语义需求,并且ALAS跟踪但不重复任务准确性,暴露了那些得分高但未真正基于音频的模型。我们将ALAS作为开源库发布,以便从业者探测自己的语音-LLM或在新任务上尝试。

英文摘要

Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion strategies, there is no standard way to measure how well a Speech-LLM internally binds audio frames to text tokens. We introduce ALAS (Automatic Latent Alignment Score), a model and task-agnostic metric that probes the LLM's per-layer hidden states, scoring the cross-modal cosine similarity between audio and text representations against a Whisper-derived reference. ALAS needs only a frozen forward pass and an off-the-shelf ASR reference, with no training or fitted classifier, and is calibrated to an interpretable uniform baseline comparable across tasks. Applying ALAS to four open-source Speech-LLMs (AF3, Qwen2-Audio, Qwen-Omni, SALMONN) across emotion recognition (IEMOCAP), open-ended SQA (LibriSQA), and multi-choice audio understanding (MMAU-speech), we find that the depth and strength of alignment reflect each model's audio-encoder design and the acoustic-versus-semantic demands of the task, and that ALAS tracks but does not duplicate task accuracy, exposing models that score well without genuinely grounding in the audio. We release ALAS as an open-source library so that practitioners can probe their own Speech-LLMs or try it on new tasks.

10. 安全、隐私与深度伪造音频 1 篇

2603.28378 2026-06-17 cs.SD cs.AI 版本更新

Membership Inference Attacks against Large Audio Language Models

针对大型音频语言的成员推断攻击

Jia-Kai Dong, Yu-Xiang Lin, Hung-Yi Lee

AI总结 首次系统评估大型音频语言模型的成员推断攻击,提出盲基线协议控制分布偏移,发现跨模态记忆仅源于说话人声纹与文本绑定。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

我们首次对大型音频语言模型(LALMs)进行了系统的成员推断攻击(MIA)评估。利用基于文本、频谱和韵律特征的多模态盲基线,我们证明即使没有模型推理,常见音频数据集也表现出近乎完美的训练/测试可分离性(AUC ~ 1.0),因此MIA可能主要检测分布偏移。因此,我们引入了一个盲基线协议来控制这一混杂因素。在该协议下,我们发现分布匹配的数据集能够实现可靠的MIA评估,而不会产生分布偏移伪影。我们基准测试了多种MIA方法,并在这些数据集上进行了模态解缠实验。结果表明,LALM的记忆是跨模态的,仅源于将说话人的声纹与其文本绑定。这些发现为审计LALMs建立了超越虚假相关性的原则性标准。我们的代码库可在该网址获取。

英文摘要

We present the first systematic Membership Inference Attack (MIA) evaluation of LALMs. Using Multi-modal Blind Baselines based on textual, spectral and prosodic features, we demonstrate that common audio datasets exhibit near-perfect train/test separability (AUC ~ 1.0) even without model inference, thus MIA may primarily detect distribution shift. We therefore introduce a blind-baseline protocol to control for this confound. Under this protocol, we identify that the distribution-matched datasets enable reliable MIA evaluation without distribution-shift artifacts. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations. Our codebase is available at https://github.com/snooow1029/ALM_MIA.

11. 其他/综合语音音频 9 篇

2606.18019 2026-06-17 eess.AS cs.CL cs.SD 交叉投稿

Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews

字里行间:利用大型语言模型从临床访谈中进行全球痴呆和抑郁评估

Franziska Braun, Alea Rüggeberg, Thomas Ranzenberger, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

发表机构 * TH Nürnberg(Nürnberg大学) FAU Erlangen(埃朗根大学) PMU Klinikum Nürnberg(纽伦堡大学医院)

AI总结 本研究利用开放权重大型语言模型,从154名德语受试者的临床访谈录音中预测痴呆和抑郁严重程度,引入与全球恶化量表对齐的全球抑郁量表,发现零样本预测对抑郁有效,而结构化特征提取显著提升痴呆评估性能,误差降低达35%,且暂停增强转录本表现与人工转录相当。

Comments Accepted for publication in Text, Speech and Dialogue (TSD 2026). The final authenticated publication will be available online via Springer LNCS/LNAI

详情
AI中文摘要

痴呆和抑郁是老年人群中最常见的神经精神障碍,其重叠症状对鉴别诊断构成重大挑战。在本研究中,我们探讨了开放权重的大型语言模型(LLMs)用于从154名德语受试者的标准化病史访谈录音中预测痴呆和抑郁严重程度。我们引入了一个与已建立的全球恶化量表(GDS)对齐的观察者基础全球抑郁量表(GDS-D),从而能够对情感和认知症状进行并行全局分期。我们在两种设置下比较了三种LLMs(Mistral 3.1、DeepHermes、Qwen3):(1) 零样本预测和(2) 基于LLM的特征提取用于支持向量回归,使用人工转录和暂停增强转录。结果显示,LLMs在零样本设置中有效预测抑郁严重程度(最佳MAE为0.60),而痴呆评估显著受益于结构化特征提取(最佳MAE为0.78),相比零样本基线误差降低高达35%。暂停增强转录本在性能上与人工转录相当,证明了全自动筛查流程在神经精神鉴别评估中的可行性。

英文摘要

Dementia and depression are the most prevalent neuropsychiatric disorders in geriatric populations, and their overlapping symptoms pose major challenges for differential diagnosis. In this study, we investigate open-weights Large Language Models (LLMs) for predicting dementia and depression severity from speech samples collected during standardized history taking interviews with 154 German-speaking subjects. We introduce an observer-based Global Depression Scale (GDS-D) aligned with the established Global Deterioration Scale (GDS), enabling parallel global staging of affective and cognitive symptoms. We compare three LLMs (Mistral 3.1, DeepHermes, Qwen3) in two settings: (1) zero-shot prediction and (2) LLM-based feature extraction for Support Vector Regression, using human and pause-enriched transcripts. Results show that LLMs effectively predict depression severity in zero-shot settings (best MAE of 0.60), while dementia assessment benefits substantially from structured feature extraction (best MAE of 0.78), reducing errors by up to 35% over zero-shot baselines. Pause-enriched transcripts achieve competitive performance with human transcriptions, demonstrating the viability of fully automatic screening pipelines for differential neuropsychiatric assessment.

2509.15210 2026-06-17 cs.SD cs.AI cs.LG 版本更新

Explicit Context-Driven Neural Acoustic Modeling for High-Fidelity RIR Generation

显式上下文驱动的神经声学建模用于高保真RIR生成

Chen Si, Qianyi Wu, Chaitanya Amballa, Romit Roy Choudhury

AI总结 提出MiNAF模型,通过查询房间网格并提取距离分布作为显式局部几何特征,引导神经隐式模型生成更准确的房间脉冲响应(RIR),在多项指标上达到竞争性能。

详情
AI中文摘要

逼真的声音模拟在许多应用中起着关键作用。声音模拟的一个关键要素是房间脉冲响应(RIR),它描述了声音在给定空间中的传播方式。最近的研究应用神经隐式方法,利用从环境中收集的上下文信息(如场景图像)来学习RIR。然而,这些方法没有有效利用环境中的显式几何信息。为了进一步利用具有直接几何特征的神经隐式模型,我们提出了MiNAF,它在给定位置查询粗略的房间网格,并提取距离分布作为局部上下文的显式表示。我们的方法表明,结合显式的局部几何特征可以更好地引导模型生成更准确的RIR预测。通过与常规和最先进方法的比较,我们展示了MiNAF在各种评估指标上具有竞争力的性能。

英文摘要

Realistic sound simulation plays a critical role in many applications. A key element in sound simulation is the room impulse response (RIR), which characterizes how sound propagates within a given space. Recent studies have applied neural implicit methods to learn RIR using context information collected from the environment, such as scene images. However, these approaches do not effectively leverage explicit geometric information from the environment. To further exploit neural implicit models with direct geometric features, we present MiNAF, which queries a rough room mesh at given locations and extracts distance distributions as an explicit representation of local context. Our approach demonstrates that incorporating explicit local geometric features can better guide the model in generating more accurate RIR predictions. Through comparisons with conventional and state-of-the-art methods, we show that MiNAF performs competitively across various evaluation metrics.

2606.11766 2026-06-17 eess.AS cs.AI cs.CL cs.SD 版本更新

Fast Speech Foundation Model Distillation Using Interleaved Stacking

快速语音基础模型蒸馏使用交错堆叠

Eungbeom Kim, Kyogu Lee

发表机构 * IPAI AIIS Dept. of Intelligence and Information(智能与信息系)

AI总结 提出交错堆叠方法加速语音基础模型蒸馏训练,通过保持层位置一致性解决性能下降问题,在SUPERB上验证有效性。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

将大型语音基础模型(SFM)蒸馏为高效的学生模型已成功应用于低资源环境。尽管蒸馏减少了推理延迟,但它需要额外的学生模型训练。然而,SFM蒸馏的训练效率仍未得到充分探索。在这项工作中,我们探索了SFM蒸馏的训练加速以加快模型部署。我们研究了堆叠的潜力,其中模型深度通过训练逐步增加,直到达到目标模型深度。虽然现有的堆叠方法提高了训练速度,但它们遭受性能下降。为了解决这一限制,我们提出了交错堆叠,一种新颖的堆叠方法,在整个堆叠过程中始终保持层位置。这一特性在SFM中尤为关键,因为每一层编码了不同的层特定知识。我们在SUPERB上验证了所提方法的有效性。

英文摘要

Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.

2506.10207 2026-06-17 cs.SD cs.DC eess.AS 版本更新

FedMLAC: Mutual Learning Driven Heterogeneous Federated Audio Classification

FedMLAC:基于互学习的异构联邦音频分类

Jun Bai, Rajib Rana, Di Wu, Youyang Qu, Xiaohui Tao, Ji Zhang, Carlos Busso, Shivakumara Palaiahnakote

发表机构 * School of Computer Science, McGill University(麦吉尔大学计算机科学学院) Mila - Quebec AI Institute(魁北克AI研究所) School of Mathematics, Physics and Computing, University of Southern Queensland(南方昆士兰大学数学、物理与计算学院) Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所) School of Science, Engineering and Environment, University of Salford(萨尔福德大学科学、工程与环境学院)

AI总结 FedMLAC通过双向知识蒸馏解决联邦音频分类中的数据和模型异质性问题,并引入分层剪枝聚合策略对抗数据污染,实验表明其在分类准确性和抗噪声能力上优于现有方法。

Comments updated version for the first submission

详情
Journal ref
Pattern Recognition, vol. 180, Article 114250, 2026
AI中文摘要

联邦学习(FL)提供了一个隐私保护的框架,用于在去中心化的客户端上训练音频分类(AC)模型,而无需共享原始数据。然而,联邦音频分类(FedAC)面临三大主要挑战:数据异质性、模型异质性以及数据污染,这些会降低实际应用中的性能。尽管现有方法通常分别解决这些问题,但统一且稳健的解决方案仍被忽视。我们提出了FedMLAC,一种基于互学习的FL框架,同时解决这三个挑战。每个客户端维护一个个性化本地AC模型和一个轻量级、全局共享的Plug-in模型。这些模型通过双向知识蒸馏交互,实现全局知识共享的同时适应本地数据分布,从而解决数据和模型异质性问题。为对抗数据污染,我们引入了分层剪枝聚合(LPA)策略,在聚合过程中根据参数偏差过滤异常的Plug-in更新。在四个多样化的音频分类基准上进行了广泛的实验,包括语音和非语音任务,结果表明FedMLAC在分类准确性和抗噪声能力上始终优于最先进的基线方法。

英文摘要

Federated Learning (FL) offers a privacy-preserving framework for training audio classification (AC) models across decentralized clients without sharing raw data. However, Federated Audio Classification (FedAC) faces three major challenges: data heterogeneity, model heterogeneity, and data poisoning, which degrade performance in real-world settings. While existing methods often address these issues separately, a unified and robust solution remains underexplored. We propose FedMLAC, a mutual learning-based FL framework that tackles all three challenges simultaneously. Each client maintains a personalized local AC model and a lightweight, globally shared Plug-in model. These models interact via bidirectional knowledge distillation, enabling global knowledge sharing while adapting to local data distributions, thus addressing both data and model heterogeneity. To counter data poisoning, we introduce a Layer-wise Pruning Aggregation (LPA) strategy that filters anomalous Plug-in updates based on parameter deviations during aggregation. Extensive experiments on four diverse audio classification benchmarks, including both speech and non-speech tasks, show that FedMLAC consistently outperforms state-of-the-art baselines in classification accuracy and robustness to noisy data.

2408.15188 2026-06-17 eess.AS cs.CL cs.SD 版本更新

Infusing Acoustic Pause Context into Text-Based Dementia Assessment

将语音停顿上下文注入基于文本的痴呆症评估

Franziska Braun, Sebastian P. Bayerl, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

发表机构 * Technische Hochschule Nürnberg(图林根应用技术大学纽伦堡分校) Technische Hochschule Rosenheim(图林根应用技术大学罗森海姆分校) Klinik für Psychiatrie und Psychotherapie, Universitätsklinik der Paracelsus Medizinischen Privatuniversität, Klinikum Nürnberg, Germany(帕拉塞尔斯医学私人大学纽伦堡大学心理治疗与精神病科诊所) KST Institut GmbH, Bad Emstal, Germany(KST研究所,巴德埃姆斯塔尔,德国)

AI总结 本文研究利用停顿增强的转录文本,通过Transformer语言模型区分无认知障碍、轻度认知障碍和阿尔茨海默病患者,探讨停顿信息和声学上下文对不同任务的影响。

Comments Accepted at INTERSPEECH 2024

详情
Journal ref
Proceedings of Interspeech 2024
AI中文摘要

语音停顿,与内容和结构相结合,提供了一种有价值的、非侵入性的生物标志物,用于检测痴呆症。本工作探讨了在基于Transformer的语言模型中使用包含停顿的转录文本,以区分无认知障碍、轻度认知障碍和阿尔茨海默病患者在临床评估中的语音特征。我们处理了三个二元分类任务:起始、监测和痴呆排除。通过在德语口头流畅性测试和图片描述测试上的实验,比较模型在不同语音生成上下文中的有效性。从文本基线开始,我们探讨了停顿信息和声学上下文的整合效果。我们展示了测试应根据任务选择,并且词汇停顿信息和声学交叉注意力对不同任务贡献不同。

英文摘要

Speech pauses, alongside content and structure, offer a valuable and non-invasive biomarker for detecting dementia. This work investigates the use of pause-enriched transcripts in transformer-based language models to differentiate the cognitive states of subjects with no cognitive impairment, mild cognitive impairment, and Alzheimer's dementia based on their speech from a clinical assessment. We address three binary classification tasks: Onset, monitoring, and dementia exclusion. The performance is evaluated through experiments on a German Verbal Fluency Test and a Picture Description Test, comparing the model's effectiveness across different speech production contexts. Starting from a textual baseline, we investigate the effect of incorporation of pause information and acoustic context. We show the test should be chosen depending on the task, and similarly, lexical pause information and acoustic cross-attention contribute differently.

2308.08306 2026-06-17 eess.AS cs.SD 版本更新

Classifying Dementia in the Presence of Depression: A Cross-Corpus Study

在抑郁存在下的痴呆分类:一项跨语料库研究

Franziska Braun, Sebastian P. Bayerl, Paula A. Pérez-Toro, Florian Hönig, Hartmut Lehfeld, Thomas Hillemacher, Elmar Nöth, Tobias Bocklet, Korbinian Riedhammer

发表机构 * Technische Hochschule Nürnberg(图林根应用技术大学) Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔兰根-纽伦堡 Friedrich-Alexander 大学) Klinik für Psychiatrie und Psychotherapie, Universitätsklinik der Paracelsus Medizinischen Privatuniversität, Klinikum Nürnberg, Germany(纽伦堡大学心理治疗与精神病科诊所,帕拉塞尔医学私人大学大学医院,纽伦堡诊所,德国) KST Institut GmbH, Bad Emstal, Germany(KST 机构,巴德埃姆斯塔尔,德国)

AI总结 本文通过跨语料库实验,利用文本、音频和情感嵌入对语音进行三类分类(HC vs. MCI vs. DEM),探讨抑郁作为次级诊断对分类器的影响。

Comments Accepted at INTERSPEECH 2023

详情
Journal ref
Proceedings of Interspeech 2023
AI中文摘要

自动痴呆筛查有助于早期检测和干预,减少对 healthcare 系统的成本,提高受影响者的质量生活。抑郁症与痴呆有共享症状,增加了诊断的复杂性。迄今为止,研究重点是使用单个数据集的图片描述测试语音对痴呆(DEM)和健康受试者(HC)进行二分类。在本工作中,我们应用已建立的基线系统,利用语义词汇流畅度测试和波士顿命名测试的语音,通过文本、音频和情感嵌入进行三类分类。我们在两个独立录制的德语数据集上进行跨语料库和混合语料库实验,以研究在更大人群和不同录音条件下的泛化能力。在详细的错误分析中,我们研究抑郁症作为次级诊断,以了解分类器实际上学到了什么。

英文摘要

Automated dementia screening enables early detection and intervention, reducing costs to healthcare systems and increasing quality of life for those affected. Depression has shared symptoms with dementia, adding complexity to diagnoses. The research focus so far has been on binary classification of dementia (DEM) and healthy controls (HC) using speech from picture description tests from a single dataset. In this work, we apply established baseline systems to discriminate cognitive impairment in speech from the semantic Verbal Fluency Test and the Boston Naming Test using text, audio and emotion embeddings in a 3-class classification problem (HC vs. MCI vs. DEM). We perform cross-corpus and mixed-corpus experiments on two independently recorded German datasets to investigate generalization to larger populations and different recording conditions. In a detailed error analysis, we look at depression as a secondary diagnosis to understand what our classifiers actually learn.

2206.10188 2026-06-17 cs.LG cs.SD eess.AS 版本更新

Analysis of Self-Supervised Learning and Dimensionality Reduction Methods in Clustering-Based Active Learning for Speech Emotion Recognition

基于聚类的主动学习中自监督学习与降维方法的分析用于语音情感识别

Einari Vaaras, Manu Airaksinen, Okko Räsänen

发表机构 * Unit of Computing Sciences, Tampere University, Finland(图皮大学计算科学系,芬兰) Helsinki University Hospital, Helsinki, Finland(赫尔辛基大学医院,芬兰)

AI总结 本文研究了在语音情感识别中,利用自监督学习和降维方法提升基于聚类的主动学习性能,探讨了特征空间局部和全局拓扑结构对主动学习的影响,发现降维不影响性能且二维特征表现良好。

Comments To be published in Proc. Interspeech 2022, Incheon, South Korea

详情
AI中文摘要

当领域专家需要进行数据标注时,减少标注工作量以节省时间和成本至关重要。在无标注情况下,可以利用特征空间结构进行基于聚类的主动学习(AL)方法。然而,这些方法高度依赖于样本在特征空间中的组织方式和距离度量。无监督方法如对比预测编码(CPC)可以用于学习有序的特征空间,但这些方法通常会产生高维特征,这可能对估计数据密度构成挑战。本文结合CPC和多种降维方法,探索基于聚类的AL的实用方法。我们的实验表明,特征空间的局部和全局拓扑结构可以成功用于AL,并且CPC可以提高基于传统信号特征的聚类AL性能。此外,我们观察到压缩数据维度对AL性能影响不大,当标注数量不低时,二维特征表示与高维特征表示在AL性能上相似。

英文摘要

When domain experts are needed to perform data annotation for complex machine-learning tasks, reducing annotation effort is crucial in order to cut down time and expenses. For cases when there are no annotations available, one approach is to utilize the structure of the feature space for clustering-based active learning (AL) methods. However, these methods are heavily dependent on how the samples are organized in the feature space and what distance metric is used. Unsupervised methods such as contrastive predictive coding (CPC) can potentially be used to learn organized feature spaces, but these methods typically create high-dimensional features which might be challenging for estimating data density. In this paper, we combine CPC and multiple dimensionality reduction methods in search of functioning practices for clustering-based AL. Our experiments for simulating speech emotion recognition system deployment show that both the local and global topology of the feature space can be successfully used for AL, and that CPC can be used to improve clustering-based AL performance over traditional signal features. Additionally, we observe that compressing data dimensionality does not harm AL performance substantially, and that 2-D feature representations achieved similar AL performance as higher-dimensional representations when the number of annotations is not very low.

2206.06208 2026-06-17 eess.AS cs.CL cs.SD 版本更新

Automated Evaluation of Standardized Dementia Screening Tests

标准化痴呆筛查测试的自动化评估

Franziska Braun, Markus Förstel, Bastian Oppermann, Andreas Erzigkeit, Thomas Hillemacher, Hartmut Lehfeld, Korbinian Riedhammer

AI总结 本文研究了标准化痴呆筛查测试的自动化评分方法,通过分析手动和自动转录本的评分相关性,发现自动评分在某些任务上比人工评分更严格,但整体仍保持高相关性。

Comments Submitted to Interspeech 2022. arXiv admin note: text overlap with arXiv:2206.05018

详情
Journal ref
Proceedings of Interspeech 2022
AI中文摘要

在痴呆筛查和监测中,标准化测试在临床实践中起关键作用,因为它们旨在通过测量多种认知任务的表现来最小化主观性。本文报告了一项研究,该研究包括一个半标准化的病史采集,随后是两种标准化的神经心理学测试,即SKT和CERAD-NB。这些测试包括命名物体、学习词列表等基本任务,以及广泛使用的工具如MMSE。大多数任务是口头进行的,因此应适合基于转录文本的自动化评分。对于前30名患者的第一批,我们分析了专家手动评分与基于手动和自动转录的自动评分之间的相关性。对于SKT和CERAD-NB,我们观察到使用手动转录本时的高到完美相关性;对于某些相关性较低的任务,自动评分比人类参考更严格,因为其仅限于音频。使用自动转录本时,相关性下降如预期,与识别准确性相关;然而,我们仍观察到高达0.98(SKT)和0.85(CERAD-NB)的高相关性。我们证明使用词替代可以缓解识别错误,从而提高与专家评分的相关性。

英文摘要

For dementia screening and monitoring, standardized tests play a key role in clinical routine since they aim at minimizing subjectivity by measuring performance on a variety of cognitive tasks. In this paper, we report on a study that consists of a semi-standardized history taking followed by two standardized neuropsychological tests, namely the SKT and the CERAD-NB. The tests include basic tasks such as naming objects, learning word lists, but also widely used tools such as the MMSE. Most of the tasks are performed verbally and should thus be suitable for automated scoring based on transcripts. For the first batch of 30 patients, we analyze the correlation between expert manual evaluations and automatic evaluations based on manual and automatic transcriptions. For both SKT and CERAD-NB, we observe high to perfect correlations using manual transcripts; for certain tasks with lower correlation, the automatic scoring is stricter than the human reference since it is limited to the audio. Using automatic transcriptions, correlations drop as expected and are related to recognition accuracy; however, we still observe high correlations of up to 0.98 (SKT) and 0.85 (CERAD-NB). We show that using word alternatives helps to mitigate recognition errors and subsequently improves correlation with expert scores.

2106.09539 2026-06-17 eess.AS cs.LG cs.SD 版本更新

Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit

对新生儿重症监护病房中以儿童为中心的全天候录音中语音情感内容的自动分析

Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Okko Räsänen

发表机构 * Unit of Computing Sciences, Tampere University, Finland(图瓦大学计算科学系) Department of Clinical Medicine, University of Turku, Finland(图尔库大学临床医学系) Department of Signal Processing and Acoustics, Aalto University, Finland(阿尔托大学信号处理与声学系)

AI总结 本文研究了如何通过自动语音情感识别系统分析新生儿录音中的情感内容,探讨了跨语料泛化、WGAN域适应和主动学习在新领域部署中的有效性,实现了73.4%的UAR分类性能。

详情
AI中文摘要

研究人员最近开始研究年轻婴儿听到的情感语音如何影响其发展结果。作为这项研究的一部分,来自芬兰和爱沙尼亚两家医院的数百小时全天候录音被收集,用于所谓的APPLE研究。为了分析此类大规模数据集中的语音情感内容,需要一个自动语音情感识别(SER)系统。然而,目前没有情感标签或现成的领域内SER系统可用。本文介绍了最初未标注的大型真实世界音频数据集,并描述了针对芬兰子集数据开发的功能性SER系统。我们探讨了替代的最先进技术在新领域部署SER系统的有效性,比较了跨语料泛化、基于WGAN的域适应和主动学习在该任务中的效果。结果表明,表现最好的模型能够实现二元分类中valence和arousal的73.4%未加权平均召回率(UAR)和73.2% UAR。结果还显示,主动学习在与其他两种方法相比时表现最为一致。

英文摘要

Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall (UAR) and 73.2% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.