arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 语音识别与关键词检测 4 篇

2606.07608 2026-06-09 cs.CL cs.AI cs.LG cs.SD 交叉投稿

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

针对瑞士德语音识别的Whisper字幕对齐微调:基准污染、惯例不匹配以及25.6% WER(13.8% cWER)的诚实基线

Felix Akeret

发表机构 * Independent Researcher, Zurich, Switzerland(独立研究员,瑞士苏黎世) ETH Zürich(苏黎世联邦理工学院) University of Bern(伯尔尼大学) FHNW(西北应用科学与艺术大学) CeTIM Leiden/Munich(CeTIM 莱顿/慕尼黑)

AI总结 通过1,367小时广播语音与标准德语字幕的弱监督,系统微调Whisper large-v3用于瑞士德语音识,发现公开结果因基准污染被高估,并发布两个诚实评估的模型。

Comments 15 pages, 21 tables. Models available at https://huggingface.co/Flix-AI

详情
AI中文摘要

我们提出了一项系统研究,针对OpenAI的Whisper large-v3进行微调,用于瑞士德语音识,使用1,367小时的广播语音与标准德语字幕作为弱监督。通过在NVIDIA DGX Spark(Grace Blackwell,128 GB统一内存,最高1 PFLOP FP4)上进行16次迭代训练,我们比较了LoRA和全微调(1.55B参数模型),研究了幻觉的根本原因,并量化了数据质量、字幕对齐和训练策略的影响。我们的最佳模型在严格不相交数据上的诚实评估中,在All Swiss German Dialects Test Set (ASGDTS)上实现了25.6%的测量WER。通过将真实错误与有效的风格变异(时态、词序、瑞士正字法)分离的协调错误分析,得到内容WER (cWER)为13.8%,仅计算实际识别失败。偏差校正估计将其降至8.5%,表明真实错误率约为测量WER的三分之一。\n我们证明,已发表的瑞士德语ASR最先进结果(17.1-17.5% WER)因基准污染而被夸大:一个在ASGDTS测试集上自训练的普通Whisper模型(零瑞士德语数据)实现了13.88% WER,超过了所有已发表系统。使用Phi-4-multimodal的实验显示出更强的记忆效应(3.9% WER),揭示该基准主要衡量惯例匹配而非方言理解。\n我们发布了两个模型,一个LoRA适配器(25.32% WER,13.9% cWER)和一个全微调模型(25.60% WER,13.8% cWER),这是少数公开可用、经过诚实评估的瑞士德语Whisper模型之一,采用Apache 2.0许可,完全可复现,无需机构数据协议。

英文摘要

We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the All Swiss German Dialects Test Set (ASGDTS) in an honest evaluation on strictly disjoint data. A harmonized error analysis separating genuine errors from valid stylistic variation (tense, word order, Swiss orthography) yields a content WER (cWER) of 13.8%, counting only actual recognition failures. Bias-corrected estimation reduces this to 8.5%, suggesting the true error rate is roughly one third of measured WER. We demonstrate that published state-of-the-art Swiss German ASR results (17.1-17.5% WER) are inflated by benchmark contamination: a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieves 13.88% WER, surpassing all published systems. Experiments with Phi-4-multimodal show an even stronger memorization effect (3.9% WER), revealing that the benchmark primarily measures convention matching rather than dialectal comprehension. We release two models, a LoRA adapter (25.32% WER, 13.9% cWER) and a full fine-tuned model (25.60% WER, 13.8% cWER), among the few publicly available, honestly evaluated Whisper models for Swiss German, under Apache 2.0 with full reproducibility, requiring no institutional data agreements.

2606.08210 2026-06-09 eess.AS cs.CL cs.SD 交叉投稿

Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion

Paediatric-HGNN:一种通过多尺度声学融合检测儿童言语不流畅的混合异构图神经网络

Rashini Liyanarachchi, Rachael Mackay, Alison Short, Aditya Joshi, Erik Meijering

发表机构 * University of New South Wales(新南威尔士大学) Western Sydney University(西澳悉尼大学) Resourced Music Therapy(资源音乐治疗)

AI总结 针对儿童言语中声学变异大、病理口吃与发育性不流畅难以区分的问题,提出Paediatric-HGNN框架,通过构建异构图捕获词汇与声学片段的分层关系,在儿童语料上实现82.4%加权准确率和0.386的典型不流畅F1分数。

Comments Accepted at INTERSPEECH 2026 (Main)

详情
AI中文摘要

自动口吃检测(ASD)系统在处理儿童言语时面临挑战,因为发育中的声音具有高声学变异性,且病理性口吃与典型发育性不流畅之间存在细微差别。我们提出了Paediatric-HGNN,一个使用上下文感知部分-整体交互网络(CaPIN)的框架,专门针对儿童数据定制。与传统的1D信号建模不同,我们的方法构建了一个异构图,捕获词汇单元(词节点)和细粒度声学片段(帧节点)之间的层次关系。在精选的儿童语料库(UCLASS和FluencyBank)上训练后,Paediatric-HGNN实现了82.4%的加权准确率和0.386的典型不流畅F1分数。对层次化词汇-声学交互的建模捕获了发育中的“搜索”行为,为早期临床干预提供了更稳健和可解释的工具。

英文摘要

Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.

2606.09535 2026-06-09 cs.CL cs.SD 交叉投稿

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

克服Whisper在达罗毗荼语系和低资源语言中的解码器不一致性

Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik

发表机构 * Sony Research India(索尼印度研究院)

AI总结 针对Whisper在达罗毗荼语系上词错误率高的问题,通过语言学和数据集分析发现词汇稀疏和字符级替换错误,提出加权注意力和自条件化两种解码器增强方法,显著降低低资源和黏着语言的WER。

Comments Accepted at INTERSPEECH 2026, 5 pages, 1 figure, 5 tables

详情
AI中文摘要

多语言ASR模型如Whisper在高资源语言上表现良好,但在达罗毗荼语系上的词错误率(WER)显著高于印度-雅利安语系。通过语言学和数据集分析,我们发现达罗毗荼语系具有更长的单词、更高的词汇多样性和更低的重复率,导致标记分布稀疏和频繁的字符级替换错误。基线微调进一步揭示了自注意力(语言上下文)和交叉注意力(声学线索)之间的解码器不平衡。尽管合成标记重复实验表明潜在收益,但实际不可行。受这些观察启发,我们引入了两种解码器级增强:加权注意力(自适应平衡注意力来源)和自条件化(重新注入中间预测以提高标记一致性)。实验表明,对于低资源和黏着语言,WER持续降低。

英文摘要

Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.

2604.24278 2026-06-09 cs.SD cs.AI 版本更新

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

RAS:一种面向可靠性的自动语音识别度量标准

Wenbin Huang, Yuhang Qiu, Bohan Li, Yiwei Guo, Jing Peng, Hankun Wang, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China(上海交通大学计算机科学学院X-LANCE实验室,中国) MoE Key Lab of Artificial Intelligence(人工智能MOE重点实验室;江苏语言计算重点实验室,中国) Jiangsu Key Lab of Language Computing, China

AI总结 本研究提出了一种面向可靠性的度量标准RAS,用于评估自动语音识别系统在不确定段落中的转录可靠性,通过引入一种具有退避意识的转录框架,结合人类偏好校准的参数,提升了转录的可靠性同时保持了准确性。

Comments 5 pages, 4 figures; Accepted at InterSpeech 2026

详情
AI中文摘要

自动语音识别系统在嘈杂或模糊条件下常常会产生自信但错误的转录,这对用户和下游应用都是误导性的。基于词错误率的标准评估仅关注准确性,未能捕捉转录的可靠性。我们引入了具有退避意识的转录框架,使ASR模型能够显式地避免不确定的段落。为了评估在退避情况下的可靠性,我们提出了RAS,一种面向可靠性的度量标准,平衡转录的信息量和错误回避,其权衡参数通过人类偏好进行校准。然后通过监督抽样后接强化学习训练了一个具有退避意识的ASR模型。我们的实验表明,在保持竞争力的准确性的同时,转录可靠性有显著的提高。

英文摘要

Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.

2. 语音合成与声音生成 10 篇

2606.08843 2026-06-09 cs.SD cs.LG 新提交

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

从A到B再回到A:基于非平行数据的回文零样本语音转换

Moshe Mandel, Shlomo E. Chazan

发表机构 * Independent, Israel(以色列独立机构) OriginAI, Israel(以色列OriginAI公司)

AI总结 提出利用WavLM表示的K近邻检索对齐非平行语音,构建合成训练对,结合说话人损失实现零样本语音转换,在仅用英语数据训练下跨语言表现优异。

详情
AI中文摘要

我们提出一个语音转换(VC)框架,利用WavLM表示上的K近邻(KNN)检索来对齐非平行的源语音和目标语音,从而为监督学习构建合成训练对。检索到的片段作为合成输入,而真实目标音频提供真实输出,形成一种合成到真实的训练范式,该范式自然支持多语言数据,无需平行语料库或显式对齐。为了确保一致的目标说话人身份,我们引入了一个来自预训练说话人验证模型的说话人损失。跨多种语言的实验表明,尽管仅使用英语数据训练,所提出的方法实现了高自然度和强说话人相似性,优于有竞争力的VC基线。样本可在https://palindromic-vc.github.io获取。

英文摘要

We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic inputs, while real target audio provides ground-truth outputs, forming a synthetic-to-real training paradigm that naturally supports multilingual data without requiring parallel corpora or explicit alignment. To ensure consistent target-speaker identity, we incorporate a speaker loss derived from a pretrained speaker verification model. Experiments across multiple languages demonstrate that the proposed approach achieves high naturalness and strong speaker similarity, outperforming competitive VC baselines, despite being trained exclusively on English data. Samples can be accessed at: https://palindromic-vc.github.io.

2606.09019 2026-06-09 cs.SD cs.AI 新提交

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

TLDR:压缩音频令牌以实现高效自回归文本到语音

Yejin Lee, Junwon Moon, Hyoeun Kim, Hyunjin Choi, Heeseung Kim, Kyuhong Shim

发表机构 * Sungkyunkwan University(成均馆大学) University of Seoul(首尔市立大学)

AI总结 提出TLDR框架,通过将因果建模从令牌级转移到补丁级,利用轻量级压缩器和LoRA适配的冻结预训练骨干,实现1.8倍推理加速和75% KV缓存减少。

详情
AI中文摘要

基于编解码器的自回归(AR)语音语言模型通过将语音建模为离散音频令牌序列,并使用大型预训练骨干网络,实现了强大的文本到语音(TTS)质量。然而,这种令牌级公式造成了结构效率瓶颈:语音令牌序列比文本序列长得多,要求AR骨干在每个令牌位置执行因果计算,并维护随序列长度增长的KV缓存。我们引入TLDR,一种基于补丁的自回归框架,通过将因果建模从令牌级语音序列转移到补丁级序列,加速基于编解码器的AR-TTS。TLDR使用轻量级压缩器将连续的编解码器令牌分组为紧凑的潜在补丁,使用通过LoRA适配的冻结预训练AR-TTS骨干对生成的较短补丁序列进行建模,并使用说话人条件提取器在每个补丁内重建细粒度语音令牌。在补丁大小为4的情况下,TLDR比基线AR-TTS模型实现了1.8倍的推理加速,并将全局KV缓存内存减少了高达75%。实验结果表明,补丁级全局因果建模可以成为降低预训练基于编解码器的AR-TTS系统推理成本的一种实用方法,而无需替换现有模块。

英文摘要

Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.

2606.09234 2026-06-09 cs.SD cs.AI 新提交

End-to-End Training for Discrete Token LLM based TTS System

基于离散令牌LLM的文本转语音系统的端到端训练

Changfeng Gao, Yong Ren, Jun Yuan, Ye Bai, Zhao You, ShiDong Shang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出统一训练语音分词器、LLM、流匹配模型和奖励模型的端到端框架,通过多任务联合优化提升离散令牌TTS性能,在Seed-TTS-Eval上达到新SOTA。

详情
AI中文摘要

最近的先进文本转语音系统通常采用级联流水线,包括语音分词器、自回归大语言模型和基于扩散的流匹配模型,这些组件独立训练。本文提出一个完全端到端的优化框架,统一了语音分词器、LLM、FM模型和额外奖励模型的训练。具体来说,我们首先通过来自FM重建、LLM下一令牌预测和RM多识别任务的多任务目标联合优化分词器。这种联合训练鼓励离散语音令牌空间捕获更适合TTS的声学和语义显著信息。然后,我们通过FM和RM的下游重建和识别进一步优化LLM,这减少了推理时的不匹配,并引导LLM生成更优的结果。实验结果表明,我们的端到端框架始终优于级联基线。在Seed-TTS-Eval基准上,我们的系统实现了0.78%和1.56%的词错误率,使用0.6B参数的LLM和0.5B参数的FM模型取得了新的SOTA结果。这些结果验证了整体端到端优化对于改进基于离散令牌的TTS系统至关重要,且训练流水线更简单。

英文摘要

Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.

2606.09048 2026-06-09 eess.AS cs.AI cs.SD 交叉投稿

BareWave: Waveform-Native Flow-Matching Text-to-Speech

BareWave: 波形原生流匹配文本转语音

Wei Fan, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li, Kejiang Chen, Weiming Zhang, Nenghai Yu

发表机构 * Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室) Tongyi Fun Team, Alibaba Group(阿里巴巴集团通义Fun团队)

AI总结 提出BareWave,一种完全波形原生的流匹配TTS框架,通过训练时表示对齐、分阶段噪声调度和速度感知感知对齐解决直接波形训练难题,实现零样本语音克隆的高质量合成。

Comments Under Review

详情
AI中文摘要

去除中间表示和单独训练的解码阶段已成为生成建模的重要方向。然而,在文本转语音中,高质量系统通常仍通过中间声学表示构建,再进行波形合成。本文提出BareWave,一种完全波形原生的框架,用于流匹配TTS中的直接文本到波形生成。我们认为该设置引发了三个训练挑战:原始波形建模缺乏强大的预训练表示支架;不同训练阶段受益于不同的噪声调度;数据空间感知目标不会自动共享速度空间流目标的时间结构。因此,直接波形训练难以高效优化,难以通过固定配方推向强最终工作点,也难以整合有效的感知细化。基于此观点,我们开发了一个直接文本到波形训练框架,结合训练时表示对齐、分阶段噪声调度和速度感知感知对齐(VAPA),同时在测试时保持单一波形原生推理路径,无需预训练组件。零样本语音克隆实验表明,在完全波形原生推理路径下,可以实现强可懂度、说话人相似度和自然度,支持波形原生流匹配TTS作为实用方向。带有音频示例的项目页面可在https://barewave.github.io/获取。

英文摘要

Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.

2606.09050 2026-06-09 eess.AS cs.SD 交叉投稿

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

MeanVC 2:鲁棒的低延迟流式零样本语音转换

Guobin Ma, Yuxuan Xia, Yuepeng Jiang, Dake Guo, Hanke Xie, Jingbin Hu, Yanbo Wang, Lei Xie, Pengcheng Zhu

发表机构 * The University of New South Wales, Australia(澳大利亚新南威尔士大学) WeNet Open Source Community, China(中国WeNet开源社区)

AI总结 提出MeanVC 2,通过未来感知分块(FRC)和通用音色令牌编码器,在40ms分块大小下实现稳定转换,延迟从211ms降至110ms,显著提升零样本说话人相似度。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

流式零样本语音转换(VC)因其在实时应用中的潜力而日益流行。最近提出的MeanVC实现了轻量级流式零样本VC,但存在若干局限性:其逐块自回归去噪使有效训练序列长度加倍,小分块设置下转换质量下降,且其音色编码器直接依赖参考梅尔频谱图,对参考音频质量敏感。为解决这些问题,我们提出MeanVC 2。我们引入未来感知分块(FRC),该技术明确地在扩散变换器解码器层之间调度过去和未来的感受野,并消除了干净分块教师强制。通过结合有界未来上下文,FRC在40ms分块大小下实现稳定转换。我们进一步引入通用音色令牌编码器,该编码器从全局说话人嵌入构建音色表示,并通过交叉注意力检索细粒度音色线索,提高了对低质量参考的鲁棒性并增强了零样本说话人相似度。实验结果表明,MeanVC 2显著优于MeanVC,同时将延迟从211ms降低至110ms。音频样本已公开。源代码将公开发布。

英文摘要

Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.

2606.09667 2026-06-09 eess.AS cs.CL cs.SD 交叉投稿

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

基于sEMG和唇读的鲁棒无声语音合成的跨模态掩蔽

Eder del Blanco, David Gimeno-Gómez, Eva Navas, Carlos-D. Martínez-Hinarejos, Inma Hernáez

发表机构 * Aholab research group within the HiTZ Center at University of the Basque Country (UPV/EHU)(巴斯克大学HiTZ中心内Aholab研究组) PRHLT research center, Universitat Politècnica de València (UPV)(瓦伦西亚理工大学PRHLT研究中心)

AI总结 提出掩蔽多模态语音合成框架,联合表面肌电图和唇读信号,通过训练时模态掩蔽提升鲁棒性,在多说话人设置下词错误率降低14个百分点。

Comments 12 pages, 7 figures and 6 tables. Submitted to Transactions on Audio, Speech and Language Processing

详情
AI中文摘要

通过无声语音接口进行语音恢复已成为针对喉部发声受损或缺失个体的有前景的辅助技术。在非侵入式无声语音接口模态中,表面肌电图和基于视频的唇读提供了互补的发音信息,然而它们用于连续语音合成的集成仍未被充分探索。此外,现有的多模态方法很少考虑对模态退化或临时传感器故障的鲁棒性,限制了它们在现实场景中的适用性。在这项工作中,我们提出了一种掩蔽多模态语音合成框架,通过在训练期间进行模态掩蔽来联合利用表面肌电图和唇读信号。在多说话人设置下,与最强的单模态基线相比,所提出的方法将词错误率降低了多达14个绝对百分点。实验结果不仅表明掩蔽策略对于这些性能提升和低比特率条件下的鲁棒性至关重要,而且表明在模态缺失情况下,它们比针对退化的数据增强具有更好的泛化能力。音素级分析进一步揭示了跨模态的互补贡献,对元音和特定辅音组尤其有益。总体而言,这些发现证明了掩蔽多模态集成用于无声语音合成的有效性和鲁棒性,尽管适应喉切除说话者仍是一个开放的研究挑战。

英文摘要

Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.

2510.04593 2026-06-09 eess.AS cs.SD 版本更新

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

UniVoice: 统一自回归ASR与基于流匹配的TTS的大语言模型框架

Wenhao Guan, Zhikang Niu, Ziyue Jiang, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li, Xie Chen

发表机构 * Xiamen University, China(厦门大学) Shanghai Innovation Institute, China(上海创新研究院) Shanghai Jiao Tong University, China(上海交通大学) Zhejiang University, China(浙江大学)

AI总结 提出UniVoice,通过连续表示统一语音识别与合成,结合自回归建模和流匹配,设计双重注意力机制解决模态差异,实现高质量零样本语音克隆。

Comments accepted at interspeech2026

详情
AI中文摘要

大语言模型在自动语音识别和文本转语音系统中展现出有前景的性能,逐渐成为主流方法。然而,当前大多数方法分别处理这两个任务,而非通过统一框架。本工作旨在将这两个任务集成到一个统一模型中。尽管离散语音标记化能够实现联合建模,但其固有的信息损失限制了识别和生成的性能。在本工作中,我们提出了UniVoice,一个通过连续表示的统一大语言模型框架,无缝地将语音识别和合成集成在单个模型中。我们的方法结合了自回归建模在语音识别中的优势与流匹配在高品质生成中的优势。为了缓解自回归模型和流匹配模型之间的固有差异,我们进一步设计了一种双重注意力机制,在因果掩码(用于识别)和双向注意力掩码(用于合成)之间切换。此外,所提出的文本前缀条件语音填充方法实现了高保真度的零样本语音克隆。实验结果表明,我们的方法在ASR和零样本TTS任务中能够达到或超越当前单任务建模方法。本工作探索了端到端语音理解和生成的新可能性。代码可在该 https URL 获取。

英文摘要

Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation. Code is available at https://github.com/gwh22/UniVoice.

2603.08977 2026-06-09 eess.AS cs.SD 版本更新

Universal Speech Content Factorization

通用语音内容分解

Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola García-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

发表机构 * Center for Language and Speech Processing, Johns Hopkins University, USA(约翰霍普金斯大学语言与语音处理中心) Human Language Technology Center of Excellence (COE), Johns Hopkins University, USA(约翰霍普金斯大学人类语言技术卓越中心(COE))

AI总结 本文提出USCF方法,通过线性可逆方法提取低秩语音表示,抑制说话人音色同时保留语音内容。该方法扩展了语音内容分解,通过最小二乘优化学习通用语音到内容映射,并从少量目标语音中推导出说话人特定转换。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

我们提出通用语音内容分解(USCF),一种简单且可逆的线性方法,用于提取低秩语音表示,在其中抑制说话人音色同时保留语音内容。USCF通过最小二乘优化学习通用语音到内容映射,扩展了语音内容分解,一种封闭集语音转换(VC)方法,到开放集设置。我们通过嵌入分析显示USCF有效去除说话人依赖性变化。作为零样本语音转换系统,USCF在可懂度、自然度和说话人相似性方面与需要大量目标说话人数据或额外神经网络训练的方法相媲美。最后,我们证明作为训练高效的音色分离语音特征,USCF特征可作为训练音色提示文本到语音模型的声学表示。语音样本和代码已公开提供。

英文摘要

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.

2605.06582 2026-06-09 cs.LG cs.CL cs.SD 版本更新

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

PairAlign:一种通过自对齐的序列标记化框架及其在音频标记化中的应用

Adhiraj Banerjee, Vipul Arora

发表机构 * Department of Electrical Engineering, Indian Institute of Technology, Kanpur(电子工程系,印度理工学院,坎浦尔)

AI总结 PairAlign通过序列级自对齐实现紧凑音频标记化,利用条件序列生成方法,提升标记一致性、长度控制和编辑相似性。

Comments 57 pages main content, 109 total pages, 9 Figures, pre-print, Under Review

详情
AI中文摘要

许多感官数据的操作——比较、记忆、检索和推理——自然地在离散符号结构上表达。在语言中,这种接口由标记提供;在音频中,必须学习。现有音频标记器依赖于量化、聚类或编解码器重建,将标记局部分配,因此序列一致性、紧凑性、长度控制、终止和编辑相似性很少被直接优化。我们引入PairAlign,一种通过序列级自对齐实现紧凑音频标记化的框架。PairAlign将标记化视为条件序列生成:编码器将语音映射为连续条件,自回归解码器从BOS开始生成标记,学习标记身份、顺序、长度和EOS位置。给定两个保持内容的视图,每个视图的序列在另一个视图的表示下被训练为可能,而无关示例提供竞争序列。这为可扩展的编辑距离保留代理,同时抑制许多对一的坍缩。PairAlign从VQ式标记化开始,并通过EMA教师目标、交叉配对教师强制、前缀损坏、似然对比和长度控制进行优化。在3秒语音上,PairAlign学习紧凑、非退化的序列,具有广泛的词汇使用和强跨视图一致性。在检索测试中,它保留编辑距离搜索,同时将存档标记数量减少55%。连续扫频探针显示其局部重叠低于密集几何标记器,但具有更强的长度控制和在100毫秒移位下的受约束编辑轨迹。PairAlign是一种序列符号预测学习者:像JEPA式目标一样,它从另一个视图预测一个抽象目标作为学习的可变长度符号序列,而不是连续潜在变量。

英文摘要

Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On retrieval tests, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.

2601.09239 2026-06-09 cs.SD cs.AI eess.AS 版本更新

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

DSA-Tokenizer:基于流匹配层次化融合的解耦语义-声学分词器

Hanlin Zhang, Daxin Tan, Dehua Tao, Xiao Chen, Haochen Tan, Yunhe Li, Yuchen Cao, Linqi Song

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) AI Lab, Leibniz Research Center, Huawei(华为利比全中心人工智能实验室)

AI总结 提出DSA-Tokenizer,通过ASR监督语义令牌和mel谱重建监督声学令牌实现解耦,并引入层次化流匹配解码器和联合重构-上下文修补训练策略,实现高保真重构和跨语句语音克隆。

Comments Submit to ACL ARR 2026 May

详情
AI中文摘要

语音分词器是全离散语音大语言模型的关键构建模块。现有的分词器要么优先考虑语义编码,将语义内容与声学风格不可分离地融合,要么实现不完全的语义-声学解耦。为了实现更好的解耦,我们提出了DSA-Tokenizer,它通过不同的优化约束将语音显式解耦为离散的语义和声学令牌。具体来说,语义令牌由ASR监督以捕获语言内容,而声学令牌专注于mel谱重构以编码风格。我们进一步引入了层次化流匹配解码器和联合重构-上下文修补训练策略,使模型能够支持高保真重构和跨语句语音克隆。为了加速推理,我们蒸馏了DiT解码器,将推理采样步数减少到4步,并通过GAN微调提高合成质量。实验表明,DSA-Tokenizer提供了强大的语义-声学解耦、可靠的可控语音克隆以及低WER/CER的高效高保真生成。此外,我们的结果表明,解耦分词为下游大模型语音生成提供了更有效的接口。音频样本可在https://anonymous.4open.science/w/DSA_Tokenizer_demo/获取。

英文摘要

Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the dit decoder to 4-step inference and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/

3. 说话人识别、验证与分离 4 篇

2606.08078 2026-06-09 cs.SD cs.CL 新提交

On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation

说话人验证中的低位量化误差:诊断与缓解

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier

发表机构 * LIA, UPR 4128 Avignon University(阿维尼翁大学) Aday

AI总结 本文通过逐层和得分级分析,诊断了低比特量化对说话人验证的影响,发现2比特是关键拐点,并提出校准多精度级联方法,在保持低位推理效率的同时接近全精度性能。

Comments Accepted at Speaker Odyssey 2026 Lisbon

详情
AI中文摘要

尽管低比特量化为在资源受限设备上部署说话人验证提供了实用手段,但其对说话人验证性能的影响仍知之甚少。本文通过联合逐层和得分级分析,研究了ResNet-36和ResNet-200的均匀K-means量化感知训练。我们的逐层分析突出了脆弱组件,并表明得分退化不能仅由权重失真完全解释。我们在2比特处识别出一个明显的拐点,较大的得分漂移和有害决策翻转集中在FP32阈值附近。我们的得分级分析揭示了在极端量化下得分误差产生的位置和方式。基于这些发现,我们提出了一种校准的多精度级联方法,该方法在2比特下解决大多数试验,仅升级模糊情况,实现了接近FP32的性能,同时以显著降低的计算和内存成本保留了低位推理的效率优势。

英文摘要

Although low-bit quantization provides practical means to deploy speaker verification on resource-constrained devices, its effects on speaker verification performance remain poorly understood. In this paper, we study uniform K-means quantization-aware training of ResNet-36 and ResNet-200 through joint layer-wise and score-level analyses. Our layer-wise analysis highlights fragile components and shows that score degradation is not fully explained by weight distortion alone. We identify a clear knee point at 2 bits, with larger score drift and harmful decision flips concentrated near the FP32 threshold. Our score-level analysis reveals where and how score errors emerge under extreme quantization. Building on these findings, we propose a calibrated multi-precision cascade that resolves most trials at 2 bits and escalates only ambiguous cases, achieving performance close to FP32 while preserving the efficiency benefits of low-bit inference with substantially lower compute and memory costs.

2606.08087 2026-06-09 cs.SD cs.CL 新提交

Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference

评估神经说话人验证模型在训练和推理中的能耗与碳排放

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier

发表机构 * LIA, UPR 4128 Aday Avignon University(阿维尼翁大学)

AI总结 本研究通过测量不同ResNet架构在VoxCeleb2上的能耗与碳排放,发现模型加深或加宽带来边际精度提升但能耗剧增,而中等规模网络(如ResNet-50)能实现性能与环境影响的良好平衡。

Comments Accepted to Speaker Odyssey 2026 Lisbon

详情
AI中文摘要

深度学习说话人验证(SV)越来越依赖于深度神经网络骨干,但其环境影响仍缺乏记录。本文对在VoxCeleb2上训练的ResNet架构进行了评估,变化深度、通道宽度和阶段分布,并使用节点级传感器测量能耗和碳足迹。结果显示明显的收益递减点:更深或更宽的模型仅带来边际精度提升,而能耗急剧增长。相比之下,中等规模网络如ResNet-50和阶段集中变体在性能与环境影响之间实现了有利的权衡。这些发现为设计节能的SV系统提供了可操作的指导方针。

英文摘要

Deep-learning speaker verification (SV) increasingly relies on deep neural network backbones, whose environmental impact remains largely undocumented. In this paper, we conduct an evaluation of ResNet architectures trained on VoxCeleb2, varying depth, channel width, and stage distribution, and measure energy consumption and carbon footprint using node-level sensors. Results show a clear point of diminishing returns: deeper or wider models bring only marginal accuracy gains while energy consumption grows steeply. In contrast, mid-sized networks such as ResNet-50 and stage-concentrated variants achieve favorable trade-offs between performance and environmental impact. These findings provide actionable guidelines for designing energy-efficient SV systems.

2606.08505 2026-06-09 eess.AS cs.SD 交叉投稿

Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines

快速且鲁棒的设备端说话人日志:步长加速管道的相对最小聚类大小

Fumiaki Yamaguchi

发表机构 * University of Tokyo(东京大学)

AI总结 针对设备端说话人日志的推理成本问题,提出相对最小聚类大小(mcs=round(f*n), f=0.01)以自适应嵌入预算,在保持AMI上DER不变的同时,将VoxConverse的DER从0.113恢复至0.079,加速比达12.2倍。

详情
AI中文摘要

诸如会议转录和语音助手等语音应用将受益于设备端说话人日志,但实际采用受限于推理成本。我们研究了基于Pyannote 3.1的管道在消费级硬件(RTX 5070 Ti GPU和Apple M4笔记本)上能在多大程度上加速,同时保持说话人日志错误率(DER)。一个简单的方案:更粗的分割步长和逐块嵌入,在AMI上实现了多倍加速且DER不变,但在野外数据上急剧退化:在VoxConverse上,DER从0.075上升到0.113。我们将失败归因于聚类阶段说话人计数不足,这是由于固定的最小聚类大小与每个说话人嵌入数量减少相互作用所致。我们提出相对最小聚类大小,mcs = round(f * n),其中f = 0.01,它自适应于每个录音的嵌入预算。单个f值将VoxConverse DER恢复至0.079(约恢复丢失准确率的89%),同时保持AMI不变,加速后的管道在AMI(MPS)上相对于我们的CAM++基线达到12.2倍加速。

英文摘要

Speech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral on AMI, but degrades sharply on in-the-wild data: on VoxConverse, DER rises from 0.075 to 0.113. We trace the failure to speaker under-counting in the clustering stage, caused by a fixed minimum cluster size interacting with the reduced number of embeddings per speaker. We propose a relative minimum cluster size, mcs = round(f * n) with f = 0.01, which adapts to the embedding budget per recording. A single value of f recovers VoxConverse DER to 0.079 (about 89% of the lost accuracy) while keeping AMI flat, and the accelerated pipeline reaches up to 12.2x speedup on AMI (MPS) over our CAM++ baseline.

2602.15519 2026-06-09 eess.AS cs.SD 版本更新

Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

Enroll-on-Wakeup:真实噪声人机对话场景中无缝交互的目标语音提取首次比较研究

Yiming Yang, Guangyong Wang, Haixin Guan, Yanhua Long

发表机构 * Shanghai Normal University(上海师范大学) Unisound AI Technology Co., Ltd.(Unisound人工智能技术有限公司)

AI总结 提出Enroll-on-Wakeup框架,利用唤醒词片段作为注册参考,无需预录语音,实现无缝交互;首次系统比较了判别式和生成式模型在真实噪声条件下的性能,并探索了基于LLM的TTS注册增强。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

目标语音提取(TSE)通常依赖于预先录制的高质量注册语音,这破坏了用户体验并限制了在自发交互中的可行性。在本文中,我们提出了Enroll-on-Wakeup(EoW),一种新颖的框架,其中在人机交互过程中自然捕获的唤醒词片段被自动用作注册参考。这消除了对预收集语音的需求,以实现无缝体验。我们首次对EoW-TSE进行了系统研究,评估了在真实多样声学条件下的先进判别式和生成式模型。鉴于唤醒词片段的短时和噪声特性,我们研究了使用基于LLM的TTS进行注册增强。结果表明,虽然当前的TSE模型在EoW-TSE中面临性能下降,但基于TTS的辅助显著增强了听觉体验,尽管在语音识别准确性方面仍存在差距。

英文摘要

Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.

4. 语音增强、降噪与音频修复 5 篇

2606.08580 2026-06-09 eess.AS cs.SD 交叉投稿

G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching

G-MaP-SE: 基于GMM先验匹配的引导式语音增强

Yike Zhu, Ziqian Wang, Zikai Liu, Xingchen Li, Zhuangqi Chen, Xianjun Xia, Chuanzeng Huang, Lei Xie

AI总结 提出G-MaP-SE框架,利用高斯混合模型构建干净语音嵌入先验,通过匹配噪声条件嵌入来提升语音增强性能,无需注册音频即可接近理想干净条件上限。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

使用说话人嵌入作为条件可以增强语音增强,但大多数方法要么需要干净的注册音频,要么依赖于从带噪语音中提取的嵌入,这些嵌入在噪声和域偏移下脆弱。我们提出G-MaP-SE,一个引导式增强框架,它使用高斯混合模型(GMM)构建干净语音嵌入先验,并通过将该先验与带噪条件嵌入匹配来细化它。然后,通过轻量级门控融合模块将匹配的先验嵌入注入到时频增强主干中。在VoiceBank+DEMAND和DNS Challenge 2020数据集上的实验表明,所提出的先验匹配始终优于带噪条件,并显著缩小了与理想干净条件上限的差距,同时在推理时不需要注册音频。代码、音频样本和检查点均已公开。

英文摘要

Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.

2603.04862 2026-06-09 cs.SD 版本更新

Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

先聚焦后聆听:探索用于噪声鲁棒的大规模音频语言模型的即插即用音频增强器

Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 提出即插即用的音频增强器FTL,通过分离语音与非语音并利用模态路由器预测目标模态,生成任务自适应增强信号,无需微调即可提升LALMs在噪声环境下的性能。

Comments Accepted by ICML 2026 Workshop (Machine Learning for Audio)

详情
AI中文摘要

大规模音频语言模型(LALMs)是一类用于音频理解的基础模型。现有的LALMs在现实世界的噪声声学条件下,当语音和非语音声音干扰时,性能往往会显著下降。虽然噪声感知微调可以提高鲁棒性,但它需要特定任务的噪声数据和昂贵的重新训练,限制了可扩展性。为了解决这个问题,我们提出了先聚焦后聆听(FTL),一种即插即用的音频增强器,可提高LALMs的噪声鲁棒性。具体来说,FTL首先将输入波形分离为语音和非语音,并应用模态路由器根据用户指令预测目标音频模态(例如,语音)。最后,一个模态感知融合模块生成任务自适应的增强信号,以改善下游感知和推理。跨多个LALMs和任务的实验表明,FTL在不同噪声水平下都能提升性能,而无需对LALMs进行微调。

英文摘要

Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs' noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.

2604.24199 2026-06-09 cs.SD cs.AI eess.AS eess.SP 版本更新

Speech Enhancement Based on Drifting Models

基于漂移模型的语音增强

Liang Xu, Diego Caviedes-Nozal, W. Bastiaan Kleijn, Longfei Felix Yan, Rasmus Kongsgaard Olsson

发表机构 * Victoria University of Wellington(维多利亚大学) Lincoln University(林肯大学) GN Advanced Science(GN先进科学)

AI总结 本文提出了一种基于漂移模型的语音增强框架DriftSE,通过将去噪问题建模为平衡问题,实现单步推理,从而在无需配对数据的情况下实现高质量语音增强。

Comments 6 pages, 2 figures

详情
AI中文摘要

我们提出了一种基于漂移模型的语音增强(DriftSE),一种新颖的生成框架,将去噪建模为一个平衡问题。与依赖迭代采样的方法不同,DriftSE通过演化映射函数的推动分布来实现单步推理,直接匹配干净语音分布。这种演化由漂移场驱动,这是一种学习到的修正向量,引导样本向干净分布的高密度区域发展,这自然促进了在未配对数据上的训练,通过匹配分布而非配对样本。我们从两种形式研究了该框架:从噪声观测到直接映射,以及从高斯先验的随机条件生成模型。在VoiceBank-DEMAND基准测试中,DriftSE在单步中实现了高保真度的增强,优于多步扩散基线,并建立了语音增强的新范式。

英文摘要

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

2602.20967 2026-06-09 eess.AS cs.AI cs.SD 版本更新

Training-Free Intelligibility-Guided Observation Addition for Noisy ASR

无训练的可懂度引导的噪声ASR观测添加

Haoyang Li, Changsong Liu, Wei Rao, Hao Shi, Sakriani Sakti, Eng Siong Chng

发表机构 * Nanyang Technological University(南洋理工大学) Nara Institute of Science and Technology(奈良科学技術大學)

AI总结 提出一种无训练的可懂度引导观测添加方法,通过后端ASR的可懂度估计推导融合权重,提升噪声环境下ASR鲁棒性,无需修改SE或ASR模型参数。

Comments Accepted to Interspeech2026

详情
AI中文摘要

自动语音识别(ASR)在噪声环境中严重退化。尽管语音增强(SE)前端有效抑制背景噪声,但它们常常引入损害识别的伪影。观测添加(OA)通过融合噪声和SE增强语音解决了这一问题,无需修改SE或ASR模型的参数。本文提出了一种可懂度引导的OA方法,其中融合权重从后端ASR直接获得的可懂度估计中推导。与基于训练好的神经预测器的先前OA方法不同,所提出的方法无需训练,降低了复杂度并增强了泛化能力。在多种SE-ASR组合和数据集上的大量实验表明,该方法相比现有OA基线具有强大的鲁棒性和改进。对可懂度引导的基于切换的替代方案以及帧级与话语级OA的进一步分析也验证了所提出的设计。

英文摘要

Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.

2603.11669 2026-06-09 eess.AS cs.SD 版本更新

SEMamba++: A General Speech Restoration Framework Leveraging Global, Local, and Periodic Spectral Patterns

SEMamba++:一种利用全局、局部和周期性频谱模式的通用语音修复框架

Yongjoon Lee, Jung-Woo Choi

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 本文提出SEMamba++框架,通过整合全局、局部和周期性频谱特征,提升语音修复性能,同时保持计算效率。

Comments Accepted to Interspeech 2026 Long paper track. Project page: https://sites.google.com/view/semambapp

详情
AI中文摘要

通用语音修复需要能够解释复杂语音结构并在各种失真下工作的技术。虽然状态空间模型如SEMamba在语音去噪方面取得了进展,但它们并未针对关键语音特性如频谱周期性或多分辨率频率分析进行优化。在本文中,我们引入了一种架构,旨在整合语音特定的特征作为归纳偏置。特别是,我们提出了全局、局部和周期性(GLP)模块,一个有效的频率特征提取块,能够有效利用频率桶的属性。然后,我们设计了一个多分辨率并行时频双处理块以捕捉多样的频谱模式,并设计了一个可学习的映射以进一步提高模型性能。通过整合所有想法,所提出的SEMamba++在多个基线模型中表现最佳,同时保持计算效率。

英文摘要

General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics, such as spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose the Global, Local, and Periodic (GLP) module, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual-processing block to capture diverse spectral patterns, and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed SEMamba++ achieves the best performance among multiple baseline models while remaining computationally efficient.

5. 音频事件检测与场景理解 3 篇

2606.02341 2026-06-09 cs.SD cs.LG 版本更新

Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

参数高效的双编码器架构与可微Choquet积分融合用于水下声学分类

Amirmohammad Mohammadi, Joshua Peeples, Alexandra Van Dine

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出一种双编码器神经网络架构,同时处理波形和频谱图,利用预训练骨干和参数高效微调模块,并通过基于Choquet积分的可微模糊聚合机制融合时域和频域表示,提高分类准确性和可解释性。

Comments 9 pages, 7 figures

详情
AI中文摘要

水下声学分类具有广泛的海事应用,但由于日益复杂的声学环境而面临挑战。波形和频谱图表示已被主要用作该领域分类任务的声学数据特征。频谱图建模谐波依赖性,但这些降维表示可能过滤掉与判别相关的声学特征。虽然波形的相位信息允许对信号进行完整表征,但原始波形可能嘈杂且复杂,使得模型难以直接处理该表示。本文提出一种双编码器神经网络架构,同时处理声学波形和频谱图,利用预训练骨干和参数高效微调模块,实现领域自适应。为了结合这些自适应分支,引入了一种基于Choquet积分的可微模糊聚合机制,以平衡时域和频谱表示。这种融合策略不仅提高了分类准确性,还提供了可解释性。具体来说,通过分析学习到的模糊测度,揭示了网络表示依赖性的类别特定变化。通过动态将注意力转移到受潜在非对称信道失真影响最小的表示上,所提出的门控机制缓解了水下环境的非平稳挑战。在DeepShip和ShipsEar数据集上的评估表明,所提出的架构相对于独立的单编码器基线实现了分类改进,同时限制了可训练参数空间。这减轻了在有限声学数据集上过拟合的风险,同时降低了与完全微调基础模型相关的计算成本。

英文摘要

Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination. While phase information from the waveform allows full characterization of the signal, the original waveform can be noisy and complex, rendering this representation difficult for models to process directly. This paper proposes a dual-encoder neural architecture to simultaneously process acoustic waveforms and spectrograms, leveraging pre-trained backbones and parameter-efficient fine-tuning modules, enabling a domain adaptation. To combine these adapted branches, a novel differentiable fuzzy aggregation mechanism based on the Choquet integral is introduced to balance the temporal and spectral representations. This fusion strategy not only yields higher classification accuracy but also provides interpretability. Specifically, by analyzing the learned fuzzy measures, insights are revealed about class-specific shifts in the network's representation reliance. By dynamically shifting attention to the representation least corrupted by potential asymmetric channel distortions, the proposed gating mechanism mitigates the non-stationary challenges of the underwater environment. Evaluations on the DeepShip and ShipsEar datasets demonstrate that the proposed architecture achieves classification improvements over independent single-encoder baselines, while simultaneously restricting the trainable parameter space. This mitigates the risk of overfitting on limited acoustic datasets while alleviating the computational costs associated with fully fine-tuning foundation models.

2601.04178 2026-06-09 eess.AS cs.SD 版本更新

Sound Event Detection with Boundary-Aware Optimization and Inference

基于边界感知优化与推理的声音事件检测

Florian Schmid, Chi Ian Tang, Sanjeel Parekh, Vamsi Krishna Ithapu, Juan Azcarreta Ortiz, Giacomo Ferroni, Yijun Qian, Arnoldas Jasonas, Cosmin Frateanu, Camilla Clark, Gerhard Widmer, Çağdaş Bilen

发表机构 * Meta Institute of Computational Perception(计算感知研究所) Linz Institute of Technology (LIT)(林茨技术研究所) Meta Reality Labs Research(Meta现实实验室研究)

AI总结 提出边界感知优化与推理策略,通过显式建模事件起始和偏移,结合循环事件检测与事件提议网络,在AudioSet强标注子集上实现无需后处理调参的SOTA性能。

Comments Accepted for publication in IEEE Signal Processing Letters, 2026

详情
AI中文摘要

时间检测问题出现在许多领域,包括时间序列估计、活动识别和声音事件检测(SED)。在这项工作中,我们提出了一种新的时间事件建模方法,通过显式建模事件起始和偏移,并引入边界感知优化和推理策略,显著增强了时间事件检测。所提出的方法包含了新的时间建模层——循环事件检测(RED)和事件提议网络(EPN),它们与定制的损失函数一起,实现了更有效和精确的时间事件检测。我们在SED领域使用AudioSet中时间强标注部分的一个子集评估了所提出的方法。实验结果表明,我们的方法不仅优于具有最先进后处理的传统逐帧SED模型,而且消除了后处理超参数调优的需要,并扩展以在所有AudioSet强类别上实现新的最先进性能。

英文摘要

Temporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.

2602.18777 2026-06-09 eess.AS cs.SD 版本更新

Mind the Gap: Detecting Cluster Exits for Robust Local Density-Based Score Normalization in Anomalous Sound Detection

注意差距:检测聚类出口以实现异常声音检测中鲁棒的局部密度分数归一化

Kevin Wilkinghoff, Gordon Wichern, Jonathan Le Roux, Zheng-Hua Tan

发表机构 * Department of Electronic Systems, Aalborg University(电子系统系,奥尔堡大学) Pioneer Centre for Artificial Intelligence(先锋人工智能中心) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室(MERL))

AI总结 针对异常声音检测中局部密度分数归一化对邻域大小敏感的问题,提出聚类出口检测机制,通过识别距离不连续性自适应选择邻域大小,提升鲁棒性和性能。

详情
AI中文摘要

局部密度分数归一化是异常声音检测中基于距离的嵌入方法的有效组成部分,尤其是在数据密度随条件或领域变化时。然而,在实践中,性能强烈依赖于邻域大小。当邻域扩展跨越聚类边界时,增加邻域大小会降低检测精度,违反了局部密度估计的局部性假设。这一观察促使我们基于局部性保持而不是预先固定来调整邻域大小。我们通过提出聚类出口检测来实现这一点,这是一种轻量级机制,用于识别距离不连续性并相应地选择邻域大小。在多个嵌入模型和数据集上的实验表明,该方法对邻域大小选择具有更好的鲁棒性,并带来一致的性能提升。

英文摘要

Local density-based score normalization is an effective component of distance-based embedding methods for anomalous sound detection, particularly when data densities vary across conditions or domains. In practice, however, performance depends strongly on neighborhood size. Increasing it can degrade detection accuracy when neighborhood expansion crosses cluster boundaries, violating the locality assumption of local density estimation. This observation motivates adapting the neighborhood size based on locality preservation rather than fixing it in advance. We realize this by proposing cluster exit detection, a lightweight mechanism that identifies distance discontinuities and selects neighborhood sizes accordingly. Experiments across multiple embedding models and datasets show improved robustness to neighborhood-size selection and consistent performance gains.

6. 音乐信息检索与音乐生成 3 篇

2606.08722 2026-06-09 cs.SD cs.CL 新提交

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

LLM 能否理解 LilyPond?一个用于符号音乐生成与理解的基准

Matteo Spanio, Mohammad Torabi, Andrea Poltronieri, Antonio Rodà

发表机构 * University of Padova(帕多瓦大学) Universitat Pompeu Fabra(庞培法布拉大学)

AI总结 提出 LilyBench,基于 LilyPond 的基准,联合评估开源 LLM 的符号音乐生成与理解能力,实验表明零样本可生成可执行 LilyPond,但结构理解任务仍有挑战,且指标间存在系统性分歧。

Comments Accepted at Ital-IA 2026

详情
AI中文摘要

大型语言模型的符号音乐评估在表示、数据集和指标上仍然碎片化。我们引入了 LilyBench,一个基于 LilyPond 的基准,用于在同一系列开源权重 LLM 上联合评估符号音乐生成和音乐理解。该基准包括一个 200 个提示的生成套件和十个从 ABC-Eval 改编的理解任务,涵盖语法、元数据预测、结构排序和音乐识别。生成质量通过编译率、基于 Jensen-Shannon 相似度的 MusPy 描述符分布以及基于 LilyBERT 的 Fréchet 音乐距离 (FMD) 进行评估。在四个开源模型上的实验表明,在零样本设置下可以实现可执行的 LilyPond 生成,而结构理解任务尽管在作曲家和流派识别上表现强劲,但仍然具有挑战性。我们的实验还揭示了基于描述符和基于嵌入的指标之间的系统性分歧,表明符号音乐评估受益于指标三角测量而非单一分数排名。我们发布了基准、提示库和评估代码,以支持未来在符号音乐生成和理解方面的研究,地址为 https://github.com/CSCPadova/lilybench。

英文摘要

Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench

2312.15946 2026-06-09 cs.SD cs.GR eess.AS 版本更新

EnchantDance: Unveiling the Potential of Music-Driven Dance Movement

EnchantDance: 揭示音乐驱动舞蹈动作的潜力

Bo Han, Teng Zhang, Zeyu Ling, Feilin Han

发表机构 * Zhejiang University(浙江大学) Tongji University(同济大学)

AI总结 提出EnchantDance框架,通过构建舞蹈潜在空间和扩散模型,结合大规模数据集ChoreoSpectrum3D和音乐流派预测网络,提升舞蹈生成的质量、多样性和一致性。

Comments Project Page: https://fluide1022.github.io/EnchantDance/

详情
AI中文摘要

音乐驱动的舞蹈生成任务涉及创建与给定音乐相对应的连贯舞蹈动作。现有方法虽然能生成物理上合理的舞蹈,但往往难以泛化到未见数据。挑战来自三个方面:1)舞蹈动作的高度多样性和音乐模态分布的显著差异,使得生成与音乐对齐的舞蹈动作困难;2)缺乏大规模音乐-舞蹈数据集,阻碍了从音乐生成泛化舞蹈动作;3)舞蹈动作的持续性对保持一致的舞蹈风格构成挑战。在这项工作中,我们引入了EnchantDance框架,一种最先进的舞蹈生成方法。由于原始舞蹈序列在时间轴上的冗余性,EnchantDance首先构建一个强大的舞蹈潜在空间,然后在舞蹈潜在空间上训练舞蹈扩散模型。为了解决数据缺口,我们构建了一个大规模音乐-舞蹈数据集ChoreoSpectrum3D Dataset,包含四种舞蹈风格,总时长70.32小时,是迄今为止报道的最大音乐-舞蹈数据集。为了增强音乐流派与舞蹈风格之间的一致性,我们使用迁移学习预训练了一个音乐流派预测网络,并在舞蹈扩散模型的训练中将音乐流派作为额外的条件信息。大量实验表明,我们提出的框架在舞蹈质量、多样性和一致性方面达到了最先进的性能。

英文摘要

The task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make it difficult to generate music-aligned dance movements. 2) the lack of a large-scale music-dance dataset, which hinders the generation of generalized dance movements from music. 3) The protracted nature of dance movements poses a challenge to the maintenance of a consistent dance style. In this work, we introduce the EnchantDance framework, a state-of-the-art method for dance generation. Due to the redundancy of the original dance sequence along the time axis, EnchantDance first constructs a strong dance latent space and then trains a dance diffusion model on the dance latent space. To address the data gap, we construct a large-scale music-dance dataset, ChoreoSpectrum3D Dataset, which includes four dance genres and has a total duration of 70.32 hours, making it the largest reported music-dance dataset to date. To enhance consistency between music genre and dance style, we pre-train a music genre prediction network using transfer learning and incorporate music genre as extra conditional information in the training of the dance diffusion model. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on dance quality, diversity, and consistency.

2605.03395 2026-06-09 cs.SD cs.AI cs.LG cs.MM 版本更新

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

APEX:面向AI生成音乐的大规模多任务美学感知流行度预测

Jaavid Aktar Husain, Dorien Herremans

发表机构 * AMAAI Lab, Singapore University of Technology and Design(新加坡科技设计大学AMAAI实验室)

AI总结 提出APEX框架,利用MERT音频嵌入联合预测AI生成音乐的流行度指标与五维美学质量,在Music Arena数据集上验证了美学特征对偏好预测的泛化能力。

详情
AI中文摘要

音乐流行度预测因其对艺术家、平台和推荐系统的重要性而吸引了越来越多的研究兴趣。然而,AI生成音乐平台的爆炸式增长创造了一个全新且很大程度上未被探索的领域,每天都有大量歌曲被生产和消费,而没有传统的艺术家声誉或唱片公司支持。在这一探索中,美学质量是关键但尚未被研究的因素。我们提出了APEX,这是首个面向AI生成音乐的大规模多任务学习框架,在来自Suno和Udio的超过21.1万首歌曲(1万小时音频)上训练,该框架联合预测基于参与度的流行度信号——流媒体播放量和点赞分数——以及从MERT(一个自监督音乐理解模型)提取的冻结音频嵌入中的五个感知美学质量维度。美学质量和流行度捕捉了音乐的互补方面,两者结合被证明是有价值的:在Music Arena数据集上的分布外评估中,该数据集包含训练期间未见过的十一个生成音乐系统之间的成对人类偏好对决,引入美学特征持续改进了偏好预测,展示了所学表示在生成架构上的强大泛化能力。

英文摘要

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

7. 语音翻译与语音语言模型 2 篇

2606.08425 2026-06-09 cs.SD cs.CL eess.AS 新提交

TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

TinyGiantALM:面向资源约束下意图感知推理的紧凑型音频-语言模型

Vinh-Thuan Ly

发表机构 * University of Science, VNU-HCM(胡志明市国立大学下属理科大学) Vietnam National University, Ho Chi Minh City(胡志明市国立大学)

AI总结 提出紧凑型1.5B参数音频-语言模型TinyGiantALM,通过指令感知特征精炼框架(查询引导投影器+语义门控)过滤用户意图相关声学信号,在MMAR基准上零样本准确率46.4%,超越7B-13B基线,并优于8倍大模型。

Comments Accepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.app

详情
AI中文摘要

当前音频推理的进展依赖于大规模音频-语言模型(LALMs),阻碍了在资源受限环境中的部署。我们提出了TinyGiantALM,一个紧凑的1.5B参数效率导向替代方案。不同于暴力扩展规模,我们提出了一种指令感知特征精炼框架,使用查询引导投影器和语义门控,基于用户意图过滤声学信号。在MMAR基准上,TinyGiantALM实现了46.4%的零样本准确率,显著优于7B-13B基线。虽然在逻辑叙事推理方面与30B+模型存在差距,且在过于密集或空间场景中存在某些权衡,但我们的方法在解耦混合模态环境方面显著优于高达8倍大小的模型。这些发现表明,架构精度为在边缘友好规模上获得稳健感知能力提供了一条切实可行的路径。

英文摘要

Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.

2606.07547 2026-06-09 cs.CL cs.AI cs.SD 交叉投稿

Liberating LLM Capabilities in Full-Duplex Speech Models

在全双工语音模型中释放LLM能力

Luoyuan Zhang, Bokai Xu, Junbo Cui, Weiyue Sun, Yingjing Xu, Hanyu Liu, Yuan Yao

发表机构 * Royal Zhang(皇家张)

AI总结 提出Listen-Write-Speak (LWS)三通道范式,使LLM在共享因果注意力上下文中同时监听、书写可见文本并实时口语回应,无需架构修改,实现全双工交互。

详情
AI中文摘要

基于语音的大型语言模型通常局限于口语回复,这将其面向用户的输出限制在可口头表达的内容上,并抑制了文本原生能力,如代码生成、结构化分析和实时交互中的多步推理,对于需要持久、结构化且可检查的中间输出的任务。现有工作改进了口语推理或全双工轮流发言,但仍将文本视为隐藏的中间状态或从属模态,而非第一类输出通道。我们提出Listen-Write-Speak (LWS),一种文本优先的三通道范式,其中单个自回归LLM持续监听用户音频,写出可见的自由形式文本作为其主要输出,并在共享因果注意力上下文中并行生成实时口语回应。该行为完全通过Token Schema实现,无需架构修改,并通过两阶段数据流水线学习,该流水线合成与揭示的输入时间线一致的每秒认知注释。实验上,LWS在Full-Duplex-Bench上展示了强大的全双工交互,在VoiceBench AlpacaEval上达到4.72,写作-口语一致性达92.6%,并在URO-Bench上持续优于其内部消融版本。这些结果表明,可见书写可以作为语音交互的第一类输出通道,而不会牺牲实时响应性。代码和数据集可在项目页面获取:https://royalzhang.com/project/lws-page/。

英文摘要

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.

8. 多模态音频与视听学习 3 篇

2606.07533 2026-06-09 cs.CL cs.AI cs.SD 交叉投稿

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

桥接传统可解释性方法与多模态多语言模型:基于XAI的分析

Paweł Pozorski, Jakub Muszyński, Maria Ganzha

发表机构 * arXiv

AI总结 提出多模态Shapley值框架,结合频谱图引导的音素对齐(SGPA)预处理方法,实现文本与音频特征的可解释性归因,并开源计算包与可视化工具。

Comments Bachelor's thesis

详情
AI中文摘要

多模态大语言模型(MLLMs)有效整合文本和音频以理解复杂交互对话中的上下文。然而,异质模态影响模型行为的内部机制仍然不透明。虽然Shapley值(SV)为基于文本的NLP提供了鲁棒的、模型无关的局部可解释性框架,但其扩展到多模态数据受到跨通道依赖、复杂对话结构以及密集音频表示的高计算复杂性的阻碍。\n在这项工作中,我们形式化了Shapley值框架的多模态扩展,将离散文本标记和对齐的音频片段视为协作特征。为确保计算可行性,我们部署了一套高效的估计策略:低维输入的精确SV计算和基于采样的近似——包括蒙特卡洛排列和具有Neyman最优分配的分层抽样——以在有限计算预算下最小化方差。为解决模态间的粒度不匹配问题,我们提出了频谱图引导的音素对齐(SGPA),一种新颖的预处理方法,将高频音频流映射到可解释的、单词对齐的片段。\n我们的贡献有两方面:首先,我们提供了一个开源的、模型无关的Python包和配套的GUI,用于多模态归因的计算和交互式可视化。其次,我们使用VoiceBench和Infinity Instruct数据集的精选子集,在多种多语言场景下评估我们的框架。实验结果表明,输入模态是归因波动的主要驱动因素,并证明标准句法重要性代理在多模态跨语言上下文中通常无法预测模型注意力。

英文摘要

Multimodal Large Language Models (MLLMs) effectively integrate text and audio to interpret context in complex interactive dialogues. However, the internal mechanisms by which heterogeneous modalities influence model behavior remain opaque. While Shapley Values (SV) provide a robust, model-agnostic framework for local explainability in text-based NLP, their extension to multimodal data is hindered by cross-channel dependencies, intricate dialogue structures, and the prohibitive computational complexity of dense audio representations. In this work, we formalize a multimodal extension of the Shapley Value framework, treating discrete text tokens and aligned audio segments as cooperative features. To ensure computational feasibility, we deploy a suite of efficient estimation strategies: exact SV computation for low-dimensional inputs and sampling-based approximations - including Monte Carlo permutations and stratified sampling with Neyman-optimal allocation - to minimize variance under constrained computational budgets. To resolve the granularity mismatch between modalities, we propose Spectrogram-Guided Phonetic Alignment (SGPA), a novel preprocessing method that maps high-frequency audio streams to interpretable, word-aligned segments. Our contribution is twofold: first, we provide an open-source, model-agnostic Python package and a companion GUI for the computation and interactive visualization of multimodal attributions. Second, we evaluate our framework using curated subsets of the VoiceBench and Infinity Instruct datasets across diverse multilingual scenarios. Our experimental results reveal that input modality is a primary driver of attribution volatility and demonstrate that standard syntactic importance proxies often fail to predict model attention in multimodal, cross-lingual contexts.

2606.07577 2026-06-09 cs.AI cs.CV cs.SD eess.AS 交叉投稿

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

OmniMem: 面向流式音视频大语言模型的扰动感知记忆压缩

Guangzhi Sun, Yixuan Li, Yudong Yang, Chao Zhang

发表机构 * Tsinghua University(清华大学) ByteDance(字节跳动) Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 提出OmniMem,一种针对音视频LLM的流式记忆压缩框架,通过模态感知分配和扰动感知选择压缩KV缓存,在保持长视频理解的同时减少内存,在多个基准上提升2-4%准确率。

Comments Code: https://github.com/bytedance/SALMONN/tree/omni_mem

详情
AI中文摘要

音视频大语言模型(LLMs)在长视频理解方面具有强大潜力,但其长视频推理从根本上受到视频令牌和键值(KV)缓存线性增长的制约。我们提出OmniMem,一种专为音视频LLMs设计的内存高效流式框架。与将所有令牌统一处理的现有压缩方法不同,OmniMem引入了一种模态感知的内存分配策略,分别管理视觉和音频上下文,解决了两种模态之间的严重令牌不平衡问题。OmniMem进一步通过扰动感知的内存选择保留信息丰富且非冗余的KV状态,实现紧凑内存而不牺牲长程理解。为了在现实部署约束下加强压缩,我们还探索了预算感知微调,鼓励模型将有用信息整合到保留内存中。在VideoMME Long、LVBench和LVOmniBench上使用video-SALMONN 2+和Qwen-2.5-Omni的实验表明,在相同内存预算下,OmniMem始终比强训练无关压缩基线提高2-4%的绝对准确率,微调后额外提高1-2%。

英文摘要

Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs. Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.

2603.12046 2026-06-09 eess.AS cs.CV cs.SD 版本更新

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Dr. SHAP-AV:通过Shapley归因解码音频-视觉语音识别中的相对模态贡献

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

发表机构 * Imperial College London, UK(伦敦帝国学院,英国) NatWest AI Research, UK(英国NatWest人工智能研究)

AI总结 本文提出Dr.SHAP-AV框架,通过Shapley值分析音频-视觉语音识别中模态贡献,揭示噪声环境下模型对视觉的依赖及音频贡献的稳定性,推动模态加权机制和Shapley归因作为标准诊断工具。

Comments Accepted to INTERSPEECH 2026 [Long Paper track]. Project website: https://umbertocappellazzo.github.io/Dr-SHAP-AV

详情
AI中文摘要

音频-视觉语音识别(AVSR)利用音频和视觉信息在噪声环境下实现鲁棒识别。然而,模型如何平衡这些模态仍不清楚。我们提出了Dr.SHAP-AV框架,利用Shapley值分析AVSR中的模态贡献。通过在两个基准测试中六个模型上进行实验,不同SNR水平下,我们引入三种分析:全局Shapley用于整体模态平衡,生成Shapley用于解码过程中的贡献动态,时间对齐Shapley用于输入-输出对应性。我们的发现表明,在噪声下模型倾向于依赖视觉,但在严重退化下仍保持高音频贡献。模态平衡在生成过程中演变,时间对齐在噪声下保持稳定,SNR是驱动模态权重的主要因素。这些发现揭示了持续的音频偏见,推动了定制化的模态加权机制和基于Shapley的归因作为标准AVSR诊断工具。

英文摘要

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

9. 低资源、多语言与方言语音 1 篇

2604.27273 2026-06-09 cs.SD 版本更新

Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?

少样本合成方言语音用于ASR微调:什么有助于什么?

Yurii Halychanskyi, Nimet Beyza Bozdag, Mark Hasegawa-Johnson, Dilek Hakkani-Tür, Volodymyr Kindratenko

发表机构 * University of Washington(华盛顿大学)

AI总结 研究比较了合成方言语音在ASR微调中的有效性,发现随机音素扰动比目标方言音素编辑更有效,且真实语音与合成语音混合可稳定低资源微调。

Comments Accepted as a contributed talk and poster at the ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

合成方言语音是一种在真实方言录音稀缺时提升自动语音识别(ASR)性能的有希望的方法。我们探讨了什么使此类数据对ASR微调有用:目标方言音素编辑暴露识别器于方言特定发音,或随机音素扰动在音素空间中充当增强。在少样本TTS流程中,我们比较了LLM生成的方言编辑与匹配速率的随机替换和oracle控制,使用真实方言音素和语调。随机替换恢复了大部分ASR增益:LLM目标方言编辑仅比随机替换略好,真实音素接近随机基线并随着合成ASR微调集增大接近它,添加真实语调仅带来小幅增益。混合合成与真实方言语音也稳定了低资源微调,但固定合成预算后期会稀释真实数据信息,显示真实-合成比例的重要性。

英文摘要

Synthetic accented speech is a promising way to improve automatic speech recognition (ASR) when real accented recordings are scarce. We ask what makes such data useful for ASR fine-tuning: target-accent phoneme edits that expose the recognizer to accent-specific pronunciations, or random phoneme perturbations that act as augmentation in phoneme space. In a few-shot TTS pipeline, we compare LLM-generated accent edits with matched-rate random substitutions and oracle controls using ground-truth accented phonemes and prosody. Random substitutions recover much of the ASR gain: LLM target-accent edits improve over random by only a small margin, ground-truth phonemes stay close to the random baseline and nearly converge with it as the synthetic ASR fine-tuning set grows larger, and adding ground-truth prosody yields only a modest further gain. Mixing synthetic with real accented speech also stabilizes low-resource fine-tuning, but a fixed synthetic budget can later dilute the information in real data, showing that the real--synthetic ratio matters.

10. 数据集、基准与评测 6 篇

2606.08038 2026-06-09 cs.SD 新提交

Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis

探索语音反欺骗数据集的规模与多样性:实验与分析

Zhuolin Yi, Jun Xue, Yanzhen Ren, Yihuan Huang, Yi Chai, Daixian Li, Guanxiang Feng, Jiajun Liu

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院)

AI总结 本研究通过解耦训练数据规模与多样性,发现数据多样性比规模更重要,过大规模可能导致过拟合,而多样化的较小数据集在跨域评估中表现更优。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

过去十年中,语音反欺骗数据集的规模呈指数级增长,其背后假设是更大的数据能带来更好的性能。然而,无差别地扩大规模是否能够相应地提升模型泛化能力尚不清楚。本研究通过解耦训练数据规模与多样性的影响,挑战了“规模优先”的范式。通过对代表性数据集的实验,我们报告了两个关键发现:(1)更大并不总是更好。在固定生成方法下过度扩大数据规模会带来微不足道的收益,甚至可能因过拟合而降低跨域泛化能力。(2)多样性优于规模。在跨数据集评估中,一个包含多种攻击的较小复合训练集显著优于规模更大但多样性有限的数据集。我们得出结论,未来的数据集构建应优先考虑生成方法的多样性而非规模,以有效提升模型泛化能力。

英文摘要

The scale of speech anti-spoofing datasets has grown exponentially over the past decade, driven by the assumption that larger data leads to better performance. However, it remains unclear whether indiscriminate scaling commensurately improves model generalization. This study challenges the "scale-first" paradigm by decoupling the impacts of training data scale versus diversity. Through experiments on representative datasets, we report two key findings: (1) Larger is not always better. Expanding data scale excessively under fixed generation methods yields negligible returns and may even degrade cross-domain generalization due to overfitting.(2) Diversity outweighs scale. A smaller composite training set featuring diverse attacks significantly outperforms larger-scale datasets with limited diversity in cross-dataset evaluations. We conclude that future dataset construction should prioritize the diversity of generation methods over scale to effectively enhance model generalization.

2606.07643 2026-06-09 cs.CV cs.AI cs.SD eess.AS 交叉投稿

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

AVI-Bench:迈向全模态大语言模型的人类级视听智能

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出AVI-Bench基准,通过感知、理解、推理三阶段跨模态任务评估全模态大语言模型的视听智能,并引入AVI-Bench-PriSe测试原始视听感知,揭示当前模型局限,构建四级AVI分类体系。

Comments 31 pages, 8 figures, ICML 2026

详情
AI中文摘要

近期全模态大语言模型(Omni-MLLMs)的进展实现了视觉、音频和语言的强集成。然而,由于缺乏系统全面的基准,其视听智能(AVI)仍未被充分评估。我们提出AVI-Bench,一个受认知启发的基准,通过需要联合视听解释的跨模态任务,在感知、理解和推理三个阶段评估Omni-MLLMs。该设计能够细粒度诊断模型能力和失败模式。为进一步评估超出熟悉领域的鲁棒性,我们提出AVI-Bench-PriSe,一个扩展版本,使用不熟悉的、低语义刺激探测模型的原始视听感知,测试超出常见训练分布的泛化能力。对开源和闭源模型的大量实验揭示了当前Omni-MLLMs的显著局限性。基于这些发现,我们提出了一个四级AVI分类体系。总体而言,AVI-Bench提供了一个原则性的评估框架,以指导更鲁棒和可泛化AVI的发展。项目网站:https://fudancvl.github.io/AVI-Bench/

英文摘要

Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/

2502.16584 2026-06-09 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound

Audio-FLAN:面向语音、音乐和声音的统一音频理解与生成的指令跟随数据集

Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Xingjian Du, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Inner Mongolia University(内蒙古大学) Beihang University(北京航空航天大学) Queen Mary University of London(伦敦玛丽女王大学) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学) University of Surrey(萨里大学) University of Rochester(罗切斯特大学) Independent Researcher(独立研究者)

AI总结 提出Audio-FLAN数据集,包含80种任务和1亿实例,支持统一音频理解与生成的零样本学习。

详情
AI中文摘要

最近音频标记化的进展显著增强了将音频能力集成到大语言模型(LLM)中的能力。然而,音频理解和生成通常被视为不同的任务,阻碍了真正统一的音频-语言模型的发展。虽然指令调优在文本和视觉领域已显示出在改善泛化和零样本学习方面的显著成功,但其在音频领域的应用仍基本未被探索。一个主要障碍是缺乏统一音频理解和生成的全面数据集。为解决这一问题,我们引入了Audio-FLAN,这是一个大规模指令调优数据集,涵盖语音、音乐和声音领域的80种不同任务,包含超过1亿个实例。Audio-FLAN为统一的音频-语言模型奠定了基础,这些模型能够以零样本方式无缝处理跨多种音频领域的理解(如转录、理解)和生成(如语音、音乐、声音)任务。Audio-FLAN数据集可在HuggingFace和GitHub上获取。

英文摘要

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub.

2511.18421 2026-06-09 cs.SD cs.LG 版本更新

DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation

DHAuDS:用于测试时自适应的动态异构音频基准

Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul, Tissa Chandesa

发表机构 * School of Computer and Mathematical Sciences, University of Nottingham Malaysia(诺丁汉马来西亚大学计算机与数学科学学院)

AI总结 针对现有测试时自适应(TTA)评估依赖静态同质噪声协议的问题,提出DHAuDS基准,通过动态严重度和异构噪声混合暴露音频分类鲁棒性缺陷。

详情
AI中文摘要

现有的测试时自适应(TTA)研究严重依赖静态和同质的损坏协议,例如ImageNet-C和CIFAR-10-C/100-C,导致评估设置不一致,并且可能高估与实际情况相比的鲁棒性估计。TTA缺乏能够模拟现实异构声学退化的标准化评估基础设施。我们引入了DHAuDS,这是一个标准化的基准套件,用于评估在动态损坏严重性和异构噪声混合下的音频分类TTA鲁棒性。DHAuDS并非提出新的TTA算法,而是专注于暴露在传统固定噪声评估协议下仍然隐藏的鲁棒性限制。

英文摘要

Existing Test-time Adaptation (TTA) studies rely heavily on static and homogeneous corruption protocols, such as ImageNet-C and CIFAR-10-C/100-C, leading to inconsistent evaluation settings and potentially inflated robustness estimates that are compared with real-world situations. TTA lacks a standardized evaluation infrastructure capable of modeling realistic heterogeneous acoustic degradation. We introduce DHAuDS, a standardized benchmark suite for evaluating audio classification TTA robustness under dynamic corruption severity and heterogeneous noise mixtures. Rather than proposing a new TTA algorithm, DHAuDS focuses on exposing robustness limitations that remain hidden under conventional fixed-noise evaluation protocols.

2604.10628 2026-06-09 cs.SD cs.CL cs.IR 版本更新

BMdataset: A Musicologically Curated LilyPond Dataset

BMdataset:一个音乐学精心编纂的LilyPond数据集

Matteo Spanio, Ilay Guler, Antonio Rodà

发表机构 * Department of Information Engineering , University of Padua(信息工程系,帕多瓦大学) Boston University(波士顿大学)

AI总结 本文提出BMdataset,包含393个LilyPond乐谱,用于音乐理解研究,并引入LilyBERT模型,证明小规模专家编纂数据集在音乐识别任务中优于大规模噪声数据集。

Comments Submitted to SMC2026

详情
AI中文摘要

符号音乐研究几乎仅依赖MIDI数据集,而基于文本的乐谱格式如LilyPond尚未被探索。我们提出了BMdataset,包含393个LilyPond乐谱(2,646个乐章),由专家直接从原巴洛克手稿转录,涵盖作曲家、音乐形式、乐器和乐章属性的元数据。基于此资源,我们引入LilyBERT(权重可在https://huggingface.co/csc-unipd/lilybert获取),一种基于CodeBERT的编码器,通过扩展词汇表加入115个LilyPond特定标记并进行掩码语言模型预训练。在非领域数据集Mutopia上的线性探测显示,尽管其规模较小(约90M tokens),仅在BMdataset上微调的表现优于在完整PDMX数据集(约15B tokens)上的连续预训练,证明小规模专家编纂数据集在音乐理解任务中更有效。结合广泛预训练与领域特定微调获得最佳结果(84.3%作曲家准确率),证实了两种数据制度的互补性。我们发布数据集、分词器和模型,以建立LilyPond的表示学习基准。

英文摘要

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.

2602.23958 2026-06-09 eess.AS cs.SD 版本更新

An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance

Fréchet音频距离中任务诱导编码器偏差的实证分析

Wonwoo Jeong

发表机构 * Dept. of Computer Science and Engineering, Sogang University, South Korea(计算机科学与工程系,首尔大学,韩国)

AI总结 通过分解评估指标为召回率、精度和对齐(语义与结构维度),分析六种编码器在FAD中的任务诱导偏差,发现重建、ASR和分类训练编码器各有优劣,需发展评估原生编码器。

Comments Accepted to Interspeech 2026. Source code and evaluation pipeline are available at: https://github.com/wonwoo-jeong/fad-encoder-bias

详情
AI中文摘要

Fréchet音频距离(FAD)是评估文本到音频生成的事实标准,但其分数依赖于底层编码器的嵌入空间。编码器的训练任务决定了哪些声学特征被保留或丢弃,导致FAD继承系统性的任务诱导偏差。我们将评估分解为召回率、精度和对齐(分为语义和结构维度),并使用对数尺度归一化以实现公平的跨编码器比较。在两个数据集上对六种编码器进行的受控实验揭示了四轴权衡:基于重建的AudioMAE主导精度敏感性;ASR训练的Whisper在结构检测中占优,但对信号退化视而不见;分类训练的VGGish最大化语义检测,但惩罚合法的类内变异。由于没有单个编码器是通用评估器,未来的指标必须转向与人类感知内在一致的评估原生编码器。

英文摘要

Fréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.

11. 安全、隐私与深度伪造音频 3 篇

2606.08669 2026-06-09 cs.SD cs.LG 新提交

A Comparison of SSL-Based Feature Extractors and Back-End Classifiers for Spoofing Detection: A Multi-Corpus Training and Cross-Linguistic Analysis

基于SSL的特征提取器与后端分类器在欺骗检测中的比较:多语料库训练与跨语言分析

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite(阿维尼翁大学) EURECOM

AI总结 本研究通过多语料库训练和跨语言分析,比较了四种自监督学习特征提取器与四种后端分类器在欺骗检测中的性能,揭示了ASVspoof 5数据集中的领域偏差,并发现仅用8小时目标语言数据微调即可提升检测鲁棒性。

详情
AI中文摘要

语音生物识别系统面临来自欺骗攻击的日益增长的威胁,然而检测模型的评估在不同数据集上仍然不一致。为了研究这些不可预测的波动,我们对四种自监督学习特征提取器与四种后端分类器的组合进行了全面基准测试。我们比较了ResNet的层次化局部特征提取与基于注意力和图的后端的全局序列和关系建模。通过三种场景下的多语料库训练和六个评估数据集,我们的实证分析得出了两个关键发现。首先,我们揭示了ASVspoof 5数据集中的领域偏差,表明简单的数据缩放会主动降低性能。其次,我们的跨语言分析表明,仅用8小时的目标语言数据微调即可增强检测鲁棒性。这些发现共同强调了在欺骗检测中需要领域感知和语言特定适应的关键需求。

英文摘要

Voice biometric systems face growing threats from spoofing attacks, yet the evaluation of detection models remains inconsistent across datasets. To investigate these unpredictable fluctuations, we conduct a comprehensive benchmark of four self-supervised learning feature extractors paired with four back-end classifiers. We compare the hierarchical local feature extraction of ResNet with the global sequence and relational modeling of attention and graph-based back-ends. Through multi-corpus training across three scenarios and six evaluation datasets, our empirical analysis yields two critical findings. First, we expose a domain bias within the ASVspoof 5 dataset, showing that naive data scaling actively degrades performance. Second, our cross-linguistic analysis reveals that fine-tuning with just 8 hours of target-language data enhances detection robustness. Together, these findings emphasize the critical need for domain-aware and language-specific adaptation in spoofing detection.

2606.08678 2026-06-09 cs.SD cs.LG 新提交

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

基于梯度反转和变分信息瓶颈的说话人不变表示学习用于欺骗检测

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite(阿维尼翁大学) EURECOM

AI总结 针对欺骗检测中说话人偏差导致泛化差的问题,提出教师-学生框架,利用梯度反转层和变分信息瓶颈解耦身份信息,在9个数据集上EER相对降低25.7%。

详情
AI中文摘要

先进的生成语音技术可能破坏语音生物识别的可靠性。虽然欺骗检测系统在域内条件下评估时表现出色,但对域外设置的泛化能力通常较差。在本文中,我们表明此类问题可能由说话人偏差引起,即模型学习个体声音特征而非操作或生成的标记。我们提出了一种用于说话人不变欺骗检测的教师-学生框架,该框架无需说话人标签即可解耦身份。我们利用预训练的说话人识别教师通过梯度反转层指导学生模型。为了控制抑制与语音身份相关线索和保留与欺骗检测相关线索之间的平衡,我们集成了变分信息瓶颈。在九个数据集上的评估表明,与MHFA基线相比,我们的模型实现了EER相对降低25.7%。

英文摘要

Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.

2501.08238 2026-06-09 cs.SD eess.AS 版本更新

CodecFake+: Codec-Based Resynthesized Data as a Proxy for Detecting CodecFake Speech

CodecFake+: 基于编解码器的重合成数据作为检测CodecFake语音的代理

Xuanjun Chen, Jiawei Du, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University(国家交通大学通信工程研究院) Department of Computer Science and Information Engineering, National Taiwan University(国家交通大学计算机科学与信息工程系) Center for Language and Speech Processing at Johns Hopkins University(约翰霍普金斯大学语言与语音处理中心) Department of Electrical Engineering, National Taiwan University(国家交通大学电子工程系) Research Center for Information Technology Innovation, Academia Sinica(学术院信息技术创新研究中心) NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)(国家交通大学人工智能研究中心)

AI总结 针对新兴的CodecFake深度伪造语音检测挑战,提出大规模数据集CodecFake+,包含31种开源编解码器重合成训练数据和17种先进CoSG模型网络数据,并建立编解码器分类体系,验证了重合成语音作为训练数据的有效性。

Comments Accepted by TASLP 2026

详情
AI中文摘要

随着神经音频编解码器的快速发展,基于编解码器的语音生成(CoSG)系统变得非常强大。不幸的是,CoSG也使得创建高度逼真的深度伪造语音成为可能,更容易模仿个人声音并传播错误信息。我们将这种由CoSG系统生成的新兴深度伪造语音称为CodecFake。检测这种CodecFake是一个紧迫的挑战,然而现有系统大多主要关注检测由传统语音合成模型生成的伪造语音。在本文中,我们介绍了CodecFake+,一个旨在推进CodecFake检测的大规模数据集。据我们所知,CodecFake+是包含最多样化编解码器架构的最大数据集。训练集通过使用31个公开可用的开源编解码器模型进行重合成生成,而评估集包括来自17个先进CoSG模型的网络数据。我们还提出了一个全面的分类体系,根据编解码器的根组件:向量量化器、辅助目标和解码器类型对其进行分类。我们提出的数据集和分类体系使得能够在多个层面进行详细分析,以辨别成功检测CodecFake的关键因素。在单个编解码器层面,我们验证了使用编解码器重合成语音(CoRS)作为训练数据用于大规模CodecFake检测的有效性。在分类体系层面,我们表明当重合成模型包含解缠辅助目标或频域解码器时,检测性能最强。此外,从使用所有CoRS训练数据的角度,我们表明我们提出的分类体系可用于选择更好的训练数据以提高检测性能。总体而言,我们期望CodecFake+将成为通用和细粒度探索的重要资源,以开发更好的针对CodecFake的反欺骗模型。

英文摘要

With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.

12. 其他/综合语音音频 9 篇

2606.07673 2026-06-09 cs.SD cs.AI cs.LG 新提交

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

声带创伤性与非声带创伤性声音亢进的自动分类的分层特征工程框架

June-Woo Kim, Kangwook Jang, Minu Kim, Hyunju Lee

发表机构 * Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究院) GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science and Technology(光州科学技术院GIST InnoCORE AI-Nano神经退行性疾病早期检测融合研究所) School of Electrical Engineering, KAIST(韩国科学技术院电气工程学院) Department of AI Convergence, Gwangju Institute of Science and Technology(光州科学技术院人工智能融合系)

AI总结 提出分层特征工程框架,包括静态、动态、比率和耦合特征,用于区分声带创伤性和非声带创伤性声音亢进,发现耦合特征对两类分类均关键,PVH AUC 0.891,NPVH AUC 0.728。

Comments Interspeech 2026

详情
AI中文摘要

动态颈部表面加速度能够实现声音亢进的无创监测,但其亚型的稳健生物标志物仍然有限。本研究利用NeckVibe Challenge数据集区分声带创伤性(PVH)和非声带创伤性(NPVH)声音亢进与健康对照组。我们提出一个分层特征工程框架,包括:(i)静态特征,(ii)动态特征,(iii)基于比率的特征,(iv)捕捉源-滤波器交互的耦合特征。单变量统计分析显示PVH具有强可分性,但NPVH显著性有限,而我们针对高维特征集成优化的机器学习流程发现,耦合特征对两项任务都至关重要。我们实现了PVH的AUC为0.891,NPVH的AUC为0.728,表明虽然PVH近似线性可分,但NPVH的区分受益于非线性特征交互建模。

英文摘要

Ambulatory neck-surface acceleration enables non-invasive monitoring of vocal hyperfunction, yet robust biomarkers for its subtypes remain limited. This study investigates the NeckVibe Challenge dataset to distinguish phonotraumatic (PVH) and non-phonotraumatic (NPVH) from healthy controls. We propose a hierarchical feature engineering framework comprising: (i) static, (ii) dynamic, (iii) ratio-based, (iv) coupling features capturing source filter interactions. While univariate statistical analysis shows strong separability for PVH but limited significance for NPVH, our machine learning pipeline, tailored for high-dimensional feature integration, identifies that coupling features are crucial for both tasks. We achieve an AUC of 0.891 for PVH and 0.728 for NPVH, suggesting that while PVH is near-linearly separable, NPVH discrimination benefits from modeling non-linear feature interactions.

2606.08286 2026-06-09 cs.SD 新提交

FXplorer: A Map-Based Interface for Exploratory Audio Effect Design

FXplorer: 一种基于地图的探索性音频效果设计界面

Annie Chu, Jason Brent Smith, Bryan Pardo

发表机构 * Northwestern University(西北大学)

AI总结 提出FXplorer界面,将音频效果组织在感知二维空间中,通过空间交互与嵌入方法实现连续浏览与参数精调的统一,支持交互式预设编辑与插值。

Comments Accepted to NIME 2026. Project page: https://anniejchu.github.io/fxplorer/

详情
AI中文摘要

音频效果(FX)在当代音乐实践中塑造声音。然而,大多数界面将它们呈现为离散模块和参数,这有利于针对性调整而非探索性聆听。这种分离使得难以建立关于可能变换的更广阔空间的直觉,也难以在搜索和精调之间流畅移动。我们提出FXplorer,一个将音频效果组织在感知信息丰富的二维空间中的界面,允许将声音变换作为连续景观而非孤立预设进行浏览。通过结合既定的空间交互方法和可解释的DAW风格控制,以及基于嵌入的相似性和语义搜索的机器学习方法,该系统将探索和参数精调整合到单个工作空间中。FXplorer通过允许用户交互式编辑和插值效果预设,支持作曲、制作或表演。

英文摘要

Audio effects (FX) shape sound in contemporary music practice. However, most interfaces present them as discrete modules and parameters that favor targeted adjustment over exploratory listening. This separation can make it difficult to build intuition about the broader space of possible transformations or to move fluidly between searching and refinement. We present FXplorer, an interface that organizes audio effects within a perceptually informed 2D space, allowing sound transformations to be browsed as a continuous landscape rather than as isolated presets. By combining established spatial interaction approaches and interpretable DAW-style controls with recent embedding-based machine learning methods for similarity and semantic search, the system brings exploration and parameter refinement into a single workspace. FXplorer supports composition, production, or performance by allowing users to edit and interpolate between effect presets interactively.

2606.09266 2026-06-09 cs.SD cs.AI 新提交

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

物理引导的序列生成框架用于声学超材料逆向设计

Yijie Li, Jiahao Xu, Ching-Chih Tsao, Lili Qiu, Jingxian Wang

发表机构 * National University of Singapore(新加坡国立大学) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 提出MetaSeq框架,将声学超材料表示为结构化序列,通过序列到序列模型结合物理求解器和强化学习,实现宽带逆向设计,误差降低45%。

详情
AI中文摘要

声学超材料(AMM)逆向设计对于宽带目标响应尤其具有挑战性,原因是声学色散:在一个频率上匹配期望响应的结构可能在其它频率上偏离,而修改几何以改善一个子带通常会扰动相邻子带。然而,现有的宽带逆向设计方法要么受限于预定义模板,要么依赖于无法保持声学结构所需的几何精度和结构连通性的图像表示。我们提出了MetaSeq,一个物理引导的、基于序列的生成框架,用于声学超材料逆向设计。其核心是,MetaSeq引入了一种语言,将每个AMM表示为结构化序列,而不是像素网格或固定模板。这种表示保留了精确的几何形状,显式编码了连通性,并将逆向设计转化为从目标响应到结构序列的序列到序列任务。MetaSeq进一步构建了一个平衡、高保真的数据集,具有高效的校准和基于复杂度的采样。为了解决逆向设计的一对多性质,MetaSeq结合了监督预训练和基于物理求解器及有效性检查器引导的强化学习微调。针对COMSOL和五个基线的广泛评估表明,MetaSeq在最佳基线基础上将响应误差降低了45%。

英文摘要

Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.

2606.09271 2026-06-09 cs.SD cs.LG 新提交

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

基于上下文引导跨模态注意力的多视角语音表示学习用于帕金森病检测

George Theodosiou, Loukas Ilias, Dimitris Askounis

发表机构 * National Technical University of Athens(雅典国家技术大学)

AI总结 提出多分支深度学习框架,融合Log-Mel谱图、MFCC和HuBERT嵌入三种互补语音模态,通过上下文引导跨模态注意力机制动态加权,在PC-GITA语料库上实现91.51%准确率和95.97% AUC,验证了异质语音建模对帕金森病检测的有效性。

详情
AI中文摘要

帕金森病(PD)是一种进行性神经退行性疾病,常导致与运动功能减退性构音障碍相关的言语障碍。由于言语产生依赖于复杂神经肌肉机制的精确协调,语音分析已成为早期PD检测中一种有前景的非侵入性、成本效益高的生物标志物。最近的深度学习方法显示出令人鼓舞的结果;然而,大多数现有方法依赖单一语音表示,可能忽略跨不同特征空间编码的互补病理信息。在这项工作中,我们提出了一种多分支深度学习框架,用于从语音中自动检测PD。每个录音被分割成5秒的片段,并使用三种互补模态表示:Log-Mel谱图、MFCC和从原始波形中提取的HuBERT嵌入。谱图使用预训练的ResNet-18编码器处理,MFCC序列通过BiLSTM网络建模,原始语音使用预训练的HuBERT模型编码。为了有效整合这些异质表示,我们引入了一种上下文引导的跨模态注意力机制,该机制根据来自谱图和MFCC分支的全局声学上下文动态加权时间HuBERT嵌入。在公开的西班牙语PC-GITA语料库上,在严格的说话人独立5折交叉验证下进行的实验证明了所提出方法的有效性。所提出的架构实现了91.51%的准确率、91.24%的F1分数和95.97%的AUC。此外,消融研究证实了所提出的上下文引导跨模态注意力机制以及互补语音表示整合的贡献。这些发现突显了异质语音建模在稳健且临床可靠的PD检测中的潜力。

英文摘要

Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a promising non-invasive and cost-effective biomarker for early PD detection. Recent deep learning approaches have shown encouraging results; however, most existing methods rely on a single speech representation, potentially overlooking complementary pathological information encoded across different feature spaces. In this work, we propose a multi-branch deep learning framework for automatic PD detection from speech. Each recording is segmented into 5-second chunks and represented using three complementary modalities: Log-Mel spectrograms, MFCCs, and HuBERT embeddings extracted from raw waveforms. The spectrograms are processed using a pre-trained ResNet-18 encoder, MFCC sequences are modeled through a BiLSTM network, and raw speech is encoded using a pre-trained HuBERT model. To effectively integrate these heterogeneous representations, we introduce a context-guided cross-modal attention mechanism that dynamically weights temporal HuBERT embeddings according to the global acoustic context derived from the spectrogram and MFCC branches. Experiments conducted on the publicly available Spanish PC-GITA corpus under strict speaker-independent 5-fold cross-validation demonstrate the effectiveness of the proposed approach. The proposed architecture achieves an accuracy of 91.51%, an F1-score of 91.24%, and an AUC of 95.97%. Furthermore, ablation studies confirm the contribution of both the proposed context-guided cross-modal attention mechanism and the integration of complementary speech representations. These findings highlight the potential of heterogeneous speech modeling for robust and clinically reliable PD detection.

2606.09780 2026-06-09 cs.SD cs.NE 新提交

Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

声音生成中的质量-多样性搜索:用于音频探索的创新引擎研究

Björn Þór Jónsson, Çağrı Erdem, Stefano Fasciani, Kyrre Glette

发表机构 * University of Oslo(奥斯陆大学)

AI总结 本研究将质量多样性算法与监督判别模型结合,通过多频段CPPN和DSP图生成多样化合成声音,并分析进化路径与时间生态位,展示了创新引擎在声音发现中的潜力。

Comments This is an extended version of the previously published conference paper "Towards Sound Innovation Engines Using Pattern-Producing Networks and Audio Graphs": https://doi.org/10.1007/978-3-031-56992-0_14

详情
AI中文摘要

本研究解决了作曲家和声音设计师在创建和优化工具以实现其音乐目标时所面临的挑战。通过利用进化过程促进多样性并培养偶然发现,我们自动化了在未知声音空间中的搜索以发现声音,认为促进多样性的算法可以弥合声音的理论实现与实际可访问性之间的差距。我们描述了一个生成式声音合成系统,该系统将质量多样性(QD)算法与监督判别模型相结合,灵感来自创新引擎算法,并探索了不同配置以及所选合成方法与判别模型之间的相互作用。我们研究了组合模式生成网络(CPPN)和数字信号处理(DSP)图之间的交互,引入了一种新颖的方法,该方法使用多个专门针对不同频率范围的CPPN;这产生了更简单的网络,同时保持了与单CPPN设置相当的性能。我们还通过分析音乐和非音乐背景之间的目标切换来研究进化垫脚石,揭示了谱系如何穿越看似不可能的路径到达当前精英。将先前研究的行为空间扩展到包括各种声音持续时间,我们发现了时间生态位内的特化。结果表明,CPPN和DSP图与多维表型精英档案(MAP-Elites)和深度学习分类器相结合,可以生成大量多样的合成声音,在时间和上下文维度上具有多样性和创新性。我们通过在线探索器和渲染的声音文件呈现生成的声音对象,并在音乐创作的背景下,展示了一个实验性应用,该应用展示了它们在不同持续时间和上下文中的创造潜力。

英文摘要

This study addresses the challenges composers and sound designers face in creating and refining tools to achieve their musical goals. Using evolutionary processes to promote diversity and foster serendipitous discoveries, we automate the search through uncharted sonic spaces for sound discovery, arguing that diversity-promoting algorithms can bridge the gap between the theoretical realisation and practical accessibility of sounds. We describe a system for generative sound synthesis combining Quality Diversity (QD) algorithms with a supervised discriminative model, inspired by the Innovation Engine algorithm, and explore different configurations and the interplay between the chosen synthesis approach and the discriminative model. We examine the interaction between Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs, introducing a novel approach that uses multiple specialised CPPNs for different frequency ranges; this yields simpler networks while maintaining performance comparable to single-CPPN setups. We also investigate evolutionary stepping stones by analysing goal switches between musical and non-musical contexts, revealing how lineages traverse unlikely paths to current elites. Expanding the behaviour space of a previous study to include various sound durations, we uncover specialisation within temporal niches. Results indicate that CPPN and DSP graphs coupled with a Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) and a deep learning classifier can generate a substantial variety of synthetic sounds, diverse and innovative across temporal and contextual dimensions. We present the generated sound objects through an online explorer and as rendered sound files, and, in the context of music composition, an experimental application that showcases their creative potential across various durations and contexts.

2606.08385 2026-06-09 eess.SP cs.IT cs.SD cs.SY eess.SY math.IT stat.ML 交叉投稿

A Switching Beamformer for Highly Non-Stationary Environments

一种适用于高度非平稳环境的切换波束形成器

Manan Mittal, Ryan M. Corey, John R. Buck, Andrew C. Singer

发表机构 * Electrical and Computer Engineering, Stony Brook University(石溪大学电气与计算机工程系) Electrical and Computer Engineering, University of Illinois Chicago(伊利诺伊大学芝加哥分校电气与计算机工程系) Electrical and Computer Engineering, University of Massachusetts Dartmouth(马萨诸塞大学达特茅斯分校电气与计算机工程系) College of Applied Science and Engineering, Stony Brook University(石溪大学应用科学与工程学院)

AI总结 针对复杂快速变化干扰下自适应波束形成性能下降的问题,提出通用切换波束形成器(USB),通过竞争性序列预测和线性转移图动态调整有效记忆长度,理论证明其遗憾上界,实验验证其兼具短窗口的敏捷性和长窗口的精度。

Comments 11 pages, 19 figures, under review

详情
AI中文摘要

自适应波束形成是阵列信号处理的基石,但其性能在面对复杂、快速变化的干扰时常常崩溃。当干扰源出现或移动不可预测时,传统估计器面临基本的记忆权衡:短窗口能够快速跟踪但估计方差高,而长窗口提供稳定的抑制但无法适应变化。通过将竞争性序列预测引入波束形成架构,提出通用切换波束形成器(USB)解决了这一挑战。通过使用线性转移图,USB隐式维护了一个指数大的候选协方差历史族,并根据其累积输出功率动态重新加权。该机制使波束形成器能够自动改变其有效记忆长度,无需显式的变化检测或启发式参数调整。证明了相对于一个全知先知(该先知事后选择最佳分段平稳协方差模型)的遗憾的理论上界。在SwellEx-96数据集上的大量仿真和实验表明,USB实现了短窗口估计器的敏捷性和长期集成的精度,为跟踪高度非平稳场景提供了一种原则性解决方案。

英文摘要

Adaptive beamforming is a cornerstone of array signal processing, yet its performance often collapses in the face of complex, rapidly changing interference. When interferers appear or move unpredictably, conventional estimators encounter a fundamental memory trade-off: short windows enable rapid tracking but suffer from high estimation variance, while long windows provide stable rejection but fail to adapt to shifts. This challenge is resolved by introducing the Universal Switching Beamformer (USB), which integrates competitive sequential prediction into the beamforming architecture. By employing a linear transition diagram, the USB implicitly maintains an exponentially large family of candidate covariance histories and dynamically re-weights them based on their cumulative output power. This mechanism allows the beamformer to automatically vary its effective memory length without explicit change detection or heuristic parameter tuning. A theoretical upper bound is proven on the regret relative to an omniscient oracle that selects the best piecewise-stationary covariance model in hindsight. Extensive simulations and experiments on the SwellEx-96 dataset demonstrate that the USB achieves the agility of short-window estimators and the precision of long-term integration, providing a principled solution for tracking highly non-stationary scenes.

2509.02167 2026-06-09 cs.SD 版本更新

AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition

AudioRWKV:用于音频模式识别的高效稳定双向RWKV

Jing Wang, Maoxiang Wu, Jiayu Xiong, Jianlong Kwan, Jun Xue

发表机构 * arXiv

AI总结 提出AudioRWKV架构,通过2D深度可分离卷积和双向WKV核,在保持线性复杂度的同时实现全局上下文建模,解决了Transformer的高复杂度和Mamba的不稳定性问题。

Comments 6 pages, 3 figures

详情
AI中文摘要

最近,Transformer(例如,音频频谱图Transformer,AST)和状态空间模型(例如,Audio Mamba,AuM)在音频建模中取得了显著进展。然而,Transformer架构的O(L^2)计算复杂度阻碍了高效的长序列处理,而Mamba架构在扩展参数和数据时往往变得不稳定。为了解决这些挑战,本文提出了AudioRWKV(A-RWKV),一种用于音频建模的高效稳定架构。具体来说,我们继承了RWKV7的稳定高效循环公式,并将其一维token移位操作替换为二维深度可分离卷积,以更好地捕捉局部频谱-时间模式。此外,我们将原始的因果WKV核改编为双向WKV核(Bi-WKV),使得能够在整个音频序列上进行全局上下文建模,同时保持线性计算复杂度。得益于RWKV7基础的固有稳定性,A-RWKV可以无缝扩展到更大的模型尺寸。实验结果表明,在相同的线性模型机制下,A-RWKV-S(22M)达到了与AuM-B(92M)相当的性能,同时表现出比AST更稳定的吞吐量;对于长音频(约5分28秒),WKV7的处理速度提升高达13.3倍。

英文摘要

Recently, Transformers (e.g., Audio Spectrogram Transformers, AST) and state-space models (e.g., Audio Mamba, AuM) have achieved remarkable progress in audio modeling. However, the O(L^2) computational complexity of the Transformer architecture hinders efficient long-sequence processing, while the Mamba architecture tends to become unstable when scaling parameters and data. To address these challenges, this paper proposes AudioRWKV (A-RWKV), a highly efficient and stable architecture for audio modeling. Specifically, we inherit the stable and efficient recurrent formulation of RWKV7 and replace its 1D token-shift operation with a 2D depthwise separable convolution to better capture local spectro-temporal patterns. Furthermore, we adapt the original causal WKV kernel into a bidirectional WKV kernel (Bi-WKV), enabling global context modeling over the entire audio sequence while maintaining linear computational complexity. Benefiting from the inherent stability of the RWKV7 foundation, A-RWKV scales seamlessly to larger model sizes. Experimental results demonstrate that, under the same linear-model regime, A-RWKV-S (22M) achieves performance parity with AuM-B (92M) while exhibiting more stable throughput than AST; for long-form audio (~5 minutes 28 seconds), WKV7 achieves up to a 13.3X speedup in processing.

2602.05027 2026-06-09 cs.SD cs.AI 版本更新

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

AudioSAE:利用稀疏自编码器理解音频处理模型

Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich, Assel Yermekova, Laida Kushnareva, Vadim Popov, Kristian Kuznetsov, Irina Piontkovskaya

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文在Whisper和HuBERT的编码器层训练稀疏自编码器(SAE),评估其稳定性和可解释性,并展示其在特征解耦、概念擦除、语音检测优化及与人类脑电活动对齐方面的实用价值。

Comments Accepted to EACL 2026, main track

详情
Journal ref
Proceedings of EACL 2026, pages 3221-3254
AI中文摘要

稀疏自编码器(SAE)是解释神经表征的强大工具,但它们在音频领域的应用尚未充分探索。我们在Whisper和HuBERT的所有编码器层训练SAE,对其稳定性、可解释性进行了广泛评估,并展示了其实用性。超过50%的特征在随机种子间保持一致,且重建质量得以保持。SAE特征捕获了通用声学和语义信息以及特定事件,包括环境噪声和副语言声音(如笑声、低语),并有效解耦它们,仅需移除19-27%的特征即可擦除一个概念。特征引导将Whisper的虚假语音检测降低了70%,且词错误率(WER)增加可忽略不计,展示了实际应用价值。最后,我们发现SAE特征与语音感知过程中的人类脑电活动相关,表明其与人类神经处理的对齐。代码和检查点可在https://github.com/audiosae/audiosae_demo获取。

英文摘要

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.

2507.02606 2026-06-09 cs.SD cs.AI cs.CR cs.LG eess.AS 版本更新

De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks

De-AntiFake:重新思考对抗语音克隆攻击的保护扰动

Wei Fan, Kejiang Chen, Chang Liu, Weiming Zhang, Nenghai Yu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出一种两阶段净化方法,旨在提升对抗语音克隆攻击的防御效果,通过净化扰动语音并利用音素指导进行优化,实验表明其优于现有方法。

Comments Accepted by ICML 2025

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267, 2025
AI中文摘要

随着语音生成模型的快速发展,语音克隆(VC)带来的隐私和安全问题日益突出。近期研究尝试通过引入对抗扰动来阻止未经授权的语音克隆,但确定性攻击者可以缓解这些保护扰动并成功执行VC。本文首次系统评估这些保护扰动在包含扰动净化的现实威胁模型下的有效性。研究发现,尽管现有净化方法能中和大量保护扰动,但仍导致VC模型特征空间的失真,影响VC性能。因此,我们提出一种新的两阶段净化方法:(1)净化扰动语音;(2)利用音素指导进行优化,使其符合干净语音分布。实验结果表明,我们的方法在破坏VC防御方面优于现有方法。本研究揭示了基于对抗扰动的VC防御的局限性,并强调了需要更鲁棒的解决方案以缓解VC带来的安全和隐私风险。代码和音频样本可在https://de-antifake.github.io获取。

英文摘要

The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at https://de-antifake.github.io.