arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.07608 2026-06-09 cs.CL cs.AI cs.LG cs.SD 交叉投稿

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

针对瑞士德语音识别的Whisper字幕对齐微调：基准污染、惯例不匹配以及25.6% WER（13.8% cWER）的诚实基线

Felix Akeret

发表机构 * Independent Researcher, Zurich, Switzerland（独立研究员，瑞士苏黎世）； ETH Zürich（苏黎世联邦理工学院）； University of Bern（伯尔尼大学）； FHNW（西北应用科学与艺术大学）； CeTIM Leiden/Munich（CeTIM 莱顿/慕尼黑）

AI总结通过1,367小时广播语音与标准德语字幕的弱监督，系统微调Whisper large-v3用于瑞士德语音识，发现公开结果因基准污染被高估，并发布两个诚实评估的模型。

Comments 15 pages, 21 tables. Models available at https://huggingface.co/Flix-AI

详情

AI中文摘要

我们提出了一项系统研究，针对OpenAI的Whisper large-v3进行微调，用于瑞士德语音识，使用1,367小时的广播语音与标准德语字幕作为弱监督。通过在NVIDIA DGX Spark（Grace Blackwell，128 GB统一内存，最高1 PFLOP FP4）上进行16次迭代训练，我们比较了LoRA和全微调（1.55B参数模型），研究了幻觉的根本原因，并量化了数据质量、字幕对齐和训练策略的影响。我们的最佳模型在严格不相交数据上的诚实评估中，在All Swiss German Dialects Test Set (ASGDTS)上实现了25.6%的测量WER。通过将真实错误与有效的风格变异（时态、词序、瑞士正字法）分离的协调错误分析，得到内容WER (cWER)为13.8%，仅计算实际识别失败。偏差校正估计将其降至8.5%，表明真实错误率约为测量WER的三分之一。\n我们证明，已发表的瑞士德语ASR最先进结果（17.1-17.5% WER）因基准污染而被夸大：一个在ASGDTS测试集上自训练的普通Whisper模型（零瑞士德语数据）实现了13.88% WER，超过了所有已发表系统。使用Phi-4-multimodal的实验显示出更强的记忆效应（3.9% WER），揭示该基准主要衡量惯例匹配而非方言理解。\n我们发布了两个模型，一个LoRA适配器（25.32% WER，13.9% cWER）和一个全微调模型（25.60% WER，13.8% cWER），这是少数公开可用、经过诚实评估的瑞士德语Whisper模型之一，采用Apache 2.0许可，完全可复现，无需机构数据协议。

英文摘要

We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the All Swiss German Dialects Test Set (ASGDTS) in an honest evaluation on strictly disjoint data. A harmonized error analysis separating genuine errors from valid stylistic variation (tense, word order, Swiss orthography) yields a content WER (cWER) of 13.8%, counting only actual recognition failures. Bias-corrected estimation reduces this to 8.5%, suggesting the true error rate is roughly one third of measured WER. We demonstrate that published state-of-the-art Swiss German ASR results (17.1-17.5% WER) are inflated by benchmark contamination: a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieves 13.88% WER, surpassing all published systems. Experiments with Phi-4-multimodal show an even stronger memorization effect (3.9% WER), revealing that the benchmark primarily measures convention matching rather than dialectal comprehension. We release two models, a LoRA adapter (25.32% WER, 13.9% cWER) and a full fine-tuned model (25.60% WER, 13.8% cWER), among the few publicly available, honestly evaluated Whisper models for Swiss German, under Apache 2.0 with full reproducibility, requiring no institutional data agreements.

URL PDF HTML ☆

赞 0 踩 0

2606.08210 2026-06-09 eess.AS cs.CL cs.SD 交叉投稿

Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion

Paediatric-HGNN：一种通过多尺度声学融合检测儿童言语不流畅的混合异构图神经网络

Rashini Liyanarachchi, Rachael Mackay, Alison Short, Aditya Joshi, Erik Meijering

发表机构 * University of New South Wales（新南威尔士大学）； Western Sydney University（西澳悉尼大学）； Resourced Music Therapy（资源音乐治疗）

AI总结针对儿童言语中声学变异大、病理口吃与发育性不流畅难以区分的问题，提出Paediatric-HGNN框架，通过构建异构图捕获词汇与声学片段的分层关系，在儿童语料上实现82.4%加权准确率和0.386的典型不流畅F1分数。

Comments Accepted at INTERSPEECH 2026 (Main)

2606.09535 2026-06-09 cs.CL cs.SD 交叉投稿

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

克服Whisper在达罗毗荼语系和低资源语言中的解码器不一致性

Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik

发表机构 * Sony Research India（索尼印度研究院）

AI总结针对Whisper在达罗毗荼语系上词错误率高的问题，通过语言学和数据集分析发现词汇稀疏和字符级替换错误，提出加权注意力和自条件化两种解码器增强方法，显著降低低资源和黏着语言的WER。

Comments Accepted at INTERSPEECH 2026, 5 pages, 1 figure, 5 tables

详情

AI中文摘要

多语言ASR模型如Whisper在高资源语言上表现良好，但在达罗毗荼语系上的词错误率（WER）显著高于印度-雅利安语系。通过语言学和数据集分析，我们发现达罗毗荼语系具有更长的单词、更高的词汇多样性和更低的重复率，导致标记分布稀疏和频繁的字符级替换错误。基线微调进一步揭示了自注意力（语言上下文）和交叉注意力（声学线索）之间的解码器不平衡。尽管合成标记重复实验表明潜在收益，但实际不可行。受这些观察启发，我们引入了两种解码器级增强：加权注意力（自适应平衡注意力来源）和自条件化（重新注入中间预测以提高标记一致性）。实验表明，对于低资源和黏着语言，WER持续降低。

英文摘要

Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.

URL PDF HTML ☆

赞 0 踩 0

2604.24278 2026-06-09 cs.SD cs.AI 版本更新

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

RAS：一种面向可靠性的自动语音识别度量标准

Wenbin Huang, Yuhang Qiu, Bohan Li, Yiwei Guo, Jing Peng, Hankun Wang, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China（上海交通大学计算机科学学院X-LANCE实验室，中国）； MoE Key Lab of Artificial Intelligence（人工智能MOE重点实验室；江苏语言计算重点实验室，中国）； Jiangsu Key Lab of Language Computing, China

AI总结本研究提出了一种面向可靠性的度量标准RAS，用于评估自动语音识别系统在不确定段落中的转录可靠性，通过引入一种具有退避意识的转录框架，结合人类偏好校准的参数，提升了转录的可靠性同时保持了准确性。

Comments 5 pages, 4 figures; Accepted at InterSpeech 2026

2606.08843 2026-06-09 cs.SD cs.LG 新提交

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

从A到B再回到A：基于非平行数据的回文零样本语音转换

Moshe Mandel, Shlomo E. Chazan

发表机构 * Independent, Israel（以色列独立机构）； OriginAI, Israel（以色列OriginAI公司）

AI总结提出利用WavLM表示的K近邻检索对齐非平行语音，构建合成训练对，结合说话人损失实现零样本语音转换，在仅用英语数据训练下跨语言表现优异。

2606.09019 2026-06-09 cs.SD cs.AI 新提交

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

TLDR：压缩音频令牌以实现高效自回归文本到语音

Yejin Lee, Junwon Moon, Hyoeun Kim, Hyunjin Choi, Heeseung Kim, Kyuhong Shim

发表机构 * Sungkyunkwan University（成均馆大学）； University of Seoul（首尔市立大学）

AI总结提出TLDR框架，通过将因果建模从令牌级转移到补丁级，利用轻量级压缩器和LoRA适配的冻结预训练骨干，实现1.8倍推理加速和75% KV缓存减少。

详情

AI中文摘要

基于编解码器的自回归（AR）语音语言模型通过将语音建模为离散音频令牌序列，并使用大型预训练骨干网络，实现了强大的文本到语音（TTS）质量。然而，这种令牌级公式造成了结构效率瓶颈：语音令牌序列比文本序列长得多，要求AR骨干在每个令牌位置执行因果计算，并维护随序列长度增长的KV缓存。我们引入TLDR，一种基于补丁的自回归框架，通过将因果建模从令牌级语音序列转移到补丁级序列，加速基于编解码器的AR-TTS。TLDR使用轻量级压缩器将连续的编解码器令牌分组为紧凑的潜在补丁，使用通过LoRA适配的冻结预训练AR-TTS骨干对生成的较短补丁序列进行建模，并使用说话人条件提取器在每个补丁内重建细粒度语音令牌。在补丁大小为4的情况下，TLDR比基线AR-TTS模型实现了1.8倍的推理加速，并将全局KV缓存内存减少了高达75%。实验结果表明，补丁级全局因果建模可以成为降低预训练基于编解码器的AR-TTS系统推理成本的一种实用方法，而无需替换现有模块。

英文摘要

Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.

URL PDF HTML ☆

赞 0 踩 0

2606.09234 2026-06-09 cs.SD cs.AI 新提交

评估神经说话人验证模型在训练和推理中的能耗与碳排放

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier

发表机构 * LIA, UPR 4128 ； Aday ； Avignon University（阿维尼翁大学）

AI总结本研究通过测量不同ResNet架构在VoxCeleb2上的能耗与碳排放，发现模型加深或加宽带来边际精度提升但能耗剧增，而中等规模网络（如ResNet-50）能实现性能与环境影响的良好平衡。

Comments Accepted to Speaker Odyssey 2026 Lisbon

2606.08505 2026-06-09 eess.AS cs.SD 交叉投稿

Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines

快速且鲁棒的设备端说话人日志：步长加速管道的相对最小聚类大小

Fumiaki Yamaguchi

发表机构 * University of Tokyo（东京大学）

AI总结针对设备端说话人日志的推理成本问题，提出相对最小聚类大小（mcs=round(f*n), f=0.01）以自适应嵌入预算，在保持AMI上DER不变的同时，将VoxConverse的DER从0.113恢复至0.079，加速比达12.2倍。

详情

AI中文摘要

诸如会议转录和语音助手等语音应用将受益于设备端说话人日志，但实际采用受限于推理成本。我们研究了基于Pyannote 3.1的管道在消费级硬件（RTX 5070 Ti GPU和Apple M4笔记本）上能在多大程度上加速，同时保持说话人日志错误率（DER）。一个简单的方案：更粗的分割步长和逐块嵌入，在AMI上实现了多倍加速且DER不变，但在野外数据上急剧退化：在VoxConverse上，DER从0.075上升到0.113。我们将失败归因于聚类阶段说话人计数不足，这是由于固定的最小聚类大小与每个说话人嵌入数量减少相互作用所致。我们提出相对最小聚类大小，mcs = round(f * n)，其中f = 0.01，它自适应于每个录音的嵌入预算。单个f值将VoxConverse DER恢复至0.079（约恢复丢失准确率的89%），同时保持AMI不变，加速后的管道在AMI（MPS）上相对于我们的CAM++基线达到12.2倍加速。

英文摘要

Speech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral on AMI, but degrades sharply on in-the-wild data: on VoxConverse, DER rises from 0.075 to 0.113. We trace the failure to speaker under-counting in the clustering stage, caused by a fixed minimum cluster size interacting with the reduced number of embeddings per speaker. We propose a relative minimum cluster size, mcs = round(f * n) with f = 0.01, which adapts to the embedding budget per recording. A single value of f recovers VoxConverse DER to 0.079 (about 89% of the lost accuracy) while keeping AMI flat, and the accelerated pipeline reaches up to 12.2x speedup on AMI (MPS) over our CAM++ baseline.

URL PDF HTML ☆

赞 0 踩 0

2602.15519 2026-06-09 eess.AS cs.SD 版本更新

Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

Kevin Wilkinghoff, Gordon Wichern, Jonathan Le Roux, Zheng-Hua Tan

发表机构 * Department of Electronic Systems, Aalborg University（电子系统系，奥尔堡大学）； Pioneer Centre for Artificial Intelligence（先锋人工智能中心）； Mitsubishi Electric Research Laboratories (MERL)（三菱电机研究实验室（MERL））

AI总结针对异常声音检测中局部密度分数归一化对邻域大小敏感的问题，提出聚类出口检测机制，通过识别距离不连续性自适应选择邻域大小，提升鲁棒性和性能。

2606.08722 2026-06-09 cs.SD cs.CL 新提交

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

LLM 能否理解 LilyPond？一个用于符号音乐生成与理解的基准

Matteo Spanio, Mohammad Torabi, Andrea Poltronieri, Antonio Rodà

发表机构 * University of Padova（帕多瓦大学）； Universitat Pompeu Fabra（庞培法布拉大学）

AI总结提出 LilyBench，基于 LilyPond 的基准，联合评估开源 LLM 的符号音乐生成与理解能力，实验表明零样本可生成可执行 LilyPond，但结构理解任务仍有挑战，且指标间存在系统性分歧。

Comments Accepted at Ital-IA 2026

详情

AI中文摘要

大型语言模型的符号音乐评估在表示、数据集和指标上仍然碎片化。我们引入了 LilyBench，一个基于 LilyPond 的基准，用于在同一系列开源权重 LLM 上联合评估符号音乐生成和音乐理解。该基准包括一个 200 个提示的生成套件和十个从 ABC-Eval 改编的理解任务，涵盖语法、元数据预测、结构排序和音乐识别。生成质量通过编译率、基于 Jensen-Shannon 相似度的 MusPy 描述符分布以及基于 LilyBERT 的 Fréchet 音乐距离 (FMD) 进行评估。在四个开源模型上的实验表明，在零样本设置下可以实现可执行的 LilyPond 生成，而结构理解任务尽管在作曲家和流派识别上表现强劲，但仍然具有挑战性。我们的实验还揭示了基于描述符和基于嵌入的指标之间的系统性分歧，表明符号音乐评估受益于指标三角测量而非单一分数排名。我们发布了基准、提示库和评估代码，以支持未来在符号音乐生成和理解方面的研究，地址为 https://github.com/CSCPadova/lilybench。

英文摘要

Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench

URL PDF HTML ☆

赞 0 踩 0

2312.15946 2026-06-09 cs.SD cs.GR eess.AS 版本更新

EnchantDance: Unveiling the Potential of Music-Driven Dance Movement

EnchantDance: 揭示音乐驱动舞蹈动作的潜力

Bo Han, Teng Zhang, Zeyu Ling, Feilin Han

发表机构 * Zhejiang University（浙江大学）； Tongji University（同济大学）

AI总结提出EnchantDance框架，通过构建舞蹈潜在空间和扩散模型，结合大规模数据集ChoreoSpectrum3D和音乐流派预测网络，提升舞蹈生成的质量、多样性和一致性。

Comments Project Page: https://fluide1022.github.io/EnchantDance/

详情

AI中文摘要

音乐驱动的舞蹈生成任务涉及创建与给定音乐相对应的连贯舞蹈动作。现有方法虽然能生成物理上合理的舞蹈，但往往难以泛化到未见数据。挑战来自三个方面：1）舞蹈动作的高度多样性和音乐模态分布的显著差异，使得生成与音乐对齐的舞蹈动作困难；2）缺乏大规模音乐-舞蹈数据集，阻碍了从音乐生成泛化舞蹈动作；3）舞蹈动作的持续性对保持一致的舞蹈风格构成挑战。在这项工作中，我们引入了EnchantDance框架，一种最先进的舞蹈生成方法。由于原始舞蹈序列在时间轴上的冗余性，EnchantDance首先构建一个强大的舞蹈潜在空间，然后在舞蹈潜在空间上训练舞蹈扩散模型。为了解决数据缺口，我们构建了一个大规模音乐-舞蹈数据集ChoreoSpectrum3D Dataset，包含四种舞蹈风格，总时长70.32小时，是迄今为止报道的最大音乐-舞蹈数据集。为了增强音乐流派与舞蹈风格之间的一致性，我们使用迁移学习预训练了一个音乐流派预测网络，并在舞蹈扩散模型的训练中将音乐流派作为额外的条件信息。大量实验表明，我们提出的框架在舞蹈质量、多样性和一致性方面达到了最先进的性能。

英文摘要

The task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make it difficult to generate music-aligned dance movements. 2) the lack of a large-scale music-dance dataset, which hinders the generation of generalized dance movements from music. 3) The protracted nature of dance movements poses a challenge to the maintenance of a consistent dance style. In this work, we introduce the EnchantDance framework, a state-of-the-art method for dance generation. Due to the redundancy of the original dance sequence along the time axis, EnchantDance first constructs a strong dance latent space and then trains a dance diffusion model on the dance latent space. To address the data gap, we construct a large-scale music-dance dataset, ChoreoSpectrum3D Dataset, which includes four dance genres and has a total duration of 70.32 hours, making it the largest reported music-dance dataset to date. To enhance consistency between music genre and dance style, we pre-train a music genre prediction network using transfer learning and incorporate music genre as extra conditional information in the training of the dance diffusion model. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on dance quality, diversity, and consistency.

URL PDF HTML ☆

赞 0 踩 0

2605.03395 2026-06-09 cs.SD cs.AI cs.LG cs.MM 版本更新

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

APEX：面向AI生成音乐的大规模多任务美学感知流行度预测

Jaavid Aktar Husain, Dorien Herremans

发表机构 * AMAAI Lab, Singapore University of Technology and Design（新加坡科技设计大学AMAAI实验室）

AI总结提出APEX框架，利用MERT音频嵌入联合预测AI生成音乐的流行度指标与五维美学质量，在Music Arena数据集上验证了美学特征对偏好预测的泛化能力。

详情

AI中文摘要

音乐流行度预测因其对艺术家、平台和推荐系统的重要性而吸引了越来越多的研究兴趣。然而，AI生成音乐平台的爆炸式增长创造了一个全新且很大程度上未被探索的领域，每天都有大量歌曲被生产和消费，而没有传统的艺术家声誉或唱片公司支持。在这一探索中，美学质量是关键但尚未被研究的因素。我们提出了APEX，这是首个面向AI生成音乐的大规模多任务学习框架，在来自Suno和Udio的超过21.1万首歌曲（1万小时音频）上训练，该框架联合预测基于参与度的流行度信号——流媒体播放量和点赞分数——以及从MERT（一个自监督音乐理解模型）提取的冻结音频嵌入中的五个感知美学质量维度。美学质量和流行度捕捉了音乐的互补方面，两者结合被证明是有价值的：在Music Arena数据集上的分布外评估中，该数据集包含训练期间未见过的十一个生成音乐系统之间的成对人类偏好对决，引入美学特征持续改进了偏好预测，展示了所学表示在生成架构上的强大泛化能力。

英文摘要

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.08425 2026-06-09 cs.SD cs.CL eess.AS 新提交

TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

TinyGiantALM：面向资源约束下意图感知推理的紧凑型音频-语言模型

Vinh-Thuan Ly

发表机构 * University of Science, VNU-HCM（胡志明市国立大学下属理科大学）； Vietnam National University, Ho Chi Minh City（胡志明市国立大学）

AI总结提出紧凑型1.5B参数音频-语言模型TinyGiantALM，通过指令感知特征精炼框架（查询引导投影器+语义门控）过滤用户意图相关声学信号，在MMAR基准上零样本准确率46.4%，超越7B-13B基线，并优于8倍大模型。

Comments Accepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.app

2606.07547 2026-06-09 cs.CL cs.AI cs.SD 交叉投稿

Audio-FLAN：面向语音、音乐和声音的统一音频理解与生成的指令跟随数据集

Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Xingjian Du, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

发表机构 * The Hong Kong University of Science（香港科学与技术大学）； Inner Mongolia University（内蒙古大学）； Beihang University（北京航空航天大学）； Queen Mary University of London（伦敦玛丽女王大学）； The Chinese University of Hong Kong（香港中文大学）； National University of Singapore（新加坡国立大学）； University of Surrey（萨里大学）； University of Rochester（罗切斯特大学）； Independent Researcher（独立研究者）

AI总结提出Audio-FLAN数据集，包含80种任务和1亿实例，支持统一音频理解与生成的零样本学习。

详情

基于SSL的特征提取器与后端分类器在欺骗检测中的比较：多语料库训练与跨语言分析

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite（阿维尼翁大学）； EURECOM

AI总结本研究通过多语料库训练和跨语言分析，比较了四种自监督学习特征提取器与四种后端分类器在欺骗检测中的性能，揭示了ASVspoof 5数据集中的领域偏差，并发现仅用8小时目标语言数据微调即可提升检测鲁棒性。

详情

AI中文摘要

语音生物识别系统面临来自欺骗攻击的日益增长的威胁，然而检测模型的评估在不同数据集上仍然不一致。为了研究这些不可预测的波动，我们对四种自监督学习特征提取器与四种后端分类器的组合进行了全面基准测试。我们比较了ResNet的层次化局部特征提取与基于注意力和图的后端的全局序列和关系建模。通过三种场景下的多语料库训练和六个评估数据集，我们的实证分析得出了两个关键发现。首先，我们揭示了ASVspoof 5数据集中的领域偏差，表明简单的数据缩放会主动降低性能。其次，我们的跨语言分析表明，仅用8小时的目标语言数据微调即可增强检测鲁棒性。这些发现共同强调了在欺骗检测中需要领域感知和语言特定适应的关键需求。

英文摘要

Voice biometric systems face growing threats from spoofing attacks, yet the evaluation of detection models remains inconsistent across datasets. To investigate these unpredictable fluctuations, we conduct a comprehensive benchmark of four self-supervised learning feature extractors paired with four back-end classifiers. We compare the hierarchical local feature extraction of ResNet with the global sequence and relational modeling of attention and graph-based back-ends. Through multi-corpus training across three scenarios and six evaluation datasets, our empirical analysis yields two critical findings. First, we expose a domain bias within the ASVspoof 5 dataset, showing that naive data scaling actively degrades performance. Second, our cross-linguistic analysis reveals that fine-tuning with just 8 hours of target-language data enhances detection robustness. Together, these findings emphasize the critical need for domain-aware and language-specific adaptation in spoofing detection.

URL PDF HTML ☆

赞 0 踩 0

2606.08678 2026-06-09 cs.SD cs.LG 新提交

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

基于梯度反转和变分信息瓶颈的说话人不变表示学习用于欺骗检测

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite（阿维尼翁大学）； EURECOM

AI总结针对欺骗检测中说话人偏差导致泛化差的问题，提出教师-学生框架，利用梯度反转层和变分信息瓶颈解耦身份信息，在9个数据集上EER相对降低25.7%。

详情

AI中文摘要

先进的生成语音技术可能破坏语音生物识别的可靠性。虽然欺骗检测系统在域内条件下评估时表现出色，但对域外设置的泛化能力通常较差。在本文中，我们表明此类问题可能由说话人偏差引起，即模型学习个体声音特征而非操作或生成的标记。我们提出了一种用于说话人不变欺骗检测的教师-学生框架，该框架无需说话人标签即可解耦身份。我们利用预训练的说话人识别教师通过梯度反转层指导学生模型。为了控制抑制与语音身份相关线索和保留与欺骗检测相关线索之间的平衡，我们集成了变分信息瓶颈。在九个数据集上的评估表明，与MHFA基线相比，我们的模型实现了EER相对降低25.7%。

英文摘要

Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.

URL PDF HTML ☆

赞 0 踩 0

2501.08238 2026-06-09 cs.SD eess.AS 版本更新

CodecFake+: Codec-Based Resynthesized Data as a Proxy for Detecting CodecFake Speech

CodecFake+: 基于编解码器的重合成数据作为检测CodecFake语音的代理

Xuanjun Chen, Jiawei Du, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University（国家交通大学通信工程研究院）； Department of Computer Science and Information Engineering, National Taiwan University（国家交通大学计算机科学与信息工程系）； Center for Language and Speech Processing at Johns Hopkins University（约翰霍普金斯大学语言与语音处理中心）； Department of Electrical Engineering, National Taiwan University（国家交通大学电子工程系）； Research Center for Information Technology Innovation, Academia Sinica（学术院信息技术创新研究中心）； NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)（国家交通大学人工智能研究中心）

AI总结针对新兴的CodecFake深度伪造语音检测挑战，提出大规模数据集CodecFake+，包含31种开源编解码器重合成训练数据和17种先进CoSG模型网络数据，并建立编解码器分类体系，验证了重合成语音作为训练数据的有效性。

Comments Accepted by TASLP 2026

详情

AI中文摘要

随着神经音频编解码器的快速发展，基于编解码器的语音生成（CoSG）系统变得非常强大。不幸的是，CoSG也使得创建高度逼真的深度伪造语音成为可能，更容易模仿个人声音并传播错误信息。我们将这种由CoSG系统生成的新兴深度伪造语音称为CodecFake。检测这种CodecFake是一个紧迫的挑战，然而现有系统大多主要关注检测由传统语音合成模型生成的伪造语音。在本文中，我们介绍了CodecFake+，一个旨在推进CodecFake检测的大规模数据集。据我们所知，CodecFake+是包含最多样化编解码器架构的最大数据集。训练集通过使用31个公开可用的开源编解码器模型进行重合成生成，而评估集包括来自17个先进CoSG模型的网络数据。我们还提出了一个全面的分类体系，根据编解码器的根组件：向量量化器、辅助目标和解码器类型对其进行分类。我们提出的数据集和分类体系使得能够在多个层面进行详细分析，以辨别成功检测CodecFake的关键因素。在单个编解码器层面，我们验证了使用编解码器重合成语音（CoRS）作为训练数据用于大规模CodecFake检测的有效性。在分类体系层面，我们表明当重合成模型包含解缠辅助目标或频域解码器时，检测性能最强。此外，从使用所有CoRS训练数据的角度，我们表明我们提出的分类体系可用于选择更好的训练数据以提高检测性能。总体而言，我们期望CodecFake+将成为通用和细粒度探索的重要资源，以开发更好的针对CodecFake的反欺骗模型。

英文摘要

With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.

URL PDF HTML ☆

赞 0 踩 0

2606.07673 2026-06-09 cs.SD cs.AI cs.LG 新提交

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

声带创伤性与非声带创伤性声音亢进的自动分类的分层特征工程框架

June-Woo Kim, Kangwook Jang, Minu Kim, Hyunju Lee

发表机构 * Department of Electronic Engineering, Wonkwang University（圆光大学电子工程系）； AI Convergence Research Institute, Wonkwang University（圆光大学人工智能融合研究院）； GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science and Technology（光州科学技术院GIST InnoCORE AI-Nano神经退行性疾病早期检测融合研究所）； School of Electrical Engineering, KAIST（韩国科学技术院电气工程学院）； Department of AI Convergence, Gwangju Institute of Science and Technology（光州科学技术院人工智能融合系）

AI总结提出分层特征工程框架，包括静态、动态、比率和耦合特征，用于区分声带创伤性和非声带创伤性声音亢进，发现耦合特征对两类分类均关键，PVH AUC 0.891，NPVH AUC 0.728。

Comments Interspeech 2026

2606.08286 2026-06-09 cs.SD 新提交

FXplorer: A Map-Based Interface for Exploratory Audio Effect Design

FXplorer: 一种基于地图的探索性音频效果设计界面

Annie Chu, Jason Brent Smith, Bryan Pardo

发表机构 * Northwestern University（西北大学）

AI总结提出FXplorer界面，将音频效果组织在感知二维空间中，通过空间交互与嵌入方法实现连续浏览与参数精调的统一，支持交互式预设编辑与插值。

Comments Accepted to NIME 2026. Project page: https://anniejchu.github.io/fxplorer/

详情

AI中文摘要

音频效果（FX）在当代音乐实践中塑造声音。然而，大多数界面将它们呈现为离散模块和参数，这有利于针对性调整而非探索性聆听。这种分离使得难以建立关于可能变换的更广阔空间的直觉，也难以在搜索和精调之间流畅移动。我们提出FXplorer，一个将音频效果组织在感知信息丰富的二维空间中的界面，允许将声音变换作为连续景观而非孤立预设进行浏览。通过结合既定的空间交互方法和可解释的DAW风格控制，以及基于嵌入的相似性和语义搜索的机器学习方法，该系统将探索和参数精调整合到单个工作空间中。FXplorer通过允许用户交互式编辑和插值效果预设，支持作曲、制作或表演。

英文摘要

Audio effects (FX) shape sound in contemporary music practice. However, most interfaces present them as discrete modules and parameters that favor targeted adjustment over exploratory listening. This separation can make it difficult to build intuition about the broader space of possible transformations or to move fluidly between searching and refinement. We present FXplorer, an interface that organizes audio effects within a perceptually informed 2D space, allowing sound transformations to be browsed as a continuous landscape rather than as isolated presets. By combining established spatial interaction approaches and interpretable DAW-style controls with recent embedding-based machine learning methods for similarity and semantic search, the system brings exploration and parameter refinement into a single workspace. FXplorer supports composition, production, or performance by allowing users to edit and interpolate between effect presets interactively.

URL PDF HTML ☆

赞 0 踩 0

2606.09266 2026-06-09 cs.SD cs.AI 新提交

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

物理引导的序列生成框架用于声学超材料逆向设计

Yijie Li, Jiahao Xu, Ching-Chih Tsao, Lili Qiu, Jingxian Wang

发表机构 * National University of Singapore（新加坡国立大学）； UT Austin（德克萨斯大学奥斯汀分校）

AI总结提出MetaSeq框架，将声学超材料表示为结构化序列，通过序列到序列模型结合物理求解器和强化学习，实现宽带逆向设计，误差降低45%。

详情

AI中文摘要

声学超材料（AMM）逆向设计对于宽带目标响应尤其具有挑战性，原因是声学色散：在一个频率上匹配期望响应的结构可能在其它频率上偏离，而修改几何以改善一个子带通常会扰动相邻子带。然而，现有的宽带逆向设计方法要么受限于预定义模板，要么依赖于无法保持声学结构所需的几何精度和结构连通性的图像表示。我们提出了MetaSeq，一个物理引导的、基于序列的生成框架，用于声学超材料逆向设计。其核心是，MetaSeq引入了一种语言，将每个AMM表示为结构化序列，而不是像素网格或固定模板。这种表示保留了精确的几何形状，显式编码了连通性，并将逆向设计转化为从目标响应到结构序列的序列到序列任务。MetaSeq进一步构建了一个平衡、高保真的数据集，具有高效的校准和基于复杂度的采样。为了解决逆向设计的一对多性质，MetaSeq结合了监督预训练和基于物理求解器及有效性检查器引导的强化学习微调。针对COMSOL和五个基线的广泛评估表明，MetaSeq在最佳基线基础上将响应误差降低了45%。

英文摘要

Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.09271 2026-06-09 cs.SD cs.LG 新提交

AudioSAE：利用稀疏自编码器理解音频处理模型

Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich, Assel Yermekova, Laida Kushnareva, Vadim Popov, Kristian Kuznetsov, Irina Piontkovskaya

发表机构 * Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结本文在Whisper和HuBERT的编码器层训练稀疏自编码器（SAE），评估其稳定性和可解释性，并展示其在特征解耦、概念擦除、语音检测优化及与人类脑电活动对齐方面的实用价值。

Comments Accepted to EACL 2026, main track

详情

DOI: 10.18653/v1/2026.eacl-long.149
Journal ref: Proceedings of EACL 2026, pages 3221-3254

AI中文摘要

稀疏自编码器（SAE）是解释神经表征的强大工具，但它们在音频领域的应用尚未充分探索。我们在Whisper和HuBERT的所有编码器层训练SAE，对其稳定性、可解释性进行了广泛评估，并展示了其实用性。超过50%的特征在随机种子间保持一致，且重建质量得以保持。SAE特征捕获了通用声学和语义信息以及特定事件，包括环境噪声和副语言声音（如笑声、低语），并有效解耦它们，仅需移除19-27%的特征即可擦除一个概念。特征引导将Whisper的虚假语音检测降低了70%，且词错误率（WER）增加可忽略不计，展示了实际应用价值。最后，我们发现SAE特征与语音感知过程中的人类脑电活动相关，表明其与人类神经处理的对齐。代码和检查点可在https://github.com/audiosae/audiosae_demo获取。

英文摘要

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.

URL PDF HTML ☆

赞 0 踩 0

2507.02606 2026-06-09 cs.SD cs.AI cs.CR cs.LG eess.AS 版本更新

De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks

De-AntiFake：重新思考对抗语音克隆攻击的保护扰动

Wei Fan, Kejiang Chen, Chang Liu, Weiming Zhang, Nenghai Yu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出一种两阶段净化方法，旨在提升对抗语音克隆攻击的防御效果，通过净化扰动语音并利用音素指导进行优化，实验表明其优于现有方法。

Comments Accepted by ICML 2025

详情

Journal ref: Proceedings of the 42nd International Conference on Machine Learning, PMLR 267, 2025

AI中文摘要

随着语音生成模型的快速发展，语音克隆（VC）带来的隐私和安全问题日益突出。近期研究尝试通过引入对抗扰动来阻止未经授权的语音克隆，但确定性攻击者可以缓解这些保护扰动并成功执行VC。本文首次系统评估这些保护扰动在包含扰动净化的现实威胁模型下的有效性。研究发现，尽管现有净化方法能中和大量保护扰动，但仍导致VC模型特征空间的失真，影响VC性能。因此，我们提出一种新的两阶段净化方法：（1）净化扰动语音；（2）利用音素指导进行优化，使其符合干净语音分布。实验结果表明，我们的方法在破坏VC防御方面优于现有方法。本研究揭示了基于对抗扰动的VC防御的局限性，并强调了需要更鲁棒的解决方案以缓解VC带来的安全和隐私风险。代码和音频样本可在https://de-antifake.github.io获取。

英文摘要

The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at https://de-antifake.github.io.

URL PDF HTML ☆

赞 0 踩 0

1. 语音识别与关键词检测 4 篇

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

2. 语音合成与声音生成 10 篇

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

End-to-End Training for Discrete Token LLM based TTS System

BareWave: Waveform-Native Flow-Matching Text-to-Speech

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Universal Speech Content Factorization

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

3. 说话人识别、验证与分离 4 篇

On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation

Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference

Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines

Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

4. 语音增强、降噪与音频修复 5 篇

G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching

Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

Speech Enhancement Based on Drifting Models

Training-Free Intelligibility-Guided Observation Addition for Noisy ASR

SEMamba++: A General Speech Restoration Framework Leveraging Global, Local, and Periodic Spectral Patterns

5. 音频事件检测与场景理解 3 篇

Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

Sound Event Detection with Boundary-Aware Optimization and Inference

Mind the Gap: Detecting Cluster Exits for Robust Local Density-Based Score Normalization in Anomalous Sound Detection

6. 音乐信息检索与音乐生成 3 篇

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

EnchantDance: Unveiling the Potential of Music-Driven Dance Movement

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

7. 语音翻译与语音语言模型 2 篇

TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

Liberating LLM Capabilities in Full-Duplex Speech Models

8. 多模态音频与视听学习 3 篇

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

9. 低资源、多语言与方言语音 1 篇

Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?

10. 数据集、基准与评测 6 篇

Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound

DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation

BMdataset: A Musicologically Curated LilyPond Dataset

An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance

11. 安全、隐私与深度伪造音频 3 篇

A Comparison of SSL-Based Feature Extractors and Back-End Classifiers for Spoofing Detection: A Multi-Corpus Training and Cross-Linguistic Analysis

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

CodecFake+: Codec-Based Resynthesized Data as a Proxy for Detecting CodecFake Speech

12. 其他/综合语音音频 9 篇

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

FXplorer: A Map-Based Interface for Exploratory Audio Effect Design

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

A Switching Beamformer for Highly Non-Stationary Environments

AudioRWKV: Efficient and Stable Bidirectional RWKV for Audio Pattern Recognition

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks