2605.29613 2026-05-29 eess.AS cs.SD 版本更新

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

基于扩散的ASR解码策略：基于置信度阈值的系统评估

Jeong Hun Yeo, Minsu Kim, Hyeongseop Rha, Yong Man Ro

发表机构 * KAIST（韩国科学技术院）； Google DeepMind（谷歌DeepMind）

AI总结本文系统评估了基于扩散语言模型的ASR中三种解码策略，提出使用基于负对数似然的不确定性度量来监控解码进度，发现基于阈值的策略在准确率和速度上均优于固定步数策略，其中静态阈值策略在匹配自回归解码准确率的同时具有更高效率。

详情

AI中文摘要

虽然基于LLM的自动语音识别（ASR）实现了高准确率，但其速度受限于顺序自回归解码。扩散语言模型（DLM）提供了一种并行替代方案，然而其解码策略在ASR场景中尚未得到充分探索。本文分析了三种用于DLM-based ASR的解码方案：固定步数、静态置信度阈值和动态置信度阈值。我们提出使用基于负对数似然的不确定性度量作为解码进度的代理来测量逐轮准确率。结果表明，基于阈值的策略在准确率和速度上均显著优于固定步数方案。我们将此归因于ASR独有的特性：大多数token在早期就达到高置信度，从而可以积极收集可靠token，仅将困难token留到后续轮次。值得注意的是，静态阈值策略在匹配自回归解码准确率的同时提供了更高的效率。

英文摘要

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.00969 2026-05-29 cs.SD cs.AI cs.CL 版本更新

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

MedMosaic：一个具有挑战性的多样化医学音频大规模基准

Harshit Rajgarhia, Shuubham Ojha, Asif Shaik, Akhil Pothanapalli, Rachuri Lokesh, Abhishek Mukherji, Prasanna Desikan

发表机构 * Centific Global Solutions Inc.（Centific全球解决方案公司）； University of Maryland, College Park, MD, USA（马里兰大学学院市分校）

AI总结为解决医学音频数据稀缺和现有基准不足的问题，提出MedMosaic数据集，包含多种医学音频类型和46701个问答对，用于评估语言和音频推理模型，实验表明推理仍具挑战性。

Comments Accepted at ICML 2026

详情

AI中文摘要

由于隐私法规和领域专业知识导致的高注释成本，医学音频数据难以收集。因此，现有基准往往未能充分代表复杂的医学音频场景。为应对这一挑战，我们提出了MedMosaic，一个医学音频问答数据集，旨在在现实临床约束下对语言和音频推理模型进行基准测试。MedMosaic包含多种医学音频类型，包括与疾病相关的生理声音、精心构建的模拟带有伪影的语音的合成声音，以及模拟不同上下文长度的真实短篇和长篇临床对话。该数据集还包含总共46,701个问答对，涵盖多项选择、顺序多轮和开放式问答等类别，从而能够系统评估多跳推理和答案生成能力。对13个音频和多模态推理模型的基准测试显示，推理对所有评估系统仍然具有挑战性，且在不同问题类型上表现差异显著。特别是，即使是像Gemini-2.5-pro这样的最先进模型也只能达到约68.1%的准确率。这些发现强调了医学推理中的持续局限性，并凸显了对更鲁棒、特定领域的多模态推理模型的需求。基准数据样本可在此处获取：https://shorturl.at/Lyp33

英文摘要

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models. A sample of benchmark data is available here: https://shorturl.at/Lyp33

URL PDF HTML ☆

赞 0 踩 0

2603.27667 2026-05-29 cs.SD cs.AI 版本更新

AV-EMO-Reasoning: 在具有视听线索的全模态大语言模型中基准测试情感推理能力

Dingkun Zhou, Krish Patel, Ajay Kankipati, Akshaj Gupta, Zeyi Austin Li, Mohul Shukla, Vibhor Narang, Sara Kofman, Zongli Ye, Grace Wang, Xiaoyu Shi, Tingle Li, Guan-Ting Lin, Kan Jen Cheng, Huang-Cheng Chou, Jiachen Lian, Gopala Anumanchipalli

发表机构 * UC Berkeley（加州大学伯克利分校）； South China University of Technology（华南理工大学）； Zhejiang University（浙江大学）； National Taiwan University（台湾大学）； University of Southern California（美国南加州大学）

AI总结提出AV-EMO-Reasoning基准，通过合成和真实世界的视听对话数据集及情感感知与交互推理指标，系统评估全模态大语言模型的情感推理能力。

2502.20838 2026-05-29 cs.SD cs.AI cs.LG eess.AS 版本更新

Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data

弱监督检测与长时间生物声学数据中鲸叫声的时间定位

Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Systems and Control Engineering, School of Engineering, Institute of Science Tokyo, Japan（东京科学研究院工程学院系统与控制工程系）

AI总结提出DSMIL-LocNet框架，利用弱监督多实例学习仅使用录音级标签实现鲸叫声的分类和时间定位，在长录音上优于全监督基线。

Comments Accepted in European Signal Processing Conference (EUSIPCO) 2026

详情

AI中文摘要

被动声学监测（PAM）系统生成持续数月连续录音，但自动化生物声学分析鲸叫声需要两种独立的标注工作：用于分类的二元存在标签和用于定位的精确时间边界。一个多分钟录音的二元标签可以在几秒钟内分配，但对其中的每个叫声打时间戳需要数小时的专家努力。在操作规模上同时提供两者是不可行的。我们提出DSMIL-LocNet，一个弱监督多实例学习（MIL）框架，仅使用录音级存在/缺失标签执行分类和时间定位。我们的双流架构整合频谱和时间特征，处理2-30分钟的录音，而无需现有CNN方法在长输入上退化的时间压缩。在AcousticTrends BlueFinLibrary上，DSMIL-LocNet在300-1800秒录音上达到F1分数0.88-0.91，而全监督CNN基线退化为0.19-0.64。它还提供这些基线在没有帧级标注的情况下无法产生的时间定位。代码：https://github.com/Ragib-Amin-Nihal/DSMIL-LocNet

英文摘要

Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise temporal boundaries for localization. A binary label for a multi-minute recording can be assigned in seconds, but timestamping every call within it requires hours of expert effort. Providing both is infeasible at operational scale. We present DSMIL-LocNet, a weakly supervised multiple instance learning (MIL) framework that performs both classification and temporal localization using only recording-level presence/absence labels. Our dual-stream architecture integrates spectral and temporal features to process recordings of 2--30 minutes without the temporal compression that degrades existing CNN methods on long inputs. On the AcousticTrends BlueFinLibrary, DSMIL-LocNet achieves F1 scores of 0.88--0.91 on recordings of 300--1800s, where fully supervised CNN baselines degrade to 0.19--0.64. It also provides temporal localization that these baselines cannot produce without frame-level annotation. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc

URL PDF HTML ☆

赞 0 踩 0

2605.29531 2026-05-29 cs.SD cs.CV cs.LG 版本更新

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

使用交叉注意力特征融合的半真音频深度伪造检测与定位

S. Sutharya, Remya K. Sasi

发表机构 * Department of Computer Science（计算机科学系）

AI总结提出CAFNet模型，通过三元分类和边界回归联合检测部分伪造音频，在MLADDC数据集上达到92.71%准确率和0.075s定位误差。

Comments 13 pages, 5 figures, 11 tables

详情

AI中文摘要

音频深度伪造检测通常作为二分类问题研究，但部分篡改语音（其中一段短合成片段被拼接进真实语音）构成了更困难且更现实的威胁。检测此类半真音频不仅需要区分真实和完全伪造语音，还需要定位篡改发生的位置。我们提出了CAFNet，一个576k参数的架构，联合处理这两个任务：它在单次前向传播中执行三元分类（真实、完全伪造或半真）并回归合成区域的时间边界。CAFNet通过并行深度可分离卷积分支和交叉注意力融合梅尔频率倒谱系数（MFCC）、线性频率倒谱系数（LFCC）和色度短时傅里叶变换（Chroma-STFT）特征，随后使用双向长短期记忆（BiLSTM）回归头进行边界预测。在组合的多语言音频深度伪造检测语料库（MLADDC）T2+T3测试集上，CAFNet达到92.71%的准确率和0.9910的宏观曲线下面积（AUC），边界定位平均绝对误差（MAE）为0.075秒，中位误差为0.052秒。在二分类检测中，它达到96.76%的准确率和3.20%的等错误率（EER），以超过500倍的参数减少优于微调的XLS-R 300M（78.31%）和AST 87M（93.03%）。跨数据集研究进一步表明，即使在降低骨干学习率的情况下，标准微调也会破坏跨域表示。

英文摘要

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

URL PDF HTML ☆

赞 0 踩 0

2605.29300 2026-05-29 cs.CL cs.AI cs.SD 版本更新

单声道音频的端到端多说话人自动语音识别综述

Xinlu He, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute（沃斯特理工大学）

AI总结本文系统综述了端到端多说话人自动语音识别的神经架构范式（SIMO与SISO）、近期改进方法及长语音扩展策略，并通过标准基准评估比较了各类方法。

Comments Accepted for publication in Computer Speech & Language (CSL)

详情

AI中文摘要

单声道多说话人自动语音识别（ASR）由于数据稀缺以及识别并将词语归因于单个说话人的内在困难（尤其是在重叠语音中）仍然具有挑战性。最近的进展推动了从级联系统向端到端（E2E）架构的转变，这减少了错误传播并更好地利用了语音内容与说话人身份之间的协同作用。尽管端到端多说话人ASR取得了快速进展，但该领域缺乏对近期发展的全面综述。本综述为多说话人ASR的端到端神经方法提供了一个系统的分类法，突出了近期进展和比较分析。具体而言，我们分析了：（1）用于预分割音频的架构范式（SIMO与SISO），分析了它们的不同特征和权衡；（2）基于这两种范式的近期架构和算法改进；（3）对长语音的扩展，包括分割策略和说话人一致性的假设拼接。此外，我们（4）在标准基准上评估和比较了各种方法。最后，我们讨论了构建鲁棒且可扩展的多说话人ASR所面临的开放挑战和未来研究方向。

英文摘要

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

URL PDF HTML ☆

赞 0 踩 0