arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30339 2026-05-29 cs.CV cs.MM cs.SD eess.AS 版本更新

Benchmarking Single-Factor Physical Video-to-Audio Generation

单因素物理视频到音频生成的基准测试

Tingle Li, Siddharth Gururani, Kevin J. Shih, Gantavya Bhatt, Sang-gil Lee, Zhifeng Kong, Arushi Goel, Gopala Anumanchipalli, Ming-Yu Liu

发表机构 * UC Berkeley(伯克利大学) NVIDIA(英伟达) University of Washington(华盛顿大学)

AI总结 提出FlatSounds基准,通过控制反事实对和单视频模式测试评估视频到音频模型的物理推理能力,发现模型依赖文本描述而非视觉流,且物理准确性与时序对齐存在权衡。

Comments CVPR 2026

详情
AI中文摘要

生成式视频到音频(V2A)模型能产生高度逼真的音轨,但尚不清楚它们是否捕捉了底层物理过程。现有评估强调感知真实性,忽视了在受控干预下的物理正确性。本文中,我们引入FlatSounds,一个通过以下方式审计V2A模型物理推理的基准:1)改变单个物理因素的受控反事实对,以及2)探测内部一致性和方向趋势的单视频模式测试。这些设置测试生成的音频是否正确反映特定的物理属性和时序。我们对最先进模型的评估揭示了一致的权衡:模型更依赖文本描述而非视觉流来推断物理和语义。描述通常提高物理和语义准确性,但矛盾地降低了时序对齐。我们的结果强调了需要超越音频质量,直接从像素学习物理过程。最后,我们发现我们的基于物理的指标与我们自己数据上的人类偏好测试强相关。项目网页:https://research.nvidia.com/labs/cosmos-lab/flatsounds/

英文摘要

Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data. Project webpage: https://research.nvidia.com/labs/cosmos-lab/flatsounds/

2605.30031 2026-05-29 cs.SD cs.AI cs.CL 版本更新

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

大型音频语言模型中的音频越狱:分类、攻防分析与成本感知评估

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang, Yun-Nung Chen

发表机构 * National Taiwan University(台湾大学)

AI总结 本文提出了大型音频语言模型中音频越狱攻击与防御的统一分类法和受控实证评估,揭示了声学最佳N攻击暴露了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而现有防御在鲁棒性与良性可用性之间存在权衡。

Comments Submitted to ACL ARR 2026 May

详情
AI中文摘要

大型音频语言模型(LALMs)将越狱风险从令牌级提示扩展到完整的语音感知到推理管道,其中不安全行为可以通过语义、声学风格、信号伪影或内部表示来诱导。现有研究在异质的威胁模型和评估协议下研究这些风险,使得比较攻击实用性或防御效用变得困难。本文提供了LALM越狱攻击和防御的统一分类法和受控实证评估。我们将先前的工作组织为语义、声学、信号和嵌入层攻击;基于防护、无需训练和基于训练的防御;以及跨模态、音频原生和交互式基准。然后,我们在十个开源LALM上评估代表性攻击和防御,不仅测量攻击成功率,还测量良性拒绝和延迟。我们的结果表明,声学最佳N揭示了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而当前防御在鲁棒性与良性可用性之间存在权衡。这些发现支持将成本和效用感知评估作为仅成功率的LALM安全基准的必要补充。

英文摘要

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.

2605.29862 2026-05-29 eess.AS cs.AI cs.SD 版本更新

Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions

在联邦域泛化下通过因果启发的干预减轻听诊器引起的呼吸音分类中的捷径

Heejoon Koo, Yoon Tae Kim, Miika Toikkanen, June-Woo Kim

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) RSC LAB(RSC实验室) Wonkwang University(Wonkwang大学)

AI总结 针对呼吸音分类中听诊器设备差异导致的域偏移问题,提出一种因果启发的多模态联邦域泛化框架,通过内容保持的风格扰动、反事实文本增强和梯度对齐实现设备不变表示,在ICBHI和SPRSound数据集上优于传统方法。

Comments 2 figures, 4 tables, and 5 pages

详情
AI中文摘要

基于AI的呼吸音分类(RSC)有望实现自动化肺部疾病检测,但多站点部署受到听诊器间差异的阻碍。我们针对听诊器引起的设备偏移引入了一种联邦域泛化(FedDG)公式,其中客户端使用异构设备,模型在未见设备上进行评估。我们的实证分析表明,听诊器引起的风格和疾病特定内容紧密纠缠,使得确定性风格去除不可靠。为此,我们提出了一种因果启发的多模态FedDG框架,结合了:(i) 因果启发的设备风格干预网络,执行内容保持的风格扰动,(ii) 反事实文本增强,中和元数据捷径,以及(iii) 梯度对齐,促进跨客户端的设备不变表示。基于多模态语言-音频预训练模型,在ICBHI和SPRSound数据集上的留一设备验证中,它优于传统数据增强和联邦学习基线。代码将在发表后发布。

英文摘要

AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.

2605.29628 2026-05-29 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

COMET:音频-文本多模态对比嵌入中模态间隙的概念空间剖析

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey(Surrey 大学视觉、语音和信号处理中心)

AI总结 提出COMET框架,通过PLS-SVD分解揭示CLAP模型中模态间隙主要由少数共享概念轴贡献,并基于谱截断方法无训练地缓解间隙,实现零样本音频字幕接近全监督性能。

详情
AI中文摘要

对比语言-音频预训练(CLAP)模型广泛用于音频理解,并在许多零样本应用中支持模态无关的条件交换。然而,其性能受到音频和文本嵌入之间模态间隙的严重影响。现有解释主要将此间隙归因于锥体效应,将其视为均值嵌入之间的偏移,但仅纠正均值只能带来有限的改进。其他假设,如信息不平衡和维度坍缩,也被提出,但仍未得到充分验证,并且在音频领域尚未被深入研究。同时,一些工作尝试将多模态对比嵌入分解为可解释的概念,但没有任何工作从概念分解的角度显式分析模态间隙。在这项工作中,我们引入了COMET(基于PLS-SVD变换的概念空间组织与模态间隙解释),这是一个新颖的用于CLAP的偏最小二乘奇异值分解(PLS-SVD)框架,揭示了模态间隙的更广泛视角。我们的框架揭示,只有一小部分可解释的轴(捕捉共享概念)对相似度计算有显著贡献,并且均值分量仅部分代表模态间隙。基于这一见解,我们提出了一种简单的谱截断方法,以无训练的方式缓解模态间隙。该方法使得零样本音频字幕通过条件交换接近全监督性能,无需大型辅助记忆库或昂贵计算。同时,它在保持检索和音频字幕任务强性能的同时,实现了显著的嵌入维度缩减。

英文摘要

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

2605.29613 2026-05-29 eess.AS cs.SD 版本更新

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

基于扩散的ASR解码策略:基于置信度阈值的系统评估

Jeong Hun Yeo, Minsu Kim, Hyeongseop Rha, Yong Man Ro

发表机构 * KAIST(韩国科学技术院) Google DeepMind(谷歌DeepMind)

AI总结 本文系统评估了基于扩散语言模型的ASR中三种解码策略,提出使用基于负对数似然的不确定性度量来监控解码进度,发现基于阈值的策略在准确率和速度上均优于固定步数策略,其中静态阈值策略在匹配自回归解码准确率的同时具有更高效率。

详情
AI中文摘要

虽然基于LLM的自动语音识别(ASR)实现了高准确率,但其速度受限于顺序自回归解码。扩散语言模型(DLM)提供了一种并行替代方案,然而其解码策略在ASR场景中尚未得到充分探索。本文分析了三种用于DLM-based ASR的解码方案:固定步数、静态置信度阈值和动态置信度阈值。我们提出使用基于负对数似然的不确定性度量作为解码进度的代理来测量逐轮准确率。结果表明,基于阈值的策略在准确率和速度上均显著优于固定步数方案。我们将此归因于ASR独有的特性:大多数token在早期就达到高置信度,从而可以积极收集可靠token,仅将困难token留到后续轮次。值得注意的是,静态阈值策略在匹配自回归解码准确率的同时提供了更高的效率。

英文摘要

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.

2605.00969 2026-05-29 cs.SD cs.AI cs.CL 版本更新

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

MedMosaic:一个具有挑战性的多样化医学音频大规模基准

Harshit Rajgarhia, Shuubham Ojha, Asif Shaik, Akhil Pothanapalli, Rachuri Lokesh, Abhishek Mukherji, Prasanna Desikan

发表机构 * Centific Global Solutions Inc.(Centific全球解决方案公司) University of Maryland, College Park, MD, USA(马里兰大学学院市分校)

AI总结 为解决医学音频数据稀缺和现有基准不足的问题,提出MedMosaic数据集,包含多种医学音频类型和46701个问答对,用于评估语言和音频推理模型,实验表明推理仍具挑战性。

Comments Accepted at ICML 2026

详情
AI中文摘要

由于隐私法规和领域专业知识导致的高注释成本,医学音频数据难以收集。因此,现有基准往往未能充分代表复杂的医学音频场景。为应对这一挑战,我们提出了MedMosaic,一个医学音频问答数据集,旨在在现实临床约束下对语言和音频推理模型进行基准测试。MedMosaic包含多种医学音频类型,包括与疾病相关的生理声音、精心构建的模拟带有伪影的语音的合成声音,以及模拟不同上下文长度的真实短篇和长篇临床对话。该数据集还包含总共46,701个问答对,涵盖多项选择、顺序多轮和开放式问答等类别,从而能够系统评估多跳推理和答案生成能力。对13个音频和多模态推理模型的基准测试显示,推理对所有评估系统仍然具有挑战性,且在不同问题类型上表现差异显著。特别是,即使是像Gemini-2.5-pro这样的最先进模型也只能达到约68.1%的准确率。这些发现强调了医学推理中的持续局限性,并凸显了对更鲁棒、特定领域的多模态推理模型的需求。基准数据样本可在此处获取:https://shorturl.at/Lyp33

英文摘要

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models. A sample of benchmark data is available here: https://shorturl.at/Lyp33

2603.27667 2026-05-29 cs.SD cs.AI 版本更新

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

EvA: 一种面向LALM的以证据为先的音频理解范式

Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Didi Chuxing(滴滴出行)

AI总结 提出EvA双路径架构,通过分层聚合和非压缩时间对齐融合增强声学证据保留,并在统一零样本协议下在MMAU、MMAR和MMSU上取得最佳开源感知结果,支持以证据为先的假设。

详情
AI中文摘要

大型音频语言模型(LALM)在复杂声学场景中仍然存在困难,因为它们往往在推理开始前未能保留与任务相关的声学证据。我们将这种错误模式识别为证据瓶颈:最先进的系统在声学证据提取方面的缺陷大于下游推理,这表明上游感知通常是限制因素。为了解决这个问题,我们提出了EvA(以证据为先的音频),一种双路径架构,通过分层聚合和非压缩、时间对齐融合来增强声学证据保留。我们还构建了EvA-Perception,一个大规模训练集,包含约54K个事件排序描述和500K个基于证据的问答对。在统一的零样本协议下,EvA在MMAU、MMAR和MMSU上取得了最佳开源感知结果,在感知密集型分割上增益最大。对开放描述的人工评估进一步显示了改进的细粒度声学覆盖和描述质量。这些结果支持以证据为先的假设:更强的音频理解依赖于在推理前保留声学证据。项目地址:https://satsuki2486441738.github.io/EvA/。

英文摘要

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We identify this error pattern as the evidence bottleneck: state-of-the-art systems show larger deficits in acoustic evidence extraction than in downstream reasoning, suggesting that upstream perception is often the limiting factor. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that enhances acoustic evidence preservation through hierarchical aggregation and non-compressive, time-aligned fusion. We also build EvA-Perception, a large-scale training set with about 54K event-ordered captions and 500K evidence-grounded QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source \emph{Perception} results on MMAU, MMAR, and MMSU, with the largest gains on perception-heavy splits. Human evaluation on open-ended captioning further shows improved fine-grained acoustic coverage and caption quality. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning. Project can be found at https://satsuki2486441738.github.io/EvA/.

2602.18527 2026-05-29 cs.CV cs.AI cs.SD 版本更新

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

JAEGER:模拟物理环境中的联合3D音频-视觉定位与推理

Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

发表机构 * Tsinghua University(清华大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Tencent AI Lab(腾讯AI实验室)

AI总结 提出JAEGER框架,通过集成RGB-D观测和多通道一阶环境声学,将音频-视觉大语言模型扩展到3D空间,实现联合空间定位与推理,并引入神经强度向量(Neural IV)提升声源方向估计的鲁棒性。

Comments Accepted to ICML 2026

详情
AI中文摘要

当前的音频-视觉大语言模型(AV-LLMs)主要局限于2D感知,依赖于RGB视频和单声道音频。这种设计选择引入了基本的维度不匹配,阻碍了在复杂3D环境中可靠的声源定位和空间推理。我们通过提出JAEGER框架来解决这一限制,该框架将AV-LLMs扩展到3D空间,通过集成RGB-D观测和多通道一阶环境声学实现联合空间定位与推理。我们工作的核心贡献是神经强度向量(Neural IV),一种学习的空间音频表示,它编码了鲁棒的方向线索,以增强到达方向估计,即使在具有重叠声源的不利声学场景中也是如此。为了促进大规模训练和系统评估,我们提出了SpatialSceneQA,一个包含从模拟物理环境中整理的6.1万个指令调优样本的基准。大量实验表明,我们的方法在各种空间感知和推理任务中始终优于以2D为中心的基线,强调了显式3D建模对于推进物理环境中AI的必要性。我们的源代码、预训练模型检查点和数据集可在https://github.com/liuzhan22/JAEGER获取。

英文摘要

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints, and datasets are available at https://github.com/liuzhan22/JAEGER.

2602.12304 2026-05-29 cs.SD cs.AI cs.MM eess.AS 版本更新

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

OmniCustom: 通过联合音视频生成模型实现同步音视频定制

Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu

发表机构 * The University of Hong Kong(香港大学) Shanda AI Research Tokyo(Shanda AI东京研究所) XIntelligence Technology Co., Limited(XIntelligence技术有限公司)

AI总结 提出一种基于DiT的零样本音视频定制框架OmniCustom,通过参考图像和音频同步生成保持身份和音色一致性的视频,支持文本指定语音内容。

Comments code: https://github.com/OmniCustom-project/OmniCustom

详情
AI中文摘要

现有的主流视频定制方法侧重于基于给定参考图像和文本提示生成身份一致的视频。受益于联合音视频生成的快速发展,本文提出一个更具吸引力的新任务:同步音视频定制,旨在同步定制视频身份和音频音色。具体来说,给定参考图像$I^{r}$和参考音频$A^{r}$,该新任务要求生成保持参考图像身份并模仿参考音频音色的视频,语音内容可由用户提供的文本提示自由指定。为此,我们提出OmniCustom,一个基于DiT的强大音视频定制框架,能够以零样本方式一次性根据参考图像身份、音频音色和文本提示合成视频。我们的框架基于三个关键贡献。首先,身份和音频音色控制通过独立的参考身份和音频LoRA模块实现,这些模块通过基础音视频生成模型中的自注意力层操作。其次,我们引入了对比学习目标与标准流匹配目标一起使用。它将以参考输入为条件的预测流作为正例,以无参考条件的预测流作为负例,从而增强模型保持身份和音色的能力。第三,我们在构建的大规模高质量音视频人类数据集上训练OmniCustom。大量实验表明,OmniCustom在生成具有一致身份和音色保真度的音视频内容方面优于现有方法。项目页面:https://omnicustom-project.github.io/page/。

英文摘要

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity. Project page: https://omnicustom-project.github.io/page/.

2602.08979 2026-05-29 cs.SD cs.CL 版本更新

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

超越文本:音频章节划分的新视角

Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过提出音频专用架构AudioSeg、分析影响性能的因素以及形式化评估协议,系统研究了音频章节划分任务,发现AudioSeg显著优于基于文本的方法,停顿是最有效的声学特征,而多模态大模型在短音频上表现有潜力。

Comments Accepted at ACL 2026 (Main Conference)

详情
AI中文摘要

音频章节划分是将长音频分割成连贯部分的任务,对于导航播客、讲座和视频越来越重要。尽管其相关性,研究仍然有限且基于文本,留下了关于利用音频信息、处理ASR错误以及无转录评估的关键问题未解决。我们通过三个贡献来解决这些空白:(1)基于文本模型与声学特征、一种新颖的仅音频架构(AudioSeg,操作于学习到的音频表示)以及多模态大模型的系统比较;(2)影响性能因素的经验分析,包括转录质量、声学特征、持续时间和说话人组成;(3)形式化的评估协议,对比依赖转录的文本空间协议与转录不变的时间空间协议。我们在YTSeg上的实验表明,AudioSeg显著优于基于文本的方法,停顿提供了最大的声学增益,而MLLMs受限于上下文长度和指令遵循能力较弱,但MLLMs在较短的音频上显示出潜力。

英文摘要

Audio chaptering, the task of segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

2510.18416 2026-05-29 cs.SD 版本更新

SegTune: Structured and Fine-Grained Control for Song Generation

SegTune:歌曲生成的结构化与细粒度控制

Pengfei Cai, Joanna Wang, Haorui Zheng, Xu Li, Zihao Ji, Teng Ma, Zhongliang Liu, Chen Zhang, Pengfei Wan

发表机构 * Kuaishou Technology(快手科技)

AI总结 提出非自回归框架SegTune,通过段级局部描述和全局提示实现歌曲的结构化可控生成,并引入基于LLM的时长预测器实现精确的歌词-音乐对齐。

Comments This technical report was later revised and published at ACL 2026 (oral). ACL paper link: https://openreview.net/forum?id=FKf2S4u8at , code: https://github.com/KlingAIResearch/SegTune

详情
AI中文摘要

近期歌曲生成领域的进展在根据歌词和/或全局文本提示生成歌曲方面展现了有希望的结果。然而,大多数现有系统缺乏对歌曲随时间变化属性的建模能力,限制了对音乐结构和动态的细粒度控制。在本文中,我们提出SegTune,一个用于结构化和可控歌曲生成的非自回归框架。SegTune通过允许用户或大语言模型指定与歌曲段落对齐的局部音乐描述来实现段级控制。段级提示通过时间广播注入到对应时间窗口的模型中,而全局提示则影响整首歌曲以确保风格一致性。为了获得准确的段落时长并实现精确的歌词-音乐对齐,我们引入了一个基于LLM的时长预测器,该预测器以自回归方式生成LRC格式的句子级带时间戳歌词。我们进一步构建了一个大规模数据管道,用于收集带有对齐歌词和提示的高质量歌曲,并提出了新的评估指标来评估段级对齐和声乐属性一致性。实验结果表明,与现有基线相比,SegTune实现了优越的可控性和音乐连贯性。参见https://cai525.github.io/SegTune_demo获取我们工作的演示。

英文摘要

Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting fine-grained control over musical structure and dynamics. In this paper, we propose SegTune, a non-autoregressive framework for structured and controllable song generation. SegTune enables segment-level control by allowing users or large language models to specify local musical descriptions aligned to song sections.The segmental prompts are injected into the model by temporally broadcasting them to corresponding time windows, while global prompts influence the whole song to ensure stylistic coherence. To obtain accurate segment durations and enable precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamped lyrics in LRC format. We further construct a large-scale data pipeline for collecting high-quality songs with aligned lyrics and prompts, and propose new evaluation metrics to assess segment-level alignment and vocal attribute consistency. Experimental results show that SegTune achieves superior controllability and musical coherence compared to existing baselines. See https://cai525.github.io/SegTune_demo for demos of our work.

2510.07355 2026-05-29 cs.MM cs.SD 版本更新

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

AV-EMO-Reasoning: 在具有视听线索的全模态大语言模型中基准测试情感推理能力

Dingkun Zhou, Krish Patel, Ajay Kankipati, Akshaj Gupta, Zeyi Austin Li, Mohul Shukla, Vibhor Narang, Sara Kofman, Zongli Ye, Grace Wang, Xiaoyu Shi, Tingle Li, Guan-Ting Lin, Kan Jen Cheng, Huang-Cheng Chou, Jiachen Lian, Gopala Anumanchipalli

发表机构 * UC Berkeley(加州大学伯克利分校) South China University of Technology(华南理工大学) Zhejiang University(浙江大学) National Taiwan University(台湾大学) University of Southern California(美国南加州大学)

AI总结 提出AV-EMO-Reasoning基准,通过合成和真实世界的视听对话数据集及情感感知与交互推理指标,系统评估全模态大语言模型的情感推理能力。

详情
AI中文摘要

通过声音和面部表情传达的情感塑造了人机交互中的参与度和情境。尽管全模态大语言模型取得了快速进展,但利用视听线索进行情感推理的整体评估仍然有限。为解决这一差距,我们引入了AV-EMO-Reasoning,一个旨在系统评估大语言模型情感推理能力的基准。该框架使用一个精心策划的视听语料库,包括合成的单轮和多轮对话以及一个真实世界子集,结合情感感知和交互推理指标,评估模型是否能理解用户情感并产生适当响应。通过发布一个系统评估基准,AV-EMO-Reasoning为评估情感感知对话提供了一个可重复的标准,并推动更自然、自适应的人机交互发展。

英文摘要

Emotions conveyed through voice and face shape engagement and context in human AI interaction. Despite rapid progress in omni modal large language models, the holistic evaluation of emotional reasoning with audiovisual cues remains limited. To address this gap, we introduce AV EMO Reasoning, a benchmark designed to systematically assess emotional reasoning abilities in large language models. The framework uses a curated audiovisual corpus comprising synthetic single turn and multi turn dialogues and a real world subset, together with emotion perception and interaction reasoning metrics, to evaluate whether models can understand user emotions and produce appropriate responses. By releasing a systematic evaluation benchmark, AV EMO Reasoning offers a reproducible standard for evaluating emotion aware dialogue and advances toward more natural, adaptive human AI interaction.

2502.20838 2026-05-29 cs.SD cs.AI cs.LG eess.AS 版本更新

Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data

弱监督检测与长时间生物声学数据中鲸叫声的时间定位

Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Systems and Control Engineering, School of Engineering, Institute of Science Tokyo, Japan(东京科学研究院工程学院系统与控制工程系)

AI总结 提出DSMIL-LocNet框架,利用弱监督多实例学习仅使用录音级标签实现鲸叫声的分类和时间定位,在长录音上优于全监督基线。

Comments Accepted in European Signal Processing Conference (EUSIPCO) 2026

详情
AI中文摘要

被动声学监测(PAM)系统生成持续数月连续录音,但自动化生物声学分析鲸叫声需要两种独立的标注工作:用于分类的二元存在标签和用于定位的精确时间边界。一个多分钟录音的二元标签可以在几秒钟内分配,但对其中的每个叫声打时间戳需要数小时的专家努力。在操作规模上同时提供两者是不可行的。我们提出DSMIL-LocNet,一个弱监督多实例学习(MIL)框架,仅使用录音级存在/缺失标签执行分类和时间定位。我们的双流架构整合频谱和时间特征,处理2-30分钟的录音,而无需现有CNN方法在长输入上退化的时间压缩。在AcousticTrends BlueFinLibrary上,DSMIL-LocNet在300-1800秒录音上达到F1分数0.88-0.91,而全监督CNN基线退化为0.19-0.64。它还提供这些基线在没有帧级标注的情况下无法产生的时间定位。代码:https://github.com/Ragib-Amin-Nihal/DSMIL-LocNet

英文摘要

Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise temporal boundaries for localization. A binary label for a multi-minute recording can be assigned in seconds, but timestamping every call within it requires hours of expert effort. Providing both is infeasible at operational scale. We present DSMIL-LocNet, a weakly supervised multiple instance learning (MIL) framework that performs both classification and temporal localization using only recording-level presence/absence labels. Our dual-stream architecture integrates spectral and temporal features to process recordings of 2--30 minutes without the temporal compression that degrades existing CNN methods on long inputs. On the AcousticTrends BlueFinLibrary, DSMIL-LocNet achieves F1 scores of 0.88--0.91 on recordings of 300--1800s, where fully supervised CNN baselines degrade to 0.19--0.64. It also provides temporal localization that these baselines cannot produce without frame-level annotation. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc

2605.29531 2026-05-29 cs.SD cs.CV cs.LG 版本更新

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

使用交叉注意力特征融合的半真音频深度伪造检测与定位

S. Sutharya, Remya K. Sasi

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出CAFNet模型,通过三元分类和边界回归联合检测部分伪造音频,在MLADDC数据集上达到92.71%准确率和0.075s定位误差。

Comments 13 pages, 5 figures, 11 tables

详情
AI中文摘要

音频深度伪造检测通常作为二分类问题研究,但部分篡改语音(其中一段短合成片段被拼接进真实语音)构成了更困难且更现实的威胁。检测此类半真音频不仅需要区分真实和完全伪造语音,还需要定位篡改发生的位置。我们提出了CAFNet,一个576k参数的架构,联合处理这两个任务:它在单次前向传播中执行三元分类(真实、完全伪造或半真)并回归合成区域的时间边界。CAFNet通过并行深度可分离卷积分支和交叉注意力融合梅尔频率倒谱系数(MFCC)、线性频率倒谱系数(LFCC)和色度短时傅里叶变换(Chroma-STFT)特征,随后使用双向长短期记忆(BiLSTM)回归头进行边界预测。在组合的多语言音频深度伪造检测语料库(MLADDC)T2+T3测试集上,CAFNet达到92.71%的准确率和0.9910的宏观曲线下面积(AUC),边界定位平均绝对误差(MAE)为0.075秒,中位误差为0.052秒。在二分类检测中,它达到96.76%的准确率和3.20%的等错误率(EER),以超过500倍的参数减少优于微调的XLS-R 300M(78.31%)和AST 87M(93.03%)。跨数据集研究进一步表明,即使在降低骨干学习率的情况下,标准微调也会破坏跨域表示。

英文摘要

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

2605.29300 2026-05-29 cs.CL cs.AI cs.SD 版本更新

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

MusTBENCH:音乐大语言模型中的时间定位基准与推进

Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo, Shuyang Cui, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

发表机构 * Seoul National University(首尔国立大学) Sony Group Corporation(索尼集团) Sony AI(索尼人工智能)

AI总结 提出MusTBENCH基准和MusT四阶段优化方法,评估并提升音乐大语言模型在音频中的时间定位能力。

详情
AI中文摘要

近期的大型音频-语言模型(LALMs)在理解音乐内容方面展现了有前景的能力。然而,它们的响应是否基于音频中正确的时间区域仍未得到充分探索。这一限制对于音乐理解尤为关键,因为关键信息通常以时间局部化事件的形式出现,例如乐器进入和节奏转换。为了解决这一差距,我们引入了MusTBENCH,一个由音乐专家验证的基准,旨在通过五个时间定位的问答任务评估LALMs中的时间定位能力。为了进一步提升现有模型中的时间定位,我们提出了MusT,一种新颖的四阶段时间优化方案,涵盖音乐编码器适应、LLM适应、LLM监督微调和基于RL的优化。在MusTBENCH上的实验表明,现有LALMs在精确时间定位方面存在困难,而MusT相比强基线带来了显著改进。这些结果将时间定位确立为当前LALMs中缺失的关键能力,并将MusTBENCH定位为未来时间定位音乐理解研究的具有挑战性的基准。

英文摘要

Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.

2605.29257 2026-05-29 cs.SD 版本更新

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

ChildVox:理解与表征儿童期声音的语音、音频及大型音频语言模型基准

Tiantian Feng, Anfeng Xu, Xuan Shi, Aditya Kommineni, Shakhrul Iman Siam, Megan Micheletti, Zhonghao Shi, Helen Tager-Flusberg, Mi Zhang, Lynn K. Perry, Catherine Lord, Daniel Messinger, Shrikanth Narayanan

发表机构 * University of Southern California(南加州大学) The Ohio State University(俄亥俄州立大学) University of California, Los Angeles(加州大学洛杉矶分校) Harvard University(哈佛大学) Boston University(波士顿大学) University of Miami(迈阿密大学)

AI总结 提出ChildVox基准,整合17个儿童音频数据集和20多个子任务,评估多种基础模型在儿童生理声、非语言发声、规范音节和口语识别上的性能。

Comments preprint under review

详情
AI中文摘要

我们提出了ChildVox,这是一个新颖的基准,用于表征儿童通过其交流的多样化声学信号。具体来说,ChildVox遵循从出生到学龄的完整发展轨迹,涵盖生理声音、非语言发声、规范音节和口语。ChildVox整合了来自17个以儿童为中心的音频和语音数据集的20多个子任务,实现了系统的跨语料库和跨领域比较。我们评估了一系列代表性的音频和语音基础模型,包括自监督、面向ASR和大型音频语言模型,在生理声音分类、发声和规范音节建模以及语音质量评估和识别等任务上的表现。基准测试结果表明,ChildVox提供了一套高性能模型,用于识别来自儿童的广泛声学信号,支持下游应用,如表征儿童语言水平和追踪随年龄变化的语音产生。

英文摘要

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

2605.13841 2026-05-29 cs.SD cs.AI cs.CL cs.LG 版本更新

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

EVA-Bench:一种用于评估语音代理的新型端到端框架

Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang H. Nguyen, Raghav Mehndiratta, Lindsay Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, Srinivas Sunkara

发表机构 * ServiceNow

AI总结 提出EVA-Bench框架,通过机器人间音频对话模拟和复合指标(EVA-A和EVA-X)全面评估语音代理的准确性和体验质量。

Comments Work in progress

详情
AI中文摘要

语音代理是一种通过口语对话完成任务的人工智能系统,越来越多地部署在企业应用中。然而,现有基准测试未能同时解决两个核心评估挑战:生成逼真的模拟对话,以及全面衡量语音特定故障模式的质量。我们提出了EVA-Bench,一个端到端评估框架,同时解决这两个问题。在模拟方面,EVA-Bench通过动态多轮对话协调机器人间的音频对话,并自动进行模拟验证,检测用户模拟器错误并在评分前适当重新生成对话。在测量方面,EVA-Bench引入了两个复合指标:EVA-A(准确性),捕捉任务完成度、忠实度和音频级语音保真度;以及EVA-X(体验),捕捉对话进展、口语简洁性和话轮转换时机。这两个指标适用于所有主要的代理架构,支持直接的跨架构比较。EVA-Bench包含三个企业领域的213个场景、一个用于口音和噪声鲁棒性的受控扰动套件,以及区分峰值能力和可靠能力的pass@1、pass@k、pass^k测量。在跨越所有三种架构的12个系统中,我们发现:(1)没有系统在EVA-A pass@1和EVA-X pass@1上同时超过0.5;(2)峰值性能和可靠性能差异显著(EVA-A上pass@k与pass^k的中位数差距为0.44);(3)口音和噪声扰动暴露了显著的鲁棒性差距,其影响因架构、系统和指标而异(平均Δ高达0.314)。我们在开源许可下发布了完整的框架、评估套件和基准数据。

英文摘要

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $Δ$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

2603.01006 2026-05-29 cs.SD cs.AI cs.LG cs.MM 版本更新

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

AG-REPA:音频流匹配中表示对齐的因果层选择

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

发表机构 * AI Thrust, Information Hub, The Hong Kong University of Science and Technology (Guangzhou)(人工智能 thrust,信息中心,香港科学与技术大学(广州))

AI总结 提出AG-REPA方法,通过前向门控消融量化各层对速度场的因果贡献,实现稀疏层选择和自适应加权对齐,在音频流匹配中优于传统REPA基线。

Comments Accepted to ICML 2026. 17 pages, 4 figures, 12 tables

详情
AI中文摘要

表示对齐(REPA)通过将中间隐藏状态与预训练教师特征对齐来改进生成流模型的训练,但在令牌条件音频流匹配中,其有效性关键取决于监督层的选择,而监督层通常基于深度启发式地选择。在这项工作中,我们引入了归因引导的表示对齐(AG-REPA),一种用于音频流匹配中表示对齐的新型因果层选择策略。首先,我们发现最能存储语义/声学信息(高教师空间相似性)的层不一定是那些对驱动生成的速度场贡献最大的层,我们称之为存储-贡献分离(SCD)。为了将这一见解转化为可操作的训练指导,我们提出了一种前向门控消融(FoG-A),通过预测速度场中的诱导变化来量化每个层的因果贡献,从而实现稀疏层选择和自适应加权对齐。在统一的语音和通用音频训练(LibriSpeech + AudioSet)中,在不同的令牌条件拓扑下,AG-REPA始终优于REPA基线。总体而言,我们的结果表明,当对齐应用于因果主导的驱动速度场的层时,而不是应用于表示丰富但功能被动的层时,对齐最为有效。

英文摘要

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.

2601.22661 2026-05-29 cs.SD 版本更新

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

评估与奖励基于平均延续对数概率的表达性角色扮演TTS的LALM

Yong Ren, Jingbei Li, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Beihang University(北京航空航天大学)

AI总结 提出平均延续对数概率(MCLP)作为评估指标和奖励信号,用于提升大型音频语言模型在角色扮演文本转语音任务中的风格一致性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型音频语言模型(LALM)的最新进展已将文本转语音(TTS)扩展到交互式角色扮演场景,这要求高表达力和严格遵守角色扮演指令。然而,现有模型在多轮对话中难以保持与角色档案和场景描述一致的风格。一个关键瓶颈是缺乏量化说话风格的客观指标。为弥补这一差距,我们提出平均延续对数概率(MCLP)作为评估指标和奖励信号,并在基于LALM的角色扮演TTS(RP-TTS)任务上进行了验证。MCLM利用预训练LALM的上下文学习能力,测量真实语音标记在由转录文本、生成语音和重复转录文本组成的上下文历史条件下的似然性,作为风格连续性的代理。此外,我们使用MCLP作为强化学习奖励,以增强生成语音与角色扮演指令之间的风格对齐。为支持该任务,我们构建了一个大规模带有丰富场景和角色标注的RP-TTS数据集。实验表明,MCLP与人类对风格一致性的判断高度一致,并且作为改进RP-TTS的有效奖励,在客观指标和主观评估中均带来一致提升。我们的代码已公开于https://github.com/y-ren16/MCLP。

英文摘要

Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. MCLP leverages the in-context learning capability of pretrained LALMs to measure the likelihood of ground-truth speech tokens conditioned on a contextual history consisting of the transcript, generated speech, and repeated transcript, serving as a proxy for stylistic continuity. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and role-play instructions. To support this task, we construct a large-scale RP-TTS dataset with rich scene and character annotations. Experiments demonstrate that MCLP is well aligned with human judgments of stylistic consistency and serves as an effective reward for improving RP-TTS, leading to consistent gains in both objective metrics and subjective evaluations. Our code is publicly available at https://github.com/y-ren16/MCLP.

2509.15629 2026-05-29 cs.SD eess.AS 版本更新

An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results

歌唱声音转换挑战2025评估结果的深入分析

Lester Phillip Violeta, Xueyao Zhang, Jiatong Shi, Yusuke Yasuda, Wen-Chin Huang, Zhizheng Wu, Tomoki Toda

发表机构 * Graduate School of Informatics, Nagoya University, Japan(名古屋大学信息学研究科,日本) Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)) Carnegie Mellon University, USA(卡内基梅隆大学,美国) National Institute of Informatics, Japan(日本信息处理学会) Information Technology Center, Nagoya University, Japan(名古屋大学信息技术中心,日本)

AI总结 本文对2025年歌唱声音转换挑战赛的评估结果进行了深入分析,通过新数据库、两个任务、开源基线和大规模众包测试,比较了33个系统在歌手身份和歌唱风格转换上的表现,发现顶级系统在身份相似性上接近真实样本,但风格建模(如气息、滑音、颤音)仍具挑战,且现有客观指标无法完全替代主观评分。

Comments Submitted to IEEE TASLP

详情
AI中文摘要

我们呈现了对最新一届歌唱声音转换挑战赛结果的分析,该科学活动旨在在受控环境中比较和理解不同的声音转换系统。与以往仅关注转换歌手身份的迭代相比,今年我们还关注了转换歌手的歌唱风格。为了创建受控环境和进行彻底评估,我们开发了一个新的挑战数据库,引入了两个任务,开源了基线系统,并进行了大规模的众包听力测试和客观评估。该挑战赛持续了两个月,我们总共评估了33个不同的系统。大规模众包听力测试的结果表明,顶级系统在歌手身份评分上与真实样本相当。然而,建模歌唱风格并因此实现高自然度仍然是该任务中的一个挑战,主要原因是难以对气息、滑音和颤音歌唱风格中的动态信息进行建模。对挑战赛的进一步分析还讨论了传统相似性测试和动态偏好测试在评估歌唱风格相似性方面的局限性。此外,计算斯皮尔曼秩相关系数表明,依赖客观指标(如色度对齐)和非匹配指标(如说话人嵌入)与主观评分相关性最高,但仍未达到可被视为真正替代主观评分的水平。

英文摘要

We present a thorough analysis of the findings of the latest iteration of the Singing Voice Conversion Challenge, a scientific event aiming to compare and understand different voice conversion systems in a controlled environment. Compared to previous iterations which solely focused on converting the singer identity, this year we also focused on converting the singing style of the singer. To create a controlled environment and thorough evaluations, we developed a new challenge database, introduced two tasks, open-sourced baselines, and conducted large-scale crowd-sourced listening tests and objective evaluations. The challenge was run for two months and in total we evaluated 33 different systems. The results of the large-scale crowd-sourced listening test showed that top systems had comparable singer identity scores to ground truth samples. However, modeling the singing style and consequently achieving high naturalness still remains a challenge in this task, primarily due to the difficulty in modeling dynamic information in breathy, glissando, and vibrato singing styles. Further analyses of the challenge also discuss the limitations of both the traditional similarity test and the dynamic preference test in evaluating singing style similarity. Moreover, calculating Spearman's rank correlation coefficient shows that dependent objective metrics such as chroma-alignment and non-match metrics such as speaker embeddings are the most correlated to subjective scores, but are still not at a level where it could be considered as a true replacement for subjective scores.

2505.10975 2026-05-29 cs.CL cs.AI cs.SD eess.AS 版本更新

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

单声道音频的端到端多说话人自动语音识别综述

Xinlu He, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute(沃斯特理工大学)

AI总结 本文系统综述了端到端多说话人自动语音识别的神经架构范式(SIMO与SISO)、近期改进方法及长语音扩展策略,并通过标准基准评估比较了各类方法。

Comments Accepted for publication in Computer Speech & Language (CSL)

详情
AI中文摘要

单声道多说话人自动语音识别(ASR)由于数据稀缺以及识别并将词语归因于单个说话人的内在困难(尤其是在重叠语音中)仍然具有挑战性。最近的进展推动了从级联系统向端到端(E2E)架构的转变,这减少了错误传播并更好地利用了语音内容与说话人身份之间的协同作用。尽管端到端多说话人ASR取得了快速进展,但该领域缺乏对近期发展的全面综述。本综述为多说话人ASR的端到端神经方法提供了一个系统的分类法,突出了近期进展和比较分析。具体而言,我们分析了:(1)用于预分割音频的架构范式(SIMO与SISO),分析了它们的不同特征和权衡;(2)基于这两种范式的近期架构和算法改进;(3)对长语音的扩展,包括分割策略和说话人一致性的假设拼接。此外,我们(4)在标准基准上评估和比较了各种方法。最后,我们讨论了构建鲁棒且可扩展的多说话人ASR所面临的开放挑战和未来研究方向。

英文摘要

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.