多模态大模型 - arXivDaily 专题

2606.14702 2026-06-18 cs.CV 新提交 90%

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K：通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University（南京大学）； CASIA（中国科学院自动化研究所）

专题命中音视频多模态：音视频推理数据集与问答

AI总结提出OmniVideo-100K数据集，通过实体锚定视频脚本和线索引导的QA生成机制，解决音视频问答中跨段实体不一致和长时推理不足的问题，微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情

AI中文摘要

当前的音视频问答（QA）自动化流水线通常采用“视频-字幕-QA”范式。然而，这些方法通常将视频分割成短片段，并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联，而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外，将长文本理解和QA合成耦合到单一步骤中，往往将模型限制在局部事件上，生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题，我们提出了一种自动化数据引擎，包含两种机制：（1）**实体锚定视频脚本**将视频转换为结构化脚本，包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验，确保跨片段引用一致性并重建音视频关联。（2）**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索，然后基于这些高价值线索生成QA对。利用这一流水线，我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B，在OmniVideo-Test上获得了高达20.59%的性能提升，并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力（提升高达12.64%）。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

URL PDF HTML ☆

赞 0 踩 0

2606.19157 2026-06-18 eess.AS cs.CL 新提交 85%

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

IndicContextEval：评估8种印度语言音频大语言模型上下文利用能力的基准

Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

发表机构 * AI4Bharat, Indian Institute of Technology Madras, India（AI4Bharat，印度理工学院马德拉斯分校）； Sarvam AI, India（Sarvam AI，印度）

专题命中音视频多模态：评估音频大语言模型的上下文利用能力

AI总结提出IndicContextEval基准，包含8种印度语言555位说话人的56小时自然语音，通过7级提示框架评估音频大语言模型是否真正利用上下文而非依赖参数化知识。

Comments Accepted at Interspeech 2026

详情

AI中文摘要

音频大语言模型（AudioLLMs）能够基于文本提示（如领域描述或实体列表）进行语音识别。然而，尚不清楚这些模型是真正利用此类上下文，还是依赖预训练期间学到的参数化知识。现有基准无法回答这个问题，因为它们仅在固定提示条件下评估转录，且很少包含明确的上下文输入。我们引入IndicContextEval，这是一个56小时的多语言基准，包含来自8种印度语言和23个专业领域的555位说话人的自然语音。我们设计了一个7级提示框架，逐步引入上下文信号，包括元数据、自然语言描述、英语和本地文字的实体列表，以及包含错误实体的对抗性提示。评估五个模型揭示了上下文利用行为的显著差异，凸显了对音频大语言模型中上下文基础进行显式评估的必要性。

英文摘要

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.18924 2026-06-18 cs.SD 新提交 85%

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突？音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电子工程学院）

专题命中音视频多模态：分析音频大模型中文本偏差机制

AI总结本文通过机制分析揭示音频大模型中的文本主导偏差，发现文本路径主动抑制完整音频表征，并提出无训练干预方法back-patching以增强音频表征，缓解文本主导。

Comments Preprint

详情

AI中文摘要

虽然音频大模型在多模态理解方面表现出色，但它们存在文本主导偏差，即模型盲目偏向文本而忽视声学证据，导致幻觉。然而，当音频和文本输入相互矛盾时，这些模型内部行为的底层机制尚未被探索。在这项工作中，我们通过追踪内部表征在层间的传播，首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现：（i）文本主导在模型中系统性地且经验性地存在；（ii）虽然文本和音频依赖功能不同的路径，但它们最终在后期层中汇聚到一个共享语义空间；（iii）文本路径不会擦除音频信息，而是主动抑制完整的音频表征。基于这些见解，我们利用back-patching，一种无训练干预方法，将后期层的音频激活路由回早期层。这放大了音频表征，使其能够克服文本抑制。我们的评估表明，back-patching持续减少文本主导，为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

URL PDF HTML ☆

赞 0 踩 0

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 新提交 85%

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST（韩国科学技术院）

专题命中音视频多模态：提出CoAT框架，增强音频语言模型的连续音频思考能力。

AI总结提出连续音频思考（CoAT）框架，通过专家蒸馏在连续潜在空间中组织声学信息，使音频语言模型在生成响应前利用丰富声学特征，无需额外自回归解码成本，在多个音频任务上提升性能。

Comments Preprint

详情

AI中文摘要

大型音频语言模型（LALMs）在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而，由于LALMs通常被训练生成与文本对齐的响应，其隐藏状态逐渐为文本生成而塑造，而非保留声学信息。因此，音频携带的多样化声学内容，如语音细节、韵律、声音事件、情感和音调，在过程中丢失，难以在响应中利用。我们引入了连续音频思考（CoAT），这是一个框架，为音频语言模型配备一个连续的潜在工作空间，用于在响应生成之前组织声学信息，并通过音频专家的蒸馏进行基础化。在思考空间内，模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外，所提出的连续思考块可以在单个预填充中处理，因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上，Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3，在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实，辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

URL PDF HTML ☆

赞 0 踩 0

2606.19203 2026-06-18 eess.AS 新提交 80%

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

DASH: 基于多层隐藏表示的双视角自蒸馏用于鲁棒语音识别

Jaeeun Baik, Ui-Hyeop Shin, Jiwoon Lee, Woocheol Jeong, Hyung-Min Park

专题命中音视频多模态：提出自蒸馏框架提升语音识别鲁棒性，属于音频处理

AI总结提出DASH自蒸馏框架，通过双视角学习干净-噪声一致性，从多层编码器蒸馏隐藏表示并最小化原型分配分布的KL散度，在保持干净准确率的同时提升噪声鲁棒性，额外开销仅约微调时间的4%。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

自动语音识别（ASR）在现实噪声环境中常常性能下降，因此噪声鲁棒性对于部署至关重要。有监督的噪声增强微调是一种常见的补救措施，但它可能引入鲁棒性与干净性能之间的权衡，并过度拟合特定噪声，导致干净条件下的识别性能下降。我们提出了DASH，一种自蒸馏框架，通过从配对视图中学习干净-噪声一致性来提高鲁棒性。DASH从多个编码器层蒸馏隐藏表示，以捕获从低级声学到高级语义的特征，并通过最小化干净视图和噪声视图的原型分配分布之间的KL散度来稳定训练。在LibriSpeech上的实验表明，DASH在保持干净准确率的同时，在各种噪声条件下持续提高识别性能，这是通过在标准微调之外增加一个无标签的预训练阶段实现的，额外开销极小（约为微调时间的4%）。

英文摘要

Automatic Speech Recognition (ASR) often degrades in real-world noisy environments, making noise robustness essential for deployment. Supervised noise-augmented fine-tuning is a common remedy, but it can introduce a robustness-clean trade-off and overfit to specific corruptions, degrading recognition in clean conditions. We propose DASH, a self-distillation framework that improves robustness by learning clean--noisy consistency from paired views. DASH distills hidden representations from multiple encoder layers to capture features from low-level acoustics to high-level semantics, and stabilizes training by minimizing KL divergence between prototype assignment distributions of clean and noisy views. Experiments on LibriSpeech show that DASH consistently improves recognition under diverse noisy conditions while preserving clean accuracy, achieved by a label-free pre-training stage with minimal additional overhead (about 4% of fine-tuning time) beyond standard fine-tuning.

URL PDF HTML ☆

赞 0 踩 0