AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning
AudioProcessBench: 音频基础推理中过程错误识别的基准
Xiangyu Zhao, Junyu Yan, Yaling Shen, Zimu Wang, Yiwen Jiang, Stephanie Fong, Qingyang Xu, Jiahe Liu, Dominic Dwyer, Zongyuan Ge
AI总结 提出AudioProcessBench基准,用于评估音频-语言模型在推理步骤中的过程错误识别能力,涵盖步骤正确性、错误类型检测和链级聚合三种范式。
详情
大型音频-语言模型(LALMs)越来越多地使用显式推理轨迹进行复杂的音频理解,但对推理质量的评估仍未被充分探索。尽管过程级基准(用于过程奖励模型PRMs)在文本和多模态领域推进了推理评估,但音频推理的类似评估仍然有限。在本文中,我们提出了AudioProcessBench,一个用于音频推理中步骤级过程错误识别的综合基准。AudioProcessBench包含由6个音频和全模态语言模型生成的不同推理轨迹。每个轨迹被分割成离散的推理步骤,并标注了二元步骤正确性和细粒度错误类型。我们的基准在三种互补范式下评估模型:(1)步骤正确性识别,(2)错误类型条件检测,用于诊断音频特定验证器能力,以及(3)链级聚合,其中验证器为同一问题选择或聚合多个推理轨迹。这种设计使得系统分析当前模型是否能检测过程错误、它们的弱点是否因音频特定错误类型而异,以及过程验证是否能转化为改进的答案选择成为可能。AudioProcessBench为未来关于音频推理验证器、过程奖励模型和可靠的全模态推理研究提供了测试平台。
Large audio-language models (LALMs) increasingly use explicit reasoning traces for complex audio understanding, yet the evaluation of reasoning quality remains underexplored. Although process-level benchmarks for process reward models (PRMs) have advanced reasoning evaluation in text and multi-modal domains, comparable evaluation for audio reasoning remains limited. In this paper, we present AudioProcessBench, a comprehensive benchmark for step-level process error identification in audio reasoning. AudioProcessBench contains diverse reasoning traces generated by 6 audio and omni language models. Each trace is segmented into discrete reasoning steps and annotated with binary step correctness and fine-grained error types. Our benchmark evaluates models under three complementary paradigms: (1) step correctness identification, (2) error-type-conditioned detection for diagnosing audio-specific verifier capacities, and (3) chain-level aggregation, where verifiers select or aggregate among multiple reasoning traces for the same question. This design enables a systematic analysis of whether current models can detect process errors, whether their weaknesses differ across audio-specific error types, and whether process verification translates into improved answer selection. AudioProcessBench provides a testbed for future research on audio reasoning verifiers, process reward models, and reliable omni-modal reasoning.