大模型推理能力 - arXivDaily 专题

2604.28076 2026-06-18 cs.CL cs.AI cs.LG 版本更新 85%

TopBench: A Benchmark for Implicit Predictive Reasoning in Tabular Question Answering

TopBench：表格问答中隐式预测推理的基准

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University, China（人工智能学院，南京大学，中国）； National Key Laboratory for Novel Software Technology, Nanjing University, China（新型软件技术国家重点实验室，南京大学，中国）

专题命中复杂问题求解：表格问答中隐式预测推理的基准

AI总结提出TopBench基准，包含779个样本和四个子任务，评估大语言模型在表格问答中识别隐式预测意图并进行可靠推理的能力，发现当前模型在意图识别上存在困难。

详情

AI中文摘要

大型语言模型（LLM）推动了表格问答的发展，其中大多数查询可以通过提取信息或简单聚合来回答。然而，一类常见的现实世界查询是隐式预测性的，需要从历史模式中推断未观察到的答案，而不仅仅是检索。这些查询带来了两个挑战：识别潜在意图和对大规模表格进行可靠的预测推理。为了评估LLM在带有隐式预测任务的表格问答中的表现，我们引入了TopBench，一个包含779个样本的基准，涵盖四个子任务，从单点预测到决策制定、处理效应分析和复杂过滤，要求模型生成涵盖推理文本和结构化表格的输出。我们在基于文本和代理工作流下评估了多种模型。实验表明，当前模型通常在意图识别上存在困难，默认进行查找。更深入的分析发现，准确的意图消歧是引导这些预测行为的前提。此外，提升预测精度的上限需要整合更复杂的建模或推理能力。

英文摘要

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

URL PDF HTML ☆

赞 0 踩 0

2509.22363 2026-06-18 cs.LG eess.AS 版本更新 70%

Investigating Faithfulness in Large Audio Language Models

大型音频语言模型中的忠实性研究

Pooneh Mousavi, Lovenya Jain, Mirco Ravanelli, Cem Subakan

发表机构 * Concordia University（康科迪亚大学）； Mila - Quebec AI Institute（魁北克人工智能研究院）； Université Laval（拉瓦尔大学）； Birla Institute of Technology and Science, Pilani（比拉理工学院和科学学院，皮兰尼）

专题命中复杂问题求解：研究链式推理的忠实性，涉及推理评估

AI总结提出系统框架评估大型音频语言模型在推理链忠实性上的表现，定义三个音频忠实性标准，并通过基准测试发现模型推理与音频输入存在脱节。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

大型音频语言模型（LALMs）将音频编码器与预训练的大型语言模型集成，以执行复杂的多模态推理任务。虽然这些模型可以生成思维链（CoT）解释，但这些推理链的忠实性仍不清楚。在这项工作中，我们提出了一个系统框架来评估LALMs中CoT在输入音频和最终模型预测方面的忠实性。我们定义了音频忠实性的三个标准：无幻觉、整体性和专注聆听。我们还引入了一个基于音频和CoT干预的基准来评估忠实性\footnote{基准测试界面和评估结果可在以下网址获取：https://this https URL。}。在Audio Flamingo 3和Qwen2.5-Omni上的实验表明存在潜在的多模态脱节：推理通常与最终预测一致，但并不总是强烈基于音频，并且可能容易受到幻觉或对抗性扰动的影响。

英文摘要

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness\footnote{The benchmarking interface and evaluation results are available at https://poonehmousavi.github.io/faithfulness/. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

URL PDF HTML ☆

赞 0 踩 0

2603.05128 2026-06-18 eess.AS cs.SD 版本更新 70%

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

PolyBench：多声部音频中组合推理的基准测试

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

发表机构 * Harbin University of Science and Technology（哈尔滨理工大学）； The University of Melbourne（墨尔本大学）； KAIST（韩国成均馆大学）； University of Surrey（萨里大学）

专题命中复杂问题求解：评估音频大模型的组合推理能力

AI总结针对多声部音频中组合推理评估缺失的问题，提出PolyBench基准，包含计数、分类、检测、并发和时长估计五个子集，评估发现现有大音频语言模型在多声部场景下性能持续下降。

Comments Accepted by INTERSPEECH 2026

2503.01805 2026-06-18 cs.LG cs.AI cs.CL 版本更新 70%

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

图任务算法推理中Transformer的深度-宽度权衡

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

发表机构 * Courant Institute of Mathematical Sciences, New York University（纽约大学应用数学科学研究所）； Google Research（谷歌研究）； Meta AI ； Bar-Ilan University（巴伊兰大学）； Department of Bio-Medical Engineering, Edmond J. Safra Center for Bioinformatics, Tel-Aviv University（生物医学工程系，埃德蒙·J·萨法中心，特拉维夫大学）； Tel Aviv University（特拉维夫大学）

专题命中复杂问题求解：研究Transformer在图算法任务中的推理能力。

AI总结研究Transformer在图算法任务中深度与宽度的权衡，发现线性宽度下常数深度足以解决许多图问题，而某些问题需要二次宽度，实验验证了宽模型在保持精度的同时训练和推理更快。

Comments Updated ISF grant number

详情

AI中文摘要

Transformer已经彻底改变了机器学习领域。特别是，它们可用于解决复杂的算法问题，包括基于图的任务。在此类算法任务中，一个关键问题是能够实现该任务的Transformer的最小尺寸是多少。最近的工作开始探索图任务的这个问题，表明对于次线性嵌入维度（即模型宽度），对数深度就足够了。然而，我们在这里解决的一个开放问题是，如果允许宽度线性增长而深度保持固定，会发生什么。我们分析了这种情况，并得出了一个令人惊讶的结果：在线性宽度下，常数深度足以解决一系列基于图的问题。这表明宽度的适度增加可以允许更浅的模型，这在推理和训练时间方面是有利的。对于其他问题，我们表明需要二次宽度。我们的结果展示了Transformer实现图算法的复杂而有趣的格局。我们通过实验研究了深度和宽度相对能力之间的这些权衡，并发现宽模型在具有与深模型相同准确度的任务中，由于可并行化的硬件，训练和推理时间更快。

英文摘要

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.

URL PDF HTML ☆

赞 0 踩 0

2601.17226 2026-06-18 cs.CL cs.AI 版本更新 70%

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复：面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales（新南威尔士大学）

专题命中复杂问题求解：提升故事复述的逻辑性和合理性

AI总结提出RRR强化学习框架，结合结构主义叙事学与标量叙事性，通过d-RLAIF从文本特征中获取训练信号，无需参考输出，提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情

AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷，此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练（如SFT）无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR)，一个基于强化学习的流水线，将结构主义叙事学与标量叙事性相结合，以教授故事结构。我们扩展了TimeTravel数据集，加入人工标注的叙事平衡阶段，以评估奖励模型。通过d-RLAIF，RRR从文本特征的叙事性中推导训练信号，无需参考输出。评估表明，RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线，输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集，为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

URL PDF HTML ☆

赞 0 踩 0