大模型推理能力 - arXivDaily 专题

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交 85%

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb：基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

专题命中逻辑推理：测试逻辑推理和理论推理能力

AI总结提出DeFAb基准，通过将知识库转换为可验证的溯因实例，评估基础模型在可废止推理中的创造力与理论推理能力，发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情

AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例；而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%，最差降至23.5%（四种表面渲染的最坏情况）。我们引入DeFAb（可废止溯因基准），这是一个数据集和生成流水线，将四十年的公共资助知识库转换为形式化可废止溯因实例：通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查（有效推导、保守性和最小性），DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具，评分的是理论修正的规范构建，而非流畅但破坏理论的散文。该流水线将分类层次结构（OpenCyc、YAGO、Wikidata）与行为属性图（ConceptNet、UMLS）配对，从18个来源生成372,648+个实例，涉及33.75M条实例化规则，分为三个级别，并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理：渲染鲁棒的Level 2准确率为7.8-23.5%；思维链方差（约36个百分点）超过任何模型间差距；匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard（235个实例的Level 3难度变体；最佳模型53.3% vs 符号100%）和CONJURE（一个内核验证的变革性创造力变体，包含560个Lean 4/Mathlib实例，其金答案证明内核先前未包含的定义，无需判断的验证器；试点发现零新概念）。同一验证器还可作为偏好优化（DPO、RLVR/GRPO）的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

URL PDF HTML ☆

赞 0 踩 0

2606.15633 2026-06-18 cs.LG 新提交 85%

Formalizing and Mitigating Structural Distortion in LLM Attention for Graph Reasoning

形式化并缓解大语言模型注意力中的结构失真以实现零样本图推理

Donald Loveland, Puja Trivedi, Ari Weinstein, Edward W Huang, Danai Koutra

发表机构 * University of Michigan（密歇根大学）； Amazon（亚马逊）

专题命中逻辑推理：图推理中的结构失真缓解，提升LLM推理

AI总结本文形式化了大语言模型处理文本属性图时因图线性化导致的结构失真机制，并提出轻量级推理时修改方法GaLA，通过校正注意力偏差提升零样本图推理性能。

Comments Accepted to KDD 2026

详情

AI中文摘要

大语言模型（LLM）在文本属性图（TAG）推理中展现出潜力。然而，将LLM应用于图需要将其结构线性化为序列，这引入了根源于图带宽问题的失真。虽然这种失真已被证明会降低性能，但通常归因于提示设计或模型规模，其潜在机制尚不清楚。在这项工作中，我们展示了旋转位置嵌入如何将图线性化为带宽相关的注意力衰减，抑制了序列化序列中被强制分隔开的图相邻节点之间的注意力。这将基于LLM的图推理的焦点从提示工程和规模缩放转向纠正注意力错位。受此分析启发，我们提出了图对齐语言注意力（GaLA），一种轻量级的、推理时修改LLM的方法。GaLA将注意力偏向图相邻节点，同时保留LLM的序列归纳偏差。在TAG基准测试中，GaLA以可忽略的开销提升了性能，表明失真是基于LLM的图推理中可纠正的瓶颈。

英文摘要

Large Language Models (LLMs) have shown promise for reasoning over Text-Attributed Graphs (TAGs). However, applying LLMs to graphs requires linearizing their structure into sequences, introducing distortion rooted in the graph bandwidth problem. While this distortion has been shown to degrade performance, it is often attributed to prompt design or model scale, leaving the underlying mechanism unclear. In this work, we show \textit{how} rotary positional embeddings turn graph linearization into bandwidth-dependent attention decay, suppressing attention between graph-adjacent nodes that are forced far apart in the serialized sequence. This shifts the focus of LLM-based graph reasoning from prompt engineering and scaling toward correcting attention misalignment. Motivated by this analysis, we propose \textbf{G}raph-\textbf{a}ligned \textbf{L}anguage \textbf{A}ttention (\textbf{GaLA}), a lightweight, inference-time modification for LLMs. GaLA biases attention toward graph-adjacent nodes while preserving the LLM's sequential inductive biases. Across TAG benchmarks, GaLA improves performance with negligible overhead, demonstrating that distortion is a correctable bottleneck in LLM-based graph reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.18624 2026-06-18 cs.CL 新提交 80%

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST：用于语用语言理解的自我强化反事实推理

Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

专题命中逻辑推理：自我强化反事实推理提升语用语言理解

AI总结提出PragReST框架，通过自监督构建语用问答数据、生成反事实推理轨迹，结合监督微调和强化学习提升大语言模型的语用推理能力，在四个基准上显著优于基线模型。

Comments First two authors contributed equally. Code and models: https://github.com/jihyung803/PragReST

详情

AI中文摘要

自然语言理解通常依赖于隐含而非明确陈述的含义，需要语用推理。尽管大语言模型（LLMs）在数学和逻辑推理上表现强劲，但在进行语用推理时仍存在困难，往往选择字面解释。为了提升LLM的语用推理能力，我们提出了PragReST，一个自监督框架，它构建语用问答数据，生成反事实推理轨迹，并通过监督微调和强化学习训练模型内化这些轨迹，无需人工标注训练数据或从更强的教师模型蒸馏。在四个语用基准（PragMega、Ludwig、MetoQA和AltPrag）上，PragReST相比骨干模型、任务特定的语用微调基线以及同一流水线的非反事实变体均有提升。在基于准确率的基准上，PragReST在Qwen3-8B和Qwen3-14B上分别比指令骨干模型提升了5.37%和5.50%（绝对值）。我们的错误分析和消融实验强调了反事实推理的重要性：PragReST主要减少了因未能将观察到的话语与合理的替代方案进行对比而导致的错误，而去除反事实推理会显著降低性能。此外，我们的训练保留了对通用知识和数学推理基准的域外性能。

英文摘要

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0