大模型推理能力

2606.19257 2026-06-18 cs.CL 新提交 90%

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B：面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong（香港大学）； Peking University（北京大学）

专题命中数学推理：块扩散语言模型用于长链推理

AI总结提出块大小课程学习，通过从细粒度到粗粒度的渐进训练，解决块扩散语言模型在长链推理中性能差距问题，DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情

AI中文摘要

块扩散语言模型通过并行块级去噪加速解码，但其能否可靠地扩展到长思维链（CoT）推理仍未解决。为此，我们开发了开源块扩散推理模型DreamReasoner-8B，并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距：使用大块大小训练会导致推理性能极差，而小块大小则能保持有效的推理。为了弥合这一粒度差距，我们提出了块大小课程学习，逐步从细粒度块大小过渡到粗粒度块大小进行训练，从而克服了这一限制，并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中，DreamReasoner-8B取得了与领先的开源自回归模型（如Qwen3-8B）相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型：https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

URL PDF HTML ☆

赞 0 踩 0

2606.19236 2026-06-18 cs.LG cs.AI cs.CL 新提交 90%

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE: 基于惊讶度的令牌级优势重加权以实现策略熵稳定性

Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Tencent Hunyuan（腾讯 Hunyuan）

专题命中数学推理：GRPO策略熵稳定性方法，提升推理

AI总结针对GRPO等RL算法中策略熵崩溃问题，提出STARE方法，通过惊讶度分位数识别熵关键令牌并重加权其优势，结合目标熵闭环门控稳定熵，在1.5B-32B模型和多种任务上实现稳定训练，AIME24/25准确率提升4%-8%。

Comments LLM, Reinforcement Learning

详情

AI中文摘要

基于可验证奖励的强化学习算法（如GRPO）已成为LLMs复杂推理的主流后训练范式，但通常在训练中遭受策略熵崩溃。我们对GRPO下的令牌级熵动态进行一阶梯度分析，识别出令牌级信用分配不匹配：每个令牌的熵变化分解为轨迹级优势与下一个令牌分布上的熵敏感函数的乘积，产生优势-惊讶度四象限结构和近临界性质。受此启发，我们提出STARE（基于惊讶度的令牌级优势重加权以实现策略熵稳定性），该方法通过批次内惊讶度分位数识别熵关键令牌子集，选择性重加权其有效优势，并引入目标熵闭环门控以实现稳定的熵调节。在1.5B至32B的模型规模以及三个任务族（短思维链、长思维链和多轮工具使用）上，STARE在数千步内维持稳定的RL训练，同时将策略熵保持在目标带内。在AIME24和AIME25上，STARE在平均准确率上比DAPO和其他竞争基线高出4%-8%，反思令牌和响应长度同步增长，表明持续探索-利用平衡进一步释放了RL训练潜力。代码可在https://github.com/xxxx获取。

英文摘要

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

URL PDF HTML ☆

赞 0 踩 0

2606.18844 2026-06-18 cs.LG 新提交 85%

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

从自身错误中学习：为自蒸馏构建可学习的微反思轨迹

Zhilin Huang, Hang Gao, Ziqiang Dong, Yuan Chen, Yifeng Luo, Chujun Qin, Jingyi Wang, Yang Yang, Guanjun Jiang

发表机构 * Qwen Business Unit of Alibaba（阿里巴巴通义千问事业部）； Tsinghua University（清华大学）； Peking University（北京大学）

专题命中数学推理：自蒸馏改进数学推理，轨迹对比学习。

AI总结提出TAPO方法，通过对比正确与错误轨迹构建微反思修正，实现从隐式分布对齐到显式轨迹构建的自蒸馏改进，在多个数学推理基准上优于GRPO。

详情

AI中文摘要

自蒸馏通过使用模型自身的生成作为训练信号来改进大型语言模型的推理能力，通常通过隐式的logit级对齐来实现，最小化与特权目标分布的KL散度。然而，由于这种监督是通过无控制采样生成的，它无法提供关于模型特定错误的诊断性洞察，也无法针对其个体失败模式提供纠正性指导。因此，模型学习的是模仿特权分布，而不是接收精确指出其推理失败位置和原因的细粒度修正。在本文中，我们提出了轨迹增强策略优化（TAPO），将自蒸馏从隐式分布对齐推进到显式轨迹构建。在强化学习训练期间，模型对同一查询同时产生正确和错误的生成轨迹，TAPO利用这种对比结构来构建微反思修正——新的训练轨迹，保留模型在失败点之前的错误推理，然后插入自然语言诊断和由同一采样组中的正确参考引导的修正推理。由于每条轨迹都锚定在学习者自身的前缀和解决方案上，与基于KL的方法施加的位置级对齐相比，修正信号在更大程度上保留了模型的在策略分布。为了整合这些轨迹，TAPO在模型能力边界引入了难度感知的候选选择，并采用解耦优势估计以防止梯度污染。在AIME 2024、AIME 2025和HMMT 2025上的实验表明，在相同训练步数下，TAPO相比GRPO取得了一致的改进。进一步分析表明，TAPO增强了首次推理和错误纠正的有效性。

英文摘要

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.18453 2026-06-18 cs.CL 新提交 85%

LLM Parameters for Math Across Languages: Shared or Separate?

跨语言数学问题的LLM参数：共享还是分离？

Behzad Shomali, Luisa Victor, Tim Selbach, Ali Hamza Bashir, David Berghaus, Joachim Koehler, Mehdi Ali, Markus Frey

发表机构 * Lamarr Institute（Lamarr研究所）； University of Bonn（波恩大学）； Fraunhofer IAIS（弗劳恩霍夫智能分析和信息系统研究所）

专题命中数学推理：跨语言数学推理的机制分析

AI总结通过跨语言机制分析，发现多语言LLM中数学相关参数存在部分跨语言重叠，且主要集中在中间层，英语参数集最大，低资源语言参数集较小。

Comments 5 pages. Accepted at ACL Student Research Workshop (SRW) 2026. Code: https://github.com/luisavictor/math-across-languages Translated Datasets: https://huggingface.co/math-across-languages Webpage: https://math-across-languages.github.io

详情

AI中文摘要

大型语言模型（LLM）在数学推理性能上表现出显著的跨语言差异，但目前尚不清楚这些差异是反映语言特定参数，还是反映一种因语言不同而表现不同的共享机制。我们提出了一种跨语言的LLM数学推理机制分析，使我们能够定位和比较支持跨语言数学推理的模型参数。我们发现，提取的数学相关参数表现出部分跨语言重叠，最强的重叠集中在中间模型层。我们进一步观察到，英语始终产生最大的数学相关参数集，而低资源语言则显示出较小的相关参数集。这些结果表明，多语言LLM中与数学相关的行为既不是完全语言不变的，也不是完全语言特定的，而是表现出部分跨语言参数重叠，并伴有系统性的语言依赖差异。

英文摘要

Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.

URL PDF HTML ☆

赞 0 踩 0

2603.01221 2026-06-18 cs.MA 版本更新 85%

Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

认知增益，偶然成本：多智能体辩论中的不确定性分解用于数学推理

Dan Qiao, Binbin Chen, Fengyu Cai, Jianlong Chen, Wenhao Li, Fuxin Jiang, Zuzhi Chen, Hongyuan Zha, Tieying Zhang, Baoxiang Wang

专题命中数学推理：多智能体辩论中的数学推理不确定性分解

AI总结本文提出贝叶斯不确定性分析框架，将多智能体辩论中的预测不确定性分解为认知不确定性和偶然不确定性，并设计不确定性引导的多智能体强化学习算法，在控制偶然成本的同时提升认知增益，从而提高推理准确性和辩论效率。

Comments ICML2026

详情

AI中文摘要

多智能体辩论（MAD）在改善推理和减少幻觉方面显示出前景，但信息交换如何塑造个体推理行为仍不清楚。经验上，MAD表现出矛盾现象，包括准确率随token熵增加而上升，以及同质和异质智能体组合之间的显著差异。在本文中，我们引入了一个用于MAD的贝叶斯不确定性分析框架，该框架将答案级别的预测不确定性分解为认知不确定性和偶然不确定性，分别对应辩论的潜在增益和成本。在多种智能体配置下，我们发现有效的辩论取决于在受控的偶然成本下实现高认知增益。基于这一见解，我们设计了一种不确定性引导的多智能体强化学习算法，鼓励更低的偶然成本和更有效的认知信息利用。实验表明，我们的方法同时提高了每个智能体的准确性，并促进了更富有成效的辩论过程，为理解和改进MAD提供了一个可操作的贝叶斯视角。

英文摘要

Multi-Agent Debate (MAD) has shown promise in improving reasoning and reducing hallucinations, yet it remains unclear how information exchange shapes individual reasoning behavior. Empirically, MAD exhibits paradoxical phenomena, including rising accuracy with increasing token entropy and marked differences between homogeneous and heterogeneous agent combinations. In this paper, we introduce a Bayesian uncertainty analysis framework for MAD, which decomposes answer-level predictive uncertainty into epistemic uncertainty and aleatoric uncertainty, corresponding to the potential gain and cost of debate. Across multiple agent configurations, we find that effective debate depends on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning algorithm that encourages lower aleatoric cost and more effective epistemic information utilization. Experiments show that our approach simultaneously enhances each agent's accuracy and promotes a more productive debate process, providing an operational Bayesian perspective for understanding and improving MAD.

URL PDF HTML ☆

赞 0 踩 0

2505.23851 2026-06-18 cs.CL cs.AI cs.SC 版本更新 85%

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

ASyMOB：代数符号数学运算基准

Michael Shalyt, Rotem Elimelech, Ido Kaminer

发表机构 * MIT（麻省理工学院）； Technion - Israel Institute of Technology（技术学院-以色列理工学院）

专题命中数学推理：基准测试评估大模型符号数学推理鲁棒性

AI总结提出ASyMOB基准，包含35,368个符号数学问题，通过扰动测试揭示大模型在符号数学推理中的鲁棒性不足，并发现LLM与CAS的互补潜力。

Comments Published in ICML2026: https://icml.cc/virtual/2026/poster/63549 Code repository: https://github.com/RamanujanMachine/ASyMOB Complete benchmark dataset: https://huggingface.co/datasets/Shalyt/ASyMOB-Algebraic_Symbolic_Mathematical_Operations_Benchmark

详情

AI中文摘要

大型语言模型（LLM）越来越多地应用于符号数学，然而现有评估常常混淆模式记忆与真正推理。为弥补这一空白，我们提出\textbf{ASyMOB}，一个包含\textit{35,368}个经过验证的符号数学问题的高分辨率数据集，涵盖积分、极限、微分方程、级数和超几何函数。与以往基准不同，\textbf{ASyMOB}通过符号、数值和等价保持变换系统地扰动每个种子问题，从而实现对泛化能力的细粒度评估。我们的评估揭示了三个关键发现：（1）大多数模型的性能在微小扰动下崩溃，而顶级系统表现出明显的鲁棒性\textit{机制转变}；（2）集成代码工具稳定了性能，尤其对较弱模型；（3）我们识别出计算机代数系统（CAS）失败而LLM成功的例子，以及仅通过LLM-CAS混合方法解决的问题，突显了有前景的集成前沿。\textbf{ASyMOB}作为一个原则性诊断工具，用于衡量和加速构建可验证、可信赖的AI以促进科学发现。

英文摘要

Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution dataset of 35,368 validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks, ASyMOB systematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization. Our evaluation reveals three key findings: (1) most models' performance collapses under minor perturbations, while top systems exhibit an apparent regime shift in robustness; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier. ASyMOB serves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.03460 2026-06-18 cs.AI cs.LG 版本更新 80%

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR：面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research（LG人工智能研究）

专题命中数学推理：金融时间序列推理，涉及数学推理和链式思维。

AI总结针对时间序列推理模型在金融领域的失效问题，提出基于2x2能力分类法的FinSTaR模型，通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

Comments KDD Workshop on SciSoc Agents & LLMs 2026 (Oral Presentation)

详情

AI中文摘要

时间序列推理模型在通用领域表现出色，但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法，通过交叉1)单实体与多实体分析，以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务，并基于标普股票构建FinTSR-Bench基准。为此，我们提出FinSTaR（金融时间序列思考与推理），在FinTSR-Bench上训练，并针对每个类别采用不同的思维链策略。对于评估（确定性，即可从可观测数据计算得出），我们采用Compute-in-CoT，一种程序化思维链，使模型能够直接从原始价格推导答案。对于预测（本质上是随机的，即受不可观测因素影响），我们采用场景感知思维链，在做出判断前生成多种场景，模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率，显著优于LLM和TSRM基线。此外，我们展示了四个能力类别通过联合训练具有互补性和相互增强性，并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开：https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

URL PDF HTML ☆

赞 0 踩 0

2606.18810 2026-06-18 cs.LG cs.AI 新提交 75%

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

从自身解中学习：面向可验证奖励强化学习的自条件化信用分配

Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

发表机构 * Beijing Institute of Technology（北京理工大学）； Beihang University（北京航空航天大学）； Independent Researcher（独立研究者）

专题命中数学推理：在数学、代码和智能体任务上提升推理

AI总结提出SC-GRPO方法，利用自条件化分布间的KL散度作为GRPO梯度的乘性权重，实现细粒度信用分配，在数学、代码和智能体任务上平均提升8.1%。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）在训练LLMs进行推理任务方面取得了显著进展，但代表性方法如GRPO对所有token分配统一信用，浪费了常规token上的梯度，同时低估了关键推理步骤。现有的token级信用分配方法需要超出模型自身rollout的资源。GRPO变体依赖于过程奖励模型或真实答案。知识蒸馏通过每个token的散度分配信用，但需要外部教师（在线策略蒸馏）或特权信息（在线策略自蒸馏）。然而，这些依赖性限制了在纯RLVR设置中的适用性。我们观察到，将模型以其自身验证过的轨迹为条件，会在原始分布和条件分布之间诱导出可测量的每token KL散度，并证明当存在多个验证过的轨迹时，从由验证过的轨迹构建的自教师进行蒸馏会导致不可行的加权平均解。我们提出SC-GRPO（自条件化GRPO），它使用前述KL散度作为GRPO梯度的乘性权重。在涵盖数学、代码和智能体任务的五个基准上，SC-GRPO一致优于GRPO 8.1%，优于DAPO 5.9%，并具有更强的分布外性能。此外，SC-GRPO实现了比OPD更高的性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

URL PDF HTML ☆

赞 0 踩 0

2606.18910 2026-06-18 cs.LG cs.CL 新提交 90%

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES：通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University（西北大学）； Amazon AGI（亚马逊人工智能实验室）； Qualcomm AI Research（高通人工智能研究）； University of Minnesota（明尼苏达大学）

专题命中测试时计算：通过修订与验证增强测试时扩展推理

AI总结提出REVES框架，通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示，实现高效的离策略数据生成，提升大语言模型的多步推理能力，在LiveCodeBench上比强化学习基线高6.5分。

详情

AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型（LLM）推理能力的强大范式。然而，标准的后训练方法主要优化单次目标，与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习（RL），但传统方法直接优化多步轨迹，未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架，交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤（“接近正确”答案）转化为解耦的修订和验证提示，我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比，这种方法实现了高效的离策略数据生成，并减少了长程采样的计算开销。在LiveCodeBench上，使用公开可用的测试用例作为反馈，我们观察到比RL基线高6.5分，比标准多轮训练高4.0分。除了编码，我们的方法在圆填充问题上达到了先前报告的SOTA结果，同时使用了最小的基础模型（4B）和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题，如n皇后和迷你数独，其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

URL PDF HTML ☆

赞 0 踩 0

2606.11918 2026-06-18 cs.AI 新提交 90%

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

提问的艺术：一致性增强空间推理中的事实性

Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari, Leonidas Guibas

发表机构 * The University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）； University of Oxford（牛津大学）； Stanford University（斯坦福大学）

专题命中规划推理：自监督强化学习提升空间推理能力

AI总结提出自监督强化学习框架，通过几何与语义一致性验证器（如图像翻转、文本对象顺序交换）对齐预训练模型的内在空间推理能力，无需标注数据即可达到接近监督方法的精度。

详情

AI中文摘要

当前的大型推理模型（LRMs）展现出显著的通用能力，但在空间推理任务中表现明显不足。现有方法将此差距视为知识缺陷，依赖监督微调（SFT）从外部视觉源或合成引擎中获取标注空间数据。相反，我们认为对于许多任务，空间推理能力已经存在于预训练的LRMs中，但需要通过几何2D和3D约束下的逻辑一致性进行对齐。在这项工作中，我们提出了一个自监督强化学习（RL）框架，针对内部推理过程，无需真实标注。通过形式化一致性验证器——即在变换下检查几何和语义一致性的奖励函数——我们证明模型可以提高其空间推理能力。我们同时使用图像变换（如翻转）和文本变换（如交换问题中对象的顺序），并提出了一种新的基于最优传输的RL策略OT-GRPO，这是针对成对验证器定制的组相对策略优化的最小匹配变体。我们展示了这种无标签一致性训练在精度上接近使用真实监督训练的模型，并在不同任务和数据领域实现了类似的泛化。

英文摘要

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.

URL PDF HTML ☆

赞 0 踩 0

2606.18686 2026-06-18 cs.AI cs.CL cs.LG 新提交 85%

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim：一个模拟世界预测基准

Jaeho Lee, Nick Merrill, Ezra Karger

发表机构 * Forecasting Research Institute（预测研究所）

专题命中规划推理：模拟世界预测基准，评估概率推理

AI总结提出基于Freeciv游戏模拟的预测基准ForecastBench-Sim，通过游戏回滚生成可控、即时可解的预测问题，用于评估AI系统的概率推理能力。

Comments 15 pages, 5 main figures, 6 appendix figures. Spotlight presentation at Forecasting as a New Frontier of Intelligence / Workshop on AI Forecasting, ICML 2026

详情

AI中文摘要

通用AI系统的预测基准通常继承现实世界的约束：结果缓慢显现、尾部事件罕见、反事实问题难以评分。我们引入ForecastBench-Sim，一个基于Freeciv（一款以文明系列为模型的回合制策略游戏）游戏回滚的模拟世界预测基准。预测者接收固定的世界报告（当前游戏状态的结构化快照），并回答关于隐藏未来状态的问题；然后基准继续模拟并对预测进行评分。由于世界是模拟的，同一设置可以生成任意时间跨度的连续或二元预测问题、用于条件或因果问题的配对干预世界，以及罕见或破坏性结果的已解决示例。我们描述了基准流程、问题族、评分协议和发布工件，并报告了来自模型评估和匿名人工试点的验证切片。ForecastBench-Sim旨在通过提供受控、即时可解的任务来补充现实世界预测基准，用于研究动态世界状态下的概率推理。

英文摘要

Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

URL PDF HTML ☆

赞 0 踩 0

2605.29649 2026-06-18 cs.AI 版本更新 85%

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

LLM进化的符号AI规划领域无关启发式

Elliot Gestrin, Jendrik Seipp

专题命中规划推理：LLM进化领域无关启发式，用于符号规划

AI总结本文使用进化搜索让大语言模型生成领域无关的启发式函数，在未见测试域上超越手工最优启发式，并首次系统评估了启发式的信息性-速度权衡。

Comments Accepted at the LM4Plan workshop at ICAPS 2026

详情

AI中文摘要

启发式搜索是符号AI规划中的主导范式，最强的启发式是规划研究者数十年工作的成果。最近的工作表明，大型语言模型（LLM）可以为单个规划领域设计启发式，但迄今为止，没有LLM生成的启发式能在任意规划任务上工作。在本文中，我们使用进化搜索来产生第一个LLM生成的领域无关启发式，其超越了手工最优的现有技术。我们让LLM变异用C++编写的父启发式，将候选解存储在MAP-Elites档案中，以信息性和速度作为键，并通过混合覆盖率和求解时间计算适应度分数。为了将进化程序置于上下文中，我们还额外基准测试了一组广泛的手工启发式在信息性-速度权衡上的表现，据我们所知，这之前从未做过。在未见测试域上，我们最好的进化启发式比最强基线解决了更多任务，我们的完整启发式套件跨越了所述权衡的帕累托前沿。我们还发现，从平凡的盲目启发式开始进化优于从强FF启发式开始，即使最终程序本身是FF变体，并且LLM推理努力影响候选编译成功的频率远大于影响那些编译成功的候选的质量。由于进化程序是纯C++，它们可以作为即插即用替代品插入现有规划器，并继承底层搜索的健全性和完备性保证。

英文摘要

Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planning researchers. Recent work has shown that large language models (LLMs) can design heuristics for individual planning domains, but no LLM-generated heuristic has so far worked on arbitrary planning tasks. In this paper, we use evolutionary search to produce the first LLM-generated domain-independent heuristics that exceed the hand-engineered state of the art. We let an LLM mutate parent heuristics written in C++, store candidates in a MAP-Elites archive keyed on informedness and speed and calculate fitness scores by blending coverage with solving time. To place the evolved programs in context, we additionally benchmark a broad set of hand-engineered heuristics on their informedness-speed tradeoff, which to our knowledge has not been done before. On unseen testing domains, our best evolved heuristic solves more tasks than even the strongest baseline, with our full heuristic suite spanning the Pareto frontier of said tradeoff. We also find that seeding evolution from the trivial blind heuristic outperforms seeding from the strong FF heuristic, even when the resulting program is itself an FF variant, and that LLM reasoning effort affects how often candidates compile much more than the quality of those that do. Because the evolved programs are plain C++, they slot into existing planners as drop-in replacements and inherit the soundness and completeness guarantees of the underlying search.

URL PDF HTML ☆

赞 0 踩 0

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交 80%

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench：智能体能否玩转长期博弈？

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

专题命中规划推理：长期不确定环境下的决策能力

AI总结提出CEO-Bench，通过模拟500天运营初创公司的任务，评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情

AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而，现实世界的挑战需要结合多种复杂技能，这些技能在很大程度上尚未在智能体中得到测试：（1）在不确定性中导航长期视野；（2）在嘈杂环境中获取信息；（3）适应不断变化的世界；（4）协调多个移动部分以实现连贯目标。我们引入CEO-Bench，通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面，在相同的环境中运行，并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库，将信号转化为合理的策略，并通过编程协调许多决策。最强的智能体编写复杂的代码，模拟客户群体以预测未来现金流，并挖掘谈判历史以揭示隐藏的客户偏好。即便如此，大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金，且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

URL PDF HTML ☆

赞 0 踩 0

2606.19328 2026-06-18 cs.LG cs.AI cs.RO 新提交 70%

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

UBP2: 不确定性平衡的偏好规划用于高效基于偏好的强化学习

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

发表机构 * Learning, Embodied Autonomy, and Forecasting (LEAF) Lab, University of Toronto（学习、具身自主与预测（LEAF）实验室，多伦多大学）

专题命中规划推理：不确定性平衡的偏好规划

AI总结提出UBP2方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索，在Meta-World基准上显著提高了样本效率。

详情

AI中文摘要

基于偏好的强化学习提供了一种从行为的成对比较中学习奖励模型的方法，绕过了显式奖励设计的需求。然而，现有方法通常依赖于被动数据收集，并且在学习的早期阶段样本效率低下。我们引入了一种基于模型的方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索。我们的方法，不确定性平衡的偏好规划（UBP2），使用奖励、动力学和值函数模型的集成，根据结合了期望奖励、终值认知不确定性的统一评分来评估候选轨迹。在此目标下的规划产生了利用和信息获取之间的显式权衡，无需临时的探索启发式。在标准正则性假设下，我们为有限时域和无限时域设置建立了次线性遗憾保证。实验上，在Meta-World基准上的实验表明，UBP2比无模型的基于偏好的方法和非乐观的基于模型的基线方法实现了更高的样本效率。

英文摘要

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2603.09344 2026-06-18 cs.AI stat.ML 版本更新 70%

Robust Regularized Policy Iteration under Transition Uncertainty

鲁棒正则化策略迭代在转移不确定性下

Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China（浙江大学计算机科学与技术学院）； School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China（西北工业大学人工智能、光学与电子学院（iOPEN））； School of Software Technology, Zhejiang University, Hangzhou, China（浙江大学软件技术学院）； School of Software Engineering, Xi'an Jiaotong University, Xi'an, China（西安交通大学软件工程学院）； School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China（中山大学系统科学与工程学院）

专题命中规划推理：鲁棒策略迭代用于离线强化学习

AI总结提出鲁棒正则化策略迭代（RRPI），通过将离线强化学习建模为鲁棒策略优化，使用KL正则化替代难解的双层目标，并基于鲁棒正则化贝尔曼算子实现高效策略迭代，理论保证收敛性，实验在D4RL基准上表现优异。

详情

AI中文摘要

离线强化学习（RL）无需在线探索即可实现数据高效且安全的策略学习，但其性能常因分布偏移而下降。学习到的策略可能访问分布外的状态-动作对，其中价值估计和学习到的动态不可靠。为了在统一框架中处理策略引发的外推和转移不确定性，我们将离线RL建模为鲁棒策略优化，将转移核视为不确定性集内的决策变量，并针对最坏情况动态优化策略。我们提出鲁棒正则化策略迭代（RRPI），用可处理的KL正则化替代难解的最大-最小双层目标，并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们提供了理论保证，证明所提出的算子是$\gamma$-压缩算子，且迭代更新替代目标能单调改进原始鲁棒目标并收敛。在D4RL基准上的实验表明，RRPI实现了强大的平均性能，在大多数环境中优于包括基于百分位数方法在内的最新基线，并在其余环境中保持竞争力。此外，RRPI通过将较低的$Q$值与高认知不确定性对齐，展现出鲁棒性能，从而防止策略执行不可靠的分布外动作。

英文摘要

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

URL PDF HTML ☆

赞 0 踩 0

2606.18633 2026-06-18 cs.MA 新提交 60%

PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning

PersonalPlan: 面向个性化编程学习的多智能体系统规划

Zhiyuan Wen, Jiannong Cao, Peng Gao, Haochen Shi, Wengpan Kuan, Bo Yuan, Xiuxiu Qi

专题命中规划推理：分层SFT和奖励自适应生成可执行计划

AI总结提出PersonalPlan，一种两阶段多智能体规划器，通过分层SFT和奖励自适应GRPO生成可执行、个性化且具有教学支架的计划，在MAP-PPL数据集上优于现有方法。

详情

AI中文摘要

有效的编程教育需要针对不同学习者背景进行个性化教学。然而，虽然基于LLM的多智能体系统（MAS）擅长复杂规划，但现有规划器通常缺乏轮廓基础（profile-grounding）和教学支架（pedagogical scaffolding），从而削弱了个性化编程学习。为填补这一空白，我们首先引入\textbf{MAP-PPL}（\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning），这是一个基于轮廓的多智能体规划数据集，包含来自1,730个Stack Overflow问题组和2,738个学习者轮廓的3,043个查询-轮廓-计划实例。每个计划指定了智能体、子任务、可执行步骤和先决依赖关系。然后，我们提出\textbf{PersonalPlan}，一个两阶段MAS规划器，首先使用独立的LoRA适配器进行分层SFT，用于轮廓感知的任务分解和步骤依赖规划，然后应用奖励自适应GRPO，鼓励模型生成可执行、个性化且具有教学支架的计划。在MAP-PPL上进行的广泛实验，将PersonalPlan与前沿LLM、通用MAS框架和智能体规划器进行比较，证明了其优越性。仅使用8B和32B变体，PersonalPlan在计划可执行性、个性化和教学质量方面达到了最先进水平，有效协调了MAS进行智能体-学生交互。

英文摘要

Effective programming education requires personalized instruction adapted to diverse learner backgrounds. However, while LLM-based multi-agent systems (MAS) excel at complex planning, existing planners often lack profile-grounding and pedagogical scaffolding, thereby undermining personalized programming learning. To fill in the gap, we first introduce \textbf{MAP-PPL} (\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning), a profile-conditioned multi-agent planning dataset with 3{,}043 query--profile--plan instances from 1{,}730 Stack Overflow question groups and 2{,}738 learner profiles. Each plan specifies agents, subtasks, executable steps, and prerequisite dependencies. Then, we propose \textbf{PersonalPlan}, a two-stage MAS planner that first performs hierarchical SFT with separate LoRA adapters for profile-aware task decomposition and step dependency planning, then applies a Reward-Adaptive GRPO to encourage the model to generate executable, personalized, and pedagogically scaffolded plans. Extensive experiments on MAP-PPL comparing PersonalPlan against frontier LLMs, generic MAS frameworks, and agentic planners demonstrate its superiority. With only 8B and 32B variants, PersonalPlan achieves state-of-the-art plan executability, personalization, and pedagogical quality, effectively orchestrating MAS for agent-student interactions.

URL PDF HTML ☆

赞 0 踩 0

2606.18954 2026-06-18 cs.CL 新提交 85%

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO：基于图的推理模型策略优化

Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学北京校区人工智能学院）； Ant Group（蚂蚁集团）

专题命中复杂问题求解：基于图的策略优化提高推理模型效率。

AI总结提出GraphPO框架，将推理轨迹建模为有向无环图，通过合并语义等价路径减少冗余探索，并利用边级优势函数提高推理效率，在多个基准上优于链式和树式方法。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为增强大型推理模型能力的标准范式。RLVR通常独立采样响应并根据最终答案优化策略。该范式有两个局限性：首先，独立响应常包含相似的中间推理步骤，导致冗余探索和计算浪费；其次，稀疏的最终答案奖励难以识别有用步骤。基于树的方法通过共享前缀并比较同一前缀下的分支来提供细粒度信号，部分解决了这一问题。然而，树分支仍然是独立扩展的。当不同分支达到相似的推理状态时，它们无法共享信息并重复类似的探索。此外，基于树的方法忽略了这种分散性，仅在不同分支内进行局部比较，这可能导致优势估计的方差更高。为了解决这一挑战，我们提出了GraphPO（基于图的策略优化），一种新颖的RL框架，将轨迹表示为有向无环图，其中推理步骤作为边，从推理路径中总结的语义状态作为节点。GraphPO将语义等价的推理路径合并为等价类，允许它们共享后缀，并将预算从冗余扩展重新分配到多样化探索。此外，我们为入边分配效率优势，为出边分配正确性优势，从而在从结果中推导过程监督的同时提高推理效率。理论表明，GraphPO降低了优势估计方差并提高了推理效率。在三个LLM上的推理和智能体搜索基准实验表明，在相同的token预算或响应预算下，GraphPO始终优于基于链和基于树的基线方法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

URL PDF HTML ☆

赞 0 踩 0

2604.28076 2026-06-18 cs.CL cs.AI cs.LG 版本更新 85%

TopBench: A Benchmark for Implicit Predictive Reasoning in Tabular Question Answering

TopBench：表格问答中隐式预测推理的基准

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University, China（人工智能学院，南京大学，中国）； National Key Laboratory for Novel Software Technology, Nanjing University, China（新型软件技术国家重点实验室，南京大学，中国）

专题命中复杂问题求解：表格问答中隐式预测推理的基准

AI总结提出TopBench基准，包含779个样本和四个子任务，评估大语言模型在表格问答中识别隐式预测意图并进行可靠推理的能力，发现当前模型在意图识别上存在困难。

详情

AI中文摘要

大型语言模型（LLM）推动了表格问答的发展，其中大多数查询可以通过提取信息或简单聚合来回答。然而，一类常见的现实世界查询是隐式预测性的，需要从历史模式中推断未观察到的答案，而不仅仅是检索。这些查询带来了两个挑战：识别潜在意图和对大规模表格进行可靠的预测推理。为了评估LLM在带有隐式预测任务的表格问答中的表现，我们引入了TopBench，一个包含779个样本的基准，涵盖四个子任务，从单点预测到决策制定、处理效应分析和复杂过滤，要求模型生成涵盖推理文本和结构化表格的输出。我们在基于文本和代理工作流下评估了多种模型。实验表明，当前模型通常在意图识别上存在困难，默认进行查找。更深入的分析发现，准确的意图消歧是引导这些预测行为的前提。此外，提升预测精度的上限需要整合更复杂的建模或推理能力。

英文摘要

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.18831 2026-06-18 cs.CL cs.AI 新提交 80%

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程：长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB ； Tsinghua University（清华大学）

专题命中复杂问题求解：提升长上下文推理，涉及检索、多证据合成和推理任务

AI总结提出一种简单有效的数据配方，结合最小化基于结果的GRPO设置，显著提升大语言模型的长上下文推理能力，在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情

AI中文摘要

长上下文推理是大语言模型的一项关键能力，特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式，然而现有工作主要关注奖励工程，而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题，并表明仅凭一种简单有效的数据配方，结合最小化基于结果的GRPO设置，就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集，总计约1.4万个示例。在三个模型（Qwen3-4B/8B/30B-A3B）上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升，超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中，在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练，GAIA提升+4.8分，BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2509.22363 2026-06-18 cs.LG eess.AS 版本更新 70%

Investigating Faithfulness in Large Audio Language Models

大型音频语言模型中的忠实性研究

Pooneh Mousavi, Lovenya Jain, Mirco Ravanelli, Cem Subakan

发表机构 * Concordia University（康科迪亚大学）； Mila - Quebec AI Institute（魁北克人工智能研究院）； Université Laval（拉瓦尔大学）； Birla Institute of Technology and Science, Pilani（比拉理工学院和科学学院，皮兰尼）

专题命中复杂问题求解：研究链式推理的忠实性，涉及推理评估

AI总结提出系统框架评估大型音频语言模型在推理链忠实性上的表现，定义三个音频忠实性标准，并通过基准测试发现模型推理与音频输入存在脱节。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

大型音频语言模型（LALMs）将音频编码器与预训练的大型语言模型集成，以执行复杂的多模态推理任务。虽然这些模型可以生成思维链（CoT）解释，但这些推理链的忠实性仍不清楚。在这项工作中，我们提出了一个系统框架来评估LALMs中CoT在输入音频和最终模型预测方面的忠实性。我们定义了音频忠实性的三个标准：无幻觉、整体性和专注聆听。我们还引入了一个基于音频和CoT干预的基准来评估忠实性\footnote{基准测试界面和评估结果可在以下网址获取：https://this https URL。}。在Audio Flamingo 3和Qwen2.5-Omni上的实验表明存在潜在的多模态脱节：推理通常与最终预测一致，但并不总是强烈基于音频，并且可能容易受到幻觉或对抗性扰动的影响。

英文摘要

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness\footnote{The benchmarking interface and evaluation results are available at https://poonehmousavi.github.io/faithfulness/. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

URL PDF HTML ☆

赞 0 踩 0

2603.05128 2026-06-18 eess.AS cs.SD 版本更新 70%

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

PolyBench：多声部音频中组合推理的基准测试

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

发表机构 * Harbin University of Science and Technology（哈尔滨理工大学）； The University of Melbourne（墨尔本大学）； KAIST（韩国成均馆大学）； University of Surrey（萨里大学）

专题命中复杂问题求解：评估音频大模型的组合推理能力

AI总结针对多声部音频中组合推理评估缺失的问题，提出PolyBench基准，包含计数、分类、检测、并发和时长估计五个子集，评估发现现有大音频语言模型在多声部场景下性能持续下降。

Comments Accepted by INTERSPEECH 2026

2503.01805 2026-06-18 cs.LG cs.AI cs.CL 版本更新 70%

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

图任务算法推理中Transformer的深度-宽度权衡

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

发表机构 * Courant Institute of Mathematical Sciences, New York University（纽约大学应用数学科学研究所）； Google Research（谷歌研究）； Meta AI ； Bar-Ilan University（巴伊兰大学）； Department of Bio-Medical Engineering, Edmond J. Safra Center for Bioinformatics, Tel-Aviv University（生物医学工程系，埃德蒙·J·萨法中心，特拉维夫大学）； Tel Aviv University（特拉维夫大学）

专题命中复杂问题求解：研究Transformer在图算法任务中的推理能力。

AI总结研究Transformer在图算法任务中深度与宽度的权衡，发现线性宽度下常数深度足以解决许多图问题，而某些问题需要二次宽度，实验验证了宽模型在保持精度的同时训练和推理更快。

Comments Updated ISF grant number

详情

AI中文摘要

Transformer已经彻底改变了机器学习领域。特别是，它们可用于解决复杂的算法问题，包括基于图的任务。在此类算法任务中，一个关键问题是能够实现该任务的Transformer的最小尺寸是多少。最近的工作开始探索图任务的这个问题，表明对于次线性嵌入维度（即模型宽度），对数深度就足够了。然而，我们在这里解决的一个开放问题是，如果允许宽度线性增长而深度保持固定，会发生什么。我们分析了这种情况，并得出了一个令人惊讶的结果：在线性宽度下，常数深度足以解决一系列基于图的问题。这表明宽度的适度增加可以允许更浅的模型，这在推理和训练时间方面是有利的。对于其他问题，我们表明需要二次宽度。我们的结果展示了Transformer实现图算法的复杂而有趣的格局。我们通过实验研究了深度和宽度相对能力之间的这些权衡，并发现宽模型在具有与深模型相同准确度的任务中，由于可并行化的硬件，训练和推理时间更快。

英文摘要

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.

URL PDF HTML ☆

赞 0 踩 0

2601.17226 2026-06-18 cs.CL cs.AI 版本更新 70%

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复：面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales（新南威尔士大学）

专题命中复杂问题求解：提升故事复述的逻辑性和合理性

AI总结提出RRR强化学习框架，结合结构主义叙事学与标量叙事性，通过d-RLAIF从文本特征中获取训练信号，无需参考输出，提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情

AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷，此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练（如SFT）无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR)，一个基于强化学习的流水线，将结构主义叙事学与标量叙事性相结合，以教授故事结构。我们扩展了TimeTravel数据集，加入人工标注的叙事平衡阶段，以评估奖励模型。通过d-RLAIF，RRR从文本特征的叙事性中推导训练信号，无需参考输出。评估表明，RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线，输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集，为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

URL PDF HTML ☆

赞 0 踩 0

2606.19185 2026-06-18 cs.LG 新提交 60%

AGDN: Learning to Solve Traveling Salesman Problem with Anisotropic Graph Diffusion Network

AGDN：利用各向异性图扩散网络学习求解旅行商问题

Bolin Shen, Ziwei Huang, Zhiguang Cao, Yushun Dong

发表机构 * Florida State University（佛罗里达州立大学）； Singapore Management University（新加坡管理学院）

专题命中复杂问题求解：图神经网络求解TSP，属于组合优化推理

AI总结提出各向异性图扩散网络（AGDN），通过MixScore转移矩阵和各向异性扩散策略，有效利用图结构信息求解旅行商问题，在多种实例规模和分布上优于现有方法。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

DOI: 10.1145/3770855.3817789

AI中文摘要

旅行商问题（TSP）是组合优化的基石，出现在许多实际场景中。尽管基于图的学习方法已被探索用于TSP，但如何更有效地利用图结构的问题仍然悬而未决。我们提出了各向异性图扩散网络（AGDN），一种新的图神经网络框架，旨在求解TSP。我们的方法解决了两个核心难点：（1）完全连接TSP图中缺乏信息丰富的拓扑先验，以及（2）在常用的图稀疏化技术后，最优解中丢失连接节点。为了克服这些问题，我们构建了一个MixScore转移矩阵，将节点相似性与成对距离相结合，并开发了一种各向异性图扩散策略，支持跨多跳的高效信息交换。涵盖不同实例规模和节点分布的全面实验表明，AGDN在保持计算时间竞争力的同时，始终优于现有方法。此外，AGDN能够很好地泛化到训练期间未见的问题规模和分布。实现代码已公开在：this https URL。

英文摘要

The Traveling Salesman Problem (TSP) is a cornerstone of combinatorial optimization and arises in many practical scenarios. Although graph-based learning approaches have been explored for TSP, the question of how to exploit graph structure more effectively remains open. We present the Anisotropic Graph Diffusion Network (AGDN), a new Graph Neural Network framework designed to solve TSP. Our method tackles two central difficulties: (1) the lack of informative topological prior in fully connected TSP graphs, and (2) losing connected nodes in the optimal solution after the commonly used graph sparsification techniques. To overcome these issues, we construct a MixScore transition matrix that merges node similarity with pairwise distance, and we develop an anisotropic graph diffusion strategy that supports efficient information exchange across multiple hops. Comprehensive experiments spanning diverse instance sizes and node distributions show that AGDN consistently outperforms existing methods while keeping computation time competitive. Furthermore, AGDN generalizes well to problem sizes and distributions beyond those seen during training. The implementation is publicly available at: https://github.com/LabRAI/AGDN.

URL PDF HTML ☆

赞 0 踩 0

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交 85%

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb：基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

专题命中逻辑推理：测试逻辑推理和理论推理能力

AI总结提出DeFAb基准，通过将知识库转换为可验证的溯因实例，评估基础模型在可废止推理中的创造力与理论推理能力，发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情

AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例；而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%，最差降至23.5%（四种表面渲染的最坏情况）。我们引入DeFAb（可废止溯因基准），这是一个数据集和生成流水线，将四十年的公共资助知识库转换为形式化可废止溯因实例：通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查（有效推导、保守性和最小性），DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具，评分的是理论修正的规范构建，而非流畅但破坏理论的散文。该流水线将分类层次结构（OpenCyc、YAGO、Wikidata）与行为属性图（ConceptNet、UMLS）配对，从18个来源生成372,648+个实例，涉及33.75M条实例化规则，分为三个级别，并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理：渲染鲁棒的Level 2准确率为7.8-23.5%；思维链方差（约36个百分点）超过任何模型间差距；匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard（235个实例的Level 3难度变体；最佳模型53.3% vs 符号100%）和CONJURE（一个内核验证的变革性创造力变体，包含560个Lean 4/Mathlib实例，其金答案证明内核先前未包含的定义，无需判断的验证器；试点发现零新概念）。同一验证器还可作为偏好优化（DPO、RLVR/GRPO）的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

URL PDF HTML ☆

赞 0 踩 0

2606.15633 2026-06-18 cs.LG 新提交 85%

Formalizing and Mitigating Structural Distortion in LLM Attention for Graph Reasoning

形式化并缓解大语言模型注意力中的结构失真以实现零样本图推理

Donald Loveland, Puja Trivedi, Ari Weinstein, Edward W Huang, Danai Koutra

发表机构 * University of Michigan（密歇根大学）； Amazon（亚马逊）

专题命中逻辑推理：图推理中的结构失真缓解，提升LLM推理

AI总结本文形式化了大语言模型处理文本属性图时因图线性化导致的结构失真机制，并提出轻量级推理时修改方法GaLA，通过校正注意力偏差提升零样本图推理性能。

Comments Accepted to KDD 2026

详情

AI中文摘要

大语言模型（LLM）在文本属性图（TAG）推理中展现出潜力。然而，将LLM应用于图需要将其结构线性化为序列，这引入了根源于图带宽问题的失真。虽然这种失真已被证明会降低性能，但通常归因于提示设计或模型规模，其潜在机制尚不清楚。在这项工作中，我们展示了旋转位置嵌入如何将图线性化为带宽相关的注意力衰减，抑制了序列化序列中被强制分隔开的图相邻节点之间的注意力。这将基于LLM的图推理的焦点从提示工程和规模缩放转向纠正注意力错位。受此分析启发，我们提出了图对齐语言注意力（GaLA），一种轻量级的、推理时修改LLM的方法。GaLA将注意力偏向图相邻节点，同时保留LLM的序列归纳偏差。在TAG基准测试中，GaLA以可忽略的开销提升了性能，表明失真是基于LLM的图推理中可纠正的瓶颈。

英文摘要

Large Language Models (LLMs) have shown promise for reasoning over Text-Attributed Graphs (TAGs). However, applying LLMs to graphs requires linearizing their structure into sequences, introducing distortion rooted in the graph bandwidth problem. While this distortion has been shown to degrade performance, it is often attributed to prompt design or model scale, leaving the underlying mechanism unclear. In this work, we show \textit{how} rotary positional embeddings turn graph linearization into bandwidth-dependent attention decay, suppressing attention between graph-adjacent nodes that are forced far apart in the serialized sequence. This shifts the focus of LLM-based graph reasoning from prompt engineering and scaling toward correcting attention misalignment. Motivated by this analysis, we propose \textbf{G}raph-\textbf{a}ligned \textbf{L}anguage \textbf{A}ttention (\textbf{GaLA}), a lightweight, inference-time modification for LLMs. GaLA biases attention toward graph-adjacent nodes while preserving the LLM's sequential inductive biases. Across TAG benchmarks, GaLA improves performance with negligible overhead, demonstrating that distortion is a correctable bottleneck in LLM-based graph reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.18624 2026-06-18 cs.CL 新提交 80%

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST：用于语用语言理解的自我强化反事实推理

Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

专题命中逻辑推理：自我强化反事实推理提升语用语言理解

AI总结提出PragReST框架，通过自监督构建语用问答数据、生成反事实推理轨迹，结合监督微调和强化学习提升大语言模型的语用推理能力，在四个基准上显著优于基线模型。

Comments First two authors contributed equally. Code and models: https://github.com/jihyung803/PragReST

详情

AI中文摘要

自然语言理解通常依赖于隐含而非明确陈述的含义，需要语用推理。尽管大语言模型（LLMs）在数学和逻辑推理上表现强劲，但在进行语用推理时仍存在困难，往往选择字面解释。为了提升LLM的语用推理能力，我们提出了PragReST，一个自监督框架，它构建语用问答数据，生成反事实推理轨迹，并通过监督微调和强化学习训练模型内化这些轨迹，无需人工标注训练数据或从更强的教师模型蒸馏。在四个语用基准（PragMega、Ludwig、MetoQA和AltPrag）上，PragReST相比骨干模型、任务特定的语用微调基线以及同一流水线的非反事实变体均有提升。在基于准确率的基准上，PragReST在Qwen3-8B和Qwen3-14B上分别比指令骨干模型提升了5.37%和5.50%（绝对值）。我们的错误分析和消融实验强调了反事实推理的重要性：PragReST主要减少了因未能将观察到的话语与合理的替代方案进行对比而导致的错误，而去除反事实推理会显著降低性能。此外，我们的训练保留了对通用知识和数学推理基准的域外性能。

英文摘要

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2505.12369 2026-06-18 cs.AI cs.LG cs.LO 版本更新 70%

Fully Geometric Multi-Hop Reasoning on Knowledge Graphs with Transitive Relations

知识图谱上具有传递关系的全几何多跳推理

Fernando Zhapa-Camacho, Robert Hoehndorf

发表机构 * KAUST Center of Excellence for Smart Health (KCSH)（智能健康卓越中心）； KAUST Center of Excellence for Generative AI（生成人工智能卓越中心）

专题命中逻辑推理：知识图谱多跳逻辑推理，几何嵌入方法

AI总结提出GeometrE方法，将逻辑操作映射为纯几何变换，并引入传递损失函数，在保持可解释性的同时提升多跳推理性能。

Comments Accepted at ESWC 2026

Journal ref The Semantic Web. ESWC 2026. Lecture Notes in Computer Science, vol 16549. Springer, Cham (2026)

详情

DOI: 10.1007/978-3-032-25156-5_14

AI中文摘要

知识图谱上的多跳逻辑推理需要将逻辑语义忠实地映射到潜在空间。当前的几何嵌入方法通过将实体映射到几何区域、逻辑操作映射到潜在变换，在此任务上表现出有效性。虽然几何嵌入可以为查询回答提供直接的可解释性框架，但当前方法仅利用了实体的几何构造，未能将逻辑操作映射为纯几何变换，而是使用神经组件来学习这些操作。另一方面，纯神经方法优于几何方法，但在潜在空间中缺乏可解释性。我们提出了GeometrE，一种用于多跳推理的几何嵌入方法，它将每个逻辑操作映射为潜在空间中的纯几何操作。此外，我们引入了一个传递损失函数，并表明与现有方法不同，它可以保留对所有a,b,c的逻辑规则：r(a,b)和r(b,c) -> r(a,c)。我们的实验表明，GeometrE优于当前最先进的几何方法，并在标准基准数据集上与现有的神经方法保持竞争力。

英文摘要

Multi-hop logical reasoning on knowledge graphs requires faithfully mapping the logical semantics to latent space. Current geometric embedding methods show to be useful on this task by mapping entities to geometric regions and logical operations to latent transformations. While a geometric embedding can provide a direct interpretability framework for query answering, current methods have only leveraged the geometric construction of entities, failing to map logical operations to pure geometric transformations and, instead, using neural components to learn these operations. On the other hand, purely neural-based methods outperform geometric methods, but they lack interpretability in the latent space. We introduce GeometrE, a geometric embedding method for multi-hop reasoning, that maps every logical operation to a purely geometric operation in the latent space. Additionally, we introduce a transitive loss function and show that, unlike existing methods, it can preserve the logical rule for all a,b,c: r(a,b) and r(b,c) -> r(a,c). Our experiments show that GeometrE outperforms current state-of-the-art geometric methods and remains competitive with existing neural-based methods on standard benchmark datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.18521 2026-06-18 cs.LG cs.AI 新提交 60%

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

稀疏性诅咒：从模型合并理解RLVR模型参数空间

Chenrui Wu, Zexi Li, Jiajun Bu, Jiangchuan Liu, Haishuai Wang

发表机构 * Zhejiang University（浙江大学）； Simon Fraser University（西蒙菲莎大学）； The Chinese University of Hong Kong（香港中文大学）； Zhejiang Key Lab of Accessible Perception and Intelligent Systems（浙江省可感知智能系统重点实验室）

专题命中其他推理：RLVR增强推理能力

AI总结本文发现RLVR模型的稀疏更新在参数空间中分散更远，形成近正交捷径导致合并脆弱，并提出SAR-Merging方法解决该问题。

Comments Accepted by KDD 2026

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为一种强大的后训练范式，在激发推理智能和抵抗灾难性遗忘方面超越了监督微调（SFT）。最近的研究进一步揭示，与SFT相比，RLVR会引发高度稀疏且偏离主成分的参数更新。这自然引出一个问题：这种稀疏性是否使RLVR模型更易于模型合并？如果是，模型合并将提供一种可扩展的、无需训练的方法，来聚合来自独立训练的RLVR模型的多样化推理能力。令人惊讶的是，我们发现相反的情况，揭示了一种稀疏性诅咒：稀疏的RLVR更新在参数空间中分散得更远，形成近正交的捷径，使得聚合本质上是脆弱的。这很可能源于RL优化的随机性和涌现推理模式的多样性。与SFT模型收敛到共享的平坦盆地并自然合并不同，RLVR模型在标准合并方法下遭受严重退化。通过对更新几何的系统性实证分析，我们描述了这种失败背后的机制，并提出了敏感性感知解析合并（SAR-Merging），这是一种针对RLVR参数空间独特结构定制的合并方案。SAR-Merging通过基于Fisher信息的敏感性仲裁解决重叠更新区域中的冲突，然后通过幅度感知稀疏化和重新缩放来保留脆弱的推理路径。在数学和编程基准上的实验表明，SAR-Merging在RLVR模型上显著优于现有合并方法，实现了单任务增强和多能力融合。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful post-training paradigm that surpasses Supervised Fine-Tuning (SFT) in eliciting reasoning intelligence and resisting catastrophic forgetting. Recent studies further reveal that RLVR induces highly sparse and off-principal parameter updates compared to SFT. This naturally raises the question: does such sparsity make RLVR models more amenable to model merging? If so, model merging would offer a scalable, training-free path to aggregate diverse reasoning capabilities from independently trained RLVR models. Surprisingly, we find the opposite, uncovering a sparsity curse: the sparse RLVR updates are spread farther apart in parameter space, forming near-orthogonal shortcuts that make aggregation inherently fragile. This is likely rooted in the stochasticity of RL optimization and the diversity of emergent reasoning patterns. Unlike SFT models that converge to shared, flat basins and merge naturally, RLVR models suffer severe degradation under standard merging methods. Through systematic empirical analysis of the update geometry, we characterize the mechanisms behind this failure and propose Sensitivity-aware Resolving Merging (SAR-Merging), a merging recipe tailored for the unique structure of RLVR parameter spaces. SAR-Merging resolves conflicts in overlapping update regions via Fisher Information-based sensitivity arbitration, followed by magnitude-aware sparsification and rescaling to preserve fragile reasoning pathways. Experiments on mathematical and coding benchmarks demonstrate that SAR-Merging substantially outperforms existing merging methods on RLVR models, enabling both single-task enhancement and multi-capability fusion.

URL PDF HTML ☆

赞 0 踩 0

1. 数学推理 8 篇

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

LLM Parameters for Math Across Languages: Shared or Separate?

Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

2. 测试时计算 1 篇

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

3. 规划推理 7 篇

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

CEO-Bench: Can Agents Play the Long Game?

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

Robust Regularized Policy Iteration under Transition Uncertainty

PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning

4. 复杂问题求解 8 篇

GraphPO: Graph-based Policy Optimization for Reasoning Models

TopBench: A Benchmark for Implicit Predictive Reasoning in Tabular Question Answering

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Investigating Faithfulness in Large Audio Language Models

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

AGDN: Learning to Solve Traveling Salesman Problem with Anisotropic Graph Diffusion Network

5. 逻辑推理 4 篇

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

Formalizing and Mitigating Structural Distortion in LLM Attention for Graph Reasoning

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

Fully Geometric Multi-Hop Reasoning on Knowledge Graphs with Transitive Relations

6. 其他推理 1 篇

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging