大模型推理能力 - arXivDaily 专题

2606.18954 2026-06-18 cs.CL 新提交 85%

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO：基于图的推理模型策略优化

Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学北京校区人工智能学院）； Ant Group（蚂蚁集团）

专题命中复杂问题求解：基于图的策略优化提高推理模型效率。

AI总结提出GraphPO框架，将推理轨迹建模为有向无环图，通过合并语义等价路径减少冗余探索，并利用边级优势函数提高推理效率，在多个基准上优于链式和树式方法。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为增强大型推理模型能力的标准范式。RLVR通常独立采样响应并根据最终答案优化策略。该范式有两个局限性：首先，独立响应常包含相似的中间推理步骤，导致冗余探索和计算浪费；其次，稀疏的最终答案奖励难以识别有用步骤。基于树的方法通过共享前缀并比较同一前缀下的分支来提供细粒度信号，部分解决了这一问题。然而，树分支仍然是独立扩展的。当不同分支达到相似的推理状态时，它们无法共享信息并重复类似的探索。此外，基于树的方法忽略了这种分散性，仅在不同分支内进行局部比较，这可能导致优势估计的方差更高。为了解决这一挑战，我们提出了GraphPO（基于图的策略优化），一种新颖的RL框架，将轨迹表示为有向无环图，其中推理步骤作为边，从推理路径中总结的语义状态作为节点。GraphPO将语义等价的推理路径合并为等价类，允许它们共享后缀，并将预算从冗余扩展重新分配到多样化探索。此外，我们为入边分配效率优势，为出边分配正确性优势，从而在从结果中推导过程监督的同时提高推理效率。理论表明，GraphPO降低了优势估计方差并提高了推理效率。在三个LLM上的推理和智能体搜索基准实验表明，在相同的token预算或响应预算下，GraphPO始终优于基于链和基于树的基线方法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.18831 2026-06-18 cs.CL cs.AI 新提交 80%

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程：长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB ； Tsinghua University（清华大学）

专题命中复杂问题求解：提升长上下文推理，涉及检索、多证据合成和推理任务

AI总结提出一种简单有效的数据配方，结合最小化基于结果的GRPO设置，显著提升大语言模型的长上下文推理能力，在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情

AI中文摘要

长上下文推理是大语言模型的一项关键能力，特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式，然而现有工作主要关注奖励工程，而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题，并表明仅凭一种简单有效的数据配方，结合最小化基于结果的GRPO设置，就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集，总计约1.4万个示例。在三个模型（Qwen3-4B/8B/30B-A3B）上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升，超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中，在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练，GAIA提升+4.8分，BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2606.19185 2026-06-18 cs.LG 新提交 60%

AGDN: Learning to Solve Traveling Salesman Problem with Anisotropic Graph Diffusion Network

AGDN：利用各向异性图扩散网络学习求解旅行商问题

Bolin Shen, Ziwei Huang, Zhiguang Cao, Yushun Dong

发表机构 * Florida State University（佛罗里达州立大学）； Singapore Management University（新加坡管理学院）

专题命中复杂问题求解：图神经网络求解TSP，属于组合优化推理

AI总结提出各向异性图扩散网络（AGDN），通过MixScore转移矩阵和各向异性扩散策略，有效利用图结构信息求解旅行商问题，在多种实例规模和分布上优于现有方法。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

DOI: 10.1145/3770855.3817789

AI中文摘要

旅行商问题（TSP）是组合优化的基石，出现在许多实际场景中。尽管基于图的学习方法已被探索用于TSP，但如何更有效地利用图结构的问题仍然悬而未决。我们提出了各向异性图扩散网络（AGDN），一种新的图神经网络框架，旨在求解TSP。我们的方法解决了两个核心难点：（1）完全连接TSP图中缺乏信息丰富的拓扑先验，以及（2）在常用的图稀疏化技术后，最优解中丢失连接节点。为了克服这些问题，我们构建了一个MixScore转移矩阵，将节点相似性与成对距离相结合，并开发了一种各向异性图扩散策略，支持跨多跳的高效信息交换。涵盖不同实例规模和节点分布的全面实验表明，AGDN在保持计算时间竞争力的同时，始终优于现有方法。此外，AGDN能够很好地泛化到训练期间未见的问题规模和分布。实现代码已公开在：this https URL。

英文摘要

The Traveling Salesman Problem (TSP) is a cornerstone of combinatorial optimization and arises in many practical scenarios. Although graph-based learning approaches have been explored for TSP, the question of how to exploit graph structure more effectively remains open. We present the Anisotropic Graph Diffusion Network (AGDN), a new Graph Neural Network framework designed to solve TSP. Our method tackles two central difficulties: (1) the lack of informative topological prior in fully connected TSP graphs, and (2) losing connected nodes in the optimal solution after the commonly used graph sparsification techniques. To overcome these issues, we construct a MixScore transition matrix that merges node similarity with pairwise distance, and we develop an anisotropic graph diffusion strategy that supports efficient information exchange across multiple hops. Comprehensive experiments spanning diverse instance sizes and node distributions show that AGDN consistently outperforms existing methods while keeping computation time competitive. Furthermore, AGDN generalizes well to problem sizes and distributions beyond those seen during training. The implementation is publicly available at: https://github.com/LabRAI/AGDN.

URL PDF HTML ☆

赞 0 踩 0