大模型推理能力 - arXivDaily 专题

2606.19257 2026-06-18 cs.CL 新提交 90%

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B：面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong（香港大学）； Peking University（北京大学）

专题命中数学推理：块扩散语言模型用于长链推理

AI总结提出块大小课程学习，通过从细粒度到粗粒度的渐进训练，解决块扩散语言模型在长链推理中性能差距问题，DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情

AI中文摘要

块扩散语言模型通过并行块级去噪加速解码，但其能否可靠地扩展到长思维链（CoT）推理仍未解决。为此，我们开发了开源块扩散推理模型DreamReasoner-8B，并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距：使用大块大小训练会导致推理性能极差，而小块大小则能保持有效的推理。为了弥合这一粒度差距，我们提出了块大小课程学习，逐步从细粒度块大小过渡到粗粒度块大小进行训练，从而克服了这一限制，并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中，DreamReasoner-8B取得了与领先的开源自回归模型（如Qwen3-8B）相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型：https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

URL PDF HTML ☆

赞 0 踩 0

2606.19236 2026-06-18 cs.LG cs.AI cs.CL 新提交 90%

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE: 基于惊讶度的令牌级优势重加权以实现策略熵稳定性

Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Tencent Hunyuan（腾讯 Hunyuan）

专题命中数学推理：GRPO策略熵稳定性方法，提升推理

AI总结针对GRPO等RL算法中策略熵崩溃问题，提出STARE方法，通过惊讶度分位数识别熵关键令牌并重加权其优势，结合目标熵闭环门控稳定熵，在1.5B-32B模型和多种任务上实现稳定训练，AIME24/25准确率提升4%-8%。

Comments LLM, Reinforcement Learning

详情

AI中文摘要

基于可验证奖励的强化学习算法（如GRPO）已成为LLMs复杂推理的主流后训练范式，但通常在训练中遭受策略熵崩溃。我们对GRPO下的令牌级熵动态进行一阶梯度分析，识别出令牌级信用分配不匹配：每个令牌的熵变化分解为轨迹级优势与下一个令牌分布上的熵敏感函数的乘积，产生优势-惊讶度四象限结构和近临界性质。受此启发，我们提出STARE（基于惊讶度的令牌级优势重加权以实现策略熵稳定性），该方法通过批次内惊讶度分位数识别熵关键令牌子集，选择性重加权其有效优势，并引入目标熵闭环门控以实现稳定的熵调节。在1.5B至32B的模型规模以及三个任务族（短思维链、长思维链和多轮工具使用）上，STARE在数千步内维持稳定的RL训练，同时将策略熵保持在目标带内。在AIME24和AIME25上，STARE在平均准确率上比DAPO和其他竞争基线高出4%-8%，反思令牌和响应长度同步增长，表明持续探索-利用平衡进一步释放了RL训练潜力。代码可在https://github.com/xxxx获取。

英文摘要

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

URL PDF HTML ☆

赞 0 踩 0

2606.18844 2026-06-18 cs.LG 新提交 85%

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

从自身错误中学习：为自蒸馏构建可学习的微反思轨迹

Zhilin Huang, Hang Gao, Ziqiang Dong, Yuan Chen, Yifeng Luo, Chujun Qin, Jingyi Wang, Yang Yang, Guanjun Jiang

发表机构 * Qwen Business Unit of Alibaba（阿里巴巴通义千问事业部）； Tsinghua University（清华大学）； Peking University（北京大学）

专题命中数学推理：自蒸馏改进数学推理，轨迹对比学习。

AI总结提出TAPO方法，通过对比正确与错误轨迹构建微反思修正，实现从隐式分布对齐到显式轨迹构建的自蒸馏改进，在多个数学推理基准上优于GRPO。

详情

AI中文摘要

自蒸馏通过使用模型自身的生成作为训练信号来改进大型语言模型的推理能力，通常通过隐式的logit级对齐来实现，最小化与特权目标分布的KL散度。然而，由于这种监督是通过无控制采样生成的，它无法提供关于模型特定错误的诊断性洞察，也无法针对其个体失败模式提供纠正性指导。因此，模型学习的是模仿特权分布，而不是接收精确指出其推理失败位置和原因的细粒度修正。在本文中，我们提出了轨迹增强策略优化（TAPO），将自蒸馏从隐式分布对齐推进到显式轨迹构建。在强化学习训练期间，模型对同一查询同时产生正确和错误的生成轨迹，TAPO利用这种对比结构来构建微反思修正——新的训练轨迹，保留模型在失败点之前的错误推理，然后插入自然语言诊断和由同一采样组中的正确参考引导的修正推理。由于每条轨迹都锚定在学习者自身的前缀和解决方案上，与基于KL的方法施加的位置级对齐相比，修正信号在更大程度上保留了模型的在策略分布。为了整合这些轨迹，TAPO在模型能力边界引入了难度感知的候选选择，并采用解耦优势估计以防止梯度污染。在AIME 2024、AIME 2025和HMMT 2025上的实验表明，在相同训练步数下，TAPO相比GRPO取得了一致的改进。进一步分析表明，TAPO增强了首次推理和错误纠正的有效性。

英文摘要

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.18453 2026-06-18 cs.CL 新提交 85%

LLM Parameters for Math Across Languages: Shared or Separate?

跨语言数学问题的LLM参数：共享还是分离？

Behzad Shomali, Luisa Victor, Tim Selbach, Ali Hamza Bashir, David Berghaus, Joachim Koehler, Mehdi Ali, Markus Frey

发表机构 * Lamarr Institute（Lamarr研究所）； University of Bonn（波恩大学）； Fraunhofer IAIS（弗劳恩霍夫智能分析和信息系统研究所）

专题命中数学推理：跨语言数学推理的机制分析

AI总结通过跨语言机制分析，发现多语言LLM中数学相关参数存在部分跨语言重叠，且主要集中在中间层，英语参数集最大，低资源语言参数集较小。

Comments 5 pages. Accepted at ACL Student Research Workshop (SRW) 2026. Code: https://github.com/luisavictor/math-across-languages Translated Datasets: https://huggingface.co/math-across-languages Webpage: https://math-across-languages.github.io

详情

AI中文摘要

大型语言模型（LLM）在数学推理性能上表现出显著的跨语言差异，但目前尚不清楚这些差异是反映语言特定参数，还是反映一种因语言不同而表现不同的共享机制。我们提出了一种跨语言的LLM数学推理机制分析，使我们能够定位和比较支持跨语言数学推理的模型参数。我们发现，提取的数学相关参数表现出部分跨语言重叠，最强的重叠集中在中间模型层。我们进一步观察到，英语始终产生最大的数学相关参数集，而低资源语言则显示出较小的相关参数集。这些结果表明，多语言LLM中与数学相关的行为既不是完全语言不变的，也不是完全语言特定的，而是表现出部分跨语言参数重叠，并伴有系统性的语言依赖差异。

英文摘要

Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.

URL PDF HTML ☆

赞 0 踩 0

2606.18810 2026-06-18 cs.LG cs.AI 新提交 75%

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

从自身解中学习：面向可验证奖励强化学习的自条件化信用分配

Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

发表机构 * Beijing Institute of Technology（北京理工大学）； Beihang University（北京航空航天大学）； Independent Researcher（独立研究者）

专题命中数学推理：在数学、代码和智能体任务上提升推理

AI总结提出SC-GRPO方法，利用自条件化分布间的KL散度作为GRPO梯度的乘性权重，实现细粒度信用分配，在数学、代码和智能体任务上平均提升8.1%。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）在训练LLMs进行推理任务方面取得了显著进展，但代表性方法如GRPO对所有token分配统一信用，浪费了常规token上的梯度，同时低估了关键推理步骤。现有的token级信用分配方法需要超出模型自身rollout的资源。GRPO变体依赖于过程奖励模型或真实答案。知识蒸馏通过每个token的散度分配信用，但需要外部教师（在线策略蒸馏）或特权信息（在线策略自蒸馏）。然而，这些依赖性限制了在纯RLVR设置中的适用性。我们观察到，将模型以其自身验证过的轨迹为条件，会在原始分布和条件分布之间诱导出可测量的每token KL散度，并证明当存在多个验证过的轨迹时，从由验证过的轨迹构建的自教师进行蒸馏会导致不可行的加权平均解。我们提出SC-GRPO（自条件化GRPO），它使用前述KL散度作为GRPO梯度的乘性权重。在涵盖数学、代码和智能体任务的五个基准上，SC-GRPO一致优于GRPO 8.1%，优于DAPO 5.9%，并具有更强的分布外性能。此外，SC-GRPO实现了比OPD更高的性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

URL PDF HTML ☆

赞 0 踩 0