大模型推理能力

2606.20227 2026-06-19 cs.AI cs.SE 新提交 95%

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

QMFOL：通过可量化的一元一阶逻辑测试用例生成来基准测试大语言模型推理

Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette, Kailong Wang

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Nanyang Technological University（南洋理工大学）； Hubei University（湖北大学）； East China Normal University（华东师范大学）； National University of Singapore（新加坡国立大学）

专题命中逻辑推理：提出QMFOL框架，通过一阶逻辑生成推理任务，评估LLM逻辑推理能力。

AI总结提出QMFOL框架，通过可控制复杂度的合取/析取模式生成一元一阶逻辑推理任务，并构建包含2880个实例的基准QMFOLBench，评估显示逻辑复杂度增加导致性能下降和计算开销上升。

详情

AI中文摘要

大型语言模型（LLMs）在推理方面取得了显著进展，特别是在演绎推理中，这对于高风险决策至关重要。随着模型的改进，评估基准也应随之发展。然而，现有基准缺乏对逻辑复杂性的细粒度控制，并且在语义多样性与逻辑一致性之间难以平衡。为了解决这些问题，我们提出了QMFOL，一个自动生成具有可量化和可控复杂度的一元一阶逻辑推理任务的框架。它使用合取和析取模式构建形式逻辑结构，从而能够精确控制推理深度、宽度、标签类型和干扰项。然后通过LLM将这些结构转化为自然语言，并通过外部证明器的往返验证确保逻辑一致性。基于我们的框架，我们构建了QMFOLBench，一个包含2880个实例、960种配置的基准，覆盖不同的逻辑和语义维度。对六个大型推理模型（LRMs）和两个LLM的评估表明，随着逻辑复杂度的增加，性能下降且计算开销上升。模型在True标签任务上的表现优于False或Unknown任务，并且对语义变化敏感。总体而言，QMFOL提供了一种可扩展且可靠的方法来构建具有可控复杂度的演绎推理基准，从而能够更精确地评估现代语言模型的推理能力。

英文摘要

Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with quantifiable and controllable complexity. It constructs formal logical structures using conjunction and disjunction patterns, enabling precise control over reasoning depth, width, label types, and distractors. These structures are then translated into natural language via LLMs, with logical consistency ensured through round-trip verification using an external prover. Based on our framework, we build QMFOLBench, a benchmark comprising 2880 instances with 960 configurations across diverse logical and semantic dimensions. Evaluations on six large reasoning models (LRMs) and two LLMs show that performance degrades and computational overhead increases with rising logical complexity. Models perform better on True-labeled tasks than on False or Unknown ones, and exhibit sensitivity to semantic variation. Overall, QMFOL offers a scalable and reliable approach for constructing deductive reasoning benchmarks with controllable complexity, enabling more precise evaluation of reasoning capabilities in modern language models.

URL PDF HTML ☆

赞 0 踩 0

2606.20526 2026-06-19 cs.AI 新提交 70%

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

DeepSWIP: 神经概率逻辑程序的商-WMC反事实

Saimun Habib, Vaishak Belle, Fengxiang He

发表机构 * University of Edinburgh（爱丁堡大学）

专题命中逻辑推理：神经概率逻辑程序的反事实推理

AI总结提出DeepSWIP，一种用于DeepProbLog程序的单世界反事实语义，通过神经物化、SWIP和加权模型计数实现精确反事实推理，实验证明比孪生网络方法快2.14倍。

详情

AI中文摘要

诸如DeepProbLog之类的神经符号系统将神经感知与概率逻辑相结合，但标准推理是关联性的。反事实推理还需要干预和证据的因果语义。我们引入了DeepSWIP，一种用于DeepProbLog程序的单世界反事实语义。利用神经物化，我们将固定上下文神经谓词简化为普通的ProbLog选择，应用单世界干预程序（SWIP），并通过单个转换程序上的加权模型计数（WMC）计算反事实。在有限基和唯一支持模型假设下，DeepSWIP相对于学习到的物化FCM是精确的。ProbLog条件句的标准商-WMC形式识别了活跃的神经概率，并解释了干预清理、校准敏感性和罕见证据不稳定性。在MPI3D上的实验证实了该转换相对于DeepTwin构造在12,000个查询上的有效性，并且由于避免了孪生网络的内源性重复，推理速度提升了2.14倍。一个SUMO HOV实验表明，神经校准退化会偏置插件估计，而正确作用域的随机策略AIPW估计器消除了总体均值和ATE估计量的大部分一阶偏差。代码位于此https URL。

英文摘要

Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference is associational. Counterfactual reasoning additionally requires a causal semantics for interventions and evidence. We introduce DeepSWIP, a single-world counterfactual semantics for DeepProbLog programs. Using neural materialization, we reduce fixed-context neural predicates to ordinary ProbLog choices, apply Single World Intervention Programs (SWIPs), and compute counterfactuals by weighted model counting (WMC) over a single transformed program. Under finite grounding and unique-supported-model assumptions, DeepSWIP is exact relative to the learned materialized FCM. The standard quotient-WMC form of ProbLog conditionals identifies active neural probabilities and explains intervention cleaning, calibration sensitivity, and rare-evidence instability. Experiments on MPI3D confirm the transformation against a DeepTwin construction against 12,000 queries, as predicted and a 2.14$\times$ inference speedup from avoiding the Twin's endogenous duplication. A SUMO HOV experiment shows that neural calibration degradation biases plug-in estimates, while a correctly scoped randomized-policy AIPW estimator removes most first-order bias for population mean and ATE estimands. Code is at https://github.com/saibib/deep_SWIP.

URL PDF HTML ☆

赞 0 踩 0

2606.20068 2026-06-19 cs.AI 新提交 90%

Process-Verified Reinforcement Learning for Theorem Proving via Lean

基于Lean的过程验证强化学习用于定理证明

Minsu Kim, Se-Young Yun

发表机构 * KAIST AI（韩国科学技术院人工智能系）

专题命中数学推理：定理证明强化学习

AI总结提出利用Lean证明助手提供过程级验证信号，结合GRPO风格强化学习目标，通过策略级监督提升定理证明性能。

详情

AI中文摘要

虽然基于可验证奖励的强化学习通常依赖于单一的二元验证信号，但形式推理中的符号证明助手提供了丰富、细粒度的结构化反馈。这种结构化过程与非结构化奖励之间的差距凸显了既密集又可靠的反馈的重要性。在这项工作中，我们证明Lean证明助手本身可以作为符号过程预言机，在训练期间提供结果级和细粒度的策略级验证反馈。证明尝试被解析为策略序列，Lean的细化标记出局部正确的步骤和最早失败的步骤，从而产生基于类型理论的密集、验证器基础的信用信号。我们将这些结构化奖励纳入GRPO风格的强化学习目标中，采用首次错误传播和首次令牌信用方法，平衡结果级和过程级优势。在STP-Lean和DeepSeek-Prover-V1.5上的实验表明，在大多数设置中，策略级监督优于仅结果基线，在MiniF2F和ProofNet等基准测试上取得了改进。除了经验上的提升，我们的研究还突出了一个更广阔的视角：符号证明助手不仅在评估时是验证器，而且在训练期间可以作为过程级奖励预言机。这为强化学习框架开辟了一条道路，该框架将语言模型的可扩展性与符号验证的可靠性相结合，用于形式推理。

英文摘要

While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.19788 2026-06-19 cs.AI cs.CL 新提交 90%

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

CombEval：评估大语言模型中组合计数的框架

Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Czech Technical University in Prague（捷克布拉格理工大学）； CRRC Zhuzhou Institute（中车株洲研究所）； Tengen Intelligence Institute（天元智能研究院）； International Center of Future Science, Jilin University（吉林大学未来科学国际合作中心）； Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE（教育部知识驱动人机智能工程研究中心）

专题命中数学推理：评估LLM在组合计数问题上的推理能力。

AI总结提出CombEval动态基准，通过类型化Cofola规范生成组合计数问题，评估11个大语言模型在直接和代码增强设置下的表现，发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上存在脆弱性。

Comments under review. Code: https://github.com/YuxuZhou-CN/combination-problem-generation

详情

AI中文摘要

我们提出了CombEval，一个用于评估大语言模型中组合计数的动态基准。CombEval将每个问题表示为关于实体、组合对象、对象依赖和约束的类型化Cofola规范，从而能够生成带有精确求解器验证答案的自然语言计数问题。与静态集合不同，CombEval支持对象类型、实体规模、约束数量和推理深度的系统变化。我们在直接和代码增强设置下评估了11个大语言模型，发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上仍然脆弱。错误分析进一步识别出在约束解释和计数原则上的失败。CombEval为研究大语言模型何时以及为何在组合推理上失败提供了一个诊断测试平台。代码和生成的基准套件可在\url{this https URL}公开获取。

英文摘要

We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{https://github.com/YuxuZhou-CN/combination-problem-generation}.

URL PDF HTML ☆

赞 0 踩 0

2606.19399 2026-06-19 cs.LG cs.AI cs.LO cs.PL 新提交 90%

VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

VERITAS：验证器引导的零样本形式定理证明搜索

Manish Acharya, Zhenyu Liao, Yueke Zhang, Kevin Leach, Yu Huang, Yifan Zhang

发表机构 * Department of Computer Science, Vanderbilt University（范德堡大学计算机科学系）； Amazon（亚马逊）

专题命中数学推理：验证器引导的零样本形式定理证明。

AI总结提出VERITAS框架，通过两阶段协议（Best-of-N采样+批评引导MCTS）利用验证器反馈进行零样本定理证明，在miniF2F上达40.6%准确率，并发布组合学基准VERITAS-CombiBench。

详情

AI中文摘要

基于LLM的形式化证明器通常将丰富的验证器信号（语法错误、类型不匹配、部分目标进展）压缩为二进制的通过/失败位。我们提出VERITAS，一个零样本框架，通过两阶段协议将每个验证器信号路由回证明搜索：首先进行Best-of-N采样，然后进行批评引导的MCTS遍历，该遍历将第一阶段失败作为显式负例吸收。该协议保留其第一阶段扫描解决的每个定理，因此第二阶段额外的解决可归因于反馈驱动的探索。VERITAS在miniF2F上达到40.6%（相比之下，独立运行的Best-of-5为36.9%，Portfolio为26.2%），在VERITAS-CombiBench上达到7.3%，这是一个我们发布的55个定理的组合学基准，在该基准上Best-of-5（1.8%）低于Portfolio（3.6%），暴露了当必须从验证器反馈中迭代恢复正确的引理名称时，无指导的采样会带来损害。工件可在GitHub上获取。

英文摘要

LLM-based formal provers often collapse rich verifier signals (syntax errors, type mismatches, partial goal progress) into a binary pass/fail bit. We present VERITAS, a zero-shot framework that routes every verifier signal back into proof search through a two-phase protocol: Best-of-N sampling first, then a critic-guided MCTS pass that ingests Phase 1 failures as explicit negative examples. The protocol preserves every theorem solved by its own Phase 1 sweep, so Phase 2's additional solves are attributable to feedback-driven exploration. VERITAS reaches 40.6% on miniF2F (vs. an independently run Best-of-5 at 36.9%, Portfolio 26.2%) and 7.3% on VERITAS-CombiBench, a 55-theorem combinatorics benchmark we release on which Best-of-5 (1.8%) falls below Portfolio (3.6%), exposing that unguided sampling hurts when correct lemma names must be recovered iteratively from verifier feedback. Artifacts are available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2606.19697 2026-06-19 cs.LG cs.AI cs.CL 新提交 85%

Efficiently Representing Algorithms With Chain-of-Thought Transformers

高效表示链式思维Transformer中的算法

Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； ETH Zürich（苏黎世联邦理工学院）

专题命中数学推理：证明CoT Transformer高效模拟Word RAM算法，涉及排序等推理

AI总结本文证明链式思维Transformer能以多对数开销高效模拟Word RAM算法，包括排序和Dijkstra算法，优于模拟图灵机的二次开销。

详情

AI中文摘要

推理模型（即在产生答案前输出一系列推理或思维token的语言模型）日益流行，部分原因在于理论结果表明链式思维（CoT）Transformer可以模拟图灵机，从而执行任意计算。然而，图灵机虽然适用于复杂性理论分析，但在讨论算法时并不方便、直观或高效。算法通常在更高的抽象层次上设计和分析，即具有随机访问存储器和单位成本操作（对$\bigO(\log n)$位字）的Word RAM模型。因此，Word RAM算法可能比其图灵机对应物更高效，这引出了一个问题：CoT Transformer能否高效模拟Word RAM算法？例如，它们能否在$\bigO(n \log n)$步内对n个元素排序，或在$\bigO(E + V \log V)$步内运行Dijkstra算法？我们给出肯定回答，开销不超过多对数。我们首先为具有多对数宽度和最右唯一硬注意力的有限精度Transformer建立这一结果，然后将结果推广到两个更实际的设置：有限宽度和对数精度：连续CoT（其中推理采用向量而非token形式）和混合架构（其中Transformer层位于循环（线性RNN）层之上）。在所有三种情况下，我们发现CoT可以高效模拟任何Word RAM算法，仅需在n上多对数开销。当Word RAM具有“平坦”指令集时，此开销降至对数平方，而对于无乘法平坦指令仅需对数开销——这与已知的CoT模拟图灵机（需要二次开销）形成鲜明对比。

英文摘要

The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

URL PDF HTML ☆

赞 0 踩 0

2606.19636 2026-06-19 cs.LG cs.AI 新提交 85%

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

困难还是未触及？诊断数学推理难度估计中的采样盲点

Luca Zhou, Sajel Shah, Emanuele Rodolà, Roberto Dessì

发表机构 * Sapienza University of Rome（罗马大学）

专题命中数学推理：诊断数学推理难度估计中的采样盲点，提出确定性采样。

AI总结发现pass@k在数学推理难度估计中存在盲点，通过激活嫁接的确定性采样可恢复10.3-22.9%的零解样本，揭示结构可识别性。

Comments 9 pages of main paper, 4 figures and 5 tables in the main paper, with more in the appendix

详情

AI中文摘要

数学和科学推理基准依赖pass@k（达到正确结果的采样链比例）作为每个示例的典型难度信号。同样的信号驱动具有可验证奖励的强化学习、数学数据整理、合成课程和验证器训练。我们表明该代理在其最困难的层级上存在持续盲点：在我们测试的八个自由形式数学单元（GSM8K和MATH，跨四个开放权重模型）中，10.3-22.9%的示例在六次尝试中没有任何采样种子解决，但通过六链确定性机制在匹配计算量下被解决。这些是贪婪解码加上通过激活嫁接应用的五个廉价残差流扰动，而单独贪婪解码在这些数学单元上最多解决6%。恢复随额外预算扩展，跨扰动（其机制差异性我们通过所有十二个单元验证，每种设置下跨类型固定集Jaccard <= 0.47）。激活嫁接用作对内部表示的干预，而非解码方法；我们纯粹将其作为诊断和多样化工具，并且我们恢复的项目表明pass@k=0%层级在残差流中结构可识别，而非未修改模型在普通推理下达到它们。

英文摘要

Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic curricula, and verifier training. We show this proxy has a persistent blind spot on its hardest stratum: on the eight free-form math cells we test (GSM8K and MATH across four open-weight models), 10.3-22.9% of the examples that no sampling seed solves in six tries are instead solved at matched compute by a six-chain deterministic regime. These are greedy decoding plus five cheap residual-stream perturbations applied via activation grafting, while greedy alone solves at most 6% on these math cells. Recovery scales with the additional budget, across perturbations whose mechanistic distinctness we verify across all twelve cells (cross-kind fix-set Jaccard <= 0.47 in every setup). Activation grafting is used as an intervention on internal representations, not a decoding method; we use it purely as a diagnostic and diversification tool, and our recovered items show that the pass@k= 0 % stratum is structurally identifiable in the residual stream rather than that the unmodified model reaches them under ordinary inference.

URL PDF HTML ☆

赞 0 踩 0

2606.20008 2026-06-19 cs.LG 新提交 80%

VIMPO: Value-Implicit Policy Optimization for LLMs

VIMPO: 值隐式策略优化用于大语言模型

Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, Xuandong Zhao

发表机构 * UC Berkeley（加州大学伯克利分校）； Yale University（耶鲁大学）

专题命中数学推理：方法在数学推理基准上优于GRPO。

AI总结提出VIMPO方法，通过KL正则化强化学习的最优条件导出策略隐含值函数，无需训练评论家，实现细粒度信用分配，在数学推理基准上优于GRPO。

详情

AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的核心工具，但当前方法在简单性与信用分配之间存在权衡。GRPO等群组相对方法避免了训练评论家，但通常为每个token分配轨迹级优势。Actor-critic方法提供更密集的学习信号，但需要学习值函数，其自身存在训练不稳定性。我们提出VIMPO，一种无需评论家的策略优化方法，从KL正则化强化学习的最优条件推导出策略隐含值函数。对于自回归生成，得到的值递归可以用策略-参考对数比率表示，并由轨迹结束时无未来奖励的终止条件锚定。这给出了一个简单的值损失，它结合了结果级可验证奖励，而无需训练评论家。相同的推导也产生了无需评论家的actor优势，使VIMPO能够通过值损失分离奖励合并，并通过PPO风格的actor更新进行策略改进。在数学RLVR基准上，VIMPO在MATH-500、AIME 2024、AIME 2025和OlympiadBench上均优于GRPO，尤其在竞赛式评估中提升更大。在噪声奖励下，VIMPO保持对GRPO的持续优势，表明策略隐含值优化可以在保持无评论家训练实用简单性的同时提供更精细的信用分配。

英文摘要

Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.

URL PDF HTML ☆

赞 0 踩 0

2605.20531 2026-06-19 cs.LO cs.LG 版本更新 80%

Pseudo-Formalization for Automatic Proof Verification

伪形式化用于自动证明验证

Slim Barkallah, Luke Bailey, Kaiyue Wen, Mohammed Abouzaid, Tengyu Ma

发表机构 * GitHub

专题命中数学推理：伪形式化用于自动证明验证

AI总结本文提出了一种名为伪形式化的证明格式，该格式在保持自然语言灵活性的同时，保留了形式证明的模块性和精确性，通过块验证算法实现了对自然语言证明的高效验证，其在错误发现的精度和召回率上优于现有基线方法。

Comments 31 pages, code available at https://github.com/Slim205/pseudo-formalization

详情

AI中文摘要

可靠的证明验证仍然是训练和评估在复杂数学推理上的人工智能系统的主要瓶颈。在像Lean这样的语言中，完全形式化的证明容易验证，因为它们是无歧义且模块化的。大多数证明，尤其是由人工智能系统编写证明，既没有这种属性，将它们翻译成形式语言在许多前沿数学领域仍然具有挑战性。我们提出了伪形式化（PF），一种证明格式，它捕捉了形式证明的模块性和精确性，同时保留了自然语言的灵活性。一个伪形式化证明被分解成自包含的模块，每个模块陈述其前提、结论和证明，用自然语言。为了验证一个常规的自然语言证明的正确性，一个LLM将其翻译成伪形式化，然后独立验证每个模块，我们称之为块验证（BV）。我们在两个涵盖竞赛和研究级数学的基准上评估PF+BV，其中它在错误发现的精度和召回率上优于LLM-as-judge基线。为了支持未来的工作，我们发布了我们的研究级证明验证基准ArxivMathGradingBench。

英文摘要

Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.

URL PDF HTML ☆

赞 0 踩 0

2606.11537 2026-06-19 cs.AI cs.CE 新提交 70%

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck（因斯布鲁克大学）； University of British Columbia（不列颠哥伦比亚大学）； Toronto Metropolitan University（多伦多都会大学）

专题命中数学推理：处理金融数值推理，涉及多步计算

AI总结提出MoCA-Agent，通过声明级验证和代码生成解决金融表格问答中的数值推理错误，在十个基准上取得强性能。

详情

AI中文摘要

金融和表格问答不仅需要流畅的推理：答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent}，一种声明市场代码智能体，它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明，要求专业交易智能体买入或卖出这些声明，将其订单清算为置信度加权的接受/拒绝决策，并从市场支持的证据中合成可执行的Python程序。然后，一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误，最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上，\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能，包括在 FinQA 上达到 78.3%，在 FinanceMath 上达到 76.0%，在 MultiHiertt 上达到 71.2%，在 ESGenius 上达到 86.9%，以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明，在原子声明级别聚合证据，而不是整个答案，提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取：this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

URL PDF HTML ☆

赞 0 踩 0

2606.19354 2026-06-19 cs.CL cs.LG 新提交 90%

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

粒度调控的自适应计算效率：测试时扩展中的最优验证

Ardit Krasniqi, Luan Vejsiu, Elira Dervishi

发表机构 * European University of Tirana（欧洲地拉那大学）

专题命中测试时计算：测试时扩展中验证粒度自适应理论框架

AI总结提出GRACE理论框架，将验证粒度建模为问题难度、验证器准确率和计算预算的函数，证明存在相变：细粒度验证在计算预算大或问题难时占优，粗粒度验证在低预算简单问题时更优，自适应策略可达到计算-性能帕累托前沿。

详情

AI中文摘要

测试时扩展（TTS）已成为一种强大的范式，通过在推理时投入额外计算来提升大语言模型（LLMs）的推理性能。TTS的核心组件是验证器，它选择或评分候选解以引导搜索过程。虽然先前工作已探索验证的益处，但一个基本问题仍未充分探索：在给定计算预算下，最优验证粒度是什么？粗粒度的结果奖励模型（ORMs）和细粒度的过程奖励模型（PRMs）代表两个极端，但两者单独均无法在所有场景下实现计算最优性。本文建立了一个统一的理论框架，称为GRACE（粒度调控的自适应计算效率），该框架将最优验证粒度刻画为问题难度、验证器准确率和计算预算的显式函数。我们证明存在一个相变：当计算预算大或问题难时，细粒度验证占优；而在低预算、简单问题场景下，粗粒度验证更受青睐。我们的理论将Best-of-N、束搜索和步骤级MCTS统一在一个帕累托最优框架内，并激发了一种自适应粒度策略，该策略可证明达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果证实了所有四个理论主张，在匹配计算量下，我们的自适应策略相比固定粒度基线准确率提升高达3.1%。

英文摘要

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

URL PDF HTML ☆

赞 0 踩 0

2606.19919 2026-06-19 cs.LG 新提交 85%

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

ADaPT：面向高效大推理模型的令牌级解耦

Tingyun Li, Zishang Jiang, Jinyi Han, Xinyi Wang, Sihang Jiang, Han Xia, Zhaoqian Dai, Shuguang Ma, Fei Yu, Jiaqing Liang, Yanghua Xiao

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）； Shanghai Institute of Artificial Intelligence for Education, East China Normal University（华东师范大学上海智能教育研究院）； College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与人工智能学院）； Ant Group（蚂蚁集团）

专题命中测试时计算：提出令牌级解耦框架ADaPT提升推理效率

AI总结提出ADaPT，通过令牌级双过程框架解耦效率与正确性信号，引入模式选择令牌控制快慢推理，实现推理时效率-性能权衡的精确连续控制，在降低推理成本的同时保持强推理能力。

详情

AI中文摘要

大型推理模型依赖长思维链实现强性能，但统一应用此类推理会产生高计算成本。现有面向效率的方法试图缩短或混合推理策略，但往往会降低推理能力。我们将根本原因识别为效率激励与正确性优化之间的序列级耦合，这隐式惩罚了长但正确的推理轨迹。为解决此问题，我们提出自适应双过程思维（ADaPT），一种令牌级双过程框架，在训练期间显式解耦效率和正确性信号。ADaPT引入模式选择令牌来控制快速和慢速推理，将效率相关奖励仅应用于此令牌，以避免惩罚正确的长推理，同时在适当时鼓励效率。此外，ADaPT在推理时实现了对效率-性能权衡的精确连续控制：通过调整模式选择令牌的生成概率，单个训练好的模型可以平滑地沿效率-性能帕累托前沿移动。大量实验表明，ADaPT在多个基准测试中显著降低推理成本，同时保持强推理性能。

英文摘要

Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency-performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency-performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.19808 2026-06-19 cs.AI cs.CL 新提交 85%

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

再思考还是更长时间思考？面向预算感知推理的选择性验证

Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）； Fralin Biomedical Research Institute, Virginia Tech（弗吉尼亚理工大学弗拉林生物医学研究所）； FBRI Cancer Research Center（FBRI癌症研究中心）

专题命中测试时计算：选择性验证框架用于预算感知推理，优化测试时计算。

AI总结提出选择性验证框架SEVRA，通过服务层控制器决定是否对冻结求解器的初始答案进行验证，在Math500上以更少token达到更高准确率，并减少有害翻转。

详情

AI中文摘要

测试时推理越来越多地被用作服务时的控制旋钮，但额外的推理并非均匀有价值：它可以修复失败的尝试，在已经正确的答案上浪费计算，或引入有害的答案更改。我们将其视为一个部署分配问题，而非新验证器问题。我们引入SEVRA，即面向推理分配的选择性验证，这是一个服务层控制器，决定是保留冻结求解器的初始答案还是调用主动验证。使用冻结的Qwen3-4B求解器，我们记录干预结果并从服务可见的尝试状态训练可恢复性感知的门控。在Math500上，选择性验证达到76.3%的准确率，而始终验证为75.5%，同时将生成后token减少26.8%，有害翻转从2.2%降至1.0%。然而，8,192 token的初始求解达到76.0%的准确率，总模型token减少28%，表明选择性恢复有用但并非测试的最佳成本前沿。在冻结迁移到GSM时，选择性策略仅验证3.0%的样本，准确率从93.4%提升至94.5%，验证token相对于始终验证减少91.2%；同样，更长的初始求解以更少的实际token达到相同准确率。在CommonsenseQA上，始终开启的验证有害，而Self-Consistency@5以约五倍的实际token成本提升准确率。由此得出的部署规则是：首先调整初始预算，然后在需要显式检查、有限重试、可审计性或回归风险控制时使用选择性恢复。

英文摘要

Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

URL PDF HTML ☆

赞 0 踩 0

2606.19771 2026-06-19 cs.AI 新提交 85%

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

超越熵：从令牌级分布偏差中学习以增强LLM推理

Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； Sichuan University（四川大学）； The Hong Kong Polytechnic University（香港理工大学）

专题命中测试时计算：通过令牌级分布偏差学习增强LLM推理。

AI总结针对RLVR中令牌更新导致的熵塌陷或爆炸问题，提出ICT框架，利用JS散度识别关键令牌，通过选择性更新平衡策略集中度，提升推理性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）显著推进了大语言模型（LLM）推理；然而，它面临一个基本的优化不稳定性：均匀令牌更新会导致熵塌陷，从而过早收敛到次优策略，而过度的香农熵最大化可能导致熵爆炸，驱动盲目探索走向不连贯的推理链。为解决这一二分问题，我们引入了独立组合令牌（ICT）框架，该框架将优化焦点从标量不确定性转移到令牌logits的分布特性。通过利用令牌logits分布之间的詹森-香农（JS）散度，ICT将具有独特分布模式的令牌识别为引导LLM推理中有效探索的关键分支点。我们的理论分析基于香农熵和二阶Rényi熵，证明选择性地更新这些令牌可以调节策略集中度：它降低了由香农熵度量的整体分布不确定性，同时控制了由二阶Rényi熵捕获的概率集中度。这种双重效应防止了过度集中的令牌生成削弱探索，并有效稳定了训练景观。实验结果表明，在Qwen2.5（0.5B/1.5B/7B）模型上仅更新前10%的独特令牌，在涵盖数学、常识和奥林匹克级别问题的七个基准测试中，与GRPO、20-Entropy和STAPO基线相比，平均pass@4提升了4.58%，最大提升达14.9%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.

URL PDF HTML ☆

赞 0 踩 0

2606.19750 2026-06-19 cs.LG cs.AI cs.CL 新提交 80%

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

流形赌博机：大语言模型潜在几何上的贝叶斯课程学习

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang

发表机构 * University of California, San Diego（加州大学圣迭戈分校）

专题命中测试时计算：贝叶斯课程学习框架用于LLM推理的强化学习。

AI总结提出贝叶斯流形课程（BMC）框架，将问题采样建模为流形结构赌博机问题，通过层次任务树和贝叶斯学习引导采样，平衡学习信号、多样性和实用性。

Comments Webpage: https://darrienmckenzie.com/manifold-bandits/

详情

AI中文摘要

强化学习（RL）是提高大语言模型（LLMs）推理能力的关键方法，其中训练效率关键取决于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示，将问题选择视为具有独立臂的标准赌博机问题，忽略了任务空间的结构化和异质性。在这项工作中，我们将问题采样框架化为具有内生非平稳性的流形结构赌博机问题：问题通过模型的潜在表示空间相关联，采样决策可以影响学习信号在该空间中的演变方式。为了实现这一视角，我们引入了贝叶斯流形课程（BMC），这是一个结构感知框架，将问题组织成层次任务树，并应用贝叶斯学习来指导采样。实验发现，不同的采样策略在生产性（学习信号）、多样性（任务流形覆盖）和实用性（评估相关性）之间引入了非平凡的权衡。这些结果表明，仅优先考虑难度不足以获得强大的下游性能，突出了将结构和类型感知纳入问题采样中的重要性。

英文摘要

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

URL PDF HTML ☆

赞 0 踩 0

2606.19350 2026-06-19 cs.CL 新提交 85%

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

基于因果归因的剪枝保留大型语言模型的推理性能

Amogh Sheth, Biruk Assefa, Yi Wen Huang, Andrew Lin, Yuhao Ge

发表机构 * Edison Academy Magnet School（爱迪生学院磁石学校）； Massachusetts Institute of Technology（麻省理工学院）； State University of New York College at Plattsburgh（纽约州立大学普拉茨堡学院）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Independent Researcher（独立研究员）

专题命中其他推理：因果归因剪枝保留推理性能

AI总结提出无需训练的因果归因剪枝（CAP）方法，通过测量注意力头对推理任务的因果影响进行细粒度剪枝，在20%稀疏度下相比Wanda在ARC-Challenge上准确率提升高达61%。

Comments Accepted at the ICLR 2026 Workshop on LLM Reasoning. 13 pages, 2 figures

详情

AI中文摘要

大型语言模型（LLMs）在多步推理方面表现出色，但推理成本高昂。我们引入了因果归因剪枝（CAP），一种无需训练的方法，通过测量注意力头对推理任务的因果影响来识别关键注意力头，并利用这些头级分数指导细粒度的权重剪枝。对于每个注意力头，CAP估计在推理问题的小型校准集上前向传播时掩码该头所导致的预期性能下降。这些因果分数随后被转换为对应投影矩阵的权重级重要性值。与仅基于幅度或激活的标准不同，CAP的干预测量直接捕捉每个头的功能贡献，在20%稀疏度下，相比Wanda在ARC-Challenge上获得高达61%的相对准确率提升。我们在GSM8K、StrategyQA和ARC-Challenge上使用Llama-3-8B-Instruct和Mistral-7B-Instruct在10%、20%和50%稀疏度下评估CAP。在中等稀疏度（10-20%）下，CAP在大多数模型-基准配置中优于Wanda，尤其在Llama-3的ARC-Challenge上提升显著。我们的结果表明，在相同稀疏度下，注意力头级因果归因比相关性剪枝标准能更好地保留下游基准的推理性能，但在50%稀疏度下仍受限于粗粒度的MLP归因。

英文摘要

Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning. For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. Unlike magnitude-only or activation-based criteria, CAP's interventional measurement directly captures each head's functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity. We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations. with especially large gains on ARC-Challenge for Llama-3. Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.

URL PDF HTML ☆

赞 0 踩 0

2606.20075 2026-06-19 cs.LG cs.CL 新提交 80%

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

什么使得潜在思维链中的监督有效：一种信息论分析

Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology（宁波数字孪生研究院，东方理工大学）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算学系）

专题命中复杂问题求解：潜在思维链监督信息论分析

AI总结本文从信息论角度分析潜在思维链中的监督失效问题，提出轨迹监督和空间监督两个维度，并引入统一潜在探针（ULP）量化信息保真度，揭示了信息-性能绑定关系。

详情

AI中文摘要

潜在思维链（Latent Chain-of-Thought, CoT）将推理内化到连续隐藏状态中，为冗长的离散推理轨迹提供了一种有前景的替代方案。然而，鲁棒的潜在推理仍然困难，因为结果监督提供的学习信号较弱，且容易导致潜在轨迹发生语义漂移。在这项工作中，我们从信息论角度分析潜在CoT，并将这种失效识别为双重崩溃：优化路径上的梯度衰减和潜在空间中的表征漂移。我们进一步将过程监督分解为两个互补维度：轨迹监督（注入密集的逐步推理信号）和空间监督（保持潜在流形的语义结构）。我们的分析表明，刚性几何压缩可能坍缩推理空间，而生成式重建提供了更灵活的语义锚点，更好地保留了信息容量。为了衡量这些效应，我们引入了统一潜在探针（Unified Latent Probe, ULP），用于量化潜在轨迹与显式推理步骤之间的互信息。实验揭示了清晰的信息-性能绑定关系：推理准确性取决于潜在链中保留的信息保真度。这些发现为潜在推理监督提供了一个原则性框架，并建议从几何模仿转向互信息最大化。我们的代码可在\href{this https URL}{此仓库}获取。

英文摘要

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

URL PDF HTML ☆

赞 0 踩 0

2606.19427 2026-06-19 astro-ph.CO astro-ph.IM physics.comp-ph physics.data-an 新提交 80%

Physics-guided discovery of dynamical dark-energy equations of state through iterative AI reasoning

通过迭代AI推理发现动力学暗能量状态方程的物理引导

Clecio R. Bom, Bernardo M. Fraga, Miguel A. Sabogal, Armando Bernui, Phelipe Darc, Gustavo Schwarz

专题命中复杂问题求解：LLM迭代推理发现暗能量状态方程

AI总结提出迭代AI推理框架，利用大语言模型生成并优化暗能量状态方程，结合文献检索和自动评估，发现两种新参数化形式，在超新星、重子声学振荡和Planck数据上优于传统模型。

Comments 6 figures, 45 pages, submitted. Code: https://iadev.cbpf.br/labia/cosmoai

详情

AI中文摘要

现象学模型构建传统上依赖人类推理：方程从理论直觉、类比或经验便利中提出，然后才与数据对比。这里我们展示，这一循环可以重构为动力学暗能量的迭代AI推理过程。我们的框架使用大语言模型提出状态方程及宇宙学理由，通过从暗能量文献中检索来奠定基础，并通过自主评估进行优化。每个候选方程嵌入宇宙学模型，针对观测进行优化，并使用似然性能和理论一致性进行评估。独立的语言模型评判者对方程及其理由的物理动机、新颖性、清晰度、稳定性和实现有效性进行评分，使得后续提议在数学结构和物理推理上共同演化。应用于包括超新星、重子声学振荡和Planck似然在内的宇宙学数据组合，该框架识别出两种参数化形式，据我们所知，这些形式此前未被探索过，且与已有形式竞争。对于Pantheon+超新星、DESI DR2重子声学振荡和完整的Planck 2018温度、极化和透镜似然，AI选择的最佳模型获得的贝叶斯证据比这里考虑的传统参数化大一个单位以上。这些结果表明，AI引导的推理可以通过提出和评估动力学暗能量的可解释现象学参数化来补充物理模型构建。

英文摘要

Phenomenological model building has traditionally relied on human reasoning: equations are proposed from theoretical intuition, analogy, or empirical convenience, and only then tested against data. Here we show that this cycle can be recast as an iterative AI reasoning process for dynamical dark energy. Our framework uses a large language model to propose equations of state together with cosmological rationales, grounded by retrieval from the dark-energy literature and refined through autonomous evaluation. Each candidate is embedded in a cosmological model, optimized against observations, and assessed using likelihood performance and theoretical consistency. An independent language-model critic scores the physical motivation, novelty, clarity, stability and implementation validity of both the equation and its rationale, allowing subsequent proposals to evolve jointly in mathematical structure and physical reasoning. Applied to cosmological data combinations including supernovae, baryon acoustic oscillations and Planck likelihoods, the framework identifies two parameterizations that, to the best of our knowledge, have not previously been explored and that are competitive with established forms. For Pantheon+ supernovae, DESI DR2 baryon acoustic oscillations and the full Planck 2018 temperature, polarization, and lensing likelihoods, the best AI-selected model attains larger Bayesian evidence than the traditional parameterizations considered here by more than one unit. These results show that AI-guided reasoning can complement physical model building by proposing and evaluating interpretable phenomenological parameterizations for dynamical dark energy.

URL PDF HTML ☆

赞 0 踩 0

2606.20401 2026-06-19 eess.SY cs.SY 新提交 70%

PowerAgentBench-Dyn: A Benchmark for Agentic AI in Power System Dynamic Studies

PowerAgentBench-Dyn：电力系统动态研究中智能体AI的基准测试

Qian Zhang, Andrea Pomarico, Costas Mylonas, Magda Foti, Alberto Berizzi, Le Xie

专题命中复杂问题求解：涉及多步推理和工程判断，属于复杂问题求解

AI总结提出PowerAgentBench-Dyn基准，用于评估基于LLM的智能体在电力系统动态分析任务中的能力，涵盖模型质量审查和安全风险筛选两个任务。

详情

AI中文摘要

基于大型语言模型（LLM）的智能体越来越多地被用于通过与软件工具交互、解释中间结果以及自主规划后续行动来自动化多步骤工程工作流。电力系统动态研究是这些智能体一个特别有前景但尚未充分探索的应用领域。与静态计算任务不同，动态研究通常需要更多时间进行模型参数校准、工程判断以及在受限动作空间下的决策。本文介绍了PowerAgentBench-Dyn，一个旨在评估智能体AI系统在电力系统动态分析任务上的基准测试。该基准针对那些不能简化为单一优化或编码任务的问题，而是需要经验丰富的电力系统工程师日常执行的那种推理、工具使用和迭代实验。所提出的框架包括两个初始基准任务。第一个是动态模型质量审查基准，评估智能体根据系统运营商指定的模型质量合规标准验证和诊断动态模型的能力。第二个是动态安全风险筛选基准，评估智能体利用语义记忆和有限的仿真预算从未见故障数据集中识别、排序和分析最关键短路事故，并提出和评估可能的缓解措施的能力。对于每个任务，我们定义了仿真环境、观测和动作空间以及评估指标。该基准在基于度量的意义上是可复现的：发布案例和仿真器设置定义了确定性评估器，而随机智能体行为通过重复运行使用成功率和其他指标进行评估。该基准支持未来用于电力系统运行和规划的智能体AI的开发。

英文摘要

Large Language Model (LLM)-based agents are increasingly being used to automate multi-step engineering work flows by interacting with software tools, interpreting intermediate results, and autonomously planning subsequent actions. Power system dynamic studies represent a particularly promising yet largely unexplored application domain for these agents. Unlike static computational tasks, dynamic studies often require more time on model parameter calibration, engineering judgment, and decision making under constrained action spaces. This paper introduces PowerAgentBench-Dyn, a benchmark designed to evaluate Agentic AI systems on power system dynamic-analysis tasks. The benchmark targets problems that cannot be reduced to a single optimization or coding task, but instead require a type of reasoning, tool usage, and iterative experimentation routinely performed by experienced power system engineers. The proposed framework includes two initial benchmark tasks. The first, the Dynamic Model Quality Review Benchmark, evaluates agents' ability to validate and diagnose dynamic models based on model-quality compliance criteria specified by system operators. The second, the Dynamic Security Risk Screening Benchmark, assesses agents' capability to leverage semantic memory and a limited simulation budget to identify, rank, and analyze the most critical short-circuit contingencies from an unseen fault dataset, as well as propose and evaluate possible mitigation measures. For each task, we define the simulation environment, observation and action spaces, and evaluation metrics. The benchmark is reproducible in a metric-based sense: released cases and simulator settings define a deterministic evaluator, while stochastic agent behavior is assessed over repeated runs using success rates and other metrics. The benchmark supports the development of future Agentic AI for power system operation and planning.

URL PDF HTML ☆

赞 0 踩 0

2606.19893 2026-06-19 cs.AI 新提交 70%

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

MetaResearcher: 通过对抗虚拟环境中的自我反思强化学习扩展深度研究

Wei Yu, Suxing Liu, Minjie Yu, Jiahao Wang, Zhijian Zheng, Haocheng Deng, Bing Li

发表机构 * School of Digital Arts, Jiangxi Arts & Ceramics Technology Institute（江西陶瓷工艺美术职业技术学院数字艺术学院）； Universiti Sains Malaysia（马来西亚理科大学）

专题命中复杂问题求解：发现导向任务，超越事实检索。

AI总结提出MetaResearcher框架，通过演化虚拟世界、发现导向任务、自我反思元奖励和异构多智能体架构，在对抗环境中扩展深度研究智能体的训练，提升基准性能和认知鲁棒性。

详情

AI中文摘要

深度研究智能体在自主信息收集和综合方面展现了卓越的能力，但其训练仍受限于模拟环境的静态性、仅限事实检索的任务设计的局限性以及基于结果的强化学习的低效性。在这项工作中，我们提出了MetaResearcher，一个新颖的框架，在四个协同维度上扩展深度研究智能体的训练。首先，我们引入了一个演化虚拟世界，将时间动态和对抗性错误信息注入训练环境，迫使智能体发展来源可信度评估和时间冲突解决技能。其次，我们设计了发现导向任务——包括假设生成和矛盾解决——超越了简单的事实检索，推动智能体走向真正的研究行为。第三，我们在GRPO框架内提出了一种自我反思元奖励机制，共同优化答案正确性、搜索路径效率、反思深度和工具调用多样性，直接解决了先前工作中观察到的重复动作循环问题。第四，我们引入了一个异构多智能体群体架构，包括专门的侦察、过滤和合成模型，通过协调强化学习学习协作研究策略。基于LiteResearcher基础设施，MetaResearcher在训练中需要零边际API成本，同时目标是在基准性能（GAIA，Xbench-DS）和对抗条件下的认知鲁棒性方面实现显著改进。我们展示了完整的框架设计、训练方法和计划的实验验证。

英文摘要

Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact-retrieval-only task designs, and the inefficiency of outcome-based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery-Oriented Tasks -- including hypothesis generation and contradiction resolution -- that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self-Reflective Meta-Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi-Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench-DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.

URL PDF HTML ☆

赞 0 踩 0

2606.19741 2026-06-19 cs.AI cs.LG 新提交 65%

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Nanyang Technological University（南洋理工大学）； Microsoft Research（微软研究院）； Massachusetts Institute of Technology（麻省理工学院）

专题命中复杂问题求解：涉及组合优化问题的可解释性，与推理相关。

AI总结提出演化程序瓶颈（EPB）框架，通过将黑盒神经组合优化模型蒸馏为可读程序组合，利用LLM和混合梯度下降实现可解释性，揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情

AI中文摘要

神经组合优化（NCO）取得了强劲性能，但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具（如概念瓶颈模型）不适用于NCO，因为其决策是动态的、状态依赖的，且缺乏适当的概念词汇定义。为弥合这一差距，我们引入了演化程序瓶颈（EPB），据我们所知，这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序，其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作：模块I固定程序库容量，并引入混合文本-数值梯度下降方案，该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合；模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性，蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化，并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO，并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

URL PDF HTML ☆

赞 0 踩 0

2305.14985 2026-06-19 cs.CV cs.CL 版本更新 65%

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

IdealGPT: 通过大型语言模型迭代分解视觉与语言推理

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University（哥伦比亚大学）； HKUST（香港科技大学）； University of California, Los Angeles（加州大学洛杉矶分校）

专题命中复杂问题求解：LLM生成子问题并推理最终答案。

AI总结提出IdealGPT框架，利用大型语言模型迭代分解视觉语言推理任务，通过子问题生成、子答案获取和最终答案推理的循环过程，在零样本设置下显著提升多步推理性能。

Comments 13 pages, 5 figures

详情

AI中文摘要

视觉与语言（VL）理解领域通过端到端的大型预训练VL模型（VLM）取得了前所未有的进展。然而，它们在需要多步推理的零样本推理任务中仍存在不足。为了实现这一目标，先前的工作采用了分而治之的流程。本文认为，先前的工作存在几个固有的缺点：1）它们依赖于特定领域的子问题分解模型。2）即使子问题或子答案提供的信息不足，它们也强制模型预测最终答案。我们通过IdealGPT框架解决了这些局限性，该框架利用大型语言模型（LLM）迭代分解VL推理。具体来说，IdealGPT使用一个LLM生成子问题，一个VLM提供相应的子答案，另一个LLM进行推理以得出最终答案。这三个模块迭代地执行分而治之的过程，直到模型对主问题的最终答案有信心。我们在零样本设置下对多个具有挑战性的VL推理任务评估了IdealGPT。特别是，我们的IdealGPT在VCR上比现有最好的GPT-4类模型绝对提高了10%，在SNLI-VE上提高了15%。代码可在以下网址获取：此 https URL

英文摘要

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

URL PDF HTML ☆

赞 0 踩 0

2606.20206 2026-06-19 stat.ML cs.LG 新提交 60%

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

马尔可夫决策过程中奖励非随机缺失的缺失感知策略的离线评估

Ziheng Wei, Annie Qu, Rui Miao

发表机构 * Department of Statistics, University of Michigan at Ann Arbor（密歇根大学安娜堡分校统计学系）； Department of Statistics（统计学系）； Applied Probability, University of California at Santa Barbara（加州大学圣巴巴拉分校应用概率系）； Department of Mathematical Sciences, University of Texas at Dallas（德克萨斯大学达拉斯分校数学科学系）

专题命中复杂问题求解：离线策略评估，奖励缺失问题

AI总结针对奖励非随机缺失的离线强化学习问题，提出基于未来状态作为影子变量的识别方法，并利用桥函数和min-max估计器恢复条件均值奖励，实现缺失感知策略的离线评估。

Comments Accepted at ICML 2026. 31 pages, 6 figures

详情

AI中文摘要

在离线强化学习中，由于记录稀疏或不规则，或超出特定奖励值的审查，记录批次数据中的即时奖励通常未被观测到。这个问题出现在实际场景中，包括医疗和营销。我们研究了有限时域马尔可夫决策过程中奖励非随机缺失时的离线策略评估，这破坏了可忽略性，并即使在以状态和行动为条件后也会引起选择偏差。为了解决这个问题，我们形式化了一个依赖于奖励的倾向模型，并使用未来状态作为影子变量来识别完整数据的条件均值奖励。我们进一步引入了一个桥函数，无需显式建模MNAR机制即可恢复条件均值奖励，并通过min-max过程进行估计以避免双重采样。基于这些识别结果，我们提出了一个类似Fitted-Q-Evaluation的估计器，该估计器传播恢复的奖励，同时允许目标策略依赖于过去的缺失指示符。最后，我们为我们的OPE估计器建立了一致性和有限样本误差界，并通过实验在模拟数据和MIMIC-III脓毒症数据上展示了我们方法相比现有方法的强性能。

英文摘要

In offline Reinforcement Learning, immediate rewards in logged batch data are often unobserved due to sparse or irregular record-keeping, or censored beyond certain reward values. This issue arises in practical settings, including health care and marketing. We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions. To address this, we formalize a reward-dependent propensity model and use future states as shadow variables to identify the full-data conditional mean reward. We further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a min-max procedure to avoid double sampling. Building upon these identification results, we propose an Fitted-Q-Evaluation-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finite-sample error bounds for our OPE estimator, and show through experiments the strong performance of our method compared to existing methods on simulated and MIMIC-III Sepsis data.

URL PDF HTML ☆

赞 0 踩 0

2606.19846 2026-06-19 econ.GN q-fin.EC 新提交 55%

What Capital After Labor? Forecasting the Talent ROI Transition in the Human-AI Era

劳动力之后是什么资本？预测人机时代的人才ROI转型

Kwan Soo Shin, In Seok Kang

专题命中复杂问题求解：AI时代人才ROI预测框架

AI总结针对AI增强打破劳动时间与贡献的会计关联，本文构建从时间到产出的人才ROI预测框架，核心定理为ROI反转，并利用韩国52小时工作制案例验证了前期压力信号，预测产出型企业在2032年TFP增长领先1.5-2.0个百分点。

Comments 90 pages, 6 figures

详情

AI中文摘要

AI增强打破了劳动时间与生产贡献之间的会计联系，但企业仍通过基于时间的间接费用包来评估人才。本文开发了一个预测框架，用于在人机时代从基于时间的人才会计向基于产出的人才ROI转型。该框架以定理3（在τ*处的ROI反转）为实证主轴，包含四个机制定理：间接费用非加性、增强节省时间路径、创新溢价放大以及人机二元归因不确定性。韩国分阶段实施的52小时工作制规定提供了一个实证预警案例。在一个包含365家上市公司的DART面板数据（2281个公司-年观测值）中，SG&A与收入比率从2018年的18.26%上升至2020年的20.06%，在2021-2022年略有修正，并于2024年达到20.10%的峰值。在收入百分位队列代理下，双向固定效应（+1.56个百分点，p=0.049）、合并事件研究估计（t=+3时为+4.21个百分点，p=0.001）以及Callaway-Sant'Anna双重稳健交错DID估计（t=+4时为+4.51个百分点）收敛于一个正向间接费用压力特征。2015-2017年的向后扩展（224家公司，601个观测值）提供了预处理数据，提供了反对预先存在的上升趋势混杂因素的证据。我们将韩国证据解读为，据我们所知，第一个经验记录的τ*前间接费用压力制度特征，其中基于时间的会计仍占主导地位，而AI增强和劳动时间压缩共同推高了间接费用。预计到2032年，基于产出的公司在公司层面TFP增长上比基于时间的同行高出1.5-2.0个百分点。贡献在于为向AI增强的人才ROI会计转型提供了一个预测模型和管理规划工具。

英文摘要

AI augmentation breaks the accounting link between labor time and productive contribution, yet firms continue to evaluate talent through time-based overhead bundles. This paper develops a forecasting framework for the transition from time-based talent accounting to output-based talent ROI in the human-AI era. The framework centres on Theorem 3 (ROI Inversion at τ*) as the empirical spine, with four mechanism theorems: overhead non-additivity, augmentation-saved-time pathways, innovation-premium amplification, and human-AI dyad attribution uncertainty. Korea's staged 52-hour workweek mandate provides an empirical early-warning case. In a DART panel of 365 listed firms (2,281 firm-year observations), the SG&A-to-revenue ratio rose from 18.26 percent in 2018 to 20.06 percent in 2020, corrected mildly in 2021-2022, and peaked at 20.10 percent in 2024. Under the revenue-percentile cohort proxy, two-way fixed effects (+1.56 pp, p = 0.049), pooled event-study estimates (+4.21 pp at t = +3, p = 0.001), and Callaway-Sant'Anna doubly-robust staggered DiD estimates (+4.51 pp at t = +4) converge on a positive overhead-pressure signature. A 2015-2017 backward extension (224 firms, 601 observations) supplies pre-treatment data, providing evidence against pre-existing upward-trend confounds. We read the Korean evidence not as a direct τ* estimate or a point causal magnitude, but as, to our knowledge, the first empirically documented signature of the pre-τ overhead-pressure regime, where time-based accounting still dominates while AI augmentation and labor-time compression jointly raise overhead. Output-based firms are forecast to outperform time-based peers by 1.5-2.0 percentage points in firm-level TFP growth by 2032. The contribution is a forecasting model and managerial planning tool for the shift to AI-augmented talent ROI accounting.

URL PDF HTML ☆

赞 0 踩 0

1702.06162 2026-06-19 cs.CR 版本更新 55%

Survey of Automated Vulnerability Detection and Exploit Generation Techniques in Cyber Reasoning Systems

网络推理系统中自动化漏洞检测与利用生成技术综述

Teresa Nicole Brooks

专题命中复杂问题求解：综述自动化漏洞检测与利用生成，涉及推理

AI总结本文综述了DARPA网络大挑战赛中获胜系统Mayhem和Mechanical Phish的自动化漏洞检测与利用生成技术，总结了其核心方法、底层技术及相关工作。

Comments This is the accepted submitted version of this paper that was published in the Intelligent Computing Proceedings of the 2018 Computing Conference, Volume 2

Journal ref Intelligent Computing: Proceedings of the 2018 Computing Conference, Vol. 2, Springer, 2019, pp. 1083-1102

详情

DOI: 10.1007/978-3-030-01177-2_79

AI中文摘要

软件无处不在，从工业电站、心脏起搏器等关键任务系统到家用电器。对技术日益增长的依赖以及软件复杂性的增加带来了严重的安全隐患，因为我们可能被含有可利用漏洞的软件所包围。这些挑战使得二进制分析成为计算机科学中的一个重要研究领域，并强调了构建能够以规模、速度和效能运行的自动化分析系统的必要性，同时具备人类专家的技能。尽管该领域的研究取得了巨大进展，但仍存在局限性和有待解决的开放挑战。认识到这一需求，DARPA赞助了网络大挑战赛（CGC），这是一场展示当前最先进系统的竞赛，这些系统执行自动化漏洞检测、利用生成和软件修补。本文是对两个获胜系统Mayhem和Mechanical Phish的漏洞检测与利用生成技术、底层技术及相关工作的综述。

英文摘要

Software is everywhere, from mission critical systems such as industrial power stations, pacemakers and even household appliances. This growing dependence on technology and the increasing complexity software has serious security implications as it means we are potentially surrounded by software that contain exploitable vulnerabilities. These challenges have made binary analysis an important area of research in computer science and has emphasized the need for building automated analysis systems that can operate at scale, speed and efficacy; all while performing with the skill of a human expert. Though great progress has been made in this area of research, there remains limitations and open challenges to be addressed. Recognizing this need, DARPA sponsored the Cyber Grand Challenge (CGC), a competition to showcase the current state of the art in systems that perform; automated vulnerability detection, exploit generation and software patching. This paper is a survey of the vulnerability detection and exploit generation techniques, underlying technologies and related works of two of the winning systems Mayhem and Mechanical Phish.

URL PDF HTML ☆

赞 0 踩 0

2606.15197 2026-06-19 cs.LG cs.AI 新提交 80%

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR: 协同树搜索与测试时强化学习用于优化建模

Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Northwest A&F University（西北农林科技大学）

专题命中规划推理：结合MCTS与GRPO进行推理优化

AI总结提出StarOR框架，结合蒙特卡洛树搜索与测试时强化学习，通过四阶段分解和GRPO更新LoRA适配器，实现无监督细粒度奖励的中间决策优化，在5个基准上以4B模型达到最优性能。

Comments 41pages, V1, preprint

详情

AI中文摘要

优化建模本质上是层次化的，需要精确的符号承诺序列。传统的基于学习的自动化优化建模方法通过大规模标注或策划的训练数据改进建模策略，但适应新问题分布成本高昂。同时，一次性生成在层次化建模中仍然脆弱，早期符号错误可能传播为无效公式。测试时缩放通过额外的实例级计算实现结构探索，提供了一种有前景的替代方案；然而，现有的基于搜索的方法通常依赖固定策略，导致重复展开继承相似的建模偏差，并为中间决策提供有限的信用分配。为了解决这些限制，我们提出了StarOR，一种协同搜索与适应的框架，将MCTS与测试时强化学习相结合用于优化建模。StarOR将建模过程分解为四个阶段，并通过GRPO在每个非终端节点更新瞬态LoRA适配器。通过使用MCTS生成的兄弟节点作为局部比较集，StarOR将搜索时的探索转化为实例特定的策略细化。此外，无监督的多方面奖励系统为中间公式决策提供细粒度反馈，无需真实标签。在五个优化基准上的实验表明，即使使用4B骨干网络，StarOR也实现了最先进的性能，优于现有方法和前沿LLMs。

英文摘要

Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.20014 2026-06-19 cs.LG cs.AI 新提交 75%

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

多智能体博弈中的层次化控制：基于LLM的规划与RL执行

Jannik Hösch, Alessandro Sestini, Florian Fuchs, Amir Baghi, Joakim Bergdahl, Konrad Tollmar, Jean-Philippe Barrette-LaPierre, Linus Gisslén

发表机构 * Electronic Arts ； KTH Royal Institute of Technology（皇家理工学院）

专题命中规划推理：LLM进行高层规划，属于规划推理。

AI总结提出LLM作为中央策略控制器选择RL技能策略的层次化架构，在2v2对抗环境中达到与手工BT相当的胜率，且被感知为最类人。

Comments 12 pages, 9 figures

详情

AI中文摘要

强化学习（RL）在序列决策中取得了强劲表现，但由于稀疏奖励、大状态-动作空间以及学习协调策略的困难，扩展到复杂多智能体环境仍具挑战。我们提出一种层次化架构，其中预训练的大语言模型（LLM）作为集中式策略控制器，为一组智能体选择专门的RL技能策略，而RL策略负责反应式底层执行。我们在竞争性2v2 King of the Hill环境中评估该混合系统，与行为树（BT）和“扁平”RL（无技能分解的端到端训练）基线进行比较。LLM+RL系统实现了与手工BT统计上相当的任务性能（胜率46.4% vs 51.5%，p=0.103），而两者均显著优于无技能分解训练的扁平RL。一项用户研究（n=15）显示，60%的参与者认为LLM+RL智能体最像人类（p=0.027），归因于行为适应性和战术变异性。这些结果表明，预训练LLM推理可以有效编排预训练RL技能，实现具有竞争力的多智能体协调和优越的感知可信度，而无需手动规则工程。

英文摘要

Reinforcement learning (RL) has achieved strong performance in sequential decision-making, yet scaling to complex multi-agent environments remains challenging due to sparse rewards, large state-action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low-level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and \emph{``Flat''} RL (end-to-end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand-crafted BT (46.4\% vs 51.5\% win rate, $p=0.103$) while both significantly outperform Flat RL trained without skill decomposition. A user study ($n=15$) reveals that 60\% of participants perceive LLM+RL agents as the most human-like ($p=0.027$), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi-agent coordination and superior perceived believability without manual rule engineering.

URL PDF HTML ☆

赞 0 踩 0

1. 逻辑推理 2 篇

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

2. 数学推理 8 篇

Process-Verified Reinforcement Learning for Theorem Proving via Lean

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

Efficiently Representing Algorithms With Chain-of-Thought Transformers

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

VIMPO: Value-Implicit Policy Optimization for LLMs

Pseudo-Formalization for Automatic Proof Verification

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

3. 测试时计算 5 篇

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

4. 其他推理 1 篇

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

5. 复杂问题求解 9 篇

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

Physics-guided discovery of dynamical dark-energy equations of state through iterative AI reasoning

PowerAgentBench-Dyn: A Benchmark for Agentic AI in Power System Dynamic Studies

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

What Capital After Labor? Forecasting the Talent ROI Transition in the Human-AI Era

Survey of Automated Vulnerability Detection and Exploit Generation Techniques in Cyber Reasoning Systems

6. 规划推理 2 篇

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution