大模型推理能力 - arXivDaily 专题

2606.11918 2026-06-18 cs.AI 新提交 90%

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

提问的艺术：一致性增强空间推理中的事实性

Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari, Leonidas Guibas

发表机构 * The University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）； University of Oxford（牛津大学）； Stanford University（斯坦福大学）

专题命中规划推理：自监督强化学习提升空间推理能力

AI总结提出自监督强化学习框架，通过几何与语义一致性验证器（如图像翻转、文本对象顺序交换）对齐预训练模型的内在空间推理能力，无需标注数据即可达到接近监督方法的精度。

详情

AI中文摘要

当前的大型推理模型（LRMs）展现出显著的通用能力，但在空间推理任务中表现明显不足。现有方法将此差距视为知识缺陷，依赖监督微调（SFT）从外部视觉源或合成引擎中获取标注空间数据。相反，我们认为对于许多任务，空间推理能力已经存在于预训练的LRMs中，但需要通过几何2D和3D约束下的逻辑一致性进行对齐。在这项工作中，我们提出了一个自监督强化学习（RL）框架，针对内部推理过程，无需真实标注。通过形式化一致性验证器——即在变换下检查几何和语义一致性的奖励函数——我们证明模型可以提高其空间推理能力。我们同时使用图像变换（如翻转）和文本变换（如交换问题中对象的顺序），并提出了一种新的基于最优传输的RL策略OT-GRPO，这是针对成对验证器定制的组相对策略优化的最小匹配变体。我们展示了这种无标签一致性训练在精度上接近使用真实监督训练的模型，并在不同任务和数据领域实现了类似的泛化。

英文摘要

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.

URL PDF HTML ☆

赞 0 踩 0

2606.18686 2026-06-18 cs.AI cs.CL cs.LG 新提交 85%

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim：一个模拟世界预测基准

Jaeho Lee, Nick Merrill, Ezra Karger

发表机构 * Forecasting Research Institute（预测研究所）

专题命中规划推理：模拟世界预测基准，评估概率推理

AI总结提出基于Freeciv游戏模拟的预测基准ForecastBench-Sim，通过游戏回滚生成可控、即时可解的预测问题，用于评估AI系统的概率推理能力。

Comments 15 pages, 5 main figures, 6 appendix figures. Spotlight presentation at Forecasting as a New Frontier of Intelligence / Workshop on AI Forecasting, ICML 2026

详情

AI中文摘要

通用AI系统的预测基准通常继承现实世界的约束：结果缓慢显现、尾部事件罕见、反事实问题难以评分。我们引入ForecastBench-Sim，一个基于Freeciv（一款以文明系列为模型的回合制策略游戏）游戏回滚的模拟世界预测基准。预测者接收固定的世界报告（当前游戏状态的结构化快照），并回答关于隐藏未来状态的问题；然后基准继续模拟并对预测进行评分。由于世界是模拟的，同一设置可以生成任意时间跨度的连续或二元预测问题、用于条件或因果问题的配对干预世界，以及罕见或破坏性结果的已解决示例。我们描述了基准流程、问题族、评分协议和发布工件，并报告了来自模型评估和匿名人工试点的验证切片。ForecastBench-Sim旨在通过提供受控、即时可解的任务来补充现实世界预测基准，用于研究动态世界状态下的概率推理。

英文摘要

Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

URL PDF HTML ☆

赞 0 踩 0

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交 80%

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench：智能体能否玩转长期博弈？

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

专题命中规划推理：长期不确定环境下的决策能力

AI总结提出CEO-Bench，通过模拟500天运营初创公司的任务，评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情

AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而，现实世界的挑战需要结合多种复杂技能，这些技能在很大程度上尚未在智能体中得到测试：（1）在不确定性中导航长期视野；（2）在嘈杂环境中获取信息；（3）适应不断变化的世界；（4）协调多个移动部分以实现连贯目标。我们引入CEO-Bench，通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面，在相同的环境中运行，并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库，将信号转化为合理的策略，并通过编程协调许多决策。最强的智能体编写复杂的代码，模拟客户群体以预测未来现金流，并挖掘谈判历史以揭示隐藏的客户偏好。即便如此，大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金，且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

URL PDF HTML ☆

赞 0 踩 0

2606.19328 2026-06-18 cs.LG cs.AI cs.RO 新提交 70%

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

UBP2: 不确定性平衡的偏好规划用于高效基于偏好的强化学习

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

发表机构 * Learning, Embodied Autonomy, and Forecasting (LEAF) Lab, University of Toronto（学习、具身自主与预测（LEAF）实验室，多伦多大学）

专题命中规划推理：不确定性平衡的偏好规划

AI总结提出UBP2方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索，在Meta-World基准上显著提高了样本效率。

详情

AI中文摘要

基于偏好的强化学习提供了一种从行为的成对比较中学习奖励模型的方法，绕过了显式奖励设计的需求。然而，现有方法通常依赖于被动数据收集，并且在学习的早期阶段样本效率低下。我们引入了一种基于模型的方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索。我们的方法，不确定性平衡的偏好规划（UBP2），使用奖励、动力学和值函数模型的集成，根据结合了期望奖励、终值认知不确定性的统一评分来评估候选轨迹。在此目标下的规划产生了利用和信息获取之间的显式权衡，无需临时的探索启发式。在标准正则性假设下，我们为有限时域和无限时域设置建立了次线性遗憾保证。实验上，在Meta-World基准上的实验表明，UBP2比无模型的基于偏好的方法和非乐观的基于模型的基线方法实现了更高的样本效率。

英文摘要

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18633 2026-06-18 cs.MA 新提交 60%

PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning

PersonalPlan: 面向个性化编程学习的多智能体系统规划

Zhiyuan Wen, Jiannong Cao, Peng Gao, Haochen Shi, Wengpan Kuan, Bo Yuan, Xiuxiu Qi

专题命中规划推理：分层SFT和奖励自适应生成可执行计划

AI总结提出PersonalPlan，一种两阶段多智能体规划器，通过分层SFT和奖励自适应GRPO生成可执行、个性化且具有教学支架的计划，在MAP-PPL数据集上优于现有方法。

详情

AI中文摘要

有效的编程教育需要针对不同学习者背景进行个性化教学。然而，虽然基于LLM的多智能体系统（MAS）擅长复杂规划，但现有规划器通常缺乏轮廓基础（profile-grounding）和教学支架（pedagogical scaffolding），从而削弱了个性化编程学习。为填补这一空白，我们首先引入\textbf{MAP-PPL}（\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning），这是一个基于轮廓的多智能体规划数据集，包含来自1,730个Stack Overflow问题组和2,738个学习者轮廓的3,043个查询-轮廓-计划实例。每个计划指定了智能体、子任务、可执行步骤和先决依赖关系。然后，我们提出\textbf{PersonalPlan}，一个两阶段MAS规划器，首先使用独立的LoRA适配器进行分层SFT，用于轮廓感知的任务分解和步骤依赖规划，然后应用奖励自适应GRPO，鼓励模型生成可执行、个性化且具有教学支架的计划。在MAP-PPL上进行的广泛实验，将PersonalPlan与前沿LLM、通用MAS框架和智能体规划器进行比较，证明了其优越性。仅使用8B和32B变体，PersonalPlan在计划可执行性、个性化和教学质量方面达到了最先进水平，有效协调了MAS进行智能体-学生交互。

英文摘要

Effective programming education requires personalized instruction adapted to diverse learner backgrounds. However, while LLM-based multi-agent systems (MAS) excel at complex planning, existing planners often lack profile-grounding and pedagogical scaffolding, thereby undermining personalized programming learning. To fill in the gap, we first introduce \textbf{MAP-PPL} (\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning), a profile-conditioned multi-agent planning dataset with 3{,}043 query--profile--plan instances from 1{,}730 Stack Overflow question groups and 2{,}738 learner profiles. Each plan specifies agents, subtasks, executable steps, and prerequisite dependencies. Then, we propose \textbf{PersonalPlan}, a two-stage MAS planner that first performs hierarchical SFT with separate LoRA adapters for profile-aware task decomposition and step dependency planning, then applies a Reward-Adaptive GRPO to encourage the model to generate executable, personalized, and pedagogically scaffolded plans. Extensive experiments on MAP-PPL comparing PersonalPlan against frontier LLMs, generic MAS frameworks, and agentic planners demonstrate its superiority. With only 8B and 32B variants, PersonalPlan achieves state-of-the-art plan executability, personalization, and pedagogical quality, effectively orchestrating MAS for agent-student interactions.

URL PDF HTML ☆

赞 0 踩 0