大模型推理能力 - arXivDaily 专题

2606.15197 2026-06-19 cs.LG cs.AI 新提交 80%

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR: 协同树搜索与测试时强化学习用于优化建模

Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Northwest A&F University（西北农林科技大学）

专题命中规划推理：结合MCTS与GRPO进行推理优化

AI总结提出StarOR框架，结合蒙特卡洛树搜索与测试时强化学习，通过四阶段分解和GRPO更新LoRA适配器，实现无监督细粒度奖励的中间决策优化，在5个基准上以4B模型达到最优性能。

Comments 41pages, V1, preprint

详情

AI中文摘要

优化建模本质上是层次化的，需要精确的符号承诺序列。传统的基于学习的自动化优化建模方法通过大规模标注或策划的训练数据改进建模策略，但适应新问题分布成本高昂。同时，一次性生成在层次化建模中仍然脆弱，早期符号错误可能传播为无效公式。测试时缩放通过额外的实例级计算实现结构探索，提供了一种有前景的替代方案；然而，现有的基于搜索的方法通常依赖固定策略，导致重复展开继承相似的建模偏差，并为中间决策提供有限的信用分配。为了解决这些限制，我们提出了StarOR，一种协同搜索与适应的框架，将MCTS与测试时强化学习相结合用于优化建模。StarOR将建模过程分解为四个阶段，并通过GRPO在每个非终端节点更新瞬态LoRA适配器。通过使用MCTS生成的兄弟节点作为局部比较集，StarOR将搜索时的探索转化为实例特定的策略细化。此外，无监督的多方面奖励系统为中间公式决策提供细粒度反馈，无需真实标签。在五个优化基准上的实验表明，即使使用4B骨干网络，StarOR也实现了最先进的性能，优于现有方法和前沿LLMs。

英文摘要

Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.20014 2026-06-19 cs.LG cs.AI 新提交 75%

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

多智能体博弈中的层次化控制：基于LLM的规划与RL执行

Jannik Hösch, Alessandro Sestini, Florian Fuchs, Amir Baghi, Joakim Bergdahl, Konrad Tollmar, Jean-Philippe Barrette-LaPierre, Linus Gisslén

发表机构 * Electronic Arts ； KTH Royal Institute of Technology（皇家理工学院）

专题命中规划推理：LLM进行高层规划，属于规划推理。

AI总结提出LLM作为中央策略控制器选择RL技能策略的层次化架构，在2v2对抗环境中达到与手工BT相当的胜率，且被感知为最类人。

Comments 12 pages, 9 figures

详情

AI中文摘要

强化学习（RL）在序列决策中取得了强劲表现，但由于稀疏奖励、大状态-动作空间以及学习协调策略的困难，扩展到复杂多智能体环境仍具挑战。我们提出一种层次化架构，其中预训练的大语言模型（LLM）作为集中式策略控制器，为一组智能体选择专门的RL技能策略，而RL策略负责反应式底层执行。我们在竞争性2v2 King of the Hill环境中评估该混合系统，与行为树（BT）和“扁平”RL（无技能分解的端到端训练）基线进行比较。LLM+RL系统实现了与手工BT统计上相当的任务性能（胜率46.4% vs 51.5%，p=0.103），而两者均显著优于无技能分解训练的扁平RL。一项用户研究（n=15）显示，60%的参与者认为LLM+RL智能体最像人类（p=0.027），归因于行为适应性和战术变异性。这些结果表明，预训练LLM推理可以有效编排预训练RL技能，实现具有竞争力的多智能体协调和优越的感知可信度，而无需手动规则工程。

英文摘要

Reinforcement learning (RL) has achieved strong performance in sequential decision-making, yet scaling to complex multi-agent environments remains challenging due to sparse rewards, large state-action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low-level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and \emph{``Flat''} RL (end-to-end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand-crafted BT (46.4\% vs 51.5\% win rate, $p=0.103$) while both significantly outperform Flat RL trained without skill decomposition. A user study ($n=15$) reveals that 60\% of participants perceive LLM+RL agents as the most human-like ($p=0.027$), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi-agent coordination and superior perceived believability without manual rule engineering.

URL PDF HTML ☆

赞 0 踩 0