AI Agent - arXivDaily 专题

2606.19787 2026-06-19 cs.AI 新提交 90%

ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

ORAgentBench: LLM代理能否解决具有挑战性的端到端运筹学任务？

Jiajun Li, Mingshu Cai, Yixuan Li, Yu Ding, Ran Hou, Guanyu Nie, Xiongwei Han, Wanyuan Wang

发表机构 * Southeast University（东南大学）； Waseda University（早稻田大学）； Nanyang Technological University（南洋理工大学）

专题命中规划决策：评估LLM代理在运筹学任务中的端到端表现。

AI总结提出ORAgentBench基准，评估LLM代理在端到端运筹学任务中的表现，发现当前代理通过率仅35.51%，主要受策略性弱点限制。

Comments 31 pages, preprint, v1

详情

AI中文摘要

大型语言模型越来越多地被部署为可执行环境中多步任务的自主代理，但它们执行现实运筹学工作的能力仍不明确。现有的运筹学评估通常将建模与求解分离，依赖预形式化或纯文本实例，很少测试从操作工件到验证决策的完整工作流程。在这项工作中，我们引入了ORAgentBench，一个基于执行环境的基准，用于评估自主代理在具有挑战性的端到端运筹学任务上的表现。它包含107个经过人工审核的任务，涵盖多样化的操作场景，每个任务都打包在一个隔离环境中，包含自然语言简介、多文件数据、配置工件和所需的提交模式。代理必须编写并运行解决方案代码，其提交由隐藏验证器根据模式有效性、硬约束可行性和归一化目标质量进行评估。对十四个前沿代理模型配置的实验表明，当前代理远未达到可靠的运筹学实践。最佳代理仅通过35.51%的所有任务和20.59%的困难任务，许多可行的提交仍低于所需的质量阈值。失败分析进一步表明，错误主要由策略性弱点主导，包括遗漏操作规则、脆弱的公式化、弱可行解构造以及解改进不足。运筹学特定的程序性技能增加了困难任务的可行性，但并未可靠地提高解质量或通过率。这些结果表明，运筹学代理的进展需要超越合理的优化代码，转向可靠、高质量的操作决策。

英文摘要

Large language models are increasingly deployed as autonomous agents for multi-step tasks in executable environments, yet their ability to perform realistic operations research (OR) work remains unclear. Existing OR evaluations often decouple modeling from solving, rely on pre-formalized or text-only instances, and rarely test the full workflow from operational artifacts to validated decisions. In this work, we introduce ORAgentBench, an execution-grounded benchmark for evaluating autonomous agents on challenging end-to-end operations research tasks. It contains 107 human-reviewed tasks across diverse operational scenarios, each packaged in an isolated environment with a natural-language brief, multi-file data, configuration artifacts, and a required submission schema. Agents must write and run solution code, and their submissions are evaluated by hidden validators for schema validity, hard-constraint feasibility, and normalized objective quality. Experiments with fourteen frontier agent-model configurations show that current agents remain far from reliable OR practice. The best agent passes only 35.51% of all tasks and 20.59% of hard tasks, and many feasible submissions still fall below the required quality threshold. Failure analysis further shows that errors are dominated by strategic weaknesses, including missed operational rules, brittle formulations, weak feasible-solution construction, and insufficient solution improvement. OR-specific procedural skills increase hard-task feasibility, but do not reliably improve solution quality or pass rate. These results suggest that progress in OR agents requires moving beyond plausible optimization code toward dependable, high-quality operational decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.15862 2026-06-19 cs.AI 新提交 90%

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

RetailBench: 在真实零售环境中评估LLM代理的长期推理与连贯决策能力

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

发表机构 * Ant Group（蚂蚁集团）； City University of Hong Kong（香港城市大学）

专题命中规划决策：评估LLM代理在零售环境中的长期决策

AI总结提出RetailBench基准，模拟单店超市运营，评估LLM代理在长期决策中的表现，发现多数模型无法持续生存，与最优策略差距显著。

Comments This paper is my paper's second version [see arXiv:2603.16453v2]

详情

AI中文摘要

大型语言模型（LLM）代理在短期、范围明确的任务上取得了快速进展，但它们在动态长期环境中维持连贯决策的能力仍不确定。我们引入了RetailBench，一个基于数据驱动的模拟基准，用于评估在单店超市运营中使用工具的LLM代理。RetailBench将零售管理建模为部分可观察的决策过程，并设计支持千天规模的模拟。在此环境中，代理必须管理定价、补货、供应商选择、货架分类、库存老化、客户反馈、外部事件和现金流约束。我们在180天的评估期内，在代表性代理框架下评估了七个当代LLM，并将它们与特权最优策略进行比较。结果显示模型之间存在显著差异：只有一小部分能够存活整个评估期，即使最强的LLM运行在最终净资产和销售结果上也远落后于最优策略。行为分析将这些差距归因于不完整的证据获取、表面决策以及缺乏一致的长期策略。RetailBench为研究经济基础长期决策中的可靠自主性提供了一个受控测试平台。

英文摘要

Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.20376 2026-06-19 cs.LG cs.AI 新提交 85%

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX：快速安全强化学习基准测试

Tristan Tomilin, Mourad Boustani, Mickey Beurskens, Thiago D. Simão

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

专题命中规划决策：安全RL基准，评估智能体在约束下的规划决策

AI总结提出基于JAX加速的安全RL基准CRAX，利用MJX物理引擎实现高达100倍加速，包含6个环境套件和3个智能体任务，评估6种方法揭示性能与安全权衡。

详情

AI中文摘要

安全性是强化学习（RL）智能体在机器人、自动驾驶等现实领域部署的核心问题。尽管基准测试对RL的进步至关重要，但现有具有高保真3D物理的安全基准计算速度慢，限制了大规模实验和快速原型开发。为解决这一问题，我们提出CRAX（基于JAX加速的约束RL）。CRAX构建在具有逼真3D动力学的MuJoCo XLA（MJX）物理引擎之上，利用向量化操作和硬件加速，相比基于CPU的同类安全基准实现高达约100倍的加速。该基准包含六个环境套件和三个智能体特定任务，每个任务涵盖三个难度级别。对六种流行安全RL方法的评估表明，没有单一方法在所有任务中占主导地位，并揭示了性能与安全之间的权衡。我们发现，跨难度级别的课程学习和安全迁移可以比直接在更困难设置中训练提高性能。

英文摘要

Safety is a core concern for deploying reinforcement learning (RL) agents in real-world domains such as robotics and autonomous driving. While benchmarks have been central to progress in RL, existing safety benchmarks with high-fidelity 3D physics remain computationally slow, limiting large-scale experimentation and rapid prototyping. To address this gap, we propose CRAX (Constrained RL Accelerated with JAX). Built on top of the MuJoCo XLA (MJX) physics engine with realistic 3D dynamics, CRAX leverages vectorized operations and hardware acceleration, yielding up to ~100x speedups over comparable CPU-based safety benchmarks. The benchmark features six environment suites and three agent-specific tasks, each spanning three difficulty levels. Evaluating six popular safe RL methods shows that no single approach dominates across all tasks, and reveals the trade-offs between performance and safety. We find that curriculum learning across difficulty levels and safety transfer can improve performance over direct training in harder settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20142 2026-06-19 cs.AI cs.MA 新提交 85%

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

RACL：用于连续元启发式学习的推理代理控制层

Antón Asla Manzárraga

发表机构 * Independent Researcher（独立研究者）

专题命中规划决策：推理代理控制层优化元启发式算法。

AI总结提出RACL方法，在元启发式优化器之上添加推理代理，通过观察、推理和干预控制搜索行为，在车辆路径问题上平均成本降低0.641%-8.337%。

Comments 10 pages, 5 tables

详情

AI中文摘要

本文介绍了RACL，一种用于元启发式算法的推理代理控制层。RACL在现有优化器之上放置一个推理代理。该代理不替换优化器，也不修改业务约束。相反，它通过观察操作内存、推理过去行为、制定有界假设、测试干预、评估结果、应用护栏、巩固有用策略并解释其决策来控制优化器的内部搜索行为。实验使用车辆路径作为测试平台，但贡献不是新的路由求解器、特定的ALNS配置或特定的路由规则集。贡献是RACL方法：一种推理代理发现、验证、巩固和解释元启发式算法控制规则的方式。在当前实验设置中，RACL在21个可行案例中的21个中改进或持平操作内存策略，在21个可行案例中的18个中改进或持平非推理停滞触发策略，平均RACL与STP成本差异为-0.641%。在Sevilla-9/10运行时样本中，RACL相对于Fixed平均成本降低-8.337%，相对于STP降低-1.605%，且没有显示实质性计算开销。在概念验证期间，Codex被用作循环推理代理，观察执行、解释日志并提出实时有界干预。后来仅使用策略代理使定量评估可重复。

英文摘要

This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer's internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisions. The experiment uses vehicle routing as a testbed, but the contribution is not a new routing solver, a particular ALNS configuration or a specific set of routing rules. The contribution is the RACL method: a way for a reasoning agent to discover, validate, consolidate and explain algorithmic control rules for a metaheuristic. In the current experimental setting, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL vs STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without showing material computational overhead. During the proof-of-concept, Codex was used as an in-the-loop reasoning agent observing executions, interpreting logs and proposing live bounded interventions. The policy proxy was later used only to make quantitative evaluation reproducible.

URL PDF HTML ☆

赞 0 踩 0

2606.20122 2026-06-19 cs.AI cs.MA 新提交 85%

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

ScaffoldAgent: 面向开放式深度研究的效用引导动态大纲优化

Zhibang Yang, Xinke Jiang, Yuzhen Xiao, Ruizhe Zhang, Yue Fang, XinFei Wan, Zhengxing Song, Yuxuan Liu, Yuheng Huang, Xu Chu, Junfeng Zhao, Yasha Wang

发表机构 * National Engineering Research Center of Software Engineering, Peking University（北京大学软件工程国家工程研究中心）； School of Computer Science, Peking University（北京大学计算机学院）； Key Laboratory of High Confidence Software Technologies, Ministry of Education（教育部高可信软件技术重点实验室）； GRG Banking Equipment Co., Ltd.（广电运通金融电子股份有限公司）； Center on Frontiers of Computing Studies, Peking University（北京大学计算前沿研究中心）； Peking University Information Technology Institute (Tianjin Binhai)（北京大学（天津滨海）信息技术研究院）

专题命中规划决策：智能体框架优化深度研究大纲。

AI总结提出ScaffoldAgent框架，通过效用引导的动态大纲优化（扩展、收缩、修订操作）解决开放式深度研究中大纲漂移问题，在DeepResearch Bench和Gym上提升长报告生成与事实准确性。

Comments 9 pages, 6 figures

详情

AI中文摘要

开放式深度研究（OEDR）要求系统通过多轮检索获取知识并生成连贯的长篇报告。大纲作为协调检索、证据组织和生成的结构性支架起着核心作用。然而，现有方法要么在写作前固定大纲，要么使用局部启发式方法进行优化，导致在持续信息积累下出现大纲漂移，且评估大纲修改的反馈延迟。我们提出ScaffoldAgent，一种面向OEDR的效用引导动态大纲优化框架。ScaffoldAgent将大纲演化建模为结构化决策过程，包含三种操作：扩展、收缩和修订，从而实现对报告支架的受控更新。它进一步引入效用引导的反馈机制，通过检索增益、结构连贯性和试生成质量来估计每个大纲操作的下游价值。得到的效用信号指导推理过程中的节点选择、操作调度和终止。在DeepResearch Bench和DeepResearch Gym上的实验表明，ScaffoldAgent在长报告生成和事实基础上持续优于现有的深度研究智能体。

英文摘要

Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.

URL PDF HTML ☆

赞 0 踩 0

2606.20014 2026-06-19 cs.LG cs.AI 新提交 85%

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

多智能体博弈中的层次化控制：基于LLM的规划与RL执行

Jannik Hösch, Alessandro Sestini, Florian Fuchs, Amir Baghi, Joakim Bergdahl, Konrad Tollmar, Jean-Philippe Barrette-LaPierre, Linus Gisslén

发表机构 * Electronic Arts ； KTH Royal Institute of Technology（皇家理工学院）

专题命中规划决策：LLM作为规划器选择RL技能策略。

AI总结提出LLM作为中央策略控制器选择RL技能策略的层次化架构，在2v2对抗环境中达到与手工BT相当的胜率，且被感知为最类人。

Comments 12 pages, 9 figures

详情

AI中文摘要

强化学习（RL）在序列决策中取得了强劲表现，但由于稀疏奖励、大状态-动作空间以及学习协调策略的困难，扩展到复杂多智能体环境仍具挑战。我们提出一种层次化架构，其中预训练的大语言模型（LLM）作为集中式策略控制器，为一组智能体选择专门的RL技能策略，而RL策略负责反应式底层执行。我们在竞争性2v2 King of the Hill环境中评估该混合系统，与行为树（BT）和“扁平”RL（无技能分解的端到端训练）基线进行比较。LLM+RL系统实现了与手工BT统计上相当的任务性能（胜率46.4% vs 51.5%，p=0.103），而两者均显著优于无技能分解训练的扁平RL。一项用户研究（n=15）显示，60%的参与者认为LLM+RL智能体最像人类（p=0.027），归因于行为适应性和战术变异性。这些结果表明，预训练LLM推理可以有效编排预训练RL技能，实现具有竞争力的多智能体协调和优越的感知可信度，而无需手动规则工程。

英文摘要

Reinforcement learning (RL) has achieved strong performance in sequential decision-making, yet scaling to complex multi-agent environments remains challenging due to sparse rewards, large state-action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low-level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and \emph{``Flat''} RL (end-to-end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand-crafted BT (46.4\% vs 51.5\% win rate, $p=0.103$) while both significantly outperform Flat RL trained without skill decomposition. A user study ($n=15$) reveals that 60\% of participants perceive LLM+RL agents as the most human-like ($p=0.027$), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi-agent coordination and superior perceived believability without manual rule engineering.

URL PDF HTML ☆

赞 0 踩 0

2606.19729 2026-06-19 cs.RO cs.AI 新提交 85%

VOiLA: Vectorized Online Planning with Learned Diffusion Model for POMDP Agents

VOiLA: 基于学习扩散模型的向量化在线规划用于POMDP智能体

Marcus Hoerger, Rishikesh Joshi, Rahul Shome, Ian Manchester, Hanna Kurniawati

发表机构 * Australian National University（澳大利亚国立大学）； The University of Sydney（悉尼大学）

专题命中规划决策：在线规划智能体，处理部分可观测环境。

AI总结提出VOiLA框架，利用条件扩散模型学习POMDP模型，通过蒸馏加速采样并与向量化在线规划器集成，在三个基准任务和实物机器人上实现高效在线规划。

Comments Submitted to the 2026 International Symposium of Robotics Research (ISRR)

详情

AI中文摘要

不确定性下的规划是自主机器人的关键能力。部分可观测马尔可夫决策过程（POMDP）为此提供了强大框架。尽管基于POMDP的规划已取得显著进展，但其在现实问题中的应用常受限于难以获得准确的POMDP模型。我们提出VOiLA（Vectorized Online planning wIth Learned diffusion model for POMDP Agents），一个学习任务无关POMDP模型以实现在不确定性下在线规划的框架。VOiLA使用条件扩散模型学习转移和观测采样器，并学习用于基于粒子的信念更新的观测似然模型。为实现高效在线规划，扩散采样器被蒸馏为紧凑的前馈生成器，并与VOPP（一种利用GPU并行化的在线POMDP规划器）集成。实验结果表明，蒸馏策略将采样成本降低了近三个数量级，使学习到的生成式POMDP模型对在线规划实用。在三个基准问题上的评估表明，VOiLA在使用不到10%训练数据的情况下，性能达到或优于递归软演员-评论家算法，并且对未见环境配置的泛化能力更强。实物机器人评估表明，VOiLA仅使用模拟数据学习模型，并在10次运行中全部成功完成任务。

英文摘要

Planning under uncertainty is an essential capability for autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for such a capability. Although POMDP-based planning has advanced significantly, its application to real-world problems is often limited by the difficulty of obtaining faithful POMDP models. We present Vectorized Online planning wIth Learned diffusion model for POMDP Agents (VOiLA), a framework that learns task-agnostic POMDP models for online planning under uncertainty. VOiLA learns transition and observation samplers using conditional diffusion models and learns observation-likelihood models for particle-based belief updates. To enable efficient online planning, the diffusion samplers are distilled into compact feedforward generators and integrated with Vectorized Online POMDP Planner (VOPP), an online POMDP planner designed to leverage GPU parallelization. Experimental results indicate the distillation strategy reduces sampling cost by up to nearly three orders of magnitude, making learned generative POMDP models practical for online planning. Evaluation of VOiLA on three benchmark problems indicate that VOiLA achieves equal or better performance than Recurrent Soft Actor Critic while using less than 10% training data, and generalizes much better to unseen environment configurations. Physical robot evaluation indicates VOiLA uses the models learned using only simulated data and generates a policy that successfully accomplish the task in 10 of 10 runs.

URL PDF HTML ☆

赞 0 踩 0

2606.15197 2026-06-19 cs.LG cs.AI 新提交 85%

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR: 协同树搜索与测试时强化学习用于优化建模

Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Northwest A&F University（西北农林科技大学）

专题命中规划决策：树搜索与强化学习用于优化建模

AI总结提出StarOR框架，结合蒙特卡洛树搜索与测试时强化学习，通过四阶段分解和GRPO更新LoRA适配器，实现无监督细粒度奖励的中间决策优化，在5个基准上以4B模型达到最优性能。

Comments 41pages, V1, preprint

详情

AI中文摘要

优化建模本质上是层次化的，需要精确的符号承诺序列。传统的基于学习的自动化优化建模方法通过大规模标注或策划的训练数据改进建模策略，但适应新问题分布成本高昂。同时，一次性生成在层次化建模中仍然脆弱，早期符号错误可能传播为无效公式。测试时缩放通过额外的实例级计算实现结构探索，提供了一种有前景的替代方案；然而，现有的基于搜索的方法通常依赖固定策略，导致重复展开继承相似的建模偏差，并为中间决策提供有限的信用分配。为了解决这些限制，我们提出了StarOR，一种协同搜索与适应的框架，将MCTS与测试时强化学习相结合用于优化建模。StarOR将建模过程分解为四个阶段，并通过GRPO在每个非终端节点更新瞬态LoRA适配器。通过使用MCTS生成的兄弟节点作为局部比较集，StarOR将搜索时的探索转化为实例特定的策略细化。此外，无监督的多方面奖励系统为中间公式决策提供细粒度反馈，无需真实标签。在五个优化基准上的实验表明，即使使用4B骨干网络，StarOR也实现了最先进的性能，优于现有方法和前沿LLMs。

英文摘要

Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.10616 2026-06-19 cs.AI 新提交 85%

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么：通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab（华为诺亚方舟实验室）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）

专题命中规划决策：长时域语言代理的记忆保留优化问题

AI总结针对长时域语言代理的有限上下文窗口，提出OSL-MR框架，将记忆保留建模为约束随机优化问题，通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值，实验表明在严格预算下优于现有方法。

详情

AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口，使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理，但大多将保留视为局部决策问题，并未在现实观测约束下显式建模其长期后果。为填补这一空白，我们将记忆保留建模为一个约束随机优化问题，具有明确的预算可行性、证据效用以及延迟成本（包括遗漏惩罚、重新获取延迟和过时信息风险）。随后，我们提出OSL-MR（观测安全记忆保留学习），这是一个新颖的框架，强制执行在线可观测特征与离线可用监督（OAS）之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式，该启发式既作为可部署的在线安全基线，又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值，同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明，OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度，敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts exceeding context windows, making memory retention a fundamental resource-allocation problem. Existing systems treat retention as local and do not model long-term consequences under observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization with budget feasibility, evidence utility, and delayed costs including miss, reacquisition, and stale penalties. We show this multi-step problem is NP-hard, making exact solution intractable. Moreover, deployment decisions must be made under partial observability. To address these challenges, we propose OSL-MR (Observability-Safe Learning for Memory Retention), a learning-augmented framework that enforces a strict separation between online-observable features and offline-available supervision. OSL-MR combines an evidence learner trained from realized evidence with a Mixed-Score heuristic that serves as a deployable online-safe baseline and an inductive prior. The policy learns query-conditioned evidence from interaction data and remains deployable under the same constraints. Experiments on LoCoMo and LongMemEval show OSL-MR outperforms recency-based, Generative Agents-style, and other heuristic baselines, especially under tight budgets. The Mixed-Score prior improves precision and recall, and sensitivity analysis shows robustness across cost settings. On small solvable instances, single-step optimization is insufficient to anticipate future demand shifts, while OSL-MR stays significantly closer to the dynamic-programming optimum, confirming the necessity of the sequential formulation and reinforcing our learning-guided approximation. These results establish constrained stochastic optimization and optimization-guided learning as a principled foundation for memory management in long-horizon agents.

URL PDF HTML ☆

赞 0 踩 0

2606.19659 2026-06-19 cs.CL 新提交 80%

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

SAGE-OPD：面向多轮在策略蒸馏的选择性智能体引导干预

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao

发表机构 * Meta AI

专题命中规划决策：多轮在策略蒸馏框架，选择性干预学生响应

AI总结提出SAGE-OPD框架，通过环境反馈和教师判断选择性干预学生响应，结合置信度加权和损失归一化，解决多轮在策略蒸馏中的错误累积问题，在ALFWorld任务中取得13.3%的相对提升。

Comments 21 pages, 3 figures

详情

AI中文摘要

在策略蒸馏（OPD）通过训练学生模型在其自身策略生成的轨迹上来改进学生模型，使其成为缓解智能体训练中曝光偏差的一种有前景的方法。然而，大多数OPD研究集中在单轮设置，而现实中的LLM智能体需要与环境进行多轮交互。在这种机制下，早期错误会改变未来观察并沿轨迹累积，标准的密集令牌级OPD变得脆弱，因为它可能过度惩罚语义上有效的替代方案，强化局部退化（如重复动作），并在分布外历史中传播不可靠的教师监督。我们提出SAGE-OPD，一种专门为多轮OPD设计的无验证器选择性干预框架。SAGE-OPD不是在所有轮次上统一应用教师监督，而是首先观察环境反馈，并使用教师判断来决定每个学生响应是否应被跳过或干预。为了进一步解决累积错误，SAGE-OPD通过教师置信度对令牌级蒸馏进行加权，减少不确定的教师分布在受损或模糊历史上的影响。最后，SAGE-OPD应用损失归一化以保留标准OPD的整体损失规模，同时保持选择性轮次级加权。在智能体任务上的实验表明，SAGE-OPD持续优于基线，在ALFWorld未见成功率上比标准OPD实现了高达13.3%的相对提升。消融研究进一步表明，轮次级干预、教师置信度加权和损失归一化提供了互补的益处。我们的结果表明，有效的多轮OPD应保持策略内，但教师监督应选择性地分配到需要干预且可靠的轮次。

英文摘要

On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on single-turn settings, while realistic LLM agents interact with environments over multiple turns. In this regime, early errors can alter future observations and compound across the trajectory, and standard dense token-level OPD becomes brittle, as it may over-penalize semantically valid alternatives, reinforce local degeneracies such as repeated actions, and propagate unreliable teacher supervision on off-distribution histories. We propose SAGE-OPD, a verifier-free selective intervention framework specifically designed for multi-turn OPD. Instead of applying teacher supervision uniformly across all turns, SAGE-OPD first observes environment feedback and uses teacher judgment to decide whether each student response should be skipped or intervened on. To further address compounding errors, SAGE-OPD weights token-level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted or ambiguous histories. Finally, SAGE-OPD applies loss normalization to preserve the overall loss scale of standard OPD while retaining selective turn-level weighting. Experiments on agent tasks show that SAGE-OPD consistently improves over baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies further demonstrate that turn-level intervention, teacher confidence weighting, and loss normalization provide complementary benefits. Our results suggest that effective multi-turn OPD should remain on-policy, but teacher supervision should be selectively allocated to turns where intervention is necessary and reliable.

URL PDF HTML ☆

赞 0 踩 0

2606.19559 2026-06-19 cs.AI cs.CL 新提交 80%

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University（AI Talent Hub, ITMO大学）

专题命中规划决策：提出不确定性分解方法使LLM代理主动寻求澄清

AI总结提出一种基于提示的不确定性分解方法，将行动置信度与请求不确定性分离，使代理能在任务规范模糊时主动寻求澄清，在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情

AI中文摘要

最近的立场论文认为，经典的偶然/认知不确定性框架对于交互式大型语言模型（LLM）代理是不够的，并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示，以解锁新的代理能力，如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法，使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁，该分解将行动置信度与请求不确定性（u）分离，使代理能在任务规范模糊时请求澄清。为了评估它，我们引入了两个增强澄清的基准（WebShop-Clarification和ALFWorld-Clarification），其中50%的任务被故意欠规范，并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上，系统地将所提出的分解与ReAct+UE和不确定性感知记忆（UAM）在五个LLM骨干（GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B）上进行比较。在五个骨干上平均，所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1，比UAM提高了36%，并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1，表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.18272 2026-06-19 cs.NI cs.AI cs.SY eess.SY 新提交 80%

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

缓解基于LLM的智能体在节能6G自主网络中的锚定偏差

Hatim Chergui, Claudia Carballo González, Farhad Rezazadeh, Merouane Debbah

发表机构 * i2CAT Foundation（i2CAT基金会）； Universitat Politècnica de Catalunya（政治技术大学）； Research Institute for Digital Future（数字未来研究院）

专题命中规划决策：LLM智能体在6G网络切片中的资源协商

AI总结提出一种基于截断三参数威布尔分布的随机锚定策略，缓解LLM智能体在6G网络切片中的锚定偏差，结合CVaR数字孪生保障SLA尾延迟，实现高达25%的节能。

Comments 7 pages, 4 figures

详情

AI中文摘要

本文提出了一种自主智能体资源协商框架，旨在使用大语言模型（LLM）智能体实现6G架构中的零接触网络切片。虽然LLM提供了强大的推理能力，但我们证明此类智能体固有地遭受锚定偏差，僵化地坚持初始启发式提议，导致严重的网络过度配置。为系统性地缓解这种认知偏差，我们提出了一种新颖的随机锚定策略，通过截断三参数威布尔分布建模。这种数学上有界的方法与采用条件风险价值（CVaR）的突发感知数字孪生（DT）无缝集成，以严格保证严格的服务水平协议（SLA）尾延迟。为验证我们的方法，我们引入并证明了双峰约束避免效用定理，表明虽然可行的协商遵循经典凸界，但高度约束的场景会发生由逆有理衰减包络控制的相变。使用本地托管的1B参数模型（\ exttt{otel-llm-1b-it}）生成的实证结果证实了这些双区域界。我们的认知去偏成功瓦解了僵化的协商模式，迫使智能体主动探索以安全地利用SLA边界，并将系统节能提升高达25%。关键的是，轻量级1B LLM实现了亚秒级推理延迟（平均0.95秒），确保我们的多智能体框架与O-RAN非实时RAN智能控制器（non-RT RIC）的操作时间尺度兼容。

英文摘要

This paper presents an autonomous agentic resource negotiation framework designed to enable zero-touch network slicing in 6G architectures using Large Language Model (LLM) agents. While LLMs offer powerful reasoning capabilities, we demonstrate that such agents inherently suffer from anchoring bias, rigidly adhering to initial heuristic proposals and causing severe network over-provisioning. To systematically mitigate this cognitive bias, we propose a novel randomized anchoring strategy modeled via a Truncated 3-Parameter Weibull distribution. This mathematically bounded approach seamlessly integrates with burst-aware Digital Twins (DTs) employing Conditional Value at Risk (CVaR) to rigorously guarantee strict Service Level Agreement (SLA) tail-latencies. To validate our methodology, we introduce and prove the \emph{Bimodal Constraint-Avoidance Utility Theorem}, demonstrating that while feasible negotiations follow classical convex bounds, highly constrained scenarios undergo a phase transition governed by an inverse rational decay envelope. Empirical results generated using a locally hosted 1B-parameter model otel-llm-1b-it confirm these dual-regime bounds. Our cognitive de-biasing successfully dismantles rigid negotiation patterns, forcing agents into active exploration to safely ride SLA boundaries and boost system energy savings up to 25\%. Crucially, the lightweight 1B LLM achieves sub-second inference latencies (0.95s mean), ensuring our multi-agent framework is compatible with the operational timescales of the O-RAN non-Real-Time RAN Intelligent Controller (non-RT RIC)\footnote{Our source code is available for non-commercial use at https://github.com/HatimChergui.

URL PDF HTML ☆

赞 0 踩 0

1805.08357 2026-06-19 cs.NI 80%

Multi-UAV Cooperative Trajectory for Servicing Dynamic Demands and Charging Battery

多无人机协作轨迹用于服务动态需求和充电电池

Kai Wang, Xiao Zhang, Lingjie Duan, Jun Tie

专题命中规划决策：多无人机协作路径规划，属于自主任务执行

AI总结本文提出了一种多无人机协作路径规划方法，以高效服务动态分布的需求并优化电池充电，通过降低计算复杂度和设计快速迭代算法，实现对大规模无人机群的高效路径规划。

详情

DOI: 10.1109/TMC.2021.3110299

AI中文摘要

无人机（UAV）技术为地面用户提供高质量移动服务提供了有前途的解决方案，其中具有有限服务覆盖范围的无人机在多个地理用户位置（如热点）之间移动以满足其本地需求。如何动态确定无人机群的协作路径规划以最佳满足用户在时空分布上的需求是一个重要问题，但文献中尚未解决。本文首次设计并分析了大规模无人机群的协作路径规划算法，以最优服务多个空间位置。地面用户的需求在长时间范围内动态释放。针对单个无人机的路径规划设计，我们成功地大幅简化了传统的动态规划并提出了一种计算复杂度低的最优算法，该算法仅与空间位置和用户需求的数量成多项式关系。在协调大量K个无人机后，这种简化动态优化问题变得难以解决，我们提出了一个具有可证明近似比1-(1-1/K)^K的快速迭代合作算法，该比值在最坏情况下明显优于传统方法，即将无人机划分为分别服务不同位置集群。为了缓解无人机电池容量限制以实现可持续的服务提供，我们进一步允许无人机同时前往充电站，从而共同设计无人机在用户位置和充电站之间的路径规划。尽管问题难度，对于最优解，我们成功地将问题转换为整数线性规划，通过创建新的有向无环图的无人机状态转换图，并提出具有常数近似比的迭代算法。

英文摘要

Unmanned Aerial Vehicle (UAV) technology is a promising solution for providing high-quality mobile services to ground users, where a UAV with limited service coverage travels among multiple geographical user locations (e.g., hotspots) for servicing their demands locally. How to dynamically determine a UAV swarm's cooperative path planning to best meet many users' spatio-temporally distributed demands is an important question but is unaddressed in the literature. To our best knowledge, this paper is the first to design and analyze cooperative path planning algorithms of a large UAV swarm for optimally servicing many spatial locations, where ground users' demands are released dynamically in the long time horizon. Regarding a single UAV's path planning design, we manage to substantially simplify the traditional dynamic program and propose an optimal algorithm of low computation complexity, which is only polynomial with respect to both the numbers of spatial locations and user demands. After coordinating a large number $K$ of UAVs, this simplified dynamic optimization problem becomes intractable and we alternatively present a fast iterative cooperation algorithm with provable approximation ratio $1-(1-\frac{1}{K})^{K}$ in the worst case, which is proved to obviously outperform the traditional approach of partitioning UAVs to serve different location clusters separately. To relax UAVs' battery capacity limit for sustainable service provisioning, we further allow UAVs to travel to charging stations in the mean time and thus jointly design UAVs' path planning over users' locations and charging stations. Despite of the problem difficulty, for the optimal solution, we successfully transform the problem to an integer linear program by creating novel directed acyclic graph of the UAV-state transition diagram, and propose an iterative algorithm with constant approximation ratio.

URL PDF HTML ☆

赞 0 踩 0

2606.20495 2026-06-19 cs.RO 新提交 70%

Increasing Resilience of Continuum Robots via Motion Planning Algorithms

通过运动规划算法提高连续体机器人的韧性

Oxana Shamilyan, Ievgen Kabin, Zoya Dyka, Oleksandr Sudakov, Peter Langendoerfer

发表机构 * IHP – Leibniz-Institut für innovative Mikroelektronik（莱布尼茨创新微电子研究所）； BTU Cottbus-Senftenberg（科特博斯-塞芬堡工业大学）； Technical Center, National Academy of Sciences of Ukraine（乌克兰国家科学院技术中心）

专题命中规划决策：涉及路径规划算法和多准则决策

AI总结本文实验研究运动规划算法对连续体机器人韧性的影响，通过改进遗传算法和A*算法，结合层次分析法评估路径质量，发现遗传算法生成更多样化路径，提升机器人韧性。

详情

AI中文摘要

本文介绍了针对韧性连续体机器人的运动规划实验研究。我们主要关注多准则决策、其在路径规划算法中的应用、对生成路径的影响以及执行时间。为此，我们使用了两种著名的路径规划算法，即遗传算法和A*算法，并通过添加层次分析法算法来评估生成路径的质量，对其进行了修改。在我们的实验中，层次分析法考虑了四个不同的准则，即距离、电机损伤、机器人手臂的机械损伤和精度，每个准则都被认为有助于连续体机器人的韧性。使用不同的准则对于延长连续体机器人的维护操作时间是必要的。我们使用两种不同的机器人模拟环境进行了实验。尽管我们显著简化了机器人模型及其环境，但我们仍然基于真实机器人原型实现了环境的一些特征。特别地，其中一个环境包含单路径点和多路径点，另一个环境仅包含多路径点。结果表明，与A*算法相比，遗传算法的性能时间不依赖于环境的基数。它生成更多样化的路径，从而提高了机器人的韧性。

英文摘要

This paper presents an experimental study of motion planning for resilient continuum robots. In this study we mainly focused on multi-criteria decision-making, its application for path-planning algorithms, impact on the generated path and execution time. To do this, we used two well-known algorithms for path planning, namely Genetic algorithm and A star algorithm, and modified them by adding the Analytical Hierarchy Process algorithm to evaluate the quality of the paths generated. In our experiment the Analytical Hierarchy Process considers four different criteria, i.e. distance, motors damage, mechanical damage of the robot's arm and accuracy, each considered to contribute to the resilience of a continuum robot. The use of different criteria is necessary to increase the time to maintenance operations of the continuum robot. We conducted the experiments using two different simulated environments of the robot. Although we significantly simplified the robot's model and its environment, we still implemented some of the features of the environment based on the real robot prototype. In particular, one of the environments has single- as well as multi-path points, and other consists of the multi-path points only. The results show that, in contrast to A star, the performance time of Genetic algorithm does not depend on the environment's cardinality. It generates more diverse paths, which increases the robot's resilience.

URL PDF HTML ☆

赞 0 踩 0

2606.20236 2026-06-19 cs.AI cs.LG cs.MA 新提交 70%

A Multi-Agent system for Multi-Objective constrained optimization

多目标约束优化的多智能体系统

Federica Filippini

发表机构 * University of Milano-Bicocca（米兰比可卡大学）

专题命中规划决策：多智能体强化学习优化约束

AI总结提出MAMO，通过多智能体强化学习解耦任务执行与目标设计，自动学习奖励权重以平衡主目标优化与约束违反，提升动态环境下RL的自主性和鲁棒性。

Comments Presented at the 17th Workshop on Optimization and Learning in Multiagent Systems (OptLearnMAS, https://optlearnmas.github.io), co-located with the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情

AI中文摘要

计算和网络系统中的许多决策问题可以自然地表述为在性能约束下的成本最小化问题。在动态环境中，强化学习（RL）通常通过在运行时将成本和约束违反通过加权惩罚项嵌入到单个标量奖励中（遵循拉格朗日启发式公式）来解决此类问题。然而，在这种背景下，学习策略的行为关键取决于这些权重的选择，而权重通常是手动选择的。这使得难以在优化主要目标和有效避免约束违反之间找到适当的权衡，特别是在非平稳环境中，它们的相对重要性可能发生变化。本文提出了MAMO（多目标约束优化的多智能体系统），一种通过多智能体RL解决这种平衡问题的方法。MAMO通过将奖励权重的选择表述为一个学习问题，将任务执行与目标设计解耦，为动态环境中约束优化问题的更自主和鲁棒的基于RL的解决方案迈出了第一步。

英文摘要

Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.

URL PDF HTML ☆

赞 0 踩 0