AI Agent

2606.20041 2026-06-19 econ.GN cs.AI cs.LG q-fin.EC q-fin.GN 新提交专题 90

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

AI经济学家代理：一种基于模型的经济分析代理框架，结合RAG、知识图谱和大语言模型

Masahiro Kato

发表机构 * Mizuho-DL Financial Technology, Co., Ltd.（Mizuho-DL金融科技有限公司）

专题命中其他Agent ：AI经济学家代理框架，规划、检索、生成报告

AI总结提出一种基于RAG的AI经济学家代理框架，利用知识图谱和大语言模型进行经济情景分析，通过代理规划、检索证据、选择模型并生成报告，提高经济叙事的连贯性和可追溯性。

详情

AI中文摘要

我们提出了一种基于模型的RAG型AI经济学家，具有用于经济情景分析的代理框架，使用大语言模型（LLMs）和知识图谱。虽然LLMs可以生成流畅的经济叙事，但经济学家通常需要做出基于经济理论和现实数据的经济主张。基于这一动机，本研究提出了一种基于RAG的AI经济学家，它利用包含经济数据和理论的知识图谱以及基于LLM的代理来规划分析、检索相关证据、选择合适的模型并生成报告。在我们的框架中，我们不直接仅使用语言模型产生定量主张；相反，我们生成基于显式模型计算的叙事，并通过AI代理与检索到的证据相关联。我们将我们的框架称为AI经济学家代理。我们在两个应用中评估了AI经济学家代理：为美国通胀持续性和美联储政策生成经济学家报告，以及为美国商业房地产再融资压力生成银行压力测试叙事。结果说明了如何通过基于生成报告来提高其经济连贯性和可追溯性。

英文摘要

We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

URL PDF HTML ☆

赞 0 踩 0

2606.20510 2026-06-19 cs.CR cs.AI 新提交专题 90

Efficient and Sound Probabilistic Verification for AI Agents

高效且可靠的AI智能体概率验证

Alaia Solko-Breslin, Pramod Kaushik Mudrakarta, Mihai Christodorescu, Somesh Jha, Krishnamurthy Dj Dvijotham

发表机构 * Google DeepMind（谷歌深Mind）； Google（谷歌）； University of Pennsylvania（宾夕法尼亚大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

专题命中其他Agent ：提出AI智能体概率验证框架，确保策略合规

AI总结提出基于分布鲁棒优化的框架，为AI智能体在复杂数字环境中的概率策略违规提供可靠上界，无需独立性假设，在终端和工具调用智能体基准上优于现有方法。

详情

AI中文摘要

保护在复杂数字环境中运行的AI智能体已成为关键需求，而运行时监控方法通过制定并执行以Datalog等正式语言表达的策略提供了一种有前景的解决方案。然而，现有方法仅限于确定性策略。在AI智能体的许多实际应用中，需要在面对模糊性时强制执行安全策略，导致概率谓词或状态转换（例如，每次调用时具有一定失败概率的解密器或个人身份信息（PII）检测器）。此外，在许多此类应用中，无法轻易做出调用先前Datalog概率推理工作所需的独立性假设。我们通过引入一种基于分布鲁棒优化的可靠且高效的验证框架来解决这一问题，该框架计算策略违规概率的可靠上界，而不考虑谓词之间可能的相关性。在终端和工具调用智能体的标准基准上，我们证明了我们的方法优于现有技术，并在确保策略违规概率的严格上界的同时，改善了安全-效用权衡。

英文摘要

Securing AI agents that operate in complex digital environments has become a critical need, and runtime monitoring approaches that formulate and enforce policies expressed in a formal language like Datalog offer a promising solution. However, existing approaches are restricted to deterministic policies. In many practical applications of AI agents, there is a need to enforce security policies in the face of ambiguity, leading to probabilistic predicates or state transitions (for example, a declassifier or Personally Identifiable Information (PII) detector that has some failure probability on each invocation). Furthermore, in many such applications, one cannot easily make the independence assumptions necessary to invoke prior work on probabilistic inference in Datalog. We address this by introducing a sound and efficient framework for such verification based on distributionally robust optimization, computing sound upper bounds on the probability of policy violation regardless of possible correlations between predicates. On standard benchmarks for terminal and tool calling agents, we demonstrate that our approach outperforms prior art and improves the security-utility trade-off while ensuring rigorous bounds on the probability of policy violation.

URL PDF HTML ☆

赞 0 踩 0

2606.19704 2026-06-19 cs.AI 新提交专题 90

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

超越静态排行榜：LLM智能体评估的预测有效性

Dhaval C. Patel, Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, Byeolah Kwon

发表机构 * IBM

专题命中其他Agent ：评估LLM智能体基准的预测有效性，提出新方法。

AI总结本文通过14项并行研究，论证聚合分数排行榜无法泛化到分布外场景，提出基于预测有效性的排名配置方法，并设计可证伪的分布外评估标准。

Comments 17 pages, 2 tables, 5 figures

详情

AI中文摘要

智能体基准测试发展迅速，但单一基准测试无法涵盖部署所涉及的多个维度。本文汇总了迄今为止最大规模的基于MCP的工业智能体基准测试的协调深度分析：14项并行实现研究，涵盖新的资产类别（包括多模态视觉扩展）、替代编排、检索策略、推理模式、基础设施优化和评估方法探索。结合这些研究与七个先前的智能体基准测试，我们认为聚合分数排行榜系统性地低估了部署智能体的评估。基于聚合分数的排名无法泛化到分布外设置；最近的公开到私有竞赛回顾提供了这种排名不稳定性的直接经验证据。我们提出通过预测有效性（样本内与样本外排名之间的相关性）而非样本内均值来配置排名，并报告了一个十二层测量装置，该装置揭示了HELM及其智能体时代后继者所忽略的部署相关维度。该立场通过三个具有明确阈值的可证伪分布外标准得以操作化；现有证据部分支持但过于薄弱无法确认。最后，我们提出了一个预注册的试点设计和下一代智能体基准测试应报告的内容的领域级愿景。

英文摘要

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

URL PDF HTML ☆

赞 0 踩 0

2606.11537 2026-06-19 cs.AI cs.CE 新提交专题 90

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck（因斯布鲁克大学）； University of British Columbia（不列颠哥伦比亚大学）； Toronto Metropolitan University（多伦多都会大学）

专题命中其他Agent ：提出声明市场代码智能体，用于金融数值推理

AI总结提出MoCA-Agent，通过声明级验证和代码生成解决金融表格问答中的数值推理错误，在十个基准上取得强性能。

详情

AI中文摘要

金融和表格问答不仅需要流畅的推理：答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent}，一种声明市场代码智能体，它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明，要求专业交易智能体买入或卖出这些声明，将其订单清算为置信度加权的接受/拒绝决策，并从市场支持的证据中合成可执行的Python程序。然后，一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误，最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上，\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能，包括在 FinQA 上达到 78.3%，在 FinanceMath 上达到 76.0%，在 MultiHiertt 上达到 71.2%，在 ESGenius 上达到 86.9%，以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明，在原子声明级别聚合证据，而不是整个答案，提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取：this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

URL PDF HTML ☆

赞 0 踩 0

2606.20475 2026-06-19 cs.LG 新提交专题 85

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

边际优势累积用于记忆驱动智能体自我进化

Mingyu Yang, Keye Zheng, Congchao Cheng, Yujie Liu, Xingkang Lu, Fan Jiang, Yefei Zheng

发表机构 * Alibaba International Digital Commerce Group（阿里巴巴国际数字商业集团）

专题命中其他Agent ：提出记忆驱动智能体自我进化方法，优化智能体轨迹蒸馏。

AI总结针对批量式轨迹蒸馏中跨批次证据缺失问题，提出边际优势累积（MAA）方法，通过差分信号构造、指数移动平均累积和语义身份合并，在16个设置中14个取得最佳结果，优化阶段token消耗减少约75%。

Comments 26 pages, 4 figures, 10 tables, 42 references

2606.20529 2026-06-19 cs.AI cs.CL 新提交专题 90

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent: 策略遵从工具调用代理的结构化状态

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral

发表机构 * Arizona State University（亚利桑那州立大学）； University of Arizona（亚利桑那大学）

专题命中工具调用：提出策略遵从工具调用代理的结构化状态方法

AI总结针对客服领域策略遵从工具调用代理，提出LedgerAgent方法，通过独立账本维护任务状态并渲染到提示中，在执行工具调用前检查状态依赖策略约束，提升多轮一致性。

Comments Work in Progress

详情

AI中文摘要

在客服领域，策略遵从工具调用代理必须在跨轮次调用工具时维护任务状态，并遵守领域策略。任务状态包括通过用户交互和工具调用观察到的事实、标识符、约束和条件。在标准代理中，任务状态没有单独表示。观察结果、工具返回和策略指令被放入提示中，使得代理每次决定下一步时都需要从提示中重建相关状态。这种设计使状态管理变得隐式，导致两种常见失败模式：代理可能检索到正确的事实，但后来基于过时、缺失或不正确的信息做出决策；语法上有效的工具调用可能仍然违反依赖于当前任务状态的领域策略。我们引入了\textsc{LedgerAgent}，一种用于工具调用代理的推理时方法，它在单独的账本中维护观察到的任务状态，并将状态渲染到提示中。在执行改变环境的工具调用之前，账本还用于检查状态依赖的策略约束，阻止策略违规。在四个客服领域以及开源和闭源模型的混合面板上，\textsc{LedgerAgent}在标准基于提示的工具调用方法上提高了平均pass^k，在更严格的多轮一致性指标下提升最大。

英文摘要

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.19992 2026-06-19 cs.SE cs.AI 新提交专题 90

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

超越静态端点：工具程序作为灵活智能体网络服务的接口

Mugeng Liu, Shuoqi Li, Yixuan Zhang, Yun Ma

发表机构 * School of Computer Science, Peking University, Beijing, China ； School of Software \& Microelectronics, Peking University, Beijing, China ； Institute for Artificial Intelligence, Peking University, Beijing, China

专题命中工具调用：提出工具程序接口，优化智能体网络服务

AI总结提出ToolPro，将工具意图表示为可执行程序，通过约束引导构建、效应感知重放和策略决策，在MCP服务上实现最高53.4%的延迟降低和96.1%的流量减少。

Comments Accepted by ICML 2026

详情

AI中文摘要

在智能体网络时代，基于LLM的智能体越来越多地将网络服务作为工具调用，然而大多数接口仍然是\emph{静态端点}，难以表达包含循环、条件、连接和重试的长周期工作流。我们提出ToolPro，它将智能体的工具意图表示为一个\emph{可执行工具程序}，该程序紧凑地编码了多步服务交互并带有显式效应类型。ToolPro结合了约束引导的程序构建、用于精确一次状态修改调用的效应感知重放，以及一个基于配置文件的策略，该策略决定何时程序执行优于逐步调用。我们在具有WebAssembly沙箱的MCP风格服务上实例化ToolPro，并在现实应用的各种工作流上进行了评估。ToolPro将端到端延迟降低了高达53.4%，客户端流量减少了高达96.1%，在网络延迟和工作流复杂度更高时收益更大。

英文摘要

In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain \emph{static endpoints} that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent's tool intent as an \emph{executable tool program} that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once state-modifying calls, and a profile-driven policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing and evaluate it on diverse workflows of real-world applications. ToolPro reduces end-to-end latency by up to 53.4\% and client-side traffic by up to 96.1\%, with larger gains under higher network latency and workflow complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.29483 2026-06-19 cs.AI 版本更新专题 90

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

VitalAgent: 一种工具增强型代理，用于对可穿戴健康数据进行反应性和主动式生理监测

Di Zhu, Yu Yvonne Wu, Hong Jia, Aaqib Saeed, Vassilis Kostakos, Ting Dang

发表机构 * The University of Melbourne, Australia（墨尔本大学）； Dartmouth College, US（达特茅斯学院）； University of Auckland, New Zealand（奥克兰大学）； Eindhoven University of Technology, Netherlands（埃因霍温理工大学）

专题命中工具调用：工具增强推理和主动监测的智能体框架

AI总结提出VitalAgent框架，通过工具增强推理和纵向生理记忆，实现对ECG/PPG信号的反应性问答与主动监测，在VitalBench基准上相比基线提升超30%。

Comments Minor revisions; results unchanged

详情

AI中文摘要

可穿戴设备能够连续监测ECG和PPG等生理信号，但现有的移动健康系统大多局限于特定任务的预测管道或对静态摘要的反应性问答。它们缺乏支持时间推理、持久生理上下文以及对长期信号流进行主动监测的能力。我们提出VitalAgent，一个基于ECG/PPG的移动健康工具增强型代理框架，支持反应性问答和主动监测。VitalAgent建立在纵向生理记忆和工具增强推理接口之上，能够对原始信号进行动态计算。我们进一步引入VitalBench，一个纵向生理监测基准数据集，包含用于反应性问答的1,862个问答对和用于主动监测的90.2小时连续ECG/PPG记录，涵盖心脏、身体活动和压力相关任务。实验表明，VitalAgent在反应性评估中相比基于提示和ReAct的基线实现了超过30%的提升，并支持对长期生理信号的主动警报监测，突显了动态工具使用和长期生理监测的重要性。

英文摘要

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 25% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.20515 2026-06-19 cs.CV 新提交专题 85

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

S-Agent：空间工具使用激发空间智能推理

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发表机构 * NTU（南洋理工大学）； THU（清华大学）； ByteDance（字节跳动）； NWPU（西北工业大学）

专题命中工具调用：提出空间工具使用智能体范式，层次化工具集

AI总结提出S-Agent空间工具使用智能体范式，通过时空证据积累和层次化工具集，将VLM作为语义规划器，实现连续多视图图像和视频的空间推理，在无训练下提升开源和闭源VLM性能，并基于S-300K轨迹微调得到紧凑空间智能体S-Agent-8B。

Comments Project Page : https://Ropedia.github.io/S-Agent

详情

AI中文摘要

现实世界的空间智能需要对连续且不断变化的三维世界进行推理，然而现有的VLM和工具增强智能体大多仍局限于从孤立的视觉观察中进行静态、无状态的推理。我们引入了\textbf{\textsc{S-Agent}}，一种用于理解和推理连续多视图图像和视频的空间工具使用智能体范式。通过将空间推理表述为时空证据积累而非孤立的帧级预测，\textsc{S-Agent}将空间感知重塑为以场景为中心的理解，超越以帧为中心的识别。具体而言，\textsc{S-Agent}将VLM作为语义规划器，决定需要哪些证据，而层次化的空间工具和专家将物体锚定在2D中，将其提升为3D几何证据，并将这些证据聚合为高级空间知识（例如，计数、测量、方向和相对位置）。此外，时间记忆机制，包括用于维护不断演变的场景状态的场景记忆和用于积累推理上下文的智能体记忆，实现了跨帧和推理步骤的证据整合。在多视图和视频空间推理基准上的全面实验表明，\textsc{S-Agent}以无需训练的方式持续提升开源和闭源VLM的性能。除了推理时增强，在\textsc{S-Agent}生成的空间轨迹\textsc{S-300K}上进行监督微调（SFT）得到了\textsc{S-Agent-8B}，一个紧凑的空间智能体，显著超越了类似规模的基线（例如，Qwen3-VL-8B），并与先进的闭源模型（例如，GPT-5.4和Gemini 3）性能相当。

英文摘要

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

URL PDF HTML ☆

赞 0 踩 0

2606.20401 2026-06-19 eess.SY cs.SY 新提交专题 85

PowerAgentBench-Dyn: A Benchmark for Agentic AI in Power System Dynamic Studies

PowerAgentBench-Dyn：电力系统动态研究中智能体AI的基准测试

Qian Zhang, Andrea Pomarico, Costas Mylonas, Magda Foti, Alberto Berizzi, Le Xie

专题命中工具调用：LLM智能体基准测试，评估电力系统动态分析中的工具使用和推理

AI总结提出PowerAgentBench-Dyn基准，用于评估基于LLM的智能体在电力系统动态分析任务中的能力，涵盖模型质量审查和安全风险筛选两个任务。

详情

AI中文摘要

基于大型语言模型（LLM）的智能体越来越多地被用于通过与软件工具交互、解释中间结果以及自主规划后续行动来自动化多步骤工程工作流。电力系统动态研究是这些智能体一个特别有前景但尚未充分探索的应用领域。与静态计算任务不同，动态研究通常需要更多时间进行模型参数校准、工程判断以及在受限动作空间下的决策。本文介绍了PowerAgentBench-Dyn，一个旨在评估智能体AI系统在电力系统动态分析任务上的基准测试。该基准针对那些不能简化为单一优化或编码任务的问题，而是需要经验丰富的电力系统工程师日常执行的那种推理、工具使用和迭代实验。所提出的框架包括两个初始基准任务。第一个是动态模型质量审查基准，评估智能体根据系统运营商指定的模型质量合规标准验证和诊断动态模型的能力。第二个是动态安全风险筛选基准，评估智能体利用语义记忆和有限的仿真预算从未见故障数据集中识别、排序和分析最关键短路事故，并提出和评估可能的缓解措施的能力。对于每个任务，我们定义了仿真环境、观测和动作空间以及评估指标。该基准在基于度量的意义上是可复现的：发布案例和仿真器设置定义了确定性评估器，而随机智能体行为通过重复运行使用成功率和其他指标进行评估。该基准支持未来用于电力系统运行和规划的智能体AI的开发。

英文摘要

Large Language Model (LLM)-based agents are increasingly being used to automate multi-step engineering work flows by interacting with software tools, interpreting intermediate results, and autonomously planning subsequent actions. Power system dynamic studies represent a particularly promising yet largely unexplored application domain for these agents. Unlike static computational tasks, dynamic studies often require more time on model parameter calibration, engineering judgment, and decision making under constrained action spaces. This paper introduces PowerAgentBench-Dyn, a benchmark designed to evaluate Agentic AI systems on power system dynamic-analysis tasks. The benchmark targets problems that cannot be reduced to a single optimization or coding task, but instead require a type of reasoning, tool usage, and iterative experimentation routinely performed by experienced power system engineers. The proposed framework includes two initial benchmark tasks. The first, the Dynamic Model Quality Review Benchmark, evaluates agents' ability to validate and diagnose dynamic models based on model-quality compliance criteria specified by system operators. The second, the Dynamic Security Risk Screening Benchmark, assesses agents' capability to leverage semantic memory and a limited simulation budget to identify, rank, and analyze the most critical short-circuit contingencies from an unseen fault dataset, as well as propose and evaluate possible mitigation measures. For each task, we define the simulation environment, observation and action spaces, and evaluation metrics. The benchmark is reproducible in a metric-based sense: released cases and simulator settings define a deterministic evaluator, while stochastic agent behavior is assessed over repeated runs using success rates and other metrics. The benchmark supports the development of future Agentic AI for power system operation and planning.

URL PDF HTML ☆

赞 0 踩 0

2606.20373 2026-06-19 cs.SE cs.AI 新提交专题 90

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass：基于证据的LLM智能体用于编译器性能调优

Zepeng Li, Jie Ren, Zhanyong Tang, Jie Zheng, Zheng Wang

发表机构 * Shaanxi Normal University（陕西师范大学）； Northwest University（西北大学）； University of Leeds（利兹大学）

专题命中工作流自动化：多智能体框架自动优化编译器性能

AI总结提出AutoPass多智能体框架，通过查询编译器内部状态和中间表示，利用运行时反馈迭代优化编译选项，无需训练即可提升性能，在x86-64和ARM64上分别实现1.043倍和1.117倍加速。

详情

AI中文摘要

大型语言模型（LLM）在代码编译任务中展现出潜力，但由于复杂的微架构效应和噪声运行时测量，将其应用于运行时性能调优较为困难。我们提出AutoPass，一个用于编译器性能调优的多智能体框架，它利用编译器和运行时证据来指导LLM生成的优化决策。与先前的自动调优方案将编译器视为黑盒不同，AutoPass向LLM开放编译器，使其能够查询编译器内部的优化状态并分析中间表示以编排编译器选项。搜索过程利用测量的运行时反馈迭代地优化配置，以诊断性能回退并指导延迟改进的编辑。AutoPass在仅推理、无需训练的环境下运行，无需离线训练或任务特定的微调，因此可轻松应用于新的基准测试和平台。我们在LLVM编译器上实现AutoPass，并在服务器级x86-64和嵌入式ARM64系统上进行评估。AutoPass优于专家调优的启发式方法和经典自动调优方法，在x86-64和ARM64上相对于LLVM -O3分别实现了1.043倍和1.117倍的几何平均加速。

英文摘要

Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi-agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM-generated optimization decisions. Rather than treating the compiler as a black box like prior auto-tuning schemes, AutoPass opens up the compiler to the LLM, enabling it to query compiler-internal optimization states and analyze the intermediate representation to orchestrate compiler options. The search process iteratively refines optimization configurations using measured runtime feedback to diagnose regressions and guide latency-improving edits. AutoPass operates in an inference-only, training-free setting and requires no offline training or task-specific fine-tuning, making it readily applicable to new benchmarks and platforms. We implement AutoPass on the LLVM compiler and evaluate it on server-grade x86-64 and embedded ARM64 systems. AutoPass outperforms expert-tuned heuristics and classical autotuning methods, achieving geometric-mean speedups of 1.043x and 1.117x over LLVM -O3 on x86-64 and ARM64, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.20318 2026-06-19 cs.DB 新提交专题 90

AgenticDB: Agentic Performance Reconfiguration for Database Workloads

AgenticDB: 面向数据库工作负载的代理式性能重配置

Xinyue Yang, Chaozheng Wang, Chen Zheng, Heng Zhang, Yanjun Wu

专题命中工作流自动化：智能体框架自动重配置数据库性能

AI总结提出AgenticDB框架，通过运行时交互实现数据库系统级和操作系统级重配置，诊断瓶颈并积累经验，在MySQL和PostgreSQL上平均性能提升118.1%。

详情

AI中文摘要

数据库配置调优对工作负载性能至关重要，但在实际部署中进行实用调优仍然困难。现有的自动调优器大多将调优视为对DBMS旋钮值的迭代搜索。这种形式导致执行成本高，过早缩小配置空间，并且未能充分解决实际需求：从系统反馈中诊断运行时瓶颈，探索操作系统级重配置机会，稳健地执行更改，以及从先前的试验和任务中学习。我们提出AgenticDB，一个用于数据库工作负载重配置的代理式框架。AgenticDB实现了一个上下文驱动的工具，通过与目标数据库环境交互，提出DBMS级和操作系统级更改，在安全约束下应用它们，观察工作负载性能和运行时状态，并使用执行反馈来指导后续决策。这种运行时交互使AgenticDB能够诊断瓶颈，探索更广泛的DBMS和操作系统级重配置空间，避免不安全或不支持的操作，并在重配置任务内部和之间积累经验。因此，AgenticDB将数据库调优转变为一种自我改进的重配置过程，其中运行时反馈迭代地改进后续决策。我们在MySQL和PostgreSQL上使用YCSB、Sysbench和TPC-H工作负载进行了广泛实验。结果表明，AgenticDB在所有评估的工作负载上实现了最佳最终性能，平均比最强基线提高118.1%，并将总到达最佳时间减少22.6%。结果还表明，其操作系统级动作空间、稳健的执行生命周期和增强记忆的规划有助于实现更有效和实用的数据库重配置。

英文摘要

Database configuration tuning is critical for workload performance, but practical tuning on real deployments remains difficult. Existing automatic tuners mostly formulate tuning as iterative search over DBMS knob values. This formulation leads to high execution cost, prematurely narrows the configuration space, and leaves practical requirements insufficiently addressed: diagnosing runtime bottlenecks from system feedback, exploring OS-level reconfiguration opportunities, executing changes robustly, and learning from previous trials and tasks. We propose AgenticDB, an agentic framework for database workload reconfiguration. AgenticDB implements a context-grounded harness that interacts with the target database environment by proposing DBMS- and OS-level changes, applying them under safety constraints, observing workload performance and runtime states, and using execution feedback to guide subsequent decisions. This runtime interaction enables AgenticDB to diagnose bottlenecks, explore a broader DBMS- and OS-level reconfiguration space, avoid unsafe or unsupported actions, and accumulate experience within and across reconfiguration tasks. As a result, AgenticDB turns database tuning into a self-refining reconfiguration process in which runtime feedback iteratively improves later decisions. We conduct extensive experiments on MySQL and PostgreSQL using YCSB, Sysbench, and TPC-H workloads. The results show that AgenticDB achieves the best final performance on all evaluated workloads, improving over the strongest baseline by 118.1% on average and reducing aggregate time-to-best by 22.6%. The results also demonstrate that its OS-level action space, robust execution lifecycle, and memory-enhanced planning contribute to more effective and practical database reconfiguration.

URL PDF HTML ☆

赞 0 踩 0

2606.19790 2026-06-19 cs.CE 新提交专题 90

The Orchestration Gap: Why Process Automation Stalls in Operationally Complex Industries

编排鸿沟：为何流程自动化在操作复杂行业中停滞不前

Jiechao Gao, Yuandong Pan. Yuangang Li, Jie Wang, Kincho Law, Michael Lepech

专题命中工作流自动化：分析多智能体系统在复杂行业自动化中的编排鸿沟。

AI总结本文提出“编排鸿沟”概念，分析为何多智能体系统在物流、医疗等复杂行业自动化中失败，并给出基于约束执行和可解释性的分阶段自动化路径。

详情

AI中文摘要

智能体系统在数字原生任务上进展迅速，但几乎未触及那些协调自动化可能最重要的行业：物流、医疗运营、建筑以及许多工作分散在不兼容工具和众多参与者中的领域。我们认为原因是缺少一种抽象。在这些场景中，价值并非来自单个有能力的模型调用，而是来自编排——协调多步骤工作流、强制执行硬领域约束、管理人工审批并桥接遗留系统的运行时。我们将这一思想发展成一个可用的概念框架。我们给出了一个操作性测试来识别哪些工作流受限于编排，一种分解方法将工作流的混乱程度与其协调工作量及价值分离，以及一个特征层面的解释说明为何当今的多智能体框架留下了一个特定鸿沟。然后我们提出核心主张：正确的自动化路径是分阶段的，而哪种架构保证最重要取决于一个行业的主要摩擦来源。在监管摩擦下，约束执行是承重关键；在责任摩擦下，可解释性是承重关键。我们以这一观点所暗示的研究计划作为结尾。

英文摘要

Agentic systems have advanced quickly on digitally native tasks, yet they have barely touched the industries where coordinated automation could matter most: logistics, healthcare operations, construction, and the many sectors whose work is spread across incompatible tools and many hands. We argue that the reason is a missing abstraction. The value in these settings does not come from a single capable model invocation; it comes from \emph{orchestration}, the runtime that coordinates multi-step workflows, enforces hard domain constraints, manages human approval, and bridges legacy systems. We develop this idea into a usable conceptual frame. We give an operational test for which workflows are orchestration-bound, a decomposition that separates how tangled a workflow is from how much of its effort is coordination and what that coordination is worth, and a feature-level account of why today's multi-agent frameworks leave a specific gap. We then advance our central claim: the right automation path is staged, and which architectural guarantee carries the most weight depends on a sector's dominant source of friction. Constraint enforcement is load-bearing under regulatory friction; explainability is load-bearing under liability friction. We close with the research program this view implies.

URL PDF HTML ☆

赞 0 踩 0

2606.19382 2026-06-19 cs.SE cs.AI 新提交专题 90

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

DynAMO：基于拓扑多智能体调度的动态资产管理编排

Kanishk Kushwaha, Vikrant Vinod Bansode, Harsh Vardhan, Dhaval C. Patel

发表机构 * Gati Shakti Vishwavidyalaya（加蒂·沙克蒂大学）； IBM Research（IBM研究院）

专题命中工作流自动化：提出多智能体编排引擎，生成工作流图。

AI总结提出DynAMO引擎，采用先规划后执行架构生成可验证工作流图，支持顺序与并行执行，通过动态识别独立任务提升效率，在工业基准上实现1.6倍延迟降低，并保持正确性与安全性。

Comments 11 pages, 2 figures, 7 tables, 4 algorithms. Evaluated on the AssetOpsBench industrial benchmark. Code: https://github.com/kushwaha001/DynAMO

详情

AI中文摘要

虽然基于LLM的智能体为工业资产生命周期提供了端到端自动化，但现实世界中的工业4.0部署受到延迟、并发不稳定性和安全风险的阻碍。我们提出了DynAMO（动态资产管理编排），一个部署就绪的引擎，采用先规划后执行架构来生成可验证的工作流图。DynAMO支持顺序工作流（拓扑执行）和并行工作流（依赖感知并发）。通过动态识别独立任务，DynAMO在保持结构正确性和安全性的同时，通过受控推理重叠显著提高效率。在AssetOpsBench工业基准上的六项受控实验中，DynAMO展示了显著的性能和鲁棒性提升。并行执行相比顺序编排将端到端延迟中位数降低了1.6倍，在高度可并行化的工作流上达到1.8倍。在外部工具调用中加入实际延迟后，延迟分解显示LLM推理和编排仍占执行时间的90%以上，表明模型推理是主要系统瓶颈。结构化上下文剪枝将推理延迟降低约30%，并且DynAMO在受控故障注入下保持正确的功能行为（任务完成、智能体排序和输出质量），同时表现出优雅降级。可重复性分析进一步证实了重复运行下的稳定执行，并行调度降低了延迟方差。这些发现确立了DynAMO作为工业4.0自动化流水线中可扩展、安全且延迟感知的智能体部署的实用蓝图。代码可在以下网址获取：this https URL

英文摘要

While LLM-powered agents offer end-to-end automation for industrial asset lifecycles, real-world Industry 4.0 deployment is hindered by latency, concurrency instability, and safety risks. We present DynAMO (Dynamic Asset Management Orchestration), a deployment-ready engine using a Plan-then-Execute architecture to generate verifiable workflow graphs. DynAMO supports both SequentialWorkflow (topological execution) and ParallelWorkflow (dependency-aware concurrency). By dynamically identifying independent tasks, DynAMO preserves structural correctness and safety while significantly improving efficiency through controlled reasoning overlap. Across six controlled experiments on the AssetOpsBench industrial benchmark, DynAMO demonstrates substantial performance and robustness gains. Parallel execution reduces end-to-end latency by a median of 1.6x over sequential orchestration, rising to 1.8x on highly parallelizable workflows. After instrumenting external tool calls with realistic latencies, a latency decomposition shows that LLM reasoning and orchestration still account for more than 90% of execution time, identifying model inference as the primary system bottleneck. Structured context pruning reduces inference latency by approximately 30%, and DynAMO maintains correct functional behaviour (task completion, agent sequencing, and output quality) while exhibiting graceful degradation under controlled fault injection. Reproducibility analysis further confirms stable execution under repeated runs, with parallel scheduling reducing latency variance. These findings establish DynAMO as a practical blueprint for scalable, safe, and latency-aware agent deployment in Industry 4.0 automation pipelines. Code is available at: https://github.com/kushwaha001/DynAMO

URL PDF HTML ☆

赞 0 踩 0

2604.23938 2026-06-19 cs.CL 版本更新专题 90

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

TSAssistant: 一种人在回路中的自动化靶点安全性评估智能体框架

Xiaochen Zheng, Zhiwen Jiang, David Tokar, Yexiang Cheng, Alvaro Serra, Melanie Guerard, Klas Hatje, Tatyana Doktorova

发表机构 * Computational Sciences Center of Excellence（计算科学卓越中心）

专题命中工作流自动化：多智能体框架自动化靶点安全性评估报告生成

AI总结提出TSAssistant多智能体框架，通过分层指令架构和交互式优化循环，将靶点安全性评估报告生成分解为专业子任务，实现高可重复性和证据溯源。

Comments Updated with quantitative and expert evaluations

详情

AI中文摘要

靶点安全性评估（TSA）需要系统整合遗传、转录组、靶点同源性、药理学和临床数据，以评估治疗靶点的潜在安全性风险。该过程劳动密集且依赖专家，在可扩展性和可重复性方面面临挑战。我们提出TSAssistant，一种人在回路中的多智能体框架，将TSA报告生成分解为专门子智能体的工作流：研究子智能体各自基于并引用单个TSA领域，合成子智能体整合跨领域发现。子智能体通过标准化工具接口从精选生物医学来源检索和综合证据，生成可单独引用、基于证据的章节，其行为由分层指令架构塑造，该架构将协调逻辑与领域专业知识和用户意图分离。为补充这些软约束，程序化执行钩子和持久记忆存储在整个工作流中强制执行硬约束，而交互式优化循环允许专家在完全保留跨迭代对话上下文的情况下审查和修订各个章节。我们不是进行单一的整体比较，而是将报告质量分解为可重复性、证据基础、任务级准确性和专家监督下的可控性，发现高可重复性和证据基础、与人类参考高度一致以及专家驱动的净正面改进。

英文摘要

Target Safety Assessment (TSA) requires systematic integration of genetic, transcriptomic, target homology, pharmacological, and clinical data to evaluate potential safety liabilities of therapeutic targets. This process is labor-intensive and expert-dependent, posing challenges in scalability and reproducibility. We present TSAssistant, a human-in-the-loop multi-agent framework that decomposes TSA report generation into a workflow of specialized subagents: Research Subagents that each ground and cite a single TSA domain, and Synthesis Subagents that integrate findings across domains. Subagents retrieve and synthesize evidence from curated biomedical sources through standardized tool interfaces and produce individually citable, evidence-grounded sections, with behavior shaped by a hierarchical instruction architecture that separates coordination logic from domain expertise and user intent. To complement these soft constraints, programmatic execution hooks and persistent memory stores enforce hard constraints across the workflow, while an interactive refinement loop allows experts to review and revise individual sections with full conversational context preserved across iterations. Rather than a single holistic comparison, we decompose report quality into reproducibility, evidential grounding, task-level accuracy, and controllability under expert oversight, finding high reproducibility and grounding, substantial agreement with the human reference, and net-positive expert-driven refinement.

URL PDF HTML ☆

赞 0 踩 0

2606.20363 2026-06-19 cs.AI 新提交专题 90

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

为计算机使用智能体自动生成SKILL.md：基于交互轨迹挖掘

Yuexing Hao, Xiaomin Li

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Harvard University（哈佛大学）

专题命中软件智能体：从GUI轨迹挖掘技能库用于计算机使用智能体

AI总结提出三阶段流水线从GUI轨迹中挖掘可读技能库，但发现可读性不保证下游策略提升，GRPO仅带来微小改进，揭示当前方法的局限性。

详情

AI中文摘要

显式技能库使计算机使用智能体更易于检查，但尚不清楚是否可以从交互数据中挖掘此类库以改进下游策略。我们通过一个三阶段流水线研究这个问题：分割GUI轨迹，将片段聚类为候选技能，并从生成的注释中训练技能感知策略。挖掘的聚类在源基准上是可读的：八个聚类中有五个对InteraSkill Workflows标签的纯度至少为0.95。然而，可读性并不意味着可迁移。GRPO仅将IW技能步骤准确率从18.5%提高到20.5%，使BrowseComp+基本不变，并在关键源域指标上低于简单的频率先验。因此，我们将该方法作为诊断性研究呈现：轨迹挖掘可以暴露可检查的技能结构，但当前的边界检测器、无序片段表示和离线奖励模型不足以实现可靠的跨域策略改进。

英文摘要

Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5\% to 20.5\%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 新提交专题 90

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式：移动代理是否需要手机屏幕？

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

发表机构 * Mila – Québec AI Institute（魁北克人工智能研究所）； Concordia University（康科迪亚大学）； University of Toronto（多伦多大学）； McMaster University（麦马斯特大学）

专题命中软件智能体：研究移动代理，比较GUI和CLI范式。

AI总结本文挑战移动代理的GUI主导范式，提出CLI应同等重要，通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线，并引入CLI-Advantage任务套件展示其优势。

详情

AI中文摘要

近期移动代理的进展主要由GUI范式主导，其中代理感知UI信息并发出屏幕交互。然而，移动平台也提供了命令行接口（CLI），可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上，使用四种模型API评估了三个编码代理（Claude Code、Terminus-2、mini-swe-agent），未进行任何移动特定后训练，并与三个可复现的GUI基线（GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B）进行比较。Claude Code（Opus 4.7）达到71.8%和51.9%，优于所有可复现的GUI基线（AndroidWorld上69.3/68.1/57.8%；MobileWorld上43.2/26.3/13.3%），而其他CLI配置也保持竞争力。为确立该范式的上限，我们提供了oracle CLI解决方案，在AndroidWorld上达到88.8%（103/116个任务可CLI解决），在MobileWorld上达到86.3%（101/117个任务可CLI解决），表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图，我们引入了\ extbf{CLI-Advantage任务套件}，包含五个类别的45个模板：批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线，且每个任务步骤显著更少（10.7步 vs. 18.6步）。为支持未来移动CLI代理的研究，我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2606.20512 2026-06-19 cs.SE cs.LG 新提交专题 85

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

代码代理的仓库指导的探测与精炼调优

Asa Shepard, Jeannie Albrecht

发表机构 * Williams College（威廉姆斯学院）

专题命中软件智能体：聚焦编码代理的仓库指导优化

AI总结提出探测与精炼调优方法，通过合成bug修复探测迭代诊断和修补仓库指导文件，在SWE-bench Verified上以Qwen3.5-35B-A3B模型达到33.0%解决率，优于静态知识库的28.3%和无指导基线的25.5%。

详情

AI中文摘要

基于LLM的代码代理需要关于仓库的更高级操作知识（哪些文件包含哪些子系统、如何运行测试套件、哪些工作流历史上导致错误修复），这些知识并不存在于代码本身。工程师通常维护\texttt{ this http URL }文件来提供这些上下文作为代码代理的指令，但它们是否有帮助存在争议：最近的研究对LLM生成的指导是否改善或损害代理性能存在分歧。在本文中，我们展示了指导的产生方式才是决定性变量，并引入了\emph{探测与精炼调优}：一种通过合成bug修复探测来迭代诊断和修补仓库指导文件的过程，使用单次LLM调用，在调优期间没有代理循环或工具使用。在SWE-bench Verified上，使用Qwen3.5-35B-A3B进行200步的四个独立试验中，探测与精炼实现了33.0%的平均解决率，而用于初始化的静态知识库为28.3%，无指导基线为25.5%（两个探测与精炼对比的p < 0.001）。改进来自覆盖率而非精确度：精炼后的指导为14.5个百分点（pp）更多的实例生成了可评估的补丁，而每个补丁的精确度在统计上保持不变（约59%，p = 0.119），表明改进的指导帮助代理到达正确的文件，而不是提高它们所做更改的质量。此外，一个步骤预算实验表明，指导让代理能够更有效地利用更大的步骤预算，而一个跨模型实验（使用NVIDIA-Nemotron-3-Nano-30B-A3B）发现，当模型无法生成足够诊断性的输出时，调优循环会退化，尽管即使在这种情况下每个补丁的精确度仍然保持不变。

英文摘要

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emph{probe-and-refine tuning}: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0\,\% mean resolve rate vs.\ 28.3\,\% for the static knowledge base used to initialize it and 25.5\,\% for an unguided baseline ($p < 0.001$ for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ($\sim$59\,\%, $p = 0.119$), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.

URL PDF HTML ☆

赞 0 踩 0

2606.20487 2026-06-19 cs.CL 新提交专题 85

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

超越全局重规划：跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Shanghai Innovation Institute（上海创新研究院）； Southeast University（东南大学）； Tsinghua University（清华大学）

专题命中软件智能体：跨设备智能体系统的分层恢复框架

AI总结提出分层重规划框架H-RePlan，通过统一API-CLI-GUI执行和跨层失败抽象，区分设备本地策略恢复与全局重规划，在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情

AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备，要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配，但恢复仍然粗粒度：当执行失败时，它们通常重试相同策略、重新分配子任务或修改全局计划，而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan}，一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略，并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力，我们引入\textbf{HeraBench}，一个故障注入基准，它在Linux和Android设备上构建跨设备工作流，并注入策略级和设备级故障。实验表明，H-RePlan显著优于单策略和粗粒度多设备基线，实现了更高的完成率、指令遵循率和完美通过率，同时降低了可靠端到端成功所需的令牌成本。这些结果表明，范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

URL PDF HTML ☆

赞 0 踩 0

2606.20158 2026-06-19 cs.SE 新提交专题 85

N-Version Programming with Coding Agents

使用编码代理的N版本编程

Javier Ron, Benoit Baudry, Martin Monperrus

专题命中软件智能体：编码代理作为智能体进行N版本编程。

AI总结本文在当代AI编码代理背景下重新审视N版本编程，通过Knight-Leveson实验评估代理系统、模型和实现语言的多样性对故障模式的影响，发现常见模式故障，但多数投票三版本单元显著降低故障数，证明该策略的工程实用性。

详情

AI中文摘要

本文在当代AI编码代理背景下重新审视N版本编程这一经典概念。通过重访开创性的Knight-Leveson实验，我们研究了代理系统、模型和实现语言之间的多样性是否会产生多样化的故障模式。使用Knight-Leveson的发射拦截器程序规范，我们在共享的预言机和100万个随机测试输入的测试集上评估了48个代理生成的实现。结果显示，与Knight-Leveson的发现一致，存在大量的共模故障。进一步分析表明，许多这些同时发生的故障可以追溯到规范中特别困难或模糊的地方。我们还证明了编码代理的多样性带来了实际效益：在多数投票的三版本单元中，平均故障数从单版本的387.44下降到三版本的130.99，并且有11,844个N版本单元表现出零观测故障。我们的原始结果是迄今为止最强的证据，表明使用编码代理的N版本编程是一种有用的工程策略。

英文摘要

This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial common-mode failure, along the findings of Knight-Leveson. Further analysis that many of those co-occuring failures can be traced to where is specification is particularly hard or ambiguous. We also demonstrate that diversity from coding agents provides practical benefit: across majority voting three-version units, the mean failure count drops from 387.44 for single versions to 130.99 for triples, and 11,844 N-version units exhibit zero observed failures. Our original results is the strongest evidence to date that N-Version Programming with coding agents is a useful engineering strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.20058 2026-06-19 cs.AI 新提交专题 90

Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

面向企业级AI规模的自驱动事件驱动多智能体编排

Harsh Rao Dhanyamraju, Leonidas Raghav, Aaron Lee

发表机构 * SAP SE（SAP股份有限公司）

专题命中多智能体：提出多智能体编排框架，处理企业级事件驱动任务。

AI总结针对企业级AI中多智能体系统在规模扩展时性能下降的问题，提出任务管理器通过优先级推理、事件合并和抢占机制，在200个生产场景中验证其降低高优先级延迟14-75%，提升相关事件正确率超20个百分点。

详情

AI中文摘要

企业AI旨在朝着跨专业智能体的持续事件监控、检测和行动方向发展，然而现有的多智能体系统大多假设离散的请求-响应工作流，并且在企业规模下仍未得到充分探索。我们在208个源自生产的场景中评估了DAG Plan and Execute和ReAct，这些场景涵盖个人（少于10个智能体）、部门（20-80个）和企业（200个）规模，并引入了一个任务管理器，通过优先级推理、相关事件合并和抢占实现持续运行。结果表明，规模而非任务复杂性主导了编排性能：两种架构在小规模下表现良好，但在企业规模下性能下降，因为智能体发现噪声成为主要瓶颈，简单任务的下降幅度比复杂任务更严重。DAG Plan and Execute在较小规模下提供更高的精度和结构化并行化，但其较高的开销在企业规模下恶化；ReAct通过增量处理故障而更具鲁棒性。任务管理器将高优先级队列延迟降低了14-75%，并在企业规模下将相关事件正确性提高了超过20个百分点。

英文摘要

Enterprise AI aims to move toward continuous event monitoring, detection, and action across specialist agents, yet existing multi-agent systems largely assume discrete request-response workflows and remain underexplored at enterprise scale. We evaluate DAG Plan and Execute and ReAct across 208 production-derived enterprise scenarios spanning Persona (<10 agents), Department (20-80), and Enterprise (200) scales, and introduce a Task Manager for continuous operation via priority inference, related-event merging, and preemption. Results show that scale, not task complexity, dominates orchestration performance: both architectures perform well at small scale but degrade at enterprise scale as agent discovery noise becomes the primary bottleneck, with simple tasks degrading more sharply than complex ones. DAG Plan and Execute offers higher precision and structured parallelization at smaller scales, but its higher overhead worsens at enterprise scale; ReAct is more robust by handling failures incrementally. The Task Manager reduces high-priority queue latency by 14-75% and improves related-event correctness by over 20 percentage points at enterprise scale.

URL PDF HTML ☆

赞 0 踩 0

2606.19782 2026-06-19 cs.AI cs.CL 新提交专题 90

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

AgentFinVQA：一种可部署的多智能体管道用于可审计的金融图表问答

Aravind Narayanan, Shaina Raza

发表机构 * Vector Institute（向量研究所）

专题命中多智能体：多智能体管道用于金融图表问答，强调可审计性。

AI总结提出多智能体管道AgentFinVQA，通过分解查询步骤并记录可追溯的模型评估包，在金融图表问答中实现可审计性与本地部署，在FinMME上提升准确率7.68个百分点。

详情

AI中文摘要

在受监管环境中的金融图表问答不仅要求准确性：从业者必须在采取行动之前知道哪些答案值得信任，而且许多机构无法将客户数据发送给外部模型提供商。然而，现有的图表问答智能体注重准确性且不透明，并且大多数假设专有API访问；据我们所知，没有一种方法能在不显著牺牲准确性的情况下同时实现可审计性和本地部署。我们提出AgentFinVQA，一个多智能体管道，将每个查询分解为规划、OCR、图例定位、视觉检查和验证，每个样本记录在可追溯的模型评估包（MEP）中。在FinMME上，AgentFinVQA在使用专有主干（Gemini-3 Flash；71.24% vs. 63.56%，McNemar p ≈ 1.1×10^{-16}）时比主骨干匹配的零样本基线提高+7.68个百分点，在使用本地服务的开放权重Qwen3.6-27B-FP8时提高+4.84个百分点。验证器的判断也作为有用的置信度信号（确认答案与修正答案的精确准确率分别为68.2%和55.6%），支持人在回路审查路由。错误分析表明，问题误解、图例混淆和提取错误占失败原因的近三分之二，并且是验证器检测最少的类别，为未来工作指明了明确方向。这些结果共同表明，可审计、本地部署的金融图表问答是可行的，并且开放权重系统保留了大部分准确率提升，同时实现了完全的数据驻留。我们发布代码以支持可重复评估。

英文摘要

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.19758 2026-06-19 cs.MA 新提交专题 90

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

SIGMA: 用于组合式多智能体设计的技能-关联图

Kun Zeng, Yu Huo, Siyu Zhang, Yuecheng Zhuo, Yuquan Lu, Haoyue Liu, Siyue Chen, Xiaoying Tang

专题命中多智能体：通过技能-关联图进行组合式多智能体设计。

AI总结提出SIGMA框架，通过技能-智能体关联图将智能体构建为可复用技能的任务条件组合，并解码通信拓扑，在六个基准测试中优于基线方法，并展现出对未见技能库的鲁棒性。

Comments EMNLP2026

详情

AI中文摘要

现有的基于图的多智能体系统（MAS）设计者主要通过优化预定义智能体、角色或组上的通信拓扑来改善协作。然而，由于每个节点仍然是一个封闭集实体，这些方法难以泛化到需要未见能力组合的任务。我们提出SIGMA，一个技能-关联图框架，将智能体构建为可复用技能的任务条件组合。给定一个任务和一个技能库，SIGMA预测一个技能-智能体关联矩阵，从选定的技能中组合智能体节点嵌入，并在构建的智能体上解码通信拓扑。在执行过程中，特定技能的邮箱将消息路由到相关分配的能力，使关联结构直接可操作。在六个推理和编码基准测试中，使用三个基础LLM，SIGMA实现了最佳平均性能，并分别比最强的非组合式拓扑基线CARD提高了2.06、2.36和1.75分。它还对未见技能库表现出更强的鲁棒性，平均性能下降仅为0.96分。这些结果表明，组合式节点构建是多智能体设计中除了通信拓扑优化之外的一个互补且重要的方向。代码可在以下网址获取：https://this URL。

英文摘要

Existing graph-based multi-agent system (MAS) designers mainly improve collaboration by optimizing communication topologies over predefined agents, roles, or groups. However, because each node remains a closed-set entity, these methods struggle to generalize to tasks that require unseen combinations of capabilities. We propose SIGMA, a skill-incidence graph framework that constructs agents as task-conditioned bundles of reusable skills. Given a task and a skill library, SIGMA predicts a skill-agent incidence matrix, composes agent node embeddings from selected skills, and decodes a communication topology over the constructed agents. During execution, skill-specific mailboxes route messages to the relevant assigned capabilities, making the incidence structure directly operational. Across six reasoning and coding benchmarks with three base LLMs, SIGMA achieves the best average performance and improves over CARD, the strongest non-compositional topology-based baseline, by 2.06, 2.36, and 1.75 points, respectively. It also shows stronger robustness to unseen skill libraries, with an average performance drop of only 0.96 points. These results suggest that compositional node construction is a complementary and important axis for multi-agent design beyond communication topology optimization. Code is available at https://anonymous.4open.science/r/SIGMA-2338/.

URL PDF HTML ☆

赞 0 踩 0

2606.18325 2026-06-19 cs.CR cs.AI 新提交专题 90

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Agentra: 一种可监督的多智能体企业入侵响应框架

Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

发表机构 * The University of Alabama, Alabama, USA（阿拉巴马大学）； Roma Tre University, Rome, Italy（罗马三大学）

专题命中多智能体：提出可监督多智能体入侵响应框架

AI总结提出可监督的多智能体入侵响应框架Agentra，通过角色划分、规划-验证循环、安全网关和风险评分机制，将警报转化为结构化响应计划，在120事件语料上F1从0.61提升至0.84，有害动作率降至0.0%。

详情

AI中文摘要

企业入侵响应仍然依赖于静态剧本和分析师驱动的分类，导致警报生成与遏制之间存在延迟。我们提出Agentra，一个可监督的多智能体入侵响应系统（IRS）框架，它将来自IDS、EDR和XDR平台的警报转换为基于MITRE ATT&CK、MITRE D3FEND和NIST CSF 2.0的结构化事件响应计划。Agentra将响应推理分解到角色范围的智能体中，通过有界的规划器-验证器审查循环验证提议的计划，通过审核安全网关筛选检索到的威胁情报，通过行动目录和风险评分门控行动，并将决策记录在仅追加的审计日志中。我们在来自ThreatHunter-Playbook、Splunk BOTSv3和DARPA OpTC的120事件语料库上，将Agentra与静态OASIS CACAO v2.0网络剧本基线进行了评估。最强的配置将感知假阳性的IRS F1从0.61提高到0.84，并在仅规划器配置引入不安全过度反应后，将预计的有害动作率恢复到静态基线水平0.0%。这些结果表明，多智能体响应规划可以在保持分析师批准和可审计性的同时，提高基于本体的IRS覆盖率。

英文摘要

Enterprise intrusion response still depends on static playbooks and analyst-driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi-agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATT&CK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role-scoped agents, validates proposed plans through a bounded Planner--Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append-only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber-playbook baseline on a 120-event corpus drawn from ThreatHunter-Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP-aware IRS F1 from 0.61 to 0.84 and restores the projected harmful-action rate to the static baseline level of 0.0% after Planner-only configurations introduce unsafe overreaction. These results indicate that multi-agent response planning can improve ontology-grounded IRS coverage while preserving analyst approval and auditability.

URL PDF HTML ☆

赞 0 踩 0

2606.06971 2026-06-19 cs.MA cs.SI 版本更新专题 90

Modeling U.S. Attitudes Toward China via an Event-Steered Multi-Agent Simulator

通过事件驱动的多智能体模拟器建模美国对华态度

Chenxu Zhu, Hantao Yao, Wu Liu, Junbo Guo, Yongdong Zhang

专题命中多智能体：事件驱动多智能体模拟器建模舆论演化

AI总结提出事件驱动多智能体模拟器（ES-MAS），利用CURE数据集和双流数据集成引擎（DSDIE）及新闻驱动动态交互模块（NDDI），模拟美国对华舆论的动态演化，实验表明优于现有模型。

详情

AI中文摘要

理解舆论的动态演化，如美国公众对中国的态度，对于评估地缘政治风险至关重要。然而，现有的基于LLM的多智能体模拟器主要依赖静态规则和固定数据集，限制了其捕捉现实世界中宏观层面舆论转变的动态、事件驱动特性的能力。为解决这一限制，我们提出了一种事件驱动的多智能体模拟器（ES-MAS），其中重大事件和日常新闻通过智能体之间的动态交互持续驱动舆论演化。我们首先构建了中美关系演化（CURE）数据集，涵盖2021年至2025年的20个季度，包括258个重大事件和超过14,000篇日常新闻文章，为建模舆论动态提供了全面的时间基础。基于CURE数据集，我们提出了双流数据集成引擎（DSDIE），该引擎通过宏观层面事件将模拟与历史时间线对齐，同时基于个体智能体画像和上下文信号实现个性化信息暴露。此外，我们设计了新闻驱动的动态交互（NDDI）模块，该模块自适应地将具有共同新闻兴趣的智能体分组到局部交互上下文中，促进自下而上的共识形成，同时降低孤立信息茧房的风险。在CURE数据集上的实验结果表明，ES-MAS在复现真实世界历史趋势方面显著优于现有模拟器，为建模动态舆论演化提供了一个可扩展且有效的框架。

英文摘要

Understanding the dynamic evolution of opinions, such as U.S. public attitudes toward China, is essential for assessing geopolitical risks. However, existing LLM-based multiagent simulators predominantly rely on static rules and fixed datasets, limiting their ability to capture the dynamic, event-driven nature of macro-level opinion shifts in real-world settings. To address this limitation, we propose an Event-Steered Multi-Agent Simulator (ES-MAS), in which significant events and daily news continuously drive opinion evolution through dynamic interactions among agents. We first construct the China-U.S. Relation Evolution (CURE) dataset, covering 20 quarters from 2021 to 2025, including 258 major events and over 14,000 daily news articles, and providing a comprehensive temporal foundation for modeling opinion dynamics. Building upon the CURE dataset, we propose a Dual-Stream Data Integration Engine (DSDIE) that aligns simulations with historical timelines via macro-level events while enabling personalized information exposure based on individual agent profiles and contextual signals. Furthermore, we design a News-Driven Dynamic Interaction (NDDI) module, which adaptively groups agents with shared news interests into localized interaction contexts, facilitating bottom-up consensus formation while mitigating the risk of isolated information cocoons. Experimental results on the CURE dataset demonstrate that ES-MAS substantially outperforms existing simulators in reproducing real-world historical trends, offering a scalable and effective framework for modeling dynamic opinion evolution.

URL PDF HTML ☆

赞 0 踩 0

2606.19787 2026-06-19 cs.AI 新提交专题 90

ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

ORAgentBench: LLM代理能否解决具有挑战性的端到端运筹学任务？

Jiajun Li, Mingshu Cai, Yixuan Li, Yu Ding, Ran Hou, Guanyu Nie, Xiongwei Han, Wanyuan Wang

发表机构 * Southeast University（东南大学）； Waseda University（早稻田大学）； Nanyang Technological University（南洋理工大学）

专题命中规划决策：评估LLM代理在运筹学任务中的端到端表现。

AI总结提出ORAgentBench基准，评估LLM代理在端到端运筹学任务中的表现，发现当前代理通过率仅35.51%，主要受策略性弱点限制。

Comments 31 pages, preprint, v1

详情

AI中文摘要

大型语言模型越来越多地被部署为可执行环境中多步任务的自主代理，但它们执行现实运筹学工作的能力仍不明确。现有的运筹学评估通常将建模与求解分离，依赖预形式化或纯文本实例，很少测试从操作工件到验证决策的完整工作流程。在这项工作中，我们引入了ORAgentBench，一个基于执行环境的基准，用于评估自主代理在具有挑战性的端到端运筹学任务上的表现。它包含107个经过人工审核的任务，涵盖多样化的操作场景，每个任务都打包在一个隔离环境中，包含自然语言简介、多文件数据、配置工件和所需的提交模式。代理必须编写并运行解决方案代码，其提交由隐藏验证器根据模式有效性、硬约束可行性和归一化目标质量进行评估。对十四个前沿代理模型配置的实验表明，当前代理远未达到可靠的运筹学实践。最佳代理仅通过35.51%的所有任务和20.59%的困难任务，许多可行的提交仍低于所需的质量阈值。失败分析进一步表明，错误主要由策略性弱点主导，包括遗漏操作规则、脆弱的公式化、弱可行解构造以及解改进不足。运筹学特定的程序性技能增加了困难任务的可行性，但并未可靠地提高解质量或通过率。这些结果表明，运筹学代理的进展需要超越合理的优化代码，转向可靠、高质量的操作决策。

英文摘要

Large language models are increasingly deployed as autonomous agents for multi-step tasks in executable environments, yet their ability to perform realistic operations research (OR) work remains unclear. Existing OR evaluations often decouple modeling from solving, rely on pre-formalized or text-only instances, and rarely test the full workflow from operational artifacts to validated decisions. In this work, we introduce ORAgentBench, an execution-grounded benchmark for evaluating autonomous agents on challenging end-to-end operations research tasks. It contains 107 human-reviewed tasks across diverse operational scenarios, each packaged in an isolated environment with a natural-language brief, multi-file data, configuration artifacts, and a required submission schema. Agents must write and run solution code, and their submissions are evaluated by hidden validators for schema validity, hard-constraint feasibility, and normalized objective quality. Experiments with fourteen frontier agent-model configurations show that current agents remain far from reliable OR practice. The best agent passes only 35.51% of all tasks and 20.59% of hard tasks, and many feasible submissions still fall below the required quality threshold. Failure analysis further shows that errors are dominated by strategic weaknesses, including missed operational rules, brittle formulations, weak feasible-solution construction, and insufficient solution improvement. OR-specific procedural skills increase hard-task feasibility, but do not reliably improve solution quality or pass rate. These results suggest that progress in OR agents requires moving beyond plausible optimization code toward dependable, high-quality operational decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.15862 2026-06-19 cs.AI 新提交专题 90

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

RetailBench: 在真实零售环境中评估LLM代理的长期推理与连贯决策能力

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

发表机构 * Ant Group（蚂蚁集团）； City University of Hong Kong（香港城市大学）

专题命中规划决策：评估LLM代理在零售环境中的长期决策

AI总结提出RetailBench基准，模拟单店超市运营，评估LLM代理在长期决策中的表现，发现多数模型无法持续生存，与最优策略差距显著。

Comments This paper is my paper's second version [see arXiv:2603.16453v2]

详情

AI中文摘要

大型语言模型（LLM）代理在短期、范围明确的任务上取得了快速进展，但它们在动态长期环境中维持连贯决策的能力仍不确定。我们引入了RetailBench，一个基于数据驱动的模拟基准，用于评估在单店超市运营中使用工具的LLM代理。RetailBench将零售管理建模为部分可观察的决策过程，并设计支持千天规模的模拟。在此环境中，代理必须管理定价、补货、供应商选择、货架分类、库存老化、客户反馈、外部事件和现金流约束。我们在180天的评估期内，在代表性代理框架下评估了七个当代LLM，并将它们与特权最优策略进行比较。结果显示模型之间存在显著差异：只有一小部分能够存活整个评估期，即使最强的LLM运行在最终净资产和销售结果上也远落后于最优策略。行为分析将这些差距归因于不完整的证据获取、表面决策以及缺乏一致的长期策略。RetailBench为研究经济基础长期决策中的可靠自主性提供了一个受控测试平台。

英文摘要

Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.20376 2026-06-19 cs.LG cs.AI 新提交专题 85

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX：快速安全强化学习基准测试

Tristan Tomilin, Mourad Boustani, Mickey Beurskens, Thiago D. Simão

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

专题命中规划决策：安全RL基准，评估智能体在约束下的规划决策

AI总结提出基于JAX加速的安全RL基准CRAX，利用MJX物理引擎实现高达100倍加速，包含6个环境套件和3个智能体任务，评估6种方法揭示性能与安全权衡。

详情

AI中文摘要

安全性是强化学习（RL）智能体在机器人、自动驾驶等现实领域部署的核心问题。尽管基准测试对RL的进步至关重要，但现有具有高保真3D物理的安全基准计算速度慢，限制了大规模实验和快速原型开发。为解决这一问题，我们提出CRAX（基于JAX加速的约束RL）。CRAX构建在具有逼真3D动力学的MuJoCo XLA（MJX）物理引擎之上，利用向量化操作和硬件加速，相比基于CPU的同类安全基准实现高达约100倍的加速。该基准包含六个环境套件和三个智能体特定任务，每个任务涵盖三个难度级别。对六种流行安全RL方法的评估表明，没有单一方法在所有任务中占主导地位，并揭示了性能与安全之间的权衡。我们发现，跨难度级别的课程学习和安全迁移可以比直接在更困难设置中训练提高性能。

英文摘要

Safety is a core concern for deploying reinforcement learning (RL) agents in real-world domains such as robotics and autonomous driving. While benchmarks have been central to progress in RL, existing safety benchmarks with high-fidelity 3D physics remain computationally slow, limiting large-scale experimentation and rapid prototyping. To address this gap, we propose CRAX (Constrained RL Accelerated with JAX). Built on top of the MuJoCo XLA (MJX) physics engine with realistic 3D dynamics, CRAX leverages vectorized operations and hardware acceleration, yielding up to ~100x speedups over comparable CPU-based safety benchmarks. The benchmark features six environment suites and three agent-specific tasks, each spanning three difficulty levels. Evaluating six popular safe RL methods shows that no single approach dominates across all tasks, and reveals the trade-offs between performance and safety. We find that curriculum learning across difficulty levels and safety transfer can improve performance over direct training in harder settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20142 2026-06-19 cs.AI cs.MA 新提交专题 85

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

RACL：用于连续元启发式学习的推理代理控制层

Antón Asla Manzárraga

发表机构 * Independent Researcher（独立研究者）

专题命中规划决策：推理代理控制层优化元启发式算法。

AI总结提出RACL方法，在元启发式优化器之上添加推理代理，通过观察、推理和干预控制搜索行为，在车辆路径问题上平均成本降低0.641%-8.337%。

Comments 10 pages, 5 tables

详情

AI中文摘要

本文介绍了RACL，一种用于元启发式算法的推理代理控制层。RACL在现有优化器之上放置一个推理代理。该代理不替换优化器，也不修改业务约束。相反，它通过观察操作内存、推理过去行为、制定有界假设、测试干预、评估结果、应用护栏、巩固有用策略并解释其决策来控制优化器的内部搜索行为。实验使用车辆路径作为测试平台，但贡献不是新的路由求解器、特定的ALNS配置或特定的路由规则集。贡献是RACL方法：一种推理代理发现、验证、巩固和解释元启发式算法控制规则的方式。在当前实验设置中，RACL在21个可行案例中的21个中改进或持平操作内存策略，在21个可行案例中的18个中改进或持平非推理停滞触发策略，平均RACL与STP成本差异为-0.641%。在Sevilla-9/10运行时样本中，RACL相对于Fixed平均成本降低-8.337%，相对于STP降低-1.605%，且没有显示实质性计算开销。在概念验证期间，Codex被用作循环推理代理，观察执行、解释日志并提出实时有界干预。后来仅使用策略代理使定量评估可重复。

英文摘要

This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer's internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisions. The experiment uses vehicle routing as a testbed, but the contribution is not a new routing solver, a particular ALNS configuration or a specific set of routing rules. The contribution is the RACL method: a way for a reasoning agent to discover, validate, consolidate and explain algorithmic control rules for a metaheuristic. In the current experimental setting, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL vs STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without showing material computational overhead. During the proof-of-concept, Codex was used as an in-the-loop reasoning agent observing executions, interpreting logs and proposing live bounded interventions. The policy proxy was later used only to make quantitative evaluation reproducible.

URL PDF HTML ☆

赞 0 踩 0

2606.20122 2026-06-19 cs.AI cs.MA 新提交专题 85

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

ScaffoldAgent: 面向开放式深度研究的效用引导动态大纲优化

Zhibang Yang, Xinke Jiang, Yuzhen Xiao, Ruizhe Zhang, Yue Fang, XinFei Wan, Zhengxing Song, Yuxuan Liu, Yuheng Huang, Xu Chu, Junfeng Zhao, Yasha Wang

发表机构 * National Engineering Research Center of Software Engineering, Peking University（北京大学软件工程国家工程研究中心）； School of Computer Science, Peking University（北京大学计算机学院）； Key Laboratory of High Confidence Software Technologies, Ministry of Education（教育部高可信软件技术重点实验室）； GRG Banking Equipment Co., Ltd.（广电运通金融电子股份有限公司）； Center on Frontiers of Computing Studies, Peking University（北京大学计算前沿研究中心）； Peking University Information Technology Institute (Tianjin Binhai)（北京大学（天津滨海）信息技术研究院）

专题命中规划决策：智能体框架优化深度研究大纲。

AI总结提出ScaffoldAgent框架，通过效用引导的动态大纲优化（扩展、收缩、修订操作）解决开放式深度研究中大纲漂移问题，在DeepResearch Bench和Gym上提升长报告生成与事实准确性。

Comments 9 pages, 6 figures

详情

AI中文摘要

开放式深度研究（OEDR）要求系统通过多轮检索获取知识并生成连贯的长篇报告。大纲作为协调检索、证据组织和生成的结构性支架起着核心作用。然而，现有方法要么在写作前固定大纲，要么使用局部启发式方法进行优化，导致在持续信息积累下出现大纲漂移，且评估大纲修改的反馈延迟。我们提出ScaffoldAgent，一种面向OEDR的效用引导动态大纲优化框架。ScaffoldAgent将大纲演化建模为结构化决策过程，包含三种操作：扩展、收缩和修订，从而实现对报告支架的受控更新。它进一步引入效用引导的反馈机制，通过检索增益、结构连贯性和试生成质量来估计每个大纲操作的下游价值。得到的效用信号指导推理过程中的节点选择、操作调度和终止。在DeepResearch Bench和DeepResearch Gym上的实验表明，ScaffoldAgent在长报告生成和事实基础上持续优于现有的深度研究智能体。

英文摘要

Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.

URL PDF HTML ☆

赞 0 踩 0

1. 其他Agent 5 篇

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

Efficient and Sound Probabilistic Verification for AI Agents

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

2. 工具调用 5 篇

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

PowerAgentBench-Dyn: A Benchmark for Agentic AI in Power System Dynamic Studies

3. 工作流自动化 5 篇

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AgenticDB: Agentic Performance Reconfiguration for Database Workloads

The Orchestration Gap: Why Process Automation Stalls in Operationally Complex Industries

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

4. 软件智能体 5 篇

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

N-Version Programming with Coding Agents

5. 多智能体 5 篇

Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Modeling U.S. Attitudes Toward China via an Event-Steered Multi-Agent Simulator

6. 规划决策 5 篇

ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

CRAX: Fast Safe Reinforcement Learning Benchmarking

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research