arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 65 信号源:cs.AI, cs.CL, cs.LG, cs.SE

1. 软件智能体 8 篇

2606.18448 2026-06-18 cs.CL 新提交 95%

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL:面向计算机使用智能体的多模态技能

Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang

发表机构 * UC Santa Barbara(加州大学圣塔芭芭拉分校) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) MIT-IBM Watson AI Lab(麻省理工学院-IBM沃森人工智能实验室)

专题命中 软件智能体 :面向计算机使用智能体的多模态技能库

AI总结 提出VISUALSKILL分层多模态技能库,通过结合文档与UI探索构建,使智能体在CUA基准上平均得分提升15.3点,且多模态优于纯文本技能。

详情
AI中文摘要

计算机使用智能体(CUA)在标准化基准上接近人类水平,但在长周期任务和未见软件上仍存在困难。现有技能库通过可复用技能解决此问题,但仅以文本形式表示技能工件,忽略了GUI交互的视觉特性。我们提出VISUALSKILL:一种分层多模态技能,针对每个目标应用定制,并组织为按主题文件索引的中央索引,智能体通过load_topic MCP工具按需获取相关主题的文本和图形。我们通过结合编写文档与实时应用UI探索的两阶段流水线构建每个技能。在两个CUA基准CUA-World和OSExpert-Eval上,由Claude Opus 4.6支持的Claude Code CLI智能体使用VISUALSKILL达到平均得分0.456,比无技能基线(0.303)绝对提升15.3点。与从相同源内容生成且仅在模态上与VISUALSKILL不同的匹配纯文本技能相比,VISUALSKILL进一步绝对提升8.3点(0.373 vs. 0.456),直接证明在技能工件中保留视觉图形而非将其语言化,有助于智能体识别UI元素并在每次操作后验证工作流状态。我们的代码见此链接。

英文摘要

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.

2606.19319 2026-06-18 cs.MA cs.AI cs.DB 新提交 90%

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

数据智能代理:通过自主编码代理解释、建模和查询企业数据

Anoushka Vyas, Aarushi Dhanuka, Sina Khoshfetrat Pakazad, Henrik Ohlsson

发表机构 * C3 AI

专题命中 软件智能体 :自主编码代理处理企业数据集成

AI总结 提出Data Intelligence Agents (DIA)系统,由三个自主编码代理组成,通过执行、验证和修复工件来压缩数据集成工作流,在七个SQL基准测试中达到或超越最佳结果。

详情
AI中文摘要

生产数据集成受限于数据所有者、工程师和分析师之间重复且有损的手动交接,他们必须协作发现、构建和查询企业数据。我们提出数据智能代理(DIA),一个由三个代理(数据解释器、模式创建器和查询生成器)组成的系统,通过将自主编码代理(ACA)作为一等抽象来压缩这一工作流:代理不是生成文本,而是生成、执行、验证和修复具体工件,利用共享内存进行经验重用,并将每个工件呈现给领域专家审查。DIA已部署在生产环境中供企业客户使用。我们深入研究了查询生成器,并在完全自主模式下跨七个SQL基准测试(涵盖四个任务类别和四种方言)进行评估。它在所有七个基准测试中达到或超越了最佳已发表结果,表明基于执行、构建在ACA和共享内存之上的架构能够泛化到数据智能工作负载,且适应仅限于自然语言指令。

英文摘要

Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data. We present Data Intelligence Agents (DIA), a system of three agents (Data Interpreter, Schema Creator, and Query Generator) that compresses this workflow by treating autonomous coding agents (ACAs) as a first-class abstraction: rather than emitting text, the agents generate, execute, validate, and repair concrete artifacts, draw on a shared memory for experience reuse, and surface each for review by domain experts. DIA is deployed in production for enterprise customers. We study the Query Generator in depth and evaluate it in fully autonomous mode across seven SQL benchmarks spanning four task categories and four dialects. It matches or surpasses the best published results on all seven, demonstrating that an architecture grounded in execution, built on ACAs and a shared memory, generalizes across the data intelligence workload with adaptation confined to natural-language instructions.

2606.18890 2026-06-18 cs.AI 新提交 90%

Skill-Guided Continuation Distillation for GUI Agents

面向GUI代理的技能引导延续蒸馏

Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

发表机构 * StepFun University of Science and Technology Beijing(北京科技大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学)

专题命中 软件智能体 :技能引导蒸馏提升GUI Agent成功率

AI总结 提出技能引导延续蒸馏(SGCD)框架,通过技能引导策略生成成功延续轨迹,弥补专家轨迹中未覆盖的状态监督缺失,在OSWorld-Verified上将三个基础模型成功率从30%左右提升至50%以上。

详情
AI中文摘要

改进GUI代理通常依赖于在专家轨迹上的行为克隆。然而,当当前策略偏离专家策略时,在闭环执行过程中不可避免地会遇到策略导致的偏离轨迹状态,即超出专家轨迹的状态。由于专家轨迹未对这些未见状态提供演示,这些状态得不到有效监督,导致策略无法选择正确动作。为弥补这一监督缺口,我们提出技能引导延续蒸馏(SGCD),一种迭代式自我改进框架。SGCD首先在没有技能引导的情况下运行简单策略若干步,以到达真实的偏离轨迹状态。从这些状态出发,技能引导策略完成任务并生成成功的延续轨迹,这些轨迹与专家轨迹混合,为策略导致的偏离轨迹状态提供监督。技能从成功和失败的轨迹中提取,包括延续计划、关键目标、失败陷阱和成功标准。在OSWorld-Verified上,SGCD将三个基础模型的成功率从30%左右提升至超过50%,证明了其有效性和通用性。

英文摘要

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30\% range to over 50\%, demonstrating its effectiveness and generality.

2606.18976 2026-06-18 cs.SE cs.AI 新提交 85%

CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

CAPRA: 使用多智能体LLM系统对软件架构交付物进行反馈扩展

Marco Becattini, Niccolò Caselli, Matteo Minin, Roberto Verdecchia, Enrico Vicario

发表机构 * Department of Information Engineering, University of Florence, Florence, Italy(信息工程系,佛罗伦萨大学,意大利佛罗伦萨)

专题命中 软件智能体 :多智能体LLM系统自动生成软件架构反馈。

AI总结 提出CAPRA多智能体LLM系统,通过多模态文档提取、确定性证据锚定和一致性管理,自动生成软件架构交付物的个性化LaTeX反馈,在10份学生报告中满足88.8%的评估标准。

Comments Accepted for publication at the 38th International Conference on Software Engineering Education and Training

详情
AI中文摘要

软件工程教育中的自动评估在代码评分和论文评分方面取得了显著进展。然而,审查软件架构交付物需要分析结构完整性和需求可追溯性,尚未完全自动化。将大型语言模型(LLM)应用于此任务需要稳健的架构,以确保技术反馈对学生准确可靠。本文提出CAPRA(可配置架构能力报告评估),一个多智能体LLM系统,分析软件架构交付物以生成个性化的、符合模板的LaTeX反馈。作为核心设计选择,CAPRA协调多个专门智能体,并采用基于Python的微服务进行多模态文档提取,利用PyMuPDF和视觉增强LLM(特别是gpt-4o)解析文本和UML图。为确保教育可靠性并减少幻觉,CAPRA引入了使用归一化Levenshtein距离进行模糊匹配的确定性证据锚定步骤,以及一个交叉验证、去重和合并发现的一致性管理器智能体。系统性能通过一个结构化的八标准二元评估分类法进行评估,涵盖:(i) 提取完整性,(ii) 特征验证,(iii) 问题依据和严重性检测,(iv) 建议特异性和可追溯性,以及(v) 模板和语气合规性。对10份学生报告的初步实证评估显示,在严格的两评分者聚合规则下,CAPRA满足了88.8%的评估标准,与人类评估者达到了中等评分者间一致性(kappa = 0.582),每份报告处理时间略超过4分钟。虽然这些结果支持LLM支持的架构反馈的可行性,但主观评估维度仍需人工监督。

英文摘要

Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

2606.18728 2026-06-18 cs.CL 新提交 85%

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

LegalWorld: 法律智能体的生命周期交互环境

Songhan Zuo, Shengbin Yue, Tao Chiang, Guanying Li, Yun Song, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Northwest University of Political and Law(西北政法大学)

专题命中 软件智能体 :法律智能体生命周期交互环境。

AI总结 提出LegalWorld,一个将中国民事诉讼建模为五阶段因果链的生命周期交互环境,基于75309对判决书构建,并评估多智能体在连续诉讼中的能力差异。

详情
AI中文摘要

民事诉讼本质上是一个生命周期过程:律师第一天起草的内容会约束数月后庭审的走向。然而,现有的法律基准评估的是孤立的子任务,而先前的法律智能体模拟器每次从共享的真实情况重新初始化场景,忽略了跨阶段的因果依赖关系。我们提出LegalWorld,一个生命周期交互环境,将中国民事诉讼建模为五个阶段(七个子场景)的因果连接状态链,基于75,309对中国民事判决书构建。我们为其配备了可重用的基础设施(本地记忆、全局案件记忆、技能/工具库),确保每个争议在其整个生命周期中保持一致。在此环境基础上,我们构建了LongJud-Bench,用于评估智能体在所有五个连接阶段的能力。来自217名法律背景评估者的18,992个评分证实,LegalWorld的轨迹在程序上忠实且角色一致;跨模型的能力级评估揭示了聚合分数无法暴露的显著分歧,没有单一骨干模型在咨询、起草和庭审辩护中均领先。详细资源将公开发布。

英文摘要

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

2606.18671 2026-06-18 cs.HC 新提交 85%

HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification

HANSEL: 从Web智能体轨迹中提取面包屑用于交互式验证

Yujin Zhang, Daye Nam

专题命中 软件智能体 :Web智能体轨迹提取证据用于验证

AI总结 提出HANSEL系统,从AI智能体轨迹中提取可交互验证的证据,减少用户审查负担,在基准测试中达到83.7%精确率和88.9%召回率,用户研究显示显著降低任务完成时间和感知努力。

Comments 13 pages, 6 figures

详情
AI中文摘要

AI Web智能体可以代表用户执行复杂的多步骤任务,例如搜索产品、比较选项和进行购买。然而,验证智能体输出的正确性仍然困难。现有的透明机制,包括完整轨迹日志、源链接、截图和LLM生成的摘要,将验证视为被动阅读任务,让用户筛选大量日志或信任可能不忠实的解释。我们提出HANSEL(突出显示智能体导航步骤作为证据链接),一个从Web智能体轨迹中提取交互式、可验证证据的系统。给定一个智能体轨迹,HANSEL提取证据页面和片段,并将其呈现为可导航、交互式的视图,并保留相关页面状态(例如,应用的过滤器、搜索查询和滚动位置),使用户能够验证智能体如何得出其答案。当智能体的答案无法追溯到任何访问过的页面时,HANSEL明确标记此缺口。在来自AssistantBench和Online-Mind2Web的45个任务上的技术评估显示,HANSEL在识别证据页面方面达到83.7%的精确率和88.9%的召回率,同时将轨迹量减少61.6%。在14名参与者的受控用户研究中,与标准智能体界面相比,HANSEL显著减少了任务完成时间和感知努力,而参与者在可用性、验证易用性和错误识别方面对其评价显著更高。我们的结果表明,将验证重新定义为交互式活动,而不是被动消费智能体解释,可以导致对AI智能体更高效的人工监督。

英文摘要

AI web agents can perform complex, multi-step tasks such as searching for products, comparing options, and making purchases on behalf of users. However, verifying the correctness of an agent's output remains difficult. Existing transparency mechanisms, including full trajectory logs, source links, screenshots, and LLM-generated summaries, treat verification as a passive reading task, leaving users to sift through overwhelming logs or trust potentially unfaithful explanations. We present HANSEL (Highlighting Agent Navigation Steps as Evidence Links), a system that extracts interactive, verifiable evidence from web-agent trajectories. Given an agent trajectory, HANSEL extracts evidence pages and snippets and presents them as navigable, interactive views with relevant page state preserved (e.g., applied filters, search queries, and scroll positions), enabling users to verify how the agent arrived at its answer. When the agent's answer cannot be traced to any visited page, HANSEL explicitly flags this gap. A technical evaluation on 45 tasks from AssistantBench and Online-Mind2Web shows that HANSEL achieves 83.7% precision and 88.8% recall in identifying evidence pages, while reducing trajectory volume by 61.6%. In a controlled user study with 14 participants, HANSEL significantly reduced task completion time and perceived effort compared to a standard agent interface, while participants rated it significantly higher on usability, verification ease, and error identification. Our results demonstrate that reframing verification as an interactive activity, rather than passive consumption of agent explanations, leads to more efficient human oversight of AI agents.

2606.16000 2026-06-18 cs.CL cs.LG 新提交 85%

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

GRACE-DS:数据科学中的受保护奖励引导智能体修正环境

Aleksandr Tsymbalov, Danis Zaripov, Artem Epifanov, Anastasiya Palienko

发表机构 * ITMO University(ITMO大学) HSE University(高等经济学院)

专题命中 软件智能体 :评估LLM驱动的AutoML智能体环境

AI总结 提出GRACE-DS,一个用于评估LLM驱动的AutoML智能体在部署前性能的隔离环境,通过隐藏的可执行验证器衡量预测性能、泄漏避免、可重复性等指标,实验证明其灵活迭代交互模式优于基线方法。

详情
AI中文摘要

我们介绍了GRACE-DS,一个数据科学中的受保护奖励引导智能体修正环境,用于对LLM驱动的AutoML智能体进行部署前评估。GRACE-DS是一组在隔离环境中的评估指标,可应用于特定组织的表格ML任务。它将智能体暴露于现实的工作流阶段,从规划和数据检查到特征工程、模型开发、验证、代码修复直至最终提交,同时隐藏的可执行验证器不仅衡量最终预测性能,还衡量泄漏避免、可重复性、协议有效性、修正行为和奖励对齐。最强的结构化机制——灵活迭代交互(我们的方法)——实现了比单次生成、非结构化交互和基于重启的基线更高的端到端归一化隐藏测试质量,同时提高了协议有效完成率。经过7000多个回合的验证,这些结果确立了GRACE-DS作为评估基于LLM的AutoML智能体在生产类条件下按照组织特定要求执行机器学习工作流能力的稳健平台。

英文摘要

We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be applied to tabular ML tasks specific to a particular organization. It exposes agents to realistic workflow stages, from planning and data inspection through feature engineering, model development, validation, and code repair to final submission, while hidden executable validators measure not only final predictive performance but also leakage avoidance, reproducibility, protocol validity, correction behavior, and reward alignment. The strongest structured regime, flexible iterative interaction (our approach), achieves higher end-to-end normalized hidden-test quality than single-shot generation, unstructured interaction, and restart-based baselines, while also improving protocol-valid completion. Validated across more than 7,000 episodes, these results establish GRACE-DS as a robust platform for assessing the capacity of LLM-based AutoML agents to execute machine learning workflows under production-like conditions and in accordance with organization-specific requirements.

2606.13681 2026-06-18 cs.CL 新提交 85%

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena: 追踪记忆演化以构建动态环境中的鲁棒LLM智能体

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

发表机构 * National University of Singapore(新加坡国立大学) Singapore Management University(新加坡管理大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Pennsylvania(宾夕法尼亚大学) Nanyang Technological University(南洋理工大学) Recursive Massachusetts Institute of Technology(麻省理工学院)

专题命中 软件智能体 :动态环境中LLM智能体的记忆演化基准

AI总结 提出EvoArena基准套件模拟终端、软件和社交领域的渐进环境变化,并设计基于补丁的记忆范式EvoMem记录结构化更新历史,使智能体能通过记忆变化推理环境演化,实验表明当前智能体在动态环境中表现不佳,EvoMem可稳定提升性能。

详情
AI中文摘要

大型语言模型(LLM)智能体在广泛基准测试中取得了强劲性能,但大多数评估假设静态环境。相比之下,实际部署本质上是动态的,要求智能体持续将其知识、技能和行为与不断变化的环境及更新的任务条件对齐。为弥补这一差距,我们引入了EvoArena,一个基准套件,将环境变化建模为终端、软件和社交领域的渐进更新序列。我们进一步提出EvoMem,一种基于补丁的记忆范式,将记忆演化记录为结构化的更新历史,使智能体能够通过记忆中的变化推理环境演化。实验表明,当前智能体在EvoArena上表现不佳,在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能,在EvoArena上平均提升1.5%,并在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。除单个任务外,EvoMem在EvoArena上还将链级准确率提升3.7%,其中成功需要完成一系列连续的相关演化子任务。机制分析表明,EvoMem改善了记忆中的证据捕获,表明更完整地保留了演化的环境状态。我们的结果强调了在评估和记忆中对演化进行建模对于可靠智能体部署的重要性。

英文摘要

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

2. 工具调用 6 篇

2606.18947 2026-06-18 cs.AI cs.CL cs.IR cs.MA 新提交 90%

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

将搜索与推理解耦:面向LLM Agent的供应商无关的接地架构

Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

发表机构 * DoorDash, Inc.(DoorDash公司)

专题命中 工具调用 :提出解耦搜索接地架构,增强LLM Agent搜索能力

AI总结 提出解耦搜索接地(DSG)架构,将搜索接地从推理模型中分离,通过MCP兼容网关实现供应商路由、缓存等控制,在降低成本和延迟的同时保持或提升准确性。

Comments 15 pages, Figure 8

详情
AI中文摘要

生产级LLM Agent越来越依赖实时搜索,但原生搜索接地将检索策略、供应商选择、证据注入、成本、延迟和生成行为捆绑在单一模型-供应商边界内。这种耦合使得接地难以检查、调优、重用或移植,并可能触发搜索诱导的冗长,破坏严格的输出合约。我们提出解耦搜索接地(DSG),一种供应商无关的边界,通过MCP兼容网关将接地移出推理模型,将供应商路由、源感知上下文渲染、配置的回退、检索深度控制以及精确和语义缓存作为一级控制暴露。在SimpleQA、FreshQA和HotpotQA上的五个前沿模型上,原生搜索在时效性敏感的FreshQA上领先,但DSG在控制重要时展现出更强的前沿:在SimpleQA上,它以91%更低的搜索成本接近原生准确率(86.1%对87.7%),保持简洁答案合约,并以68%更低的延迟达到99.4%的热缓存命中率。作为大规模Agent工作负载的共享生产接地层部署,DSG在电商查询理解(QIU)工作负载上匹配或略超原生搜索准确率,同时将搜索成本降低超过98%。实时接地最好被视为可优化的接口边界,而非固定的模型特性。

英文摘要

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

2606.18467 2026-06-18 stat.ML cs.LG 新提交 85%

ToolChain-CRC: Conformal Risk Control for Agentic AI Under Retrieval and Tool-Use Drift

ToolChain-CRC: 检索与工具使用漂移下代理型AI的共形风险控制

Jeffery Opoku, David Banahene

发表机构 * The University of Texas Rio Grande Valley(德克萨斯大学里奥格兰德谷分校) Florida International University(佛罗里达国际大学)

专题命中 工具调用 :代理型AI工具使用风险控制

AI总结 针对检索增强和工具使用代理在漂移下的风险控制问题,提出ToolChain-CRC方法,通过构建轨迹级风险评分并校准接受或干预规则,实现可证明的轨迹级风险控制。

Comments 26 pages, 11 figures

详情
AI中文摘要

现代AI代理检索文档、调用工具、检查中间信息,然后产生最终答案或行动。这产生了一个仅从最终答案无法察觉的风险控制问题。即使检索薄弱、工具输出错误或早期步骤缺乏支持,最终响应也可能看起来可接受。我们提出ToolChain-CRC,一种针对漂移下检索增强和工具使用代理的共形风险控制方法。该方法将每次代理运行视为动作、观察和最终输出的完整轨迹。它构建步骤级风险评分,将其组合成轨迹风险评分,校准接受或干预规则,并添加一个随时报警,可在最终答案前停止风险运行。我们在可交换校准运行下证明了轨迹级风险控制,给出了具有可审计常数的漂移感知扩展,并通过超鞅构造证明了随时升级规则。实验涵盖合成工具链漂移、RAG/工具使用压力测试、基于SQuAD的公共检索任务、无API代理问答案例研究、消融实验、目标风险敏感性检查、20种子鲁棒性检查、漂移边界审计以及实时RAG/工具使用代理基准。在这些设置中,仅基于最终答案的校准可能遗漏检索和工具故障,而轨迹级校准将接受轨迹的风险保持在目标之下。

英文摘要

Modern AI agents retrieve documents, call tools, check intermediate information, and then produce a final answer or action. This creates a risk-control problem that is not visible from the final answer alone. A final response may look acceptable even when the retrieval was weak, a tool output was wrong, or an earlier step was unsupported. We propose ToolChain-CRC, a conformal risk-control method for retrieval-augmented and tool-using agents under drift. The method treats each agent run as a full trajectory of actions, observations, and final output. It builds step-level risk scores, combines them into a trajectory risk score, calibrates an accept-or-intervene rule, and adds an anytime alarm that can stop risky runs before the final answer. We prove trajectory-level risk control under exchangeable calibration runs, give a drift-aware extension with auditable constants, and prove an anytime escalation rule through a supermartingale construction. Experiments cover synthetic tool-chain drift, RAG/tool-use stress tests, public SQuAD-derived retrieval tasks, an API-free agentic QA case study, ablations, target-risk sensitivity checks, 20-seed robustness checks, a drift-margin audit, and a live RAG/tool-use agent benchmark. Across these settings, final-answer-only calibration can miss retrieval and tool failures, while trajectory-level calibration keeps accepted-trajectory risk below the target.

2606.19242 2026-06-18 cs.SE 新提交 85%

Runtime Compliance Verification for AI Agents

AI代理的运行时合规性验证

Nafiseh Kahani, Masoud Barati, Diana Addae

专题命中 工具调用 :AI代理运行时合规性验证框架

AI总结 提出C-Trace框架,通过运行时监控和形式化策略谓词,确保AI代理在工具调用和对话中遵守GDPR规则,将攻击成功率降至12%以下。

详情
AI中文摘要

AI代理现在通过工具使用、函数调用和多轮对话处理个人数据,这可能在《通用数据保护条例》(GDPR)下产生义务。当前的测试实践主要依赖于离线红队测试或静态提示审查,但它们无法在运行时保证代理行为遵循监管规则。我们提出了C-Trace(基于合规轨迹的运行时代理一致性执行),一个验证框架,它:(i)将GDPR要求的子集(包括同意、目的限制、数据最小化和删除权)表达为代理执行轨迹上的形式化策略谓词;(ii)使用运行时监视器拦截每个工具调用和模型输出,并拒绝不合规的动作;(iii)使用攻击对话(包括DSPy生成的提示和来自红队测试语料库的逐字提示)测试代理,试图诱导违规。我们在四个重新框架化为GDPR的案例研究上评估该框架。在每类别10%的提取器噪声(包括丢失和过度键入)下,监视器将攻击成功率保持在小于或等于12%,低于我们比较的基线,假阳性率小于或等于16%,并在完美提取下达到0%的攻击成功率。

英文摘要

AI agents now handle personal data through tool use, function calls, and multi turn dialogue, which can create obligations under the General Data Protection Regulation (GDPR). Current testing practices mainly rely on offline red teaming or static prompt review, but they do not guarantee at runtime that agent behavior follows regulatory rules. We propose C-Trace (Compliance Trace based Runtime Agent Conformance Enforcement), a verification framework that: (i) expresses a subset of GDPR requirements, including consent, purpose limitation, data minimization, and the right to erasure, as formal policy predicates over agent execution traces; (ii) uses a runtime monitor that intercepts every tool invocation and model output and rejects non-compliant actions; and (iii) tests the agent with attack dialogues, including DSPy generated prompts and verbatim prompts from red teaming corpora, that try to induce violations. We evaluate the framework on four case studies reframed to GDPR. Under 10 percent per-category extractor noise, including drop-out and over-typing, the monitor keeps the attack success rate at less than or equal to 12 percent, below the baselines we compare against, and false positives at less than or equal to 16 percent, and reaches 0 percent ASR under perfect extraction.

2606.19047 2026-06-18 cs.AI 新提交 85%

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS: 面向多轮工具使用智能体的奖励驱动在线数据合成

Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin

发表机构 * Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Westlake University(西湖大学)

专题命中 工具调用 :多轮工具使用智能体,奖励驱动数据合成。

AI总结 针对多轮工具使用强化学习中静态数据集信息样本快速耗尽的问题,提出RODS方法,利用进度奖励方差作为零成本边界检测器,在线合成与智能体能力边界匹配的样本,以约800样本达到17K样本离线管道的性能。

详情
AI中文摘要

多轮工具使用强化学习受限于静态数据集中信息样本的快速耗尽。我们观察到GRPO中的梯度信号集中在具有最高 rollout 奖励方差的任务上,这是Popoviciu上界的结果。因此,位于智能体能力边界附近(成功与失败大致平衡)的样本贡献了不成比例的大策略梯度。随着训练进行,该边界不断移动,逐渐耗尽静态数据集中的信息样本池。我们提出RODS(奖励驱动在线数据合成)来解决这种耗尽问题。RODS通过将进度奖励方差重新用作一个实用的、零成本的边界检测器(除了训练中已计算的rollout外无需额外推理),来闭环RL训练与数据生成。它持续识别这些边界样本,通过技能对齐的重采样管道合成与其结构复杂度(例如API拓扑和依赖深度)匹配的新多轮变体,并管理一个与策略共同演化的动态回放缓冲区。从400个人工种子开始并维持约800个样本的活动训练池,RODS实现了与17K样本离线管道相当的性能,同时所需轨迹数量约少20倍,并在我们的受控设置中优于固定数据RL和环境增强方法。

英文摘要

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.

2606.18902 2026-06-18 cs.CL 新提交 85%

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI Department of Engineering, University of Cambridge(剑桥大学工程系)

专题命中 工具调用 :多智能体诊断代码执行实现提示优化

AI总结 提出随机提示优化框架SPO,其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索,在多个基准测试中表现依赖于错误类型,并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情
AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度,这促使我们将自动提示优化(APO)视为黑盒搜索。我们引入了SPO(随机提示优化),一个在提示空间上进行随机搜索的框架,并比较了三种复杂度递增的策略:基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE(基于智能体引导探索的SPO),后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中,没有单一策略占主导地位;有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上,它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为,将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

2606.18789 2026-06-18 eess.SY cs.SY 新提交 85%

PowerAgentBench-SS: A Benchmark for Agentic AI in Power System Steady-State Studies

PowerAgentBench-SS:电力系统稳态研究中智能体AI的基准测试

Costas Mylonas, Magda Foti, Andrea Pomarico, Matheus Duarte, Qian Zhang, Emmanouel Varvarigos

专题命中 工具调用 :LLM智能体执行电力系统工作流

AI总结 提出PowerAgentBench-SS基准框架,用于评估LLM智能体在电力系统稳态研究中执行工程工作流的能力,通过工具API、验证预算和风险敏感指标区分智能体性能。

详情
AI中文摘要

电力系统基准测试通常评估数值求解器、预测模型或顺序控制器。这些基准是必要的,但它们不直接测试大型语言模型(LLM)智能体是否能执行工程工作流:检查电网案例、选择工具、调用模拟器、筛选 contingencies、提出可接受的缓解措施、验证结果并生成可审计的证据链。本文介绍了PowerAgentBench-SS,一个用于评估电力系统运行和规划研究中工具使用智能体的稳态基准框架。该基准向智能体公开案例数据、动作约束、工具API和验证预算,同时隐藏的评估器重新计算物理有效性并对提交的报告进行评分。我们定义了智能体接口、工具契约、证据日志和风险敏感指标,包括提交召回率、证据支持召回率、发现召回率、假安全惩罚、严重性遗憾、残余违规分数、动作成本、工具使用效率和工作流诊断。为了使框架具体化,我们在可复现的直流热N-2 contingency搜索试点中实例化该协议,使用确定性IEEE 39节点运行点变体,包括脚本基线、LLM JSON命令适配器、三个本地托管的Ollama LLM智能体和一个OpenAI API智能体。结果表明为什么仅求解器或仅答案评估是不够的:智能体不仅通过顶级contingency发现来区分,还通过验证预算使用、显式提交、类型强制、重复验证、证据支持报告和缓解行为来区分。

英文摘要

Power system benchmarks usually evaluate numerical solvers, prediction models, or sequential controllers. These benchmarks are necessary, but they do not directly test whether a Large Language Model (LLM) agent can execute an engineering workflow: inspect a grid case, select tools, call simulators, screen contingencies, propose admissible mitigations, validate results, and produce an auditable evidence trail. This paper introduces PowerAgentBench-SS, a steady-state benchmark framework for evaluating tool-using agents in power system operation and planning studies. The benchmark exposes public case data, action constraints, a tool API, and a validation budget to an agent, while a hidden evaluator recomputes physical validity and scores the submitted report. We define the agent interface, tool contract, evidence log, and risk-sensitive metrics, including submitted recall, evidence-backed recall, found recall, false-safe penalties, severity regret, residual violation score, action cost, tool-use efficiency, and workflow diagnostics. To make the framework concrete, we instantiate the protocol in a reproducible DC thermal N-2 contingency-search pilot on deterministic IEEE 39-bus operating-point variants, with scripted baselines, an LLM JSON-command adapter, three locally hosted Ollama LLM agents, and one OpenAI API agent. The results show why solver-only or answer-only evaluation is insufficient: agents are distinguished not only by top-contingency discovery, but also by validation-budget use, explicit submission, type coercions, duplicate validations, evidence-backed reporting, and mitigation behavior.

3. 多智能体 9 篇

2606.18837 2026-06-18 cs.MA cs.AI cs.LG 新提交 90%

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS: 演化元技能以自动生成多智能体系统

Hehai Lin, Qi Yang, Chengwei Qin

发表机构 * Ant Group(蚂蚁集团) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

专题命中 多智能体 :自动生成多智能体系统,元技能演化。

AI总结 提出Skill-MAS,通过将高层编排能力解耦为可演化的元技能,在无需参数更新的情况下实现经验保留,利用多轨迹采样和选择性反思优化元技能,在多个基准和LLM上取得显著性能提升且成本可控。

详情
AI中文摘要

基于大型语言模型(LLM)的自动多智能体系统(MAS)生成已成为处理复杂任务的关键前沿。然而,现有方法在模型能力和经验保留之间面临两难困境。推理时MAS利用冻结的尖端LLM,但重复相同搜索而不从过去经验中学习。相反,训练时MAS通过梯度更新内化经验,但受限于较小模型的低能力上限,且难以扩展到大型尖端LLM。为弥合这一差距,我们提出Skill-MAS,一种新颖的第三条路径,通过将高层编排能力概念化为可演化的元技能,将经验保留与参数更新解耦。Skill-MAS通过一个封闭优化循环来精炼这种架构知识:(1)多轨迹采样在当前元技能下为每个任务采样行为分布;(2)选择性反思自适应选择优先任务,并应用分层对比分析将系统经验蒸馏为可泛化的策略级原则。在四个复杂基准和四个不同LLM上的大量实验表明,Skill-MAS不仅实现了显著的性能提升,而且保持了良好的成本-性能权衡。进一步分析揭示,演化后的元技能高度鲁棒,并在未见任务和不同LLM之间表现出强迁移性。

英文摘要

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

2606.18668 2026-06-18 cs.MA cs.CL 新提交 90%

EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems

EARS:大规模多智能体系统中可靠子智能体建模的解释性弃权

Shuang Xie, Yunan Lu, Han Li, Lingyun Wang

发表机构 * Shopify Columbia University(哥伦比亚大学)

专题命中 多智能体 :多智能体系统中子智能体弃权机制

AI总结 针对大规模多智能体系统中子智能体过度回答导致幻觉的问题,提出EARS框架,通过将弃权重构为智能体间通信协议,利用校准的LLM裁判模型生成结构化弃权标签和理由,微调子智能体以检测故障并返回理由,在电商助手系统中将响应通过率从68.5%提升至78.9%。

详情
AI中文摘要

在大规模企业环境中,集中式多智能体系统(MAS)日益被采用,其中协调器将用户请求委托给轻量级、领域专业化的子智能体。虽然这种架构提高了模块化、可扩展性和成本效率,但其可靠性不仅取决于准确的路由,还取决于子智能体根据能力约束校准其响应的能力。特别是,基于较小微调模型的子智能体通常难以进行这种校准,导致它们过度回答模糊、未明确说明、路由错误或不支持的请求,并产生幻觉输出,而不是可操作的反馈。为了应对这一挑战,我们提出了EARS(用于可靠子智能体建模的解释性弃权),这是一个面向生产的框架,将子智能体弃权重新定义为智能体间通信协议:子智能体不仅弃权,而且向协调器暴露可操作的故障状态。EARS使用一组校准的LLM裁判模型来策划人机交互数据,在子智能体故障模式的分类法下生成结构化的弃权标签和理由。这些数据用于微调子智能体,使其能够检测故障条件并返回理由,以便协调器进行澄清、重新路由或回退。我们在一个支持企业商业智能工作流程的大规模生产电商助手中评估了EARS。EARS将整体响应通过率从68.5%提高到78.9%,证明了子智能体侧的解释性弃权提高了MAS的可靠性。

英文摘要

In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity, scalability, and cost efficiency, its reliability depends not only on accurate routing but also on sub-agents' ability to calibrate their responses to capability constraints. In particular, sub-agents built on smaller fine-tuned models often struggle with such calibration, leading them to over-answer ambiguous, underspecified, misrouted, or unsupported requests and produce hallucinated outputs instead of actionable feedback. To address this challenge, we present EARS (Explanatory Abstention for Reliable Sub-Agent Modeling), a production-oriented framework that reframes sub-agent abstention as an inter-agent communication protocol: a sub-agent does not merely abstain, but exposes an actionable failure state to the coordinator. EARS curates human-agent interaction data using an ensemble of calibrated LLM-as-a-Judge models, producing structured abstention labels and rationales under a taxonomy of sub-agent failure modes. These data are used to fine-tune sub-agents to detect failure conditions and return rationales for coordinator-level clarification, rerouting, or fallback. We evaluate EARS in a large-scale production e-commerce assistant supporting enterprise business intelligence workflows. EARS improves the overall response pass rate from 68.5% to 78.9%, demonstrating that sub-agent-side explanatory abstention improves MAS reliability.

2606.18648 2026-06-18 physics.comp-ph 新提交 90%

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

物理科学中的深度研究:多智能体框架与综合基准

Yigeng Jiang, Tengchao Yang, Taoyong Cui, Jiaxing Wan, Yuan Wang, Weida Wang, Zhiyu Liu, Chuyi Peng, Binzhao Luo, Maoli Gao, Huaihai Huang, Yuqianer Zeng, Ziyang Zheng, Dongchen Huang, Chao Chen, Zichao Liu, Weiping Shen, Shuchen Pu, Siyu Zhou, Runmin Ma, Yusong Hu, Fei Chao, Bo Zhang, Xiawu Zheng, Zifu Wang, Lei Bai, Yunqi Cai, Shufei Zhang

专题命中 多智能体 :多智能体框架DelveAgent,物理科学深度研究

AI总结 提出PhySciBench基准评估LLM在物理科学中的深度研究能力,并开发DelveAgent多智能体框架,通过自适应规划、双粒度记忆和分层反思机制提升准确率并降低推理成本。

Comments 19 pages, 5 figures, 1 table;

详情
AI中文摘要

深度研究智能体是基于大型语言模型(LLM)的系统,专为自主、多步骤的科学推理而设计,在加速物理科学研究方面具有巨大潜力。然而,目前缺乏对其在该领域能力的全面深入评估。为填补这一空白,我们引入了PhySciBench,一个与物理科学研究高度相关的基准,包含200个专家策划的问题,涵盖物理和化学,分布在反映真实科学工作流程的六个任务类别中。对最先进模型和智能体系统在PhySciBench上的评估显示性能有限;即使是最强的基线Gemini Deep Research,准确率也仅为33.5%。对失败案例的分析发现了三个反复出现的缺陷:扩展推理链的脆弱性、跨步骤的知识迁移有限以及缺乏基于物理的自验证。受这些发现启发,我们开发了DelveAgent,一个模块化的多智能体框架,配备自适应规划循环、双粒度记忆和分层物理接地反思机制。在四个科学基准上,DelveAgent将准确率提高了最多7.5个百分点,同时将推理成本降低到最强基线的大约三分之一。这些结果确立了PhySciBench作为评估物理科学中AI系统关键基准的重要性,并表明架构专业化可以有效增强自主科学研究的可靠性。

英文摘要

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.

2606.19308 2026-06-18 cs.CL cs.MA 新提交 85%

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

通过多智能体虚拟博弈增强大语言模型的决策能力

Leyang Shen, Yang Zhang, Xiaoyan Zhao, Chun Kai Ling, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学)

专题命中 多智能体 :多智能体虚拟博弈增强决策

AI总结 针对多智能体系统中决策任务因立场纠缠而难以分解的问题,提出基于虚拟博弈的多智能体虚拟博弈(MAFP)范式,通过迭代最佳响应实现均衡求解,提升决策质量和鲁棒性。

Comments 18 pages, 8 figures

详情
AI中文摘要

基于大语言模型(LLM)的多智能体系统(MAS)通过将子任务分配给协作智能体,在解决具有执行复杂性的任务方面展现出巨大潜力。然而,这种分而治之的范式在现实世界中同样普遍的决策任务上表现不足。这些任务要求所有相关利益方同时推理,其决策相互依赖,因此无法孤立解决。我们将这一挑战定性为立场纠缠,这是一种区别于执行复杂性的决策复杂性。为了解决这一问题,我们提出了多智能体虚拟博弈(MAFP),一种新颖的MAS范式,将利益方立场表示为智能体,并将决策制定形式化为一个均衡寻求过程。基于博弈论中的虚拟博弈原理,MAFP通过每个智能体对其他智能体过去决策的经验混合做出最佳响应,迭代更新其决策。这使得智能体能够暴露并解决彼此的弱点,逐步提高决策质量和鲁棒性。我们在具有挑战性的决策任务上评估MAFP,这些任务测试在行动前为竞争场景制定策略的能力。MAFP在两个互补指标——锦标赛强度和鲁棒性上,均优于单轮和多轮基线,证明了其在解决立场纠缠方面的有效性。

英文摘要

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

2606.19111 2026-06-18 cs.CL cs.AI cs.MA 新提交 85%

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

领导力作为协调控制:多智能体LLM团队中的行为特征与恢复优势边界

Haewoon Kwak

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

专题命中 多智能体 :多智能体LLM团队中领导力作为协调控制

AI总结 研究多智能体LLM团队中过程级协调控制何时增加价值,通过行为特征和消融实验发现,控制器的优势仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时出现,验证了权变理论。

Comments 33 pages

详情
AI中文摘要

团队科学认为领导力是权变的:它仅在特定条件下有帮助,而能力强的自主团队可能根本不需要领导。我们对多智能体LLM团队提出类似问题:在什么可测量的条件下,过程级协调控制会增加价值,这些条件是否与团队科学的预测一致?我们使用行为特征(多数锁定、探索、从错误的第0轮共识中恢复)和每动作消融实验,因为每个控制器是一个显式动作集,而不是一个整体提示。我们将三种经典领导风格(交易型、变革型、情境型)操作化为对共享动作词汇(探索、修订、接受、综合)的控制器。一个具有相同动作但使用任意规则的匹配控制器恢复效果不优于多数投票,因此是理论推导的规则(而非词汇)起作用。在四个任务体系和三个开放权重模型系列中,没有控制器在准确率上占主导地位,正如权变观点所预测的:交易型控制在所有12个(模型、体系)组合上与共享的第0轮投票匹配,差异在1.3个百分点以内,仅在初始多数不可靠的一个组合上出现增益(llama-4-scout社会性;情境型比扁平型高8个百分点)。通过四个边界探针测试的恢复优势解释表明,控制器仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时优于纯交互。这些区域映射到权变理论(领导替代、路径-目标冗余、情境准备差距),因此基本为零的准确率结果正是理论所预测的,而非控制器的失败。我们将过程级协调控制视为一种需要测量和理论映射的权变因素,而不是需要超越的排行榜。

英文摘要

Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

2606.18268 2026-06-18 cs.SI cs.AI 新提交 85%

Towards Multi-Agent-Simulation-Based Community Note Evaluation

迈向基于多智能体模拟的社区笔记评估

Changxi Wen, Shuning Zhang, Bohao Chu, Yuwei Chuai, Hui Wang, Dai Shi, Xin Yi, Hewu Li

发表机构 * Tsinghua University, Beijing, China(清华大学,北京,中国) University of Duisburg-Essen, Duisburg, Germany(杜伊斯堡-埃森大学,杜伊斯堡,德国) University of Luxembourg, Luxembourg(卢森堡大学,卢森堡) Tongji University, Shanghai, China(同济大学,上海,中国)

专题命中 多智能体 :提出MultiCom多智能体框架模拟社区笔记评估。

AI总结 针对社区事实核查中跨共识延迟和低比例问题,提出ComRate数据集和MultiCom多智能体框架,通过矩阵分解聚类与校准聚合实现高精度评估。

详情
AI中文摘要

基于跨共识的社区事实核查在社交媒体平台上迅速扩展。然而,由人类贡献者评定的跨共识社区事实核查的延迟和低比例仍然是一个重大挑战。为解决这一问题,我们首先创建了ComRate,一个大规模数据集,包含来自$\mathbb{X}$的250万条社区笔记和超过2.09亿条评分。然后,我们提出了MultiCom,一个基于角色引导的多智能体评分框架,用于社区笔记评估。MultiCom通过在矩阵分解的评分者空间中对贡献者进行聚类,并提示角色智能体根据官方社区笔记评分模式生成结构化评估,从而模拟多样化的评分者群体。这些智能体输出结构化且可解释的判断,例如置信度、一致信号和原因。一种折外校准聚合算法结合原始投票和诊断性原因信号等特征,实现可靠预测。广泛评估表明,MultiCom优于其他方法,在评估集上平均准确率达到84.7%(平衡准确率68.3%,宏F1分数60.1%)。

英文摘要

Community-based fact-checking that relies on cross-consensus is expanding rapidly on social media platforms. However, the delay and low-ratio of cross-consensus community fact-checks rated by human contributors remains a significant challenge. To address this, we first created ComRate, a large-scale dataset comprising 2.5 million community notes and over 209 million ratings sourced from $\mathbb{X}$. We then propose MultiCom, a persona-guided multi-agent rating framework for community note evaluation. MultiCom simulates diverse rater population by clustering contributors in a matrix-factorized rater space and prompting persona agents to generate structured assessments based on the official community notes rating schema. These agents output structured and explainable judgments, such as confidence, agreement signals and reasons. An out-of-fold calibrated aggregation algorithm combines features such as raw votes and diagnostic reason signals for reliable prediction. Extensive evaluations demonstrate that MultiCom outperforms alternative methods, achieving an average accuracy of 84.7% (balanced accuracy 68.3%, macro-F1 60.1%) on the evaluation set.

2606.18264 2026-06-18 cs.SI cs.AI cs.CL 新提交 85%

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

使用多LLM智能体模拟仇恨言论级联:实证基础、建模保真度与干预策略

Fan Huang

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

专题命中 多智能体 :使用多LLM智能体模拟仇恨言论传播与干预策略。

AI总结 本研究通过多LLM智能体系统模拟在线仇恨言论传播,发现其能再现实证数据中的立场单一性和毒性同质性,并通过消融实验识别出智能体异质性为关键保真因素,提出针对密集网络的放大器干预策略。

详情
AI中文摘要

在线平台上仇恨内容传播的忠实建模仍然是内容审核研究中的一个开放问题。经典的级联模型没有明确表示与仇恨内容传播相关的用户画像、社区和内容因素,因此在实际场景中部署时可能产生效果较差的审核策略。多智能体大语言模型系统原则上可以使每次转发决策依赖于用户画像、周围社区和帖子内容,但尚不清楚这种增加的灵活性是否比经典基线更忠实地再现真实的仇恨级联。我们研究了三个仇恨Bluesky级联和一个大小匹配的良性对照。在实证Bluesky数据中,我们发现:97.4--99.7%的转发者采取敌对立场;对于仇恨级联,扩散树上的毒性-参与同质性高于关注图;仇恨级联的拓扑结构是星形(大多数转发直接来自根节点),而良性级联是树形(转发通过多跳链传播)。在模拟中,多LLM智能体模拟器再现了立场单一性和毒性差异方向。结构化消融实验将智能体异质性识别为主要的保真因素,针对密集网络的放大器干预在5.7%良性附带损害下实现了7.5--12.9%的减少。

英文摘要

Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors associated with hateful-content propagation may yield moderation strategies that behave less effectively when deployed in real-world scenarios. Multi-agent large language model (LLM) systems can, in principle, make each reshare decision depend on the user's profile, the surrounding community, and the post's content, but it remains unclear whether this added flexibility actually reproduces real hateful cascades more faithfully than classical baselines. We study three hateful Bluesky cascades and a size-matched benign control. In the empirical Bluesky data, we found that: 97.4--99.7\% of reposters take a hostile stance; toxicity-engagement homophily is higher on the diffusion tree than on the follower graph for hateful cascades; topology is star-like for the hateful cascades (most reposts come directly from the root) versus tree-like for the benign cascade (reposts propagate through multi-hop chains). In simulation, a multi-LLM-agent simulator reproduces the stance monoculture and the toxicity-delta direction. A structured ablation identifies agent heterogeneity as the leading fidelity factor, and amplifier targeting on dense networks yields 7.5--12.9\% reduction at 5.7\% benign collateral.

2606.15504 2026-06-18 cs.AI 新提交 85%

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

迈向振动医学:一种用于临床决策支持的自演化多智能体框架

Qianxue Zhang, Yiming Ren, Shihuan Qin, Xiao Zhang, Liao Zhang, Jinyang Huang, Zhengliang Liu, Chenbin Liu, Hongying Feng, Jingyuan Chen, Yuzhen Ding, Weihang You, Hanqi Jiang, Yi Pan, Yifan Zhou, Junhao Chen, Lifeng Chen, Wei Liu, Tianming Liu, Zengren Zhao, Lian Zhang

发表机构 * Medical AI Lab, The First Hospital of Hebei Medical University(河北医科大学第一医院医学人工智能实验室) Hebei Provincial Engineering Research Center for AI-Based Cancer Treatment Decision-Making, The First Hospital of Hebei Medical University(河北省人工智能癌症治疗决策工程研究中心,河北医科大学第一医院) State Key Laboratory of Neurology and Oncology Drug Development(神经与肿瘤药物研发国家重点实验室) School of Computing, University of Georgia(佐治亚大学计算学院) Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital and Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College(中国医学科学院北京协和医学院国家癌症中心/国家肿瘤临床医学研究中心/肿瘤医院深圳医院放射治疗科) Department of Radiation Oncology, Mayo Clinic(梅奥诊所放射肿瘤科) College of Mechanical and Power Engineering, China Three Gorges University(三峡大学机械与动力工程学院) Department of Radiation Oncology, Guangzhou Concord Cancer Center(广州康华肿瘤中心放射治疗科) Gastrointestinal Disease Diagnosis and Treatment Center, The First Hospital of Hebei Medical University(河北医科大学第一医院胃肠疾病诊疗中心) Department of General Surgery, The First Hospital of Hebei Medical University(河北医科大学第一医院普通外科)

专题命中 多智能体 :提出多智能体框架,包含三个专用智能体

AI总结 提出VIBEMed多智能体框架,通过自演化机制和架构级安全沙箱,从交互历史中动态学习,实现个性化临床决策支持。

详情
AI中文摘要

近年来,大型语言模型和自主智能体的进步彻底改变了医疗领域,促进了诊断并改善了治疗结果。然而,大多数现有AI系统依赖预训练知识和预定义流程,难以从包含患者结果和过去失败的交互式聊天会话历史中动态学习。为解决这一限制,我们提出了VIBEMed,一种具有内置自演化机制和架构级安全沙箱的多智能体框架,用于稳健的临床决策支持。该系统集成了三个专门智能体:用于假设生成的临床诊断智能体(CDA)、用于治疗计划的治疗执行智能体(TEA)以及将纵向临床反馈提炼为可重用知识的临床演化管理智能体(CEMA),将多模态患者信息转化为个性化医疗决策。通过自演化机制,该框架实现了跨记忆、模型行为和决策策略的迭代更新,使系统能够随时间改进。实验结果表明,VIBEMed通过其演化机制在复杂临床病例中表现出优越性能,特别是在需要集成决策和纵向规划的任务中。该框架还支持在具有挑战性的场景(如肿瘤治疗规划)中进行可靠的端到端决策,凸显了其在真实临床环境中的可行性。总体而言,VIBEMed为超越静态AI系统、迈向自适应、经验驱动的临床决策支持提供了一条实用路径,展示了将多智能体协作与持续演化相结合以推进精准医学的价值。

英文摘要

In recent years, the advances of large language models and autonomous agents have revolutionized the healthcare field, facilitating diagnosis and improving treatment results. However, most existing AI systems rely on pre-trained knowledge and predefined pipelines, which struggle to learn dynamically from the interactive chat session history that contains patient outcomes and past failures. To address this limitation, we propose VIBEMed, a multi-agent framework with a built-in self-evolution mechanism and architecture-level safety sandbox for robust clinical decision support. The system integrates three specialized agents, including a Clinical Diagnostic Agent (CDA) for hypothesis generation, a Therapeutic Execution Agent (TEA) for treatment planning, and a Clinical Evolution Manager Agent (CEMA) that distills longitudinal clinical feedback into reusable knowledge, transforming multimodal patient information into personalized medical decisions. Through self-evolution mechanism, the framework enables iterative updates across memory, model behavior, and decision strategies, allowing the system to improve over time. Experimental results show that VIBEMed demonstrates superior performance through its evolving mechanism in complex clinical cases, particularly in tasks that require integrated decision-making and longitudinal planning. The framework also supports reliable end-to-end decisions in challenging scenarios such as oncology treatment planning, highlighting its feasibility in real-world clinical contexts. Overall, VIBEMed provides a practical path beyond static AI systems toward adaptive, experience-driven clinical decision support, demonstrating the value of combining multi-agent collaboration with continuous evolution for advancing precision medicine.

2606.07150 2026-06-18 cs.CR cs.AI cs.MA cs.NI 新提交 85%

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

从隐私到工作流完整性:自主智能体互操作性中的通信图元数据

Bijaya Dangol

发表机构 * Independent Researcher(独立研究者)

专题命中 多智能体 :研究智能体互操作性协议中的通信图元数据威胁

AI总结 针对智能体通信图元数据泄露问题,提出工作流完整性威胁模型,定义传输层与引导层隐私属性,并通过A2A案例验证元数据保护可有效抑制任务推断。

Comments 22 pages, 7 figures, 6 tables

详情
AI中文摘要

诸如A2A和MCP之类的智能体互操作性协议标准化了智能体之间的通信内容,但假设基于地址的HTTP(S)传输。此类传输保护消息内容,并越来越多地采用端到端加密。它们暴露在明文中的是通信图:哪个智能体联系哪个智能体、何时以及频率如何。在智能体系统中,该图比隐私框架所暗示的更具后果性。端点通常带有能力标签,工作流是结构化和链式的,交互与实际行动耦合,因此观察者恢复的不仅仅是过去的关系。它可以推断出待处理的工作流、正在组装的任务以及可能即将发生的行动。以机器速度,它可以在工作流完成之前根据该推断采取行动。因此,威胁是工作流完整性,而不仅仅是隐私:对自主行动的预测性杠杆。我们为智能体通信图提供了一个威胁模型;识别了使智能体元数据具有独特揭示性的因素(语义性、前瞻性、驱动性);定义了传输层和引导层隐私属性,并评估了候选传输(SimpleX/SMP、Tor、混合网络)与这些属性的匹配程度;并提出了一个A2A案例研究,其中元数据保护绑定是可表达的,但揭示了协议的身份假设。我们在一个基于真实A2A捕获的生成模型上测试了这些。仅凭被动元数据,没有载荷,一个分类器从工作流的开头就能以远高于随机水平的概率恢复任务类别;应用这些属性后,该恢复被急剧拉回随机水平。除了观察者能恢复的内容外,我们衡量了利用泄露的杠杆:在工作流开头和固定预算下,选择对哪些工作流采取行动的对手在此模型中实现了大部分先知攻击者相对于元数据盲攻击者的优势,而相同的属性抑制了这一点。

英文摘要

Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another but assume address-based transport. Whether over HTTP(S) or a content-protecting binding such as MLS-based SLIM, these transports protect message content yet leave the communication graph exposed: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are capability-labeled, workflows are structured and chained, and interactions are coupled to actions, so an observer recovers more than past relationships: it can recognize a recurring pending workflow from its opening and, at machine speed, act on it before it completes. The threat is one of workflow integrity, not privacy alone. We give a threat model for the communication graph and locate what makes its metadata distinctively consequential: not stronger fingerprinting but exposure across independent trust domains, coupled to autonomous action. We define transport- and bootstrap-layer privacy properties, give them an indistinguishability-game semantics, evaluate transports, and give an A2A case study where a metadata-protecting binding surfaces its implicit identity assumptions. On a corpus of real multi-agent A2A traffic from the official reference agents, on a live A2A binding, and with a generative model as a controlled instrument, a label-blind classifier recovers a task's class from passive metadata at 6x chance, and from only its opening; a defense-aware adversary does not overturn this, and only the full set of properties drives recovery toward chance. Acting on the leak is distinct from recoverability: under a fixed budget an adversary captures 0.63 of a clairvoyant attacker's advantage on the corpus (0.41 from a workflow's opening), governed by top-ranked precision rather than overall accuracy, so integrity and privacy come apart under defense.

4. 规划决策 2 篇

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交 90%

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench:智能体能否玩转长期博弈?

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

专题命中 规划决策 :模拟500天运营初创公司任务

AI总结 提出CEO-Bench,通过模拟500天运营初创公司的任务,评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情
AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而,现实世界的挑战需要结合多种复杂技能,这些技能在很大程度上尚未在智能体中得到测试:(1)在不确定性中导航长期视野;(2)在嘈杂环境中获取信息;(3)适应不断变化的世界;(4)协调多个移动部分以实现连贯目标。我们引入CEO-Bench,通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面,在相同的环境中运行,并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库,将信号转化为合理的策略,并通过编程协调许多决策。最强的智能体编写复杂的代码,模拟客户群体以预测未来现金流,并挖掘谈判历史以揭示隐藏的客户偏好。即便如此,大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金,且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

2606.18633 2026-06-18 cs.MA 新提交 85%

PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning

PersonalPlan: 面向个性化编程学习的多智能体系统规划

Zhiyuan Wen, Jiannong Cao, Peng Gao, Haochen Shi, Wengpan Kuan, Bo Yuan, Xiuxiu Qi

专题命中 规划决策 :多智能体规划器用于个性化编程学习

AI总结 提出PersonalPlan,一种两阶段多智能体规划器,通过分层SFT和奖励自适应GRPO生成可执行、个性化且具有教学支架的计划,在MAP-PPL数据集上优于现有方法。

详情
AI中文摘要

有效的编程教育需要针对不同学习者背景进行个性化教学。然而,虽然基于LLM的多智能体系统(MAS)擅长复杂规划,但现有规划器通常缺乏轮廓基础(profile-grounding)和教学支架(pedagogical scaffolding),从而削弱了个性化编程学习。为填补这一空白,我们首先引入\textbf{MAP-PPL}(\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning),这是一个基于轮廓的多智能体规划数据集,包含来自1,730个Stack Overflow问题组和2,738个学习者轮廓的3,043个查询-轮廓-计划实例。每个计划指定了智能体、子任务、可执行步骤和先决依赖关系。然后,我们提出\textbf{PersonalPlan},一个两阶段MAS规划器,首先使用独立的LoRA适配器进行分层SFT,用于轮廓感知的任务分解和步骤依赖规划,然后应用奖励自适应GRPO,鼓励模型生成可执行、个性化且具有教学支架的计划。在MAP-PPL上进行的广泛实验,将PersonalPlan与前沿LLM、通用MAS框架和智能体规划器进行比较,证明了其优越性。仅使用8B和32B变体,PersonalPlan在计划可执行性、个性化和教学质量方面达到了最先进水平,有效协调了MAS进行智能体-学生交互。

英文摘要

Effective programming education requires personalized instruction adapted to diverse learner backgrounds. However, while LLM-based multi-agent systems (MAS) excel at complex planning, existing planners often lack profile-grounding and pedagogical scaffolding, thereby undermining personalized programming learning. To fill in the gap, we first introduce \textbf{MAP-PPL} (\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning), a profile-conditioned multi-agent planning dataset with 3{,}043 query--profile--plan instances from 1{,}730 Stack Overflow question groups and 2{,}738 learner profiles. Each plan specifies agents, subtasks, executable steps, and prerequisite dependencies. Then, we propose \textbf{PersonalPlan}, a two-stage MAS planner that first performs hierarchical SFT with separate LoRA adapters for profile-aware task decomposition and step dependency planning, then applies a Reward-Adaptive GRPO to encourage the model to generate executable, personalized, and pedagogically scaffolded plans. Extensive experiments on MAP-PPL comparing PersonalPlan against frontier LLMs, generic MAS frameworks, and agentic planners demonstrate its superiority. With only 8B and 32B variants, PersonalPlan achieves state-of-the-art plan executability, personalization, and pedagogical quality, effectively orchestrating MAS for agent-student interactions.

5. 工作流自动化 3 篇

2606.18502 2026-06-18 cs.CL 新提交 90%

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

面向企业应用的多智能体系统可扩展定制与部署

Paresh Dashore, Shreyas Kulkarni, Uttam Gurram, Nadia Bathaee, Kartik Balasubramaniam, Genta Indra Winata, Sambit Sahu, Shi-Xiong Zhang

发表机构 * Capital One(第一资本)

专题命中 工作流自动化 :多智能体系统定制与部署框架

AI总结 提出统一框架,通过智能体模型定制(持续预训练、微调、偏好优化)和推理优化(推测解码、FP8量化),实现领域自适应和4.48倍吞吐加速,保持性能并提升长尾场景鲁棒性。

Comments Preprint

详情
AI中文摘要

基于大语言模型的多智能体系统在复杂推理和任务执行上表现出色,支持广泛的企业应用。然而,由于领域特定的定制需求以及智能体工作流中的高延迟和推理成本,生产部署仍然具有挑战性。我们提出了一个统一框架,用于在实际环境中定制和高效部署多智能体系统。第一阶段,智能体模型定制,结合持续预训练、监督微调和偏好优化,将紧凑模型适应到专业领域,同时保留强大的智能体能力。第二阶段,推理优化,集成推测解码和FP8量化与目标校准,以最小质量损失实现成本高效的推理服务。在企业工作负载上,我们的框架实现了快速领域自适应,吞吐量提升4.48倍,同时保持性能并提高长尾场景的鲁棒性。

英文摘要

Large language model (LLM)-based multi-agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications. However, production deployment remains challenging due to domain-specific customization requirements and high latency and inference costs in agentic workflows. We propose a unified framework for customization and efficient deployment of multi-agent systems in real-world settings. The first stage, Agentic Model Customization, combines continual pretraining, supervised fine-tuning, and preference optimization to adapt a compact model to specialized domains while retaining strong agentic capabilities. The second stage, Inference Optimization, integrates speculative decoding and FP8 quantization with targeted calibration to enable cost-efficient serving with minimal quality loss. Across enterprise workloads, our framework enables rapid domain adaptation and achieves a 4.48x speedup in throughput while maintaining performance and improving robustness on long-tail scenarios.

2606.18661 2026-06-18 cs.CV cs.AI 新提交 85%

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench:一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University(中南大学)

专题命中 工作流自动化 :指令驱动智能体框架,自主识别分析滑坡

AI总结 提出指令驱动智能体框架,包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent,实现自主滑坡识别与分析。

详情
AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要,然而当前范式难以同时提取视觉特征和高层次地球科学语义,而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战,我们提出一个指令驱动的智能体框架,包含三个组成部分。首先,通过多VLM交叉验证和交互式标注构建LandslideBench,这是一个多模态细粒度数据集,包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后,通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM,以增强地质语义理解。最后,以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent,采用双规则控制器,结合结构化报告元数据约束和交叉验证识别约束,来调控自动化工具调用。实验表明,LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理,实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

2606.18425 2026-06-18 cs.SE cs.AI cs.DC 新提交 85%

From Specification to Execution: AI Assisted Scientific Workflow Management

从规范到执行:AI辅助的科学工作流管理

Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

发表机构 * RENCI, University of North Carolina at Chapel Hill, NC, USA(RENCI,北卡罗来纳大学教堂山分校) Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA(信息科学研究所,南加州大学马里纳德尔雷耶斯分校)

专题命中 工作流自动化 :AI辅助科学工作流生成与调试

AI总结 提出一种AI辅助方法,通过规范驱动的工作流生成、自动化调试和分布式执行,结合Pegasus与MCP层,实现从自然语言到大规模科学工作流的端到端管理。

详情
AI中文摘要

科学工作流管理系统(WMS)支持复杂管道的可扩展和可重复执行,但工作流的设计、实现和调试仍然主要依赖人工,需要大量专业知识。最近使用大型语言模型(LLM)的方法在从自然语言生成工作流方面显示出潜力,但通常依赖于直接的代码合成,这限制了透明度、可重复性以及与工作流系统的集成。我们提出了一种AI辅助的科学工作流管理方法,结合了规范驱动的工作流生成、自动化调试和分布式执行。该方法引入了一个结构化的规范阶段,将工作流意图、设计和实现分离,允许在代码生成之前进行验证。我们还开发了一个基于LLM的调试代理,用于诊断和解决跨多个系统层的故障。为了支持分布式执行和用户交互,我们将广泛使用的WMS Pegasus与模型上下文协议(MCP)层集成,为工作流提交、监控和控制提供统一接口。我们使用一个用于医学影像的联邦学习工作流来评估该方法,该工作流具有并行、迭代和依赖密集的结构。该系统生成并执行了包含数千个作业的大规模工作流,减少了调试工作量,并允许非专家用户使用专家级设计模式构建工作流。这些结果表明,端到端的AI辅助工作流生成和执行是可行的,并指向了用于管理科学工作流生命周期的AI驱动平台。

英文摘要

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

6. 其他Agent 2 篇

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 新提交 85%

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛:前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning(同情对齐机器学习) Sentient Futures(感知未来) Harvard Kennedy School(哈佛肯尼迪学院) Appalachian State University Department of Management(阿巴拉契亚州立大学管理系)

专题命中 其他Agent :评估AI代理在旅行预订中的动物福利

AI总结 提出首个代理基准TAC,测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型,所有模型得分低于随机水平64%,最佳模型仅53%。

详情
AI中文摘要

AI代理正从顾问转变为行动者,代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应,但未检验这些响应中的福利推理是否迁移到代理部署中(模型必须使用工具采取行动)。我们引入TAC(旅行代理同情心),这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景,涵盖六类动物剥削,并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%,最佳表现者(Claude Opus 4.7)为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升,在GPT-5.2中提升26个百分点,在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计(使用Gemini 2.5 Flash Lite作为评判者,对前两名模型的288个基础条件转录进行审计)未标记任何评估意识转录,表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

2606.12837 2026-06-18 cs.CL 新提交 85%

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch: 超越人类难度上限的长时域搜索代理基准测试

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su

发表机构 * Meituan(美团)

专题命中 其他Agent :长时域搜索代理基准测试

AI总结 提出LoHoSearch基准,基于700万维基实体知识图谱自动构建544个复杂问题,评估显示最强模型仅34.74%准确率,远超人类难度上限。

详情
AI中文摘要

以BrowseComp为代表的搜索代理基准在过去一年中迅速饱和,最强模型已超过90%准确率。由于这些基准主要由人类编写,标注者缺乏对实体统计的全局视角,无法系统性地最大化搜索空间大小和结构复杂性,这造成了难以突破的难度上限。为解决这一问题,我们引入了LoHoSearch(长时域搜索代理),一个包含544个人工验证问题、覆盖11个领域的挑战性基准。LoHoSearch通过基于覆盖超过700万维基百科实体的知识图谱的自动化流水线构建,该流水线选择具有大搜索空间的关系,并将其组装成结构复杂且具有知识图谱验证的唯一答案的问题。我们的评估表明,即使是最强模型也仅达到34.74%的准确率,且现有的上下文管理策略(最佳提升+6.8%)带来的增益远小于先前基准。LoHoSearch为评估搜索代理中的长时域推理和上下文管理提供了更高要求的标准。

英文摘要

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.