arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 9 信号源:cs.AI, cs.CL, cs.LG, cs.SE
2606.20529 2026-06-19 cs.AI cs.CL 新提交 90%

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent: 策略遵从工具调用代理的结构化状态

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral

发表机构 * Arizona State University(亚利桑那州立大学) University of Arizona(亚利桑那大学)

专题命中 工具调用 :提出策略遵从工具调用代理的结构化状态方法

AI总结 针对客服领域策略遵从工具调用代理,提出LedgerAgent方法,通过独立账本维护任务状态并渲染到提示中,在执行工具调用前检查状态依赖策略约束,提升多轮一致性。

Comments Work in Progress

详情
AI中文摘要

在客服领域,策略遵从工具调用代理必须在跨轮次调用工具时维护任务状态,并遵守领域策略。任务状态包括通过用户交互和工具调用观察到的事实、标识符、约束和条件。在标准代理中,任务状态没有单独表示。观察结果、工具返回和策略指令被放入提示中,使得代理每次决定下一步时都需要从提示中重建相关状态。这种设计使状态管理变得隐式,导致两种常见失败模式:代理可能检索到正确的事实,但后来基于过时、缺失或不正确的信息做出决策;语法上有效的工具调用可能仍然违反依赖于当前任务状态的领域策略。我们引入了\textsc{LedgerAgent},一种用于工具调用代理的推理时方法,它在单独的账本中维护观察到的任务状态,并将状态渲染到提示中。在执行改变环境的工具调用之前,账本还用于检查状态依赖的策略约束,阻止策略违规。在四个客服领域以及开源和闭源模型的混合面板上,\textsc{LedgerAgent}在标准基于提示的工具调用方法上提高了平均pass^k,在更严格的多轮一致性指标下提升最大。

英文摘要

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

2606.19992 2026-06-19 cs.SE cs.AI 新提交 90%

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

超越静态端点:工具程序作为灵活智能体网络服务的接口

Mugeng Liu, Shuoqi Li, Yixuan Zhang, Yun Ma

发表机构 * School of Computer Science, Peking University, Beijing, China School of Software \& Microelectronics, Peking University, Beijing, China Institute for Artificial Intelligence, Peking University, Beijing, China

专题命中 工具调用 :提出工具程序接口,优化智能体网络服务

AI总结 提出ToolPro,将工具意图表示为可执行程序,通过约束引导构建、效应感知重放和策略决策,在MCP服务上实现最高53.4%的延迟降低和96.1%的流量减少。

Comments Accepted by ICML 2026

详情
AI中文摘要

在智能体网络时代,基于LLM的智能体越来越多地将网络服务作为工具调用,然而大多数接口仍然是\emph{静态端点},难以表达包含循环、条件、连接和重试的长周期工作流。我们提出ToolPro,它将智能体的工具意图表示为一个\emph{可执行工具程序},该程序紧凑地编码了多步服务交互并带有显式效应类型。ToolPro结合了约束引导的程序构建、用于精确一次状态修改调用的效应感知重放,以及一个基于配置文件的策略,该策略决定何时程序执行优于逐步调用。我们在具有WebAssembly沙箱的MCP风格服务上实例化ToolPro,并在现实应用的各种工作流上进行了评估。ToolPro将端到端延迟降低了高达53.4%,客户端流量减少了高达96.1%,在网络延迟和工作流复杂度更高时收益更大。

英文摘要

In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain \emph{static endpoints} that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent's tool intent as an \emph{executable tool program} that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once state-modifying calls, and a profile-driven policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing and evaluate it on diverse workflows of real-world applications. ToolPro reduces end-to-end latency by up to 53.4\% and client-side traffic by up to 96.1\%, with larger gains under higher network latency and workflow complexity.

2606.20515 2026-06-19 cs.CV 新提交 85%

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

S-Agent:空间工具使用激发空间智能推理

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发表机构 * NTU(南洋理工大学) THU(清华大学) ByteDance(字节跳动) NWPU(西北工业大学)

专题命中 工具调用 :提出空间工具使用智能体范式,层次化工具集

AI总结 提出S-Agent空间工具使用智能体范式,通过时空证据积累和层次化工具集,将VLM作为语义规划器,实现连续多视图图像和视频的空间推理,在无训练下提升开源和闭源VLM性能,并基于S-300K轨迹微调得到紧凑空间智能体S-Agent-8B。

Comments Project Page : https://Ropedia.github.io/S-Agent

详情
AI中文摘要

现实世界的空间智能需要对连续且不断变化的三维世界进行推理,然而现有的VLM和工具增强智能体大多仍局限于从孤立的视觉观察中进行静态、无状态的推理。我们引入了\textbf{\textsc{S-Agent}},一种用于理解和推理连续多视图图像和视频的空间工具使用智能体范式。通过将空间推理表述为时空证据积累而非孤立的帧级预测,\textsc{S-Agent}将空间感知重塑为以场景为中心的理解,超越以帧为中心的识别。具体而言,\textsc{S-Agent}将VLM作为语义规划器,决定需要哪些证据,而层次化的空间工具和专家将物体锚定在2D中,将其提升为3D几何证据,并将这些证据聚合为高级空间知识(例如,计数、测量、方向和相对位置)。此外,时间记忆机制,包括用于维护不断演变的场景状态的场景记忆和用于积累推理上下文的智能体记忆,实现了跨帧和推理步骤的证据整合。在多视图和视频空间推理基准上的全面实验表明,\textsc{S-Agent}以无需训练的方式持续提升开源和闭源VLM的性能。除了推理时增强,在\textsc{S-Agent}生成的空间轨迹\textsc{S-300K}上进行监督微调(SFT)得到了\textsc{S-Agent-8B},一个紧凑的空间智能体,显著超越了类似规模的基线(例如,Qwen3-VL-8B),并与先进的闭源模型(例如,GPT-5.4和Gemini 3)性能相当。

英文摘要

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

2606.20401 2026-06-19 eess.SY cs.SY 新提交 85%

PowerAgentBench-Dyn: A Benchmark for Agentic AI in Power System Dynamic Studies

PowerAgentBench-Dyn:电力系统动态研究中智能体AI的基准测试

Qian Zhang, Andrea Pomarico, Costas Mylonas, Magda Foti, Alberto Berizzi, Le Xie

专题命中 工具调用 :LLM智能体基准测试,评估电力系统动态分析中的工具使用和推理

AI总结 提出PowerAgentBench-Dyn基准,用于评估基于LLM的智能体在电力系统动态分析任务中的能力,涵盖模型质量审查和安全风险筛选两个任务。

详情
AI中文摘要

基于大型语言模型(LLM)的智能体越来越多地被用于通过与软件工具交互、解释中间结果以及自主规划后续行动来自动化多步骤工程工作流。电力系统动态研究是这些智能体一个特别有前景但尚未充分探索的应用领域。与静态计算任务不同,动态研究通常需要更多时间进行模型参数校准、工程判断以及在受限动作空间下的决策。本文介绍了PowerAgentBench-Dyn,一个旨在评估智能体AI系统在电力系统动态分析任务上的基准测试。该基准针对那些不能简化为单一优化或编码任务的问题,而是需要经验丰富的电力系统工程师日常执行的那种推理、工具使用和迭代实验。所提出的框架包括两个初始基准任务。第一个是动态模型质量审查基准,评估智能体根据系统运营商指定的模型质量合规标准验证和诊断动态模型的能力。第二个是动态安全风险筛选基准,评估智能体利用语义记忆和有限的仿真预算从未见故障数据集中识别、排序和分析最关键短路事故,并提出和评估可能的缓解措施的能力。对于每个任务,我们定义了仿真环境、观测和动作空间以及评估指标。该基准在基于度量的意义上是可复现的:发布案例和仿真器设置定义了确定性评估器,而随机智能体行为通过重复运行使用成功率和其他指标进行评估。该基准支持未来用于电力系统运行和规划的智能体AI的开发。

英文摘要

Large Language Model (LLM)-based agents are increasingly being used to automate multi-step engineering work flows by interacting with software tools, interpreting intermediate results, and autonomously planning subsequent actions. Power system dynamic studies represent a particularly promising yet largely unexplored application domain for these agents. Unlike static computational tasks, dynamic studies often require more time on model parameter calibration, engineering judgment, and decision making under constrained action spaces. This paper introduces PowerAgentBench-Dyn, a benchmark designed to evaluate Agentic AI systems on power system dynamic-analysis tasks. The benchmark targets problems that cannot be reduced to a single optimization or coding task, but instead require a type of reasoning, tool usage, and iterative experimentation routinely performed by experienced power system engineers. The proposed framework includes two initial benchmark tasks. The first, the Dynamic Model Quality Review Benchmark, evaluates agents' ability to validate and diagnose dynamic models based on model-quality compliance criteria specified by system operators. The second, the Dynamic Security Risk Screening Benchmark, assesses agents' capability to leverage semantic memory and a limited simulation budget to identify, rank, and analyze the most critical short-circuit contingencies from an unseen fault dataset, as well as propose and evaluate possible mitigation measures. For each task, we define the simulation environment, observation and action spaces, and evaluation metrics. The benchmark is reproducible in a metric-based sense: released cases and simulator settings define a deterministic evaluator, while stochastic agent behavior is assessed over repeated runs using success rates and other metrics. The benchmark supports the development of future Agentic AI for power system operation and planning.

2606.20333 2026-06-19 cs.AI 新提交 80%

SoftSkill: Behavioral Compression for Contextual Adaptation

SoftSkill: 用于上下文适应的行为压缩

Xijia Tao, Yihua Teng, Xinyu Fu, Ziru Liu, Kecheng Chen, Yuzhi Zhao, Suiyun Zhang, Rui Liu, Lingpeng Kong

发表机构 * The University of Hong Kong(香港大学) Huawei Research(华为研究院) City University of Hong Kong(香港城市大学) Huazhong University of Science and Technology(华中科技大学)

专题命中 工具调用 :软技能前缀压缩自然语言技能用于智能体

AI总结 提出SoftSkill方法,通过可训练的软技能前缀压缩自然语言技能为紧凑连续向量,在冻结基模型上提升问答和数学任务性能,减少标记数量。

详情
AI中文摘要

智能体技能通常以自然语言Markdown文件形式部署,编码回答策略、证据使用习惯和任务流程。这些文件可读且可移植,但间接消耗:对于每个任务实例,冻结的语言模型必须将长文本制品转换为生成时行为。本文探讨自然语言技能是否可以初始化一个紧凑的连续上下文对象,通过可训练的软增量进行优化,同时基模型保持冻结。我们提出SoftSkill,一种冻结骨干方法,通过下一词预测调整此类软技能,并在推理时将其部署为潜在行为先验。在我们的主要单轮设置中,在Qwen3.5-4B上使用长度为32的SoftSkill前缀,相比无技能提示在SearchQA上提升8.3分,LiveMath上提升42.1分,DocVQA上提升1.3分。相对于SkillOpt,SoftSkill在SearchQA上准确率提升5.2分,LiveMath上提升12.5分,同时用少量虚拟标记替换数百到数千个Markdown技能标记。我们进一步研究了作为更难边界情况的智能体执行,其中稀疏轨迹模仿提供了有用信号,但尚未稳健地压缩长程过程行为。更广泛地说,结果表明某些任务技能更适合被视为紧凑的潜在控制,而不是在推理时重新解释的额外Markdown,用于控制冻结模型如何进入任务。

英文摘要

Agent skills are commonly deployed as natural-language Markdown files that encode answer policies, evidence-use habits, and task procedures. These files are readable and portable, but they are consumed indirectly: for each task instance, a frozen language model must translate a long textual artifact into generation-time behavior. This paper asks whether a natural-language skill can instead initialize a compact continuous context object, refined by a trainable soft delta while the base model remains frozen. We propose SoftSkill, a frozen-backbone method that tunes such soft skills with next-token prediction and deploys them as latent behavioral priors at inference time. In our main single-round setting, a length-32 SoftSkill prefix on Qwen3.5-4B improves over no-skill prompting by 8.3 points on SearchQA, 42.1 points on LiveMath, and 1.3 points on DocVQA. Relative to SkillOpt, SoftSkill improves accuracy by 5.2 points on SearchQA and 12.5 points on LiveMath, while replacing hundreds to thousands of Markdown skill tokens with a few virtual tokens. We further study agentic execution as a harder boundary case, where sparse trajectory imitation provides useful signal but does not yet robustly compress long-horizon procedural behavior. More broadly, the results suggest that some task skills are better treated not as additional Markdown to be reinterpreted at inference time, but as compact latent controls over how a frozen model enters the task.

2606.20023 2026-06-19 cs.SE cs.AI cs.CL 新提交 80%

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

当较低权限足够时:探究LLM代理中的过度权限工具选择

Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) Beijing Academy of Artificial Intelligence(北京人工智能研究院) The Chinese University of Hong Kong(香港中文大学) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

专题命中 工具调用 :聚焦LLM代理工具选择中的权限问题。

AI总结 针对LLM代理在工具选择中偏好高权限工具的安全问题,提出ToolPrivBench评估框架,发现主流代理普遍存在过度权限选择且被瞬态故障放大,并设计权限感知后训练防御方法有效减少不必要的高权限工具使用。

Comments code: https://github.com/AISafetyHub/agent-tool-selection-bias

详情
AI中文摘要

随着LLM代理越来越多地自主选择工具,它们在具有不同权限的工具之间的选择变得与安全相关。然而,先前的工具选择研究侧重于安全无关的元数据偏好,使得权限敏感的选择未被充分探索。为填补这一空白,我们研究了过度权限工具选择,即代理在存在足够低权限替代方案时仍选择或升级到更高权限工具。我们引入ToolPrivBench来评估代理是否在存在足够低权限替代方案时仍选择更高权限工具,同时衡量初始选择和瞬态工具故障后的升级。在八个领域和五种重复风险模式中,我们发现过度权限工具选择在主流LLM代理中很常见,并且被瞬态故障进一步放大。我们进一步发现,通用安全对齐不能可靠地迁移到最小权限工具选择,而提示级控制在瞬态故障下仅提供有限的缓解。因此,我们引入了一种权限感知的后训练防御,教导代理偏好足够低权限的工具,仅在必要时升级。我们的缓解实验表明,这种防御在保持通用能力的同时,显著减少了不必要的高权限工具使用。

英文摘要

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

2606.19245 2026-06-19 cs.AI cs.LG 新提交 80%

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP:分析AI代理在小分子临床前药理学中的表现

Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

发表机构 * LatchBio

专题命中 工具调用 :评估AI代理从实验数据恢复药理学结论

AI总结 提出TxBench-PP基准,用于评估AI代理从真实实验数据中恢复临床前药理学结论的能力,测试显示最强配置Claude Opus 4.8 / Pi仅通过59.3%的端点尝试。

详情
AI中文摘要

人工智能(AI)代理有望通过压缩解释和决策循环来加速药物发现,但实际部署需要基于现实程序决策的可信评估。我们引入了TherapeuticsBench临床前药理学(TxBench-PP),这是一个针对小分子临床前药理学的可验证基准,也是更广泛的TherapeuticsBench在药物发现阶段和治疗模式中的首个聚焦切片。TxBench-PP测试代理是否能够从真实实验数据中恢复准确的结论,而非从文献中记忆的事实。该基准包含100个评估,按程序阶段、实验类型和任务结构索引,涵盖作用机制(MoA)和药效学(PD)推理、化合物-靶点结合、因果靶点验证、可开发性与安全性以及转化疗效。代理接收现实的工作流程快照,在编码环境中检查文件,并返回确定性评分的结构化答案。在16个模型-工具配置(包括11个模型和4,800条轨迹)中,没有系统能够可靠地恢复临床前药理学决策。最强配置Claude Opus 4.8 / Pi通过了59.3%的端点尝试(178/300;95% CI, 51.1-67.6),其次是GPT-5.5 / Pi,为55.3%(166/300;47.0-63.6)。

英文摘要

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

2606.17041 2026-06-19 cs.CL cs.IR 新提交 80%

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

专题命中 工具调用 :评估LLM代理在元分析检索筛选中的表现

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

2606.20047 2026-06-19 cs.IR 新提交 75%

PACMS: Submodular Context Selection as a Pluggable Engine for LLM Agents

PACMS: 作为LLM代理可插拔引擎的子模上下文选择

Manu Ghulyani, Arunabh Singh, Karan Bharadwaj, Ankit Nath, Suranjan Goswami

专题命中 工具调用 :方法用于LLM代理的上下文管理。

AI总结 提出PACMS,一种基于子模函数最大化的上下文选择方法,在提示组装时按相关性从会话、记忆和工具输出中挑选内容,替代截断机制,提升长对话中的信息保持能力。

详情
AI中文摘要

对话和工具使用的LLM代理在上下文窗口中操作,该窗口同时从多个方向填充。随着会话进行,代理积累用户和助手轮次、从持久记忆存储中提取的条目,以及通常最大的工具调用输出(如文件读取、搜索结果和API响应)。一旦累积上下文超过模型的令牌预算,框架必须决定保留什么。当前机制是最近截断,有时辅以定期摘要。这是主题盲目的:会话早期建立的事实仅仅因为陈旧而被丢弃,即使当前用户查询正是关于该事实;相反,冗长但无关的近期材料被保留。必须在多轮中回忆信息的代理(记忆的定义案例)正是最近截断失败的地方。现有替代方案位于代理组装步骤之外。检索增强生成将外部文档提取到提示中,但不仲裁代理的“已存在”池化上下文。上下文压缩方法通过重写或修剪文本来减少令牌计数,但以查询盲目和有损的方式操作。两者都不将记忆条目、对话轮次和工具输出视为一个单一的候选池,在提示组装时按相关性进行选择。

英文摘要

Conversational and tool-using LLM agents operate over a context window that fills from several directions simultaneously. As a session proceeds, the agent accumulates user and assistant turns, entries drawn from a persistent memory store, and often largest of all, the verbatim outputs of tool calls such as file reads, search results, and API responses. Once the cumulative context exceeds the model's token budget, the framework must decide what to keep. The prevailing mechanism is recency truncation, sometimes paired with periodic summarization. This is topic-blind: a fact established early in a session is discarded simply because it is old, even when the current user query is about exactly that fact; conversely, verbose but irrelevant recent material is retained. Agents that must recall information across many turns, the defining case for memory, are precisely where recency truncation fails. Existing alternatives sit outside the agent's assembly step. Retrieval augmented generation fetches external documents into the prompt but does not arbitrate the agent's \emph{already-present} pooled context. Context-compression methods reduce token count by rewriting or pruning text, but operate query-blind and lossily. Neither treats memory entries, conversation turns, and tool outputs as a single candidate pool to be selected from by relevance at the moment the prompt is assembled.