arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 2 信号源:cs.AI, cs.CL, cs.LG, cs.SE
2606.01139 2026-06-18 cs.AI 版本更新 90%

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

SkillRevise: 通过轨迹条件技能修订改进LLM撰写的智能体技能

Yuxuan Liu, Zhaochen Su, Lingyun Xie, Yuhao Zhang, Qing Zong, Jiahe Guo, Zhongwei Xie, Yiyan Ji, Yauwai Yim, Hongyu Luo, Xiyu Ren, Ruan Chenyu, Haoran Li, Yangqiu Song

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Harbin Institute of Technology(哈尔滨工业大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Nanjing University(南京大学) The University of Hong Kong(香港大学)

专题命中 软件智能体 :智能体技能迭代优化,提升LLM agent成功率

AI总结 提出SkillRevise框架,通过执行证据诊断、修复原则检索和执行锚定编辑,迭代优化初始技能,在SkillsBench上将基础智能体成功率从36.05%提升至61.63%,并展现跨模型迁移性。

Comments 15 pages, 4 figures

详情
AI中文摘要

智能体技能是使LLM智能体能够执行工作流、验证约束并从故障中恢复的程序性工件。现有的自进化方法利用累积轨迹来优化技能,但在冷启动场景下(仅有一个初始的不完美技能可用)表现不佳。因此,技能构建默认采用专家编写或一次性LLM生成。专家编写的技能成本高昂,且可能与LLM智能体实际执行任务的方式不一致,而一次性生成的技能可能在语法上良好但在行为上薄弱。为弥合这一差距,我们提出SkillRevise,一个基于执行的框架,旨在迭代优化这些初始技能。SkillRevise从执行证据中诊断技能缺陷,从通用记忆中检索相关修复原则,并应用执行锚定编辑。通过重新执行候选技能并测量经验效用,它系统地保留最优技能版本。在三个基准测试和五个LLM上的评估表明,SkillRevise显著优于一次性基线,将SkillsBench上基础智能体的成功率从36.05%提升至61.63%。此外,修订后的技能展现出强大的跨模型迁移性,捕获了超越模型特定工件的通用程序性知识。

英文摘要

Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates, it retains the first verifier-passing skill within the revision budget and falls back to empirical utility only when no candidate succeeds. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills transfer across both executors and task environments, suggesting that SkillRevise captures reusable procedural knowledge beyond any single executor.

2604.06367 2026-06-18 cs.CR cs.AI cs.LG 版本更新 90%

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

WebSP-Eval:在网站安全与隐私任务上评估网络代理

Guruprasad Viswanathan Ramesh, Asmit Nayak, Basieem Siddique, Kassem Fawaz

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

专题命中 软件智能体 :评估Web Agent在安全隐私任务上的表现

AI总结 提出WebSP-Eval框架,通过200个任务实例和自动化评估器,测试多模态大模型在网站安全与隐私任务上的表现,发现状态UI元素(如开关)导致超过45%的任务失败。

Comments Accepted at PETS 2026. Project Page: https://wiscprivacy.com/webspeval/

详情
AI中文摘要

网络代理自动化浏览器任务,从简单的表单填写到复杂的工作流程(如订购杂货)。虽然当前的基准测试评估通用性能(如WebArena)或针对恶意行为的安全性(如SafeArena),但没有现有框架评估代理成功执行面向用户的网站安全和隐私任务的能力,例如管理cookie偏好、配置隐私敏感账户设置或撤销非活动会话。为填补这一空白,我们引入了WebSP-Eval,一个用于衡量网络代理在网站安全和隐私任务上性能的评估框架。WebSP-Eval包括:1)一个手动制作的任务数据集,涵盖28个网站的200个任务实例;2)一个强大的代理系统,支持使用自定义Google Chrome扩展在多次运行中进行账户和初始状态管理;以及3)一个自动化评估器。我们使用最先进的多模态大语言模型评估了总共8个网络代理实例,对网站、任务类别和UI元素进行了细粒度分析。我们的评估显示,当前模型在可靠解决网站安全和隐私任务方面自主探索能力有限,并且在特定任务类别和网站上表现困难。关键的是,我们发现状态UI元素是代理失败的主要原因,其中开关导致许多模型超过45%的任务失败。

英文摘要

Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g., SafeArena), no existing framework assesses an agent's ability to successfully execute user-facing website security and privacy tasks, such as managing cookie preferences, configuring privacy-sensitive account settings, or revoking inactive sessions. To address this gap, we introduce WebSP-Eval, an evaluation framework for measuring web agent performance on website security and privacy tasks. WebSP-Eval comprises 1) a manually crafted task dataset of 200 task instances across 28 websites; 2) a robust agentic system supporting account and initial state management across runs using a custom Google Chrome extension; and 3) an automated evaluator. We evaluate a total of 8 web agent instantiations using state-of-the-art multimodal large language models, conducting a fine-grained analysis across websites, task categories, and UI elements. Our evaluation reveals that current models suffer from limited autonomous exploration capabilities to reliably solve website security and privacy tasks, and struggle with specific task categories and websites. Crucially, we identify stateful UI elements are a primary reason for agent failure, with toggles causing more than 45% task failure across many models.