arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 2 信号源:cs.AI, cs.CL, cs.LG, cs.SE
2508.04266 2026-06-19 cs.CL 版本更新 85%

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

ShoppingBench:面向LLM智能体的真实世界意图导向购物基准

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

专题命中 软件智能体 :提出购物基准测试LLM智能体,属于软件智能体

AI总结 提出ShoppingBench基准,包含多层级真实购物意图任务,通过模拟环境和250万商品评估LLM智能体,发现GPT-4.1成功率低于50%,并提出轨迹蒸馏策略提升小模型性能。

Comments Accepted for oral presentation at AAAI 2026

详情
AI中文摘要

现有的电子商务基准主要关注基本用户意图,例如查找或购买产品。然而,现实世界的用户通常追求更复杂的目标,例如应用优惠券、管理预算以及寻找多产品卖家。为了弥补这一差距,我们提出了ShoppingBench,这是一个新颖的端到端购物基准,旨在涵盖日益具有挑战性的接地意图级别。具体来说,我们提出了一个可扩展的框架,基于从采样的真实世界产品中得出的各种意图来模拟用户指令。为了促进一致且可靠的评估,我们提供了一个大规模购物沙箱作为交互式模拟环境,包含超过250万种真实产品。实验结果表明,即使是最先进的语言智能体(如GPT-4.1)在我们的基准任务上的绝对成功率也低于50%,这突显了我们的ShoppingBench带来的重大挑战。此外,我们提出了一种轨迹蒸馏策略,并利用监督微调以及基于合成轨迹的强化学习,将大型语言智能体的能力蒸馏到较小的智能体中。结果,我们训练的智能体实现了与GPT-4.1相媲美的竞争性能。

英文摘要

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

2605.25160 2026-06-19 cs.AI 版本更新 80%

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

SimuWoB: 模拟真实世界移动应用以实现快速且保真的GUI智能体基准测试

Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学) University of Electronic Science and Technology of China(电子科技大学) MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus团队)

专题命中 软件智能体 :GUI智能体基准测试环境合成

AI总结 针对现有移动GUI智能体基准测试与现实应用之间的差距,提出全合成基准SimuWoB,通过鲁棒的虚拟环境生成框架合成高保真任务和环境,自动提供有效奖励,实现对复杂长程交互的高效可重复评估。

详情
AI中文摘要

由大型语言模型驱动的移动GUI智能体发展迅速,迫切需要真实且全面的评估。现有基准测试优先考虑可重复性,但通常局限于开源应用或文件操作任务,因为在实际应用中构建奖励困难,导致基准设置与现实使用之间存在差距。此外,大多数基准测试侧重于基本定位和导航,对复杂长程交互的覆盖有限。为解决这些局限性,我们引入了SimuWoB,一个全合成的移动GUI智能体基准测试,包含120个涵盖不同类型和难度级别的挑战性任务。我们构建了一个鲁棒的虚拟环境生成框架,合成高保真任务和环境,并为每个任务自动提供有效奖励。每个环境都部署为可通过URL访问的无后端网页,实现高效且可重复的评估。我们对几个最先进的移动GUI智能体进行了全面实验。平均成功率仅为27.92%,在长程任务上降至17.82%,揭示了当前智能体在复杂场景下的显著弱点。与真实世界样本任务的评估结果比较表明,基于我们合成环境的智能体评估具有良好的泛化性。我们进一步提供了关键能力维度的诊断见解,并讨论了对未来移动GUI智能体开发的启示。

英文摘要

GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom -- the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset -- while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.