AI Agent - arXivDaily 专题

2508.04266 2026-06-19 cs.CL 版本更新 85%

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

ShoppingBench：面向LLM智能体的真实世界意图导向购物基准

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group（阿里巴巴国际数字商业集团）

专题命中软件智能体：提出购物基准测试LLM智能体，属于软件智能体

AI总结提出ShoppingBench基准，包含多层级真实购物意图任务，通过模拟环境和250万商品评估LLM智能体，发现GPT-4.1成功率低于50%，并提出轨迹蒸馏策略提升小模型性能。

Comments Accepted for oral presentation at AAAI 2026

详情

AI中文摘要

现有的电子商务基准主要关注基本用户意图，例如查找或购买产品。然而，现实世界的用户通常追求更复杂的目标，例如应用优惠券、管理预算以及寻找多产品卖家。为了弥补这一差距，我们提出了ShoppingBench，这是一个新颖的端到端购物基准，旨在涵盖日益具有挑战性的接地意图级别。具体来说，我们提出了一个可扩展的框架，基于从采样的真实世界产品中得出的各种意图来模拟用户指令。为了促进一致且可靠的评估，我们提供了一个大规模购物沙箱作为交互式模拟环境，包含超过250万种真实产品。实验结果表明，即使是最先进的语言智能体（如GPT-4.1）在我们的基准任务上的绝对成功率也低于50%，这突显了我们的ShoppingBench带来的重大挑战。此外，我们提出了一种轨迹蒸馏策略，并利用监督微调以及基于合成轨迹的强化学习，将大型语言智能体的能力蒸馏到较小的智能体中。结果，我们训练的智能体实现了与GPT-4.1相媲美的竞争性能。

英文摘要

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

URL PDF HTML ☆

赞 0 踩 0

2605.25160 2026-06-19 cs.AI 版本更新 80%

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

SimuWoB: 模拟真实世界移动应用以实现快速且保真的GUI智能体基准测试

Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University（人工智能产业研究院（AIR），清华大学）； University of Electronic Science and Technology of China（电子科技大学）； MiLM Plus, Xiaomi Inc.（小米公司MiLM Plus团队）

专题命中软件智能体：GUI智能体基准测试环境合成

AI总结针对现有移动GUI智能体基准测试与现实应用之间的差距，提出全合成基准SimuWoB，通过鲁棒的虚拟环境生成框架合成高保真任务和环境，自动提供有效奖励，实现对复杂长程交互的高效可重复评估。

详情

AI中文摘要

由大型语言模型驱动的移动GUI智能体发展迅速，迫切需要真实且全面的评估。现有基准测试优先考虑可重复性，但通常局限于开源应用或文件操作任务，因为在实际应用中构建奖励困难，导致基准设置与现实使用之间存在差距。此外，大多数基准测试侧重于基本定位和导航，对复杂长程交互的覆盖有限。为解决这些局限性，我们引入了SimuWoB，一个全合成的移动GUI智能体基准测试，包含120个涵盖不同类型和难度级别的挑战性任务。我们构建了一个鲁棒的虚拟环境生成框架，合成高保真任务和环境，并为每个任务自动提供有效奖励。每个环境都部署为可通过URL访问的无后端网页，实现高效且可重复的评估。我们对几个最先进的移动GUI智能体进行了全面实验。平均成功率仅为27.92%，在长程任务上降至17.82%，揭示了当前智能体在复杂场景下的显著弱点。与真实世界样本任务的评估结果比较表明，基于我们合成环境的智能体评估具有良好的泛化性。我们进一步提供了关键能力维度的诊断见解，并讨论了对未来移动GUI智能体开发的启示。

英文摘要

GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom -- the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset -- while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.

URL PDF HTML ☆

赞 0 踩 0