大模型对齐与安全 - arXivDaily 专题

2605.17986 2026-06-18 cs.CR cs.AI 版本更新 95%

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

LivePI：更真实的智能体对抗间接提示注入基准测试

Lei Zhao, Abhay Bhaskar, Edgar Dobriban

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

专题命中提示注入：基准测试AI智能体对抗间接提示注入，核心是安全。

AI总结提出LivePI基准，覆盖7种输入表面、12种攻击/渲染家族和5种恶意目标，在真实虚拟机环境中评估多个AI智能体，发现攻击成功率10.7%-29.6%，并验证了两层防御的有效性。

详情

AI中文摘要

诸如OpenClaw之类的AI智能体越来越多地部署在本地工作流中，并能够访问外部工具。这带来了间接提示注入（IPI）风险：智能体可能会执行嵌入在不可信输入（如电子邮件、下载文件、网页、仓库或群聊消息）中的有害指令。现有的评估通常规模较小、完全模拟或仅关注狭窄的通道。我们引入了LivePI（实时提示注入），这是一个在生产类似但测试可控环境中的IPI风险结构化基准。LivePI覆盖了七个输入表面、十二个攻击/渲染家族和五个恶意目标，包括受保护信息窃取、未经授权的安全控制更改、不安全的代码检索或执行、收件箱摘要窃取以及加密货币转账。我们在一个真实的虚拟机上运行LivePI，该虚拟机具有实时但测试可控的电子邮件、聊天、网页、本地文件、仓库和钱包接口。在GPT-5.3-Codex、Claude Opus 4.6、Gemini 3.1 Pro、Kimi K2.5和GLM-5上，总攻击成功率范围为10.7%至29.6%。群聊注入在我们部署中评估的所有骨干模型上均成功，而仓库链接攻击尽管分母较小，仍产生了高严重性失败。我们还评估了一种由提示级过滤和执行前工具调用授权组成的两层防御。在GPT-5.3-Codex设置中，该防御在LivePI中拦截了所有测试的恶意目标完成，同时保留了PinchBench衍生工作负载上的良性效用。

英文摘要

AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injection (IPI) risk: an agent may execute harmful instructions embedded in untrusted inputs such as email, downloaded files, webpages, repositories, or group-chat messages. Existing evaluations are often small, purely simulated, or focused on a narrow set of channels. We introduce LivePI (Live Prompt Injection), a structured benchmark for IPI risk in a production-like but test-controlled environment. LivePI covers seven input surfaces, twelve attack/rendering families, and five malicious goals, including protected-information exfiltration, unauthorized security-control changes, unsafe code retrieval or execution, inbox-summary exfiltration, and cryptocurrency transfer. We run LivePI on a real virtual machine with live but test-controlled email, chat, web, local-file, repository, and wallet interfaces. Across GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.5, and GLM-5, total attack success rates range from 10.7% to 29.6%. Group-chat injection is uniformly successful across the evaluated backbones in our deployment, and repository-link attacks produce high-severity failures despite a small denominator. We also evaluate a two-layer defense consisting of prompt-level filtering and pre-execution tool-call authorization. In the GPT-5.3-Codex setting, the defense intercepts all tested malicious-goal completions in LivePI before execution while preserving benign utility on PinchBench-derived workloads.

URL PDF HTML ☆

赞 0 踩 0