Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All
超越LLM截止日期:一个面向所有人的实时内核崩溃修复基准
Chenxi Huang, Alex Mathai, Feiyang Yu, Aleksandr Nogikh, Petros Maniatis, Franjo Ivančić, Eugene Wu, Kostis Kaffes, Junfeng Yang, Baishakhi Ray
专题命中 软件智能体 :LLM代理修复内核崩溃,评估框架
AI总结 提出Live-kBench和kEnv框架,用于持续评估LLM代理修复新发现的Linux内核崩溃,实验显示代理在截止日期前修复率高出25%,但仅20%的补丁与开发者修复匹配。
详情
修复由Syzkaller等内核模糊测试工具发现的系统崩溃是软件工程中一个关键但尚未充分探索的挑战。虽然近期工作引入了基于大语言模型(LLM)的代理用于Linux内核崩溃修复,但其评估基准通常是静态的,因此未能捕捉Linux内核的演化特性,并且由于LLM知识截止日期而存在潜在的数据污染问题。为解决上述问题,我们提出了(i)Live-kBench,一个用于自我演化基准的评估框架,持续抓取并评估代理在新发现的内核漏洞上的表现,以及(ii)kEnv,一个与代理无关的标准化崩溃修复环境,用于内核编译、执行和反馈。该设计将代理工作流与重量级执行解耦,使得在相同条件下跨不同代理框架进行公平且可扩展的比较成为可能。为此,我们整理了一个包含534个Linux内核漏洞的初始数据集,并实验证明存在显著性能差距,代理在LLM知识截止日期前修复的漏洞上等效补丁率高出25%。使用kEnv,我们对三个最先进的代理进行了基准测试,结果显示它们首次尝试即修复了74%的崩溃(合理补丁);然而仅约20%生成的补丁与开发者修复紧密匹配。此外,暴露崩溃修复反馈使修复率提高了29%。Live-kBench为社区提供了一个既对时间敏感又对属性敏感的自我演化基准评估基础设施,并附带一个公共仪表板以跟踪代理在Linux内核漏洞上的进展。
Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crash-resolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under identical conditions. To this end, we curate an inaugural dataset of 534 Linux kernel bugs and empirically demonstrate a significant performance gap, with agents achieving up to 25% higher equivalent patch rate on bugs fixed before the LLM knowledge cutoff. Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt (plausible patches); however only ~20% of generated patches closely match developer fixes. Additionally, exposing crash resolution feedback improves crash resolution rate by 29%. Live-kBench provides the community with an evaluation infrastructure for self-evolving benchmarks that is both time and attribute sensitive; complete with a public dashboard to track agent progress on Linux kernel bugs.