代码大模型 / AI 编程 - arXivDaily 专题

2602.02690 2026-06-18 cs.SE 版本更新 90%

Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All

超越LLM截止日期：一个面向所有人的实时内核崩溃修复基准

Chenxi Huang, Alex Mathai, Feiyang Yu, Aleksandr Nogikh, Petros Maniatis, Franjo Ivančić, Eugene Wu, Kostis Kaffes, Junfeng Yang, Baishakhi Ray

专题命中软件智能体：LLM代理修复内核崩溃，评估框架

AI总结提出Live-kBench和kEnv框架，用于持续评估LLM代理修复新发现的Linux内核崩溃，实验显示代理在截止日期前修复率高出25%，但仅20%的补丁与开发者修复匹配。

详情

AI中文摘要

修复由Syzkaller等内核模糊测试工具发现的系统崩溃是软件工程中一个关键但尚未充分探索的挑战。虽然近期工作引入了基于大语言模型（LLM）的代理用于Linux内核崩溃修复，但其评估基准通常是静态的，因此未能捕捉Linux内核的演化特性，并且由于LLM知识截止日期而存在潜在的数据污染问题。为解决上述问题，我们提出了（i）Live-kBench，一个用于自我演化基准的评估框架，持续抓取并评估代理在新发现的内核漏洞上的表现，以及（ii）kEnv，一个与代理无关的标准化崩溃修复环境，用于内核编译、执行和反馈。该设计将代理工作流与重量级执行解耦，使得在相同条件下跨不同代理框架进行公平且可扩展的比较成为可能。为此，我们整理了一个包含534个Linux内核漏洞的初始数据集，并实验证明存在显著性能差距，代理在LLM知识截止日期前修复的漏洞上等效补丁率高出25%。使用kEnv，我们对三个最先进的代理进行了基准测试，结果显示它们首次尝试即修复了74%的崩溃（合理补丁）；然而仅约20%生成的补丁与开发者修复紧密匹配。此外，暴露崩溃修复反馈使修复率提高了29%。Live-kBench为社区提供了一个既对时间敏感又对属性敏感的自我演化基准评估基础设施，并附带一个公共仪表板以跟踪代理在Linux内核漏洞上的进展。

英文摘要

Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crash-resolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under identical conditions. To this end, we curate an inaugural dataset of 534 Linux kernel bugs and empirically demonstrate a significant performance gap, with agents achieving up to 25% higher equivalent patch rate on bugs fixed before the LLM knowledge cutoff. Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt (plausible patches); however only ~20% of generated patches closely match developer fixes. Additionally, exposing crash resolution feedback improves crash resolution rate by 29%. Live-kBench provides the community with an evaluation infrastructure for self-evolving benchmarks that is both time and attribute sensitive; complete with a public dashboard to track agent progress on Linux kernel bugs.

URL PDF HTML ☆

赞 0 踩 0

2411.19099 2026-06-18 cs.SE 版本更新 85%

Enhancing Software Maintenance: A Learning to Rank Approach for Co-changed Method Identification

增强软件维护：一种用于共变方法识别的学习排序方法

Yiping Jia, Safwat Hassan, Ying Zou

专题命中软件智能体：学习排序方法识别共变方法，辅助软件维护

AI总结提出一种学习排序方法，结合源代码特征和变更历史，在拉取请求级别预测并排序共变方法，实验表明随机森林模型在NDCG@5上优于其他模型2.5-12.8%，并超过基线方法4.7-537.5%。

详情

DOI: 10.1145/3820772

AI中文摘要

随着大规模软件系统复杂性的增加，识别特定变更所需的所有必要修改变得具有挑战性。共变方法，即经常一起修改的方法，对于理解软件依赖关系至关重要。然而，现有方法通常会产生大量结果且误报率高。关注拉取请求而非单个提交，可以提供相关变更的更全面视图，捕获关键的共变关系。为了解决这些挑战，我们提出了一种学习排序方法，结合源代码特征和变更历史，在拉取请求级别预测并排序共变方法。在150个开源Java项目（总计4150万行代码和634,216个拉取请求）上的实验表明，随机森林模型在NDCG@5上优于其他模型2.5%至12.8%。它也比文件邻近性、代码克隆、FCP2Vec和StarCoder 2等基线方法高出4.7%至537.5%。在较长历史数据（90至180天）上训练的模型表现一致，而60天后准确率下降，突显了每两个月重新训练的必要性。该方法为管理共变方法提供了有效工具，使开发团队能够处理依赖关系并维护软件质量。

英文摘要

With the increasing complexity of large-scale software systems, identifying all necessary modifications for a specific change is challenging. Co-changed methods, which are methods frequently modified together, are crucial for understanding software dependencies. However, existing methods often produce large results with high false positives. Focusing on pull requests instead of individual commits provides a more comprehensive view of related changes, capturing essential co-change relationships. To address these challenges, we propose a learning-to-rank approach that combines source code features and change history to predict and rank co-changed methods at the pull-request level. Experiments on 150 open-source Java projects, totaling 41.5 million lines of code and 634,216 pull requests, show that the Random Forest model outperforms other models by 2.5 to 12.8 percent in NDCG@5. It also surpasses baselines such as file proximity, code clones, FCP2Vec, and StarCoder 2 by 4.7 to 537.5 percent. Models trained on longer historical data (90 to 180 days) perform consistently, while accuracy declines after 60 days, highlighting the need for bi-monthly retraining. This approach provides an effective tool for managing co-changed methods, enabling development teams to handle dependencies and maintain software quality.

URL PDF HTML ☆

赞 0 踩 0