arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

代码大模型 / AI 编程

代码生成、软件工程智能体、程序修复、测试生成和开发者工具。

2026-06-18 至 2026-06-18 收录 27 信号源:cs.SE, cs.CL, cs.AI, cs.LG, cs.PL

1. 软件智能体 11 篇

2606.18733 2026-06-18 cs.SE cs.AI 新提交 90%

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

SWE-Future: 面向未来软件工程智能体的预测条件数据合成

Qiao Zhao, JianYing Qu, Jun Zhang, Yehua Yang, Hanwen Du, Zhongkai Sun

发表机构 * Baidu Inc(百度公司)

专题命中 软件智能体 :面向未来软件工程智能体的数据合成。

AI总结 提出SWE-Future方法,利用仓库历史证据预测未来任务类型(如功能实现、缺陷修复),并基于预测条件合成200个编码智能体任务,减少对历史PR回放的依赖,在80个仓库中达到58.1%的未来工作相关性。

详情
AI中文摘要

真实的编码智能体基准测试通常回放公开的GitHub问题和拉取请求,这使得它们容易与模型预训练、微调、合成数据生成或基准驱动的模型选择产生重叠。完全合成的任务避免了直接的历史回放,但可能偏离真实的仓库需求。我们提出了SWE-Future,一种面向未来编码任务的预测条件数据合成方法。给定时间$T_0$的预测快照,该方法仅使用$T_0$之前的仓库证据来预测未来的功能实现/增强、缺陷修复和重构任务族。我们首先回顾性地验证了这一预测步骤:在预测固定后,后续的拉取请求仅用于衡量预测的任务族是否与未来的仓库工作匹配。在一项80个仓库的研究中,预测器在主要语义匹配指标下达到了58.1%的未来工作相关性。然后,我们使用经过验证的预测族作为条件信号,从任务生成快照中跨61个仓库合成了一个包含200个任务的编码智能体数据集,而不是回放用于验证的后续拉取请求。SWE-Future表明,仓库演化预测可以指导现实的、面向未来的编码任务合成,同时减少对历史拉取请求回放的直接依赖。

英文摘要

Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid direct historical replay, but can drift away from real repository needs. We propose SWE-Future, a forecast-conditioned data synthesis method for future-oriented coding tasks. Given a forecast snapshot at time $T_0$, the method uses only pre-$T_0$ repository evidence to forecast future feature implementation/enhancement, bugfix, and refactor task families. We first validate this forecasting step retrospectively: after forecasts are fixed, later pull requests are used only to measure whether the predicted task families match future repository work. In an 80-repository study, the forecaster achieves 58.1\% future-work relevance under the main semantic matching metric. We then use validated forecast families as conditioning signals to synthesize a 200-task coding-agent dataset across 61 repositories from a task-generation snapshot, rather than replaying the later pull requests used for validation. SWE-Future shows that repository-evolution forecasts can guide realistic, future-oriented coding-task synthesis while reducing direct dependence on historical pull-request replay.

2606.15828 2026-06-18 cs.SE 新提交 90%

Configuration Smells in AGENTS.md Files: Common Mistakes in Configuring Coding Agents

AGENTS.md 文件中的配置异味:配置编码代理的常见错误

Helio Victor F. dos Santos, Vitor Costa, Joao Eduardo Montandon, Luciana Lourdes Silva, Marco Tulio Valente

专题命中 软件智能体 :编码代理配置文件异味分析,软件工程

AI总结 本文首次系统化编码代理配置文件(AGENTS.md/CLAUDE.md)的异味,通过灰文献综述和仓库挖掘识别出六种异味,并在100个开源仓库中验证其普遍性,其中Lint Leakage最常见(62%)。

详情
AI中文摘要

编码代理越来越多地被用于自动化软件工程任务。为了指导其行为,这些代理通常依赖配置文件(通常命名为 AGENTS.md 或 CLAUDE.md),这些文件提供关于架构、工作流、编码规范和测试实践的指令。尽管它们的重要性日益增加,但人们对影响这些文件定义和维护的常见问题知之甚少。在本文中,我们提出了首个编码代理配置文件异味目录。为了识别此类异味,我们首先进行了灰文献综述和仓库挖掘分析。结果,我们识别出六种配置异味,并提出了自动检测它们的启发式方法。为了评估所提出异味的普遍性,我们分析了100个包含 AGENTS.md 或 CLAUDE.md 文件的流行开源仓库。我们的结果表明,配置异味广泛存在。Lint Leakage 是最常见的异味,影响了62%的文件,其次是 Context Bloat(42%)和 Skill Leakage(35%)。我们进一步表明,几种异味经常同时出现,特别是 Context Bloat、Skill Leakage 和 Conflicting Instructions。

英文摘要

Coding agents are increasingly used to automate software engineering tasks. To guide their behavior, these agents commonly rely on configuration files, typically named AGENTS.‌md or CLAUDE.‌md, which provide instructions about architecture, workflows, coding conventions, and testing practices. Despite their growing importance, little is known about common problems affecting the definition and maintenance of these files. In this paper, we present the first catalog of smells for coding-agent configuration files. To identify such smells, we first conducted a grey literature review and a repository mining analysis. As a result, we identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS.‌md or a CLAUDE.‌md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions.

2602.02690 2026-06-18 cs.SE 版本更新 90%

Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All

超越LLM截止日期:一个面向所有人的实时内核崩溃修复基准

Chenxi Huang, Alex Mathai, Feiyang Yu, Aleksandr Nogikh, Petros Maniatis, Franjo Ivančić, Eugene Wu, Kostis Kaffes, Junfeng Yang, Baishakhi Ray

专题命中 软件智能体 :LLM代理修复内核崩溃,评估框架

AI总结 提出Live-kBench和kEnv框架,用于持续评估LLM代理修复新发现的Linux内核崩溃,实验显示代理在截止日期前修复率高出25%,但仅20%的补丁与开发者修复匹配。

详情
AI中文摘要

修复由Syzkaller等内核模糊测试工具发现的系统崩溃是软件工程中一个关键但尚未充分探索的挑战。虽然近期工作引入了基于大语言模型(LLM)的代理用于Linux内核崩溃修复,但其评估基准通常是静态的,因此未能捕捉Linux内核的演化特性,并且由于LLM知识截止日期而存在潜在的数据污染问题。为解决上述问题,我们提出了(i)Live-kBench,一个用于自我演化基准的评估框架,持续抓取并评估代理在新发现的内核漏洞上的表现,以及(ii)kEnv,一个与代理无关的标准化崩溃修复环境,用于内核编译、执行和反馈。该设计将代理工作流与重量级执行解耦,使得在相同条件下跨不同代理框架进行公平且可扩展的比较成为可能。为此,我们整理了一个包含534个Linux内核漏洞的初始数据集,并实验证明存在显著性能差距,代理在LLM知识截止日期前修复的漏洞上等效补丁率高出25%。使用kEnv,我们对三个最先进的代理进行了基准测试,结果显示它们首次尝试即修复了74%的崩溃(合理补丁);然而仅约20%生成的补丁与开发者修复紧密匹配。此外,暴露崩溃修复反馈使修复率提高了29%。Live-kBench为社区提供了一个既对时间敏感又对属性敏感的自我演化基准评估基础设施,并附带一个公共仪表板以跟踪代理在Linux内核漏洞上的进展。

英文摘要

Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crash-resolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under identical conditions. To this end, we curate an inaugural dataset of 534 Linux kernel bugs and empirically demonstrate a significant performance gap, with agents achieving up to 25% higher equivalent patch rate on bugs fixed before the LLM knowledge cutoff. Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt (plausible patches); however only ~20% of generated patches closely match developer fixes. Additionally, exposing crash resolution feedback improves crash resolution rate by 29%. Live-kBench provides the community with an evaluation infrastructure for self-evolving benchmarks that is both time and attribute sensitive; complete with a public dashboard to track agent progress on Linux kernel bugs.

2606.19216 2026-06-18 cs.SE cs.HC 新提交 85%

No Two Developers Think Alike: How Problem-Solving Styles and Experience Shape Needs in Conversational Interaction with Copilot

没有两个开发者想法相同:问题解决风格和经验如何塑造与 Copilot 对话交互中的需求

Jonan Richards, Bruno Alves de Oliveira, Iury Oliveira, Igor Wiese, Mairieli Wessel

专题命中 软件智能体 :研究开发者与Copilot的交互,属于AI编程

AI总结 通过混合方法出声思考研究,识别出5种交互模式和10种需求,并建立概念模型,揭示认知多样性如何影响开发者与GitHub Copilot的交互。

Comments Accepted at the International Conference on Software Maintenance and Evolution (ICSME), 2026

详情
AI中文摘要

基于LLM的对话式“编程助手”为开发者提供了诸多好处。然而,最近的研究表明,个体开发者对编程助手的需求存在差异,并且只有特定开发者群体才会遇到挑战。在本研究中,我们探讨了认知多样性在塑造与GitHub Copilot聊天交互中的作用。通过对27名专业开发者和学生进行混合方法的出声思考研究,我们表征了开发者交互中的5种不同的“交互模式”和10种潜在需求,形成了一个概念模型。我们描述了这些模式、需求与开发者的问题解决风格和经验概况之间的联系,展示了认知多样性如何塑造开发者的交互。我们为研究人员和从业者提供了关于如何设计、研究和运用编程助手以更好地满足多样化开发者需求的见解和建议。

英文摘要

Conversational LLM-based ``programming assistants'' provide a range of benefits to developers. However, recent studies demonstrate the variety in individual developers' needs regarding programming assistants, and challenges encountered by only specific groups of developers. In this study, we explore the role of cognitive diversity in shaping interactions with GitHub Copilot chat. Through a mixed-methods think aloud study with 27 professional developers and students, we characterize 5 distinct ``interaction modes'' and 10 underlying needs in developers' interactions, forming a conceptual model. We characterize links between these modes, needs, and developers' problem-solving styles and experience profiles, showing how cognitive diversity may shape developers' interactions. We provide insights and recommendations for researchers and practitioners on how to design, research, and employ programming assistants to better account for diverse developer needs.

2606.19167 2026-06-18 cs.SE 新提交 85%

Teaching Software Engineering with LLM and MCP Integration: From Classroom to Industry Practice

用LLM和MCP集成教学软件工程:从课堂到工业实践

Kehui Chen, Jacky Keung, Weining Li, Xiangbing Shao, Yishu Li, Xiaoxue Ma

专题命中 软件智能体 :将LLM和MCP集成到软件工程教学,提升编程和工具使用能力

AI总结 本研究将LLM和MCP集成到软件工程协作教学模式中,通过嵌入驱动工具到教学、代码辅助和工程模拟,弥合传统教学与工业流程的差距,提升学生编程、问题解决和智能工具使用能力。

Comments Aceept by International Symposium on Educational Technology (ISET) 2026

详情
AI中文摘要

大型语言模型(LLM)和模型上下文协议(MCP)在工业软件工程中的快速集成,迫切要求更新软件工程教育以跟上新兴技术和不断变化的行业需求。本研究探讨了一种创新方法,将LLM和MCP集成到软件工程教育的协作教学模式中,旨在构建一个与实际工程实践紧密相连的实用学习框架。通过将LLM和MCP驱动的工具嵌入日常教学、代码辅助和工程模拟中,该模型有效弥合了传统教学与工业工作流程之间的差距。这种集成增强了学生的编程能力、实际问题解决能力以及使用智能工程工具的熟练度。此外,通过与行业实习的合作,学生可以在真实环境中应用这些技术,进一步加强学术准备与专业实践之间的联系。总体而言,本研究为人工智能时代软件工程教育的改革与创新提供了一条实用路径。

英文摘要

The rapid integration of Large Language Models (LLMs) and the Model Context Protocol (MCP) into industrial software engineering has created a pressing need to update software engineering education to align with emerging technologies and evolving industry demands. This study investigates an innovative approach that integrates LLMs and MCP into a collaborative teaching model for software engineering education, aiming to build a practical learning framework closely connected to real-world engineering practices. By embedding LLM and MCP driven tools into daily teaching, code assistance, and engineering simulations, the model effectively bridges the gap between traditional instruction and industrial workflows. This integration enhances students' programming competence, practical problem-solving abilities, and proficiency in using intelligent engineering tools. Furthermore, through partnerships with industry internships, students can apply these technologies in real-world settings, further strengthening the connection between academic preparation and professional practice. Overall, this research offers a practical pathway for reforming and innovating software engineering education in the era of artificial intelligence.

2411.19099 2026-06-18 cs.SE 版本更新 85%

Enhancing Software Maintenance: A Learning to Rank Approach for Co-changed Method Identification

增强软件维护:一种用于共变方法识别的学习排序方法

Yiping Jia, Safwat Hassan, Ying Zou

专题命中 软件智能体 :学习排序方法识别共变方法,辅助软件维护

AI总结 提出一种学习排序方法,结合源代码特征和变更历史,在拉取请求级别预测并排序共变方法,实验表明随机森林模型在NDCG@5上优于其他模型2.5-12.8%,并超过基线方法4.7-537.5%。

详情
AI中文摘要

随着大规模软件系统复杂性的增加,识别特定变更所需的所有必要修改变得具有挑战性。共变方法,即经常一起修改的方法,对于理解软件依赖关系至关重要。然而,现有方法通常会产生大量结果且误报率高。关注拉取请求而非单个提交,可以提供相关变更的更全面视图,捕获关键的共变关系。为了解决这些挑战,我们提出了一种学习排序方法,结合源代码特征和变更历史,在拉取请求级别预测并排序共变方法。在150个开源Java项目(总计4150万行代码和634,216个拉取请求)上的实验表明,随机森林模型在NDCG@5上优于其他模型2.5%至12.8%。它也比文件邻近性、代码克隆、FCP2Vec和StarCoder 2等基线方法高出4.7%至537.5%。在较长历史数据(90至180天)上训练的模型表现一致,而60天后准确率下降,突显了每两个月重新训练的必要性。该方法为管理共变方法提供了有效工具,使开发团队能够处理依赖关系并维护软件质量。

英文摘要

With the increasing complexity of large-scale software systems, identifying all necessary modifications for a specific change is challenging. Co-changed methods, which are methods frequently modified together, are crucial for understanding software dependencies. However, existing methods often produce large results with high false positives. Focusing on pull requests instead of individual commits provides a more comprehensive view of related changes, capturing essential co-change relationships. To address these challenges, we propose a learning-to-rank approach that combines source code features and change history to predict and rank co-changed methods at the pull-request level. Experiments on 150 open-source Java projects, totaling 41.5 million lines of code and 634,216 pull requests, show that the Random Forest model outperforms other models by 2.5 to 12.8 percent in NDCG@5. It also surpasses baselines such as file proximity, code clones, FCP2Vec, and StarCoder 2 by 4.7 to 537.5 percent. Models trained on longer historical data (90 to 180 days) perform consistently, while accuracy declines after 60 days, highlighting the need for bi-monthly retraining. This approach provides an effective tool for managing co-changed methods, enabling development teams to handle dependencies and maintain software quality.

2606.19191 2026-06-18 cs.CR 新提交 80%

PhantomSkill: Malicious Code Injection in Agent Skill Ecosystems

PhantomSkill: 代理技能生态系统中的恶意代码注入

Yu-Ting Lin, Chia-Mu Yu

专题命中 软件智能体 :针对LLM编码代理的恶意代码注入攻击

AI总结 提出PhantomSkill攻击框架,通过VulMask技术将恶意行为隐藏在技能的辅助资源中,利用漏洞形状的实现绕过检测,在保持良性功能的同时降低警告和恶意软件检测率。

详情
AI中文摘要

代理技能使得基于LLM的编码代理能够从第三方包获取领域特定能力,但也引入了新的供应链攻击面。我们提出PhantomSkill,一个攻击框架,将恶意行为隐藏在技能的辅助资源中,而非其文本描述中。其核心技术VulMask将明显的恶意脚本重写为漏洞形状的实现,其恶意行为仅在攻击者控制的触发条件下激活。这种设计将可见信号从明确的恶意意图转变为看起来普通的易受攻击代码。在代表性的宿主技能、攻击目标、编码代理、生成模型和自动审查器上,与明显的恶意脚本相比,VulMask在保持良性功能的同时减少了警告和恶意软件级别检测。我们的结果表明,技能生态系统需要资源级审查、执行时隔离以及将代理技能中的可利用漏洞视为潜在恶意载荷的安全策略。

英文摘要

Agent skills allow LLM-based coding agents to acquire domain-specific capabilities from third-party packages, but they also introduce a new supply-chain attack surface. We present PhantomSkill, an attack framework that hides malicious behavior in a skill's auxiliary resources rather than in its textual description. Its core technique, VulMask, rewrites overt malicious scripts into vulnerability-shaped implementations whose malicious behavior is activated only under attacker-controlled trigger conditions. This design shifts the visible signal from explicit malicious intent to ordinary-looking insecure code. Across representative host skills, attack goals, coding agents, generation models, and automated reviewers, VulMask preserves benign utility while reducing warning and malware-level detection compared with overt malicious scripts. Our results show that skill ecosystems require resource-level vetting, execution-time containment, and security policies that treat exploitable vulnerabilities in agent skills as potential malicious payloads.

2602.04341 2026-06-18 cs.SE 80%

Model-Driven Legacy System Modernization at Scale

规模化遗留系统现代化的模型驱动方法

Tobias Böhm, Jens Guan Su Tien, Mohini Nonnenmann, Tom Schoonbaert, Bart Carpels, Andreas Biesdorf

专题命中 软件智能体 :模型驱动遗留系统现代化

AI总结 本文提出一种模型驱动的遗留系统现代化方法,通过在遗留代码库和现代目标平台之间插入富化中间模型,实现了核心UI组件和页面结构的半自动化迁移,提升了可维护性和开发者体验。

Comments Accepted for publication at the 1st Workshop on Code Translation, Transformation, and Modernization (ReCode'26), co-located with ICSE 2026

Journal ref Proc. ReCode '26, ACM, New York, NY, USA (2026) 13-18

详情
AI中文摘要

本文经验报告介绍了一种模型驱动的遗留系统现代化方法,通过在遗留代码库和现代目标平台之间插入一个富化、技术中立的中间模型,报告了其应用和评估。四阶段过程:分析、富化、合成和过渡,系统地提取、抽象和转换系统构件。我们应用该方法于一个基于遗留版本的.NET Framework和ASP.NET MVC构建的大型工业应用,展示了核心用户界面组件和页面结构可半自动化迁移到现代Web堆栈,同时保持功能行为和关键非功能特性。通过将架构知识整合到显式模型表示中,所得到的代码库具有更高的可维护性和可扩展性,从而改善了开发者体验。尽管自动化在标准模式迁移中有效,但定制化布局复合体的迁移仍具挑战性,需要针对性的手动调整。我们的贡献包括:(i) 一个端到端的模型驱动过程,(ii) 一个捕获结构、依赖性和语义元数据的富化中间模型,(iii) 保留功能行为和关键非功能特性的转换规则,以及(iv) 在工业环境中的应用和评估。总体而言,基于模型的抽象减少了风险和努力,同时支持了可扩展、可追溯的遗留应用现代化。我们的方法可推广到类似的现代化情境,并促进了迁移模式的重用。

英文摘要

This experience report presents a model-driven approach to legacy system modernization that inserts an enriched, technology-agnostic intermediate model between the legacy codebase and the modern target platform, and reports on its application and evaluation. The four-stage process of analysis, enrichment, synthesis, and transition systematically extracts, abstracts, and transforms system artifacts. We apply our approach to a large industrial application built on legacy versions of the .NET Framework and ASP.NET MVC and show that core user interface components and page structures can be migrated semi-automatically to a modern web stack while preserving functional behavior and essential non-functional qualities. By consolidating architectural knowledge into explicit model representations, the resulting codebase exhibits higher maintainability and extensibility, thereby improving developer experience. Although automation is effective for standard patterns, migration of bespoke layout composites remains challenging and requires targeted manual adaptation. Our contributions are: (i) an end-to-end model-driven process, (ii) an enriched intermediate model that captures structure, dependencies, and semantic metadata, (iii) transformation rules that preserve functional behavior and essential non-functional qualities, and (iv) application and evaluation of the approach in an industrial setting. Overall, model-based abstractions reduce risk and effort while supporting scalable, traceable modernization of legacy applications. Our approach generalizes to comparable modernization contexts and promotes reuse of migration patterns.

2606.19121 2026-06-18 cs.SE cs.CL cs.HC 新提交 75%

Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

由AI编写,由AI管理:跨越391个连续会话的语义空间控制与索引病消除

Hui Zhang, Shuren Song

发表机构 * Shenzhen Yunxi Technology Co., Ltd.(深圳云曦科技有限公司) Information Technology Center, Tsinghua University(清华大学信息科学技术中心)

专题命中 软件智能体 :长期LLM协作中的索引病问题,涉及代码工程

AI总结 本文通过真实软件项目中的行动研究,发现长期LLM协作中增加形式约束反而导致“索引病”,提出“基线-日志物理分离”机制,有效消除该问题。

Comments 22 pages, 2 tables, 1 figure. Action research. Bilingual submission (Chinese companion version included as supplementary). Submitted to ICSE 2027 IOR track

详情
AI中文摘要

解决长期LLM协作中概念漂移的主流工程直觉是,用更多的形式约束换取更可靠的输出——设计符号标识符系统,在系统提示中积累防御规则,扩展上下文窗口。我们的工程记录表明,在长期设置中,这种方向可能产生与设计意图相反的效果。通过在跨越约一个月和391个协作会话的真实软件项目(Bang-v3)中使用行动研究方法,我们记录并分析了这些策略的失败过程。当符号系统超过复杂度阈值时,LLM并不会变得更准确——相反,它们放弃了对业务语义的真正理解,退回到符号层内的自我指涉推理,并生成看似内部一致但实际上与现实脱节的输出。我们将这种失败模式命名为“索引病”,其典型表现为“幻影立法”。我们将底层原理命名为“庞原理(语义活力定律)”:带有明确目的的自然语言传达的信息质量远高于符号表达。由此,我们设计并验证了其物理工程机制:“基线-日志物理分离”。在同一项目中,该机制将AI指令量减少了约75%,并且在随后的约150个会话中,未观察到索引病复发。附有双语对照版本(中文)作为补充材料。

英文摘要

The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate -- instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

2606.18855 2026-06-18 cs.SE 新提交 70%

Toward Semantically-Seeded, Graph-Propagated Impact Analysis Across Software Artifacts: A Vision

面向语义种子与图传播的跨软件制品影响分析:一个愿景

Momil Seedat

专题命中 软件智能体 :跨软件制品影响分析,融合语义与结构。

AI总结 提出一种无需训练、可解释的融合方法,结合语义相似性与结构依赖,通过异构制品图与传播机制覆盖两种方法的盲点,实现跨需求-配置-服务-测试链的影响分析。

详情
AI中文摘要

当单个软件制品发生变化——一个需求、一个配置值或一个函数——工程师必须确定还有什么受到影响。现有的变更影响分析(CIA)工具往往孤立地依赖两种信号之一:从文本中恢复的语义相似性(信息检索可追溯性、代码搜索、嵌入),或结构依赖跟踪(调用图、IDE“查找用法”、测试影响选择)。每种方法都有其特有的盲点。语义驱动的方法会遗漏与变更没有共享词汇的受影响制品;结构驱动的方法会遗漏在意义上相关但未被边连接的制品,并且大多数仅对代码而非需求-配置-服务-测试链进行操作。我们主张一种无需训练且可解释的分析器,它在同一嵌入上融合两种信号。我们将系统建模为一个异构制品图,其类型化边通过静态分析恢复,通过余弦相似度计算相对于变更制品的语义先验,通过行归一化的传播矩阵进行多跳衰减传播,并通过单个可调权重λ融合两者。在一个支付子系统(5个标记的变更场景)上进行的小型但完整的概念验证显示了我们关心的机制:与变更没有文本重叠的制品仍然通过传播被恢复,而单独传播无法到达的辅助函数则通过语义层被恢复。融合是唯一覆盖两个盲点的配置,λ充当显式的精确率/召回率控制。借鉴四个公开记录的生成故障,我们认为相同的公式可以扩展到仅靠代码分析无法触及的操作制品(镜像、指标、仪表盘、数据模式)。

英文摘要

When a single software artifact changes - a requirement, a configuration value, or a function - engineers must determine what else is impacted. Existing change-impact-analysis (CIA) tooling tends to rely on one of two signals in isolation: semantic similarity recovered from text (information-retrieval traceability, code search, embeddings), or structural dependency following (call graphs, IDE "find usages", test-impact selection). Each has a characteristic blind spot. A semantically driven tool misses an impacted artifact whose text shares no vocabulary with the change; a structurally driven tool misses artifacts related in meaning but not joined by an edge, and most operate only over code rather than the Requirement-Config-Service-Test chain. We argue for a training-free and interpretable analyzer that fuses both signals over the same embeddings. We model the system as a heterogeneous artifact graph with typed edges recovered by static analysis, compute a semantic prior by cosine similarity to the changed artifact, propagate impact multi-hop with decay over a row-normalized propagation matrix, and blend the two with a single tunable weight lambda. A small but complete proof-of-concept on a payment subsystem (5 labelled change scenarios) shows the mechanism we care about: artifacts with zero textual overlap with the change are still recovered through propagation, and helper functions that propagation alone cannot reach are recovered through the semantic layer. The fusion is the only configuration that covers both blind spots, and lambda acts as an explicit precision/recall control. Drawing on four publicly documented production failures, we argue that the same formulation extends to operational artifacts (images, metrics, dashboards, data schemas) that code-only analysis cannot reach.

2606.17510 2026-06-18 cs.SE cs.SY eess.SY 新提交 70%

OmniDroneX: An LLM-Assisted Holistic Drone-as-a-Service Ecosystem

OmniDroneX: 一种LLM辅助的全方位无人机即服务生态系统

I-Ling Yen, Akeem Mohammed, Farokh Bastani, San-Yih Hwang

专题命中 软件智能体 :LLM用于服务组合和代码生成

AI总结 提出OmniDroneX统一无人机即服务生态系统,通过libUAV接口和PT-SOA抽象模型连接底层物理与高层任务,利用大语言模型辅助功能识别、服务组合和自然语言任务定义,支持多种组合技术以实现可扩展、自演进的无人机系统。

Comments This manuscript is a full version of a paper accepted in shortened form by IEEE International Conference on Joint Cloud Computing

详情
AI中文摘要

尽管无人机技术取得了快速进步,但由于无人机系统研究中的若干空白,当前部署仍然有限。为应对这些挑战,我们提出OmniDroneX,一个统一的无人机即服务生态系统,其中无人机从固定功能平台转变为动态可组合实体,可与外部基础设施集成以提供全方位能力。OmniDroneX通过统一的供应商无关接口(libUAV)和形式化的物理服务抽象模型(PT-SOA)连接底层物理原语与高层任务意图。一个核心创新是大语言模型(LLM)在OmniDroneX架构多层中的多样化应用。LLM用于辅助识别和形式化原始设备功能及抽象服务定义,支持自动化服务组合和工作流生成,并实现交互式自然语言任务规范与细化。OmniDroneX还包含了动态无人机系统中至关重要的多种组合技术类别,包括用于无人机能力增强的物理层组合,以及时空、功能、协作、异常感知和基于QoS的服务组合。总体而言,这些特性使OmniDroneX能够作为在复杂动态环境中运行的可扩展、有弹性和自演进的无人机生态系统的基础。

英文摘要

Despite rapid advances in UAV technologies, current deployments remain limited due to several gaps in UAV systems research. To address these challenges, we propose OmniDroneX, a unified Drone-as-a-Service ecosystem, in which drones are transitioned from fixed function platforms into dynamically composable entities that can be integrated with external infrastructures to offer omni-capabilities. OmniDroneX bridges low-level physical primitives with high-level mission intent through a unified vendor-agnostic interface (libUAV) and a formal physical-service abstraction model (PT-SOA). A core innovation is the diverse application of large language models (LLMs) across multiple layers of the OmniDroneX architecture. LLMs are used to assist in identifying and formalizing primitive device functions and abstract service definitions, supporting automated service composition and workflow generation, and enabling interactive, natural-language mission specification and refinement. OmniDroneX also incorporates important categories of composition techniques that are essential in dynamic UAV systems, including physical layer composition for drone capability augmentation, as well as spatiotemporal, functional, collaborative, exception-aware, and QoS-based service compositions. Collectively, these features allow OmniDroneX to serve as a foundation for scalable, resilient, and self-evolving UAV ecosystems operating in complex and dynamic environments.

2. 代码生成 8 篇

2606.06133 2026-06-18 cs.SE cs.AI cs.LG cs.LO 版本更新 90%

TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

TLA-Prover: 通过偏好优化低秩适配实现可验证的 TLA+ 规范合成

Eric Spencer, Arslan Bisharat, Brian Ortiz, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

发表机构 * Department of Computer Science, Loyola University Chicago(洛约拉芝加哥大学计算机科学系)

专题命中 代码生成 :TLA+形式化规范合成,偏好优化提升通过率

AI总结 提出 TLA-Prover 模型,结合监督微调和基于修复的组相对策略优化,在 TLC 模型检查器上实现 TLA+ 规范合成,Gold/Diamond 级别通过率达 30%,约为未调优基线的 3.5 倍。

Comments 12 pages, 5 tables, 3 figures. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026)

详情
AI中文摘要

TLA+ 是一种用于验证分布式系统和安全关键协议的正式规范语言。大型语言模型(LLM)生成的 TLA+ 规范常常因语义原因无法通过 TLC 模型检查器。在 25 个 LLM 中,最佳公开基线的语法解析成功率为 26.6%,语义模型检查通过率为 8.6%。我们提出了 TLA-Prover,一个 200 亿参数的 TLA+ 规范合成模型。训练结合了在已验证示例上的监督微调(SFT)和基于修复的组相对策略优化(GRPO)。在 GRPO 阶段,模型学习修复自身被拒绝的规范。我们还从相同的 SFT 检查点训练了一个直接偏好优化(DPO)变体作为消融实验。TLC 直接提供奖励信号,无需学习奖励模型。每个输出分为四个等级:青铜(解析通过)、银(无警告)、金(通过 TLC)和钻石。要达到钻石级,模型的正确性属性会被自动微小修改;TLC 必须检测到违反。如果 TLC 仍然通过,则该属性始终为真且无贡献;输出无法达到钻石级。在一个保留的 30 问题基准上,TLA-Prover 在金级和钻石级均达到 9/30(即 pass@1 = 30%)。这大约是未调优基线 8.6% 的 3.5 倍。DPO 变体在钻石级达到 20%。金级和钻石级在每个检查点都一致;这防止了平凡属性失败模式。

英文摘要

TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.

2606.18286 2026-06-18 cs.LG 新提交 85%

CODEBLOCK: Learning to Supervise Code at the Right Granularity

CODEBLOCK: 学习在正确的粒度上监督代码

Zhijie Deng, Ling Li, Jinlong Pang, Kaiqin Hu, Qi Xuan, Zhaowei Zhu, Jiaheng Wei

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) UC Santa Cruz(加州大学圣克鲁兹分校) Ant Group(蚂蚁集团) BAIA, ZJUT(浙江工业大学智能信息处理实验室) D5Data.ai

专题命中 代码生成 :提出CodeBlock框架,结构感知稀疏监督提升代码生成微调。

AI总结 提出CodeBlock框架,通过选择结构完整的代码块而非孤立token进行稀疏监督,在仅使用1.9%监督token的情况下,在六个代码生成基准上取得优于全token微调的效果。

详情
AI中文摘要

代码大语言模型的监督微调通常对所有响应token应用统一的交叉熵损失,隐含假设每个token提供同等有用的学习信号。最近的token级选择方法通过仅监督高价值token挑战了自然语言SFT中的这一假设。然而,直接将token级掩码迁移到代码可能会破坏语法和语义连贯的程序单元,因为代码依赖于结构完整性和定义-使用关系。因此,我们提出CodeBlock,一个结构感知的稀疏监督框架,选择结构完整的代码证据而非孤立token。CodeBlock首先选择高质量的指令-响应对,然后将代码响应划分为语法连贯的编码项,通过聚合核心逻辑token上的广义交叉熵来估计其效用,并使用数据流可达性和桥接信号重新排序,以优先传播或连接重要程序依赖的块。在训练期间,完整响应仍作为上下文可用,但损失仅应用于选定的代码项和信息性自然语言token。在六个代码生成基准上的实验表明,CodeBlock在仅使用1.9%的监督响应token的情况下,实现了比全tokenSFT和竞争性选择基线更强的平均pass@1。

英文摘要

Supervised fine-tuning of code LLMs typically applies uniform cross-entropy loss to all response tokens, implicitly assuming that every token provides equally useful learning signal. Recent token-level selection methods challenge this assumption in natural-language SFT by supervising only high-value tokens. However, directly transferring token-level masking to code can break syntactically and semantically coherent program units, because code depends on structural completeness and definition-use relations. We therefore propose CodeBlock, a structure-aware sparse supervision framework that selects structure-complete code evidence rather than isolated tokens. CodeBlock first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training, the full response remains available as context, while loss is applied only to selected code items and informative natural-language tokens. Experiments on six code-generation benchmarks show that CodeBlock achieves stronger average pass@1 than full-token SFT and competitive selection baselines, while using only 1.9% of supervised response tokens.

2511.00802 2026-06-18 cs.SE cs.CL cs.LG 版本更新 85%

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

GrowthHacker: 使用代码修改型LLM代理的自动离线策略评估优化

Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard

发表机构 * Michigan Technological University, Houghton(密歇根技术大学) Birmingham City University(伯明翰城市大学) University of British Columbia, Kelowna(不列颠哥伦比亚大学, 肯洛纳)

专题命中 代码生成 :利用LLM代理自动修改代码优化离线策略评估。

AI总结 提出GrowthHacker基准,利用LLM代理自动迭代修改代码以优化离线策略评估(OPE)实现,在Open Bandit Pipeline和Scope-RL上评估多种框架,证明基于LLM的代理可作为自动增长黑客持续改进OPE系统。

Comments Accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM), 2026

详情
AI中文摘要

随着数据驱动开发的广泛采用,在线A/B测试已成为衡量新技术效果的既定方法。然而,部署在线实验需要设计、实现和部署资源,并可能对用户产生负面影响(例如,不安全或不道德的结果),同时需要数周的数据收集。为了解决这一问题,离线策略评估(OPE)或离线A/B测试这一日益增长的研究领域,使用先前收集的日志数据离线评估新技术。OPE也是强化学习中的一个基本问题,在在线测试昂贵或风险高的领域(如医疗保健、推荐系统、教育和机器人技术)中非常重要。尽管代码生成大语言模型(LLM)和代理工作流取得了进展,但关于LLM和基于LLM的代理是否以及如何自动优化OPE实现,我们知之甚少。我们提出了GrowthHacker,这是一个基准测试,用于在大规模公共数据集上评估基线LLM和基于LLM的代理。GrowthHacker自主迭代修改代码,运行OPE,并使用指标指导后续优化。我们在Open Bandit Pipeline(OBP)和Scope-RL上评估方法,并开发了一个双代理框架,该框架解决了现有框架的局限性,同时降低了复杂性。在两个库中,双代理显示出最高的可靠性(98.1%-100%成功率)和正向结果率(78%),正向结果的中位改进为4.4%;CrewAI实现了最高的平均改进(37.9%),并且是唯一没有极端值失败的框架。AutoGen和Default各达到65%的正向结果率。这些结果证明了使用基于LLM的代理作为自动“增长黑客”持续改进OPE系统的可行性,对在手动优化成本高昂的情况下扩展数据驱动决策具有重要意义。

英文摘要

With data-driven development now widely adopted, online A/B testing is an established method for measuring the effects of new technologies. However, deploying online experiments demands resources for design, implementation, and deployment, and may negatively impact users (e.g., unsafe or unethical outcomes) while requiring weeks of data collection. To address this, the growing research area of off-policy evaluation (OPE), or offline A/B testing, assesses new technologies offline using previously collected logged data. OPE is also a fundamental problem in reinforcement learning and is important where online testing is expensive or risky, such as healthcare, recommender systems, education, and robotics. Despite advances in code-generation large language models (LLMs) and agentic workflows, little is known about whether and how LLMs and LLM-based agents can automatically optimize OPE implementations. We propose GrowthHacker, a benchmark that evaluates baseline LLMs and LLM-based agents on large-scale public datasets. GrowthHacker autonomously and iteratively modifies code, runs OPE, and uses the metrics to guide subsequent optimization. We evaluate methods on Open Bandit Pipeline (OBP) and Scope-RL, and develop a two_agent framework that addresses limitations of existing frameworks while reducing complexity. Across both libraries, two_agent shows the highest reliability (98.1%-100% success rate) and positive-outcome rate (78%), with a median improvement of 4.4% among positive outcomes; CrewAI achieves the highest average improvement (37.9%) and is the only framework with zero extreme-value failures. AutoGen and Default each reach 65% positive-outcome rates. These results establish the feasibility of using LLM-based agents as automated "growth hackers" to continuously improve OPE systems, with implications for scaling data-driven decision-making where manual optimization is expensive.

2606.19315 2026-06-18 cs.LG 新提交 80%

Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation

Diffusion-Proof:超越自回归生成的正式定理证明配方

Ruida Wang, Rui Pan, Pengcheng Wang, Shizhe Diao, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) NVIDIA(英伟达)

专题命中 代码生成 :扩散语言模型用于形式定理证明

AI总结 提出Diffusion-Proof框架,首次将扩散语言模型应用于形式定理证明,通过全证明生成和局部校正方法,在ProofNet和MiniF2F上分别提升1.61%和6.14%,并解决了一个DeepSeek-Prover-V2-7B无法解决的IMO问题。

详情
AI中文摘要

近年来,增强大型语言模型(LLMs)的形式数学推理能力已成为数学和计算机科学社区的关键焦点。虽然在使用最先进的自回归(AR)LLMs进行形式定理证明方面取得了显著进展,但这些模型存在固有局限性。它们的下一个词预测生成方法可能因长程连贯性挑战和长序列错误累积而导致次优性能。最近,扩散LLMs(dLLMs)通过多词块的迭代去噪生成文本,提供了一种有前景的替代方案。然而,dLLMs在形式数学中的应用(其中保持长程连贯性至关重要)仍然研究不足。为解决上述挑战,我们提出了**Diffusion-Proof**,据我们所知,这是第一个训练和应用dLLMs进行形式定理证明的框架。我们的框架包含两种模型的训练和推理方法。第一个是*dLLM-Prover-7B*,它执行具有长程连贯策略使用的全证明写作。第二个是*dLLM-Corrector-7B*,这是一种新颖的大块扩散校正模型。它利用dLLMs的填充能力,使用双向信息进行局部证明校正。大量实验表明,**Diffusion-Proof**相对显著优于在同一数据集上训练的AR LLM基线。与基线相比,**Diffusion-Proof**在ProofNet-Test和MiniF2F-Test基准上分别实现了**1.61%**和**6.14%**的绝对提升。值得注意的是,**Diffusion-Proof**成功解决了一个更先进的思考模型DeepSeek-Prover-V2-7B无法解决的IMO问题,展示了dLLMs在形式定理证明中的独特优势。

英文摘要

Enhancing the formal math reasoning capabilities of Large Language Models (LLMs) has become a key focus in both mathematical and computer science communities in recent years. While significant progress has been made in using state-of-the-art Auto-Regressive (AR) LLMs for formal theorem proving, these models suffer from inherent limitations. Their next-token prediction generation methods may yield suboptimal performance due to the challenges of long-range coherence and the compounding of errors over long sequences. Recent advancements in diffusion LLMs (dLLMs), which generate text through iterative denoising of a multi-token block, offer a promising alternative. However, the application of dLLMs to formal mathematics, where maintaining long-range coherence is critical, remains largely understudied. To address the challenges above, we propose **Diffusion-Proof**, to the best of our knowledge, the first framework to train and apply dLLMs for formal theorem proving. Our frameworks contain training and inference methods for two models. The first one is *dLLM-Prover-7B*, which performs whole-proof writing with long-range coherent tactic usage. The second one is *dLLM-Corrector-7B*, which is a novel large block diffusion-based correction model. It leverages the in-filling capabilities of dLLMs to perform local proof correction using bi-directional information. Extensive experiments demonstrate that **Diffusion-Proof** relatively significantly outperforms the AR LLM baseline trained under the same dataset. **Diffusion-Proof** achieves an absolute improvement of **1.61%** on ProofNet-Test and **6.14%** on MiniF2F-Test benchmarks compare to the baseline. Notably, **Diffusion-Proof** successfully resolves one IMO problem that more advanced thinking model DeepSeek-Prover-V2-7B could not solve, showcasing the unique advantage of dLLMs in formal theorem proving.

2606.19042 2026-06-18 cs.SE cs.AI 新提交 80%

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

可变性去哪了?从氛围编码到通过再生的产品线

Xhevahire Tërnava

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France(LTCI,巴黎电信学院,巴黎理工学院,Palaiseau,法国)

专题命中 代码生成 :AI驱动编程,可变性再生。

AI总结 研究AI驱动编程(氛围编码)中可变性缺失问题,提出通过再生实现可变性(VbR)方法,让LLM作为推导引擎生成无死代码的变体二进制。

Comments VARIABILITY 2026

详情
AI中文摘要

在氛围编码这一新兴的AI驱动范式中,LLM根据自然语言提示生成整个程序,但传统软件工程精心构建到代码中的可变性会发生什么?为了回答这个问题,我们对10个氛围编码的C/C++项目进行了探索性分析,结果表明在编译和运行时,工件内可变性几乎为零。所有可变性决策都在一个新的绑定时间——生成时间(即LLM生成源代码的时刻)得到解决。我们不将其视为需要修复的缺陷,而是提出了通过再生实现可变性(VbR),据我们所知,这是第一种产品线方法,其中LLM充当推导引擎,根据声明性规范为每个变体生成无死代码的专用二进制,同时变体调度器透明地将用户请求路由到匹配的二进制。我们形式化了VbR,将其与经典SPL推导进行对比,并在wc产品家族上演示了其完整流程。对于SPL工程,AI生成软件中的可变性应属于规范,而非代码。

英文摘要

In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

2606.18293 2026-06-18 cs.SE cs.AI 新提交 80%

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

Vibe Coding 吃掉我的作业:AI 方法在全新软件工程与编程中的评估

Callum Barbour

发表机构 * OpenAI

专题命中 代码生成 :评估AI编程(vibe coding)在软件工程中的可行性。

AI总结 本文评估了“氛围编码”(用自然语言提示编程)在全新软件工程任务中的可行性,并分析了现有基准,通过开发 Python 简单独立编程任务评估套件提供见解。

Comments 10 pages, 2 figures

详情
AI中文摘要

得益于生成式 AI 的快速发展,我们正处于一个可能永远改变我们与计算机交互方式的范式转变之中。我们观察到,在没有领域基础知识的情况下,使用自然语言提示来构建应用程序和编码基础设施的做法日益增长,这种做法被称为“氛围编码”。可以说,这代表了编程领域自诞生以来一直追求的目标,即每一个更高层次的抽象。就输入方法而言,氛围编码有望成为高级编程元认知的终点:完全消除人类对代码语法的使用,转而用母语进行编程。本文旨在评估氛围编码在全新软件工程任务中的可行性,并分析用于衡量其软件工程能力的基准。为此,我们开发了一个评估套件,用于分析 LLM 在 Python 中执行简单、独立的全新编程任务的熟练程度,以提供对此问题的有范围限制的见解。

英文摘要

Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

2606.19257 2026-06-18 cs.CL 新提交 70%

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B:面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong(香港大学) Peking University(北京大学)

专题命中 代码生成 :在代码推理基准上评估

AI总结 提出块大小课程学习,通过从细粒度到粗粒度的渐进训练,解决块扩散语言模型在长链推理中性能差距问题,DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情
AI中文摘要

块扩散语言模型通过并行块级去噪加速解码,但其能否可靠地扩展到长思维链(CoT)推理仍未解决。为此,我们开发了开源块扩散推理模型DreamReasoner-8B,并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距:使用大块大小训练会导致推理性能极差,而小块大小则能保持有效的推理。为了弥合这一粒度差距,我们提出了块大小课程学习,逐步从细粒度块大小过渡到粗粒度块大小进行训练,从而克服了这一限制,并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中,DreamReasoner-8B取得了与领先的开源自回归模型(如Qwen3-8B)相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型:https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

2606.18425 2026-06-18 cs.SE cs.AI cs.DC 新提交 70%

From Specification to Execution: AI Assisted Scientific Workflow Management

从规范到执行:AI辅助的科学工作流管理

Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

发表机构 * RENCI, University of North Carolina at Chapel Hill, NC, USA(RENCI,北卡罗来纳大学教堂山分校) Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA(信息科学研究所,南加州大学马里纳德尔雷耶斯分校)

专题命中 代码生成 :利用LLM生成工作流代码

AI总结 提出一种AI辅助方法,通过规范驱动的工作流生成、自动化调试和分布式执行,结合Pegasus与MCP层,实现从自然语言到大规模科学工作流的端到端管理。

详情
AI中文摘要

科学工作流管理系统(WMS)支持复杂管道的可扩展和可重复执行,但工作流的设计、实现和调试仍然主要依赖人工,需要大量专业知识。最近使用大型语言模型(LLM)的方法在从自然语言生成工作流方面显示出潜力,但通常依赖于直接的代码合成,这限制了透明度、可重复性以及与工作流系统的集成。我们提出了一种AI辅助的科学工作流管理方法,结合了规范驱动的工作流生成、自动化调试和分布式执行。该方法引入了一个结构化的规范阶段,将工作流意图、设计和实现分离,允许在代码生成之前进行验证。我们还开发了一个基于LLM的调试代理,用于诊断和解决跨多个系统层的故障。为了支持分布式执行和用户交互,我们将广泛使用的WMS Pegasus与模型上下文协议(MCP)层集成,为工作流提交、监控和控制提供统一接口。我们使用一个用于医学影像的联邦学习工作流来评估该方法,该工作流具有并行、迭代和依赖密集的结构。该系统生成并执行了包含数千个作业的大规模工作流,减少了调试工作量,并允许非专家用户使用专家级设计模式构建工作流。这些结果表明,端到端的AI辅助工作流生成和执行是可行的,并指向了用于管理科学工作流生命周期的AI驱动平台。

英文摘要

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

3. 程序修复 1 篇

2606.18619 2026-06-18 cs.CR cs.AI cs.SE 新提交 85%

Code-Augur: Agentic Vulnerability Detection via Specification Inference

Code-Augur:通过规约推断的智能体漏洞检测

Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury

发表机构 * National University of Singapore(新加坡国立大学)

专题命中 程序修复 :智能体漏洞检测,通过规约推断发现漏洞

AI总结 提出安全规约优先范式,通过显式化智能体假设并运行时反证,结合引导式模糊测试提升漏洞检测能力,在真实项目中比现有智能体检测更多漏洞。

详情
AI中文摘要

智能体漏洞检测的出现已成为软件安全的分水岭。完全由自主LLM智能体进行的审计正在发现数字社会基础软件中的关键漏洞。许多漏洞多年来一直隐藏,直到现在才被AI智能体发现。然而,这些发现背后的推理仍然令人担忧地不透明且未经验证。当智能体认为某个函数安全时,它对函数输入做了哪些假设?推理失败和错误假设可能导致遗漏漏洞,并降低对智能体分析的信任。我们提出了一种安全规约优先范式,该范式(1)将智能体的隐性假设明确暴露为安全规约,并(2)通过运行时反证持续细化这些规约。我们在Code-Augur中实现了我们的方法,这是一种用于智能体漏洞检测的新型框架。给定一个代码库,Code-Augur分析系统的每个组件以查找漏洞代码。当它认为某个组件安全时,它会将该判断背后的局部不变量作为源代码中的断言提交。同时,Code-Augur利用引导式模糊测试器尝试反证这些假设。当模糊测试器触发断言时,要么揭示一个真实漏洞,要么揭示一个需要细化的有缺陷规约。在这两种情况下,这一过程都夯实了智能体的理解,使其对代码意图的看法与代码实际行为保持一致。在真实世界的主题上,Code-Augur有效利用安全规约检测到比其他最先进智能体更多的漏洞。此外,Code-Augur在关键开源项目中发现了22个新漏洞。与精心策划的专用模型(如Claude Mythos)相比,Code-Augur提供了基于广泛可用的LLM(如Sonnet和DeepSeek)构建的有效智能体漏洞检测。

英文摘要

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

4. 代码评测 6 篇

2602.06774 2026-06-18 cs.AI 版本更新 85%

Towards Understanding What State Space Models Learn About Code

理解状态空间模型在代码中学到了什么

Jiali Wu, Abhinav Anand, Shweta Verma, Mira Mezini

发表机构 * TU Darmstadt(图宾根大学) Hessian Center for Artificial Intelligence(黑森人工智能中心) National Research Center for Applied Cybersecurity ATHENE(应用网络安全国家研究中心ATHENE)

专题命中 代码评测 :SSM代码理解机制分析

AI总结 本文首次系统分析状态空间模型(SSM)在代码理解中的学习机制,发现SSM在预训练时比Transformer更有效捕获语法和语义结构,但微调时会遗忘某些关系,并提出SSM-Interpret框架和架构改进,将NLCodeSearch的MRR提升高达6。

详情
AI中文摘要

状态空间模型(SSM)已成为Transformer架构的高效替代方案。先前工作表明,在可比条件下训练时,SSM在代码理解任务上可以匹配或超越Transformer。然而,其内部机制仍是一个黑箱。我们首次系统分析了基于SSM的代码模型所学到的内容,并在此领域直接比较了SSM和Transformer模型。我们的分析表明,SSM在预训练期间比Transformer更有效地捕获了语法和语义结构,但在某些任务的微调过程中会遗忘某些关系。为了研究这种行为,我们引入了SSM-Interpret,一个频域框架,揭示了微调期间向短程依赖的频谱偏移。在这些发现的指导下,我们提出了架构修改,将基于SSM的代码模型在NLCodeSearch上的性能显著提升了高达+6 MRR。这表明我们的分析不仅解释了模型行为,而且直接导致了更好的设计。

英文摘要

State Space Models (SSMs) have emerged as an efficient alternative to the Transformer architecture. Prior work shows that, when trained under comparable conditions, SSMs can match or surpass Transformers on code understanding tasks. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models learn along with the direct comparison between SSM and Transformer models in this domain. Our analysis shows that SSMs capture syntactic and semantic structure more effectively than Transformers during pretraining but forgets certain relations during fine-tuning on some tasks. To investigate this behavior, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model by upto +6 MRR on NLCodeSearch. This demonstrates that our analysis not only explains model behavior but also leads directly to better designs.

2606.18284 2026-06-18 cs.LG cs.AI cs.CL 新提交 75%

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

打破求解器瓶颈:在可学习前沿训练任务生成器

Lorenz Wolf, Connor Watts, Roger Creus Castanyer, Geoffrey Bradway, Maxwill Lin, Augustine N. Mavor-Parker, Matthew Daborn-Sargent

发表机构 * Vmax Goodfire AI

专题命中 代码评测 :提出PROPEL框架,优化任务生成器用于代码和软件工程。

AI总结 提出PROPEL框架,通过训练轻量级激活探针作为求解率代理,在无需重复求解器评估的情况下优化任务生成器,使生成任务集中在可学习前沿,提升数学、代码和软件工程任务的有效性。

Comments 30 pages, 9 figures, 12 tables

详情
AI中文摘要

通过强化学习训练智能体的限制资源日益成为前沿任务供给:有效、可求解且刚好足够困难以训练当前模型的任务。随着推理和智能体模型的改进,固定任务分布趋于饱和,而天真的合成生成产生琐碎、不可能或不适定的任务。用强化学习训练任务生成器以优化有效性和可学习性可以解决这一瓶颈,但直接优化需要对每个候选任务进行重复求解器评估。对于软件工程任务,单次评估可能耗时数十分钟;求解器在环的生成器训练是不可行的。我们提出PROPEL,一个求解器摊销框架,用于在目标求解率下训练任务生成器。PROPEL在一次性标注的生成任务和求解器结果语料库上训练一个轻量级激活探针。该探针从冻结的生成器参考模型预测目标求解器的通过率,并在生成器优化期间作为求解率的代理,将生成器评估简化为单次前向传播。在多种模型规模下的数学、代码和软件工程任务中,PROPEL将生成任务转向目标求解率:对于编程,在可学习前沿生成的任务从$10.1\% \ ightarrow 20.0\%$(针对Qwen2.5-3B-Instruct求解器)和从$5.3\% \ ightarrow 12.6\%$(针对Qwen2.5-7B-Instruct求解器)。对于软件工程,PROPEL将目标求解率下的生成份额从$9.8\% \ ightarrow 19.6\%$(针对Qwen3.5-27B在探针和生成器训练期间未见过的仓库)。

英文摘要

The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill-posed. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate. For software-engineering (SWE) tasks, a single rollout can take tens of minutes; solver-in-the-loop generator training is intractable. We introduce PROPEL, a solver-amortized framework for training task generators at the targeted solve rate. PROPEL trains a lightweight activation probe on a one-time labeled corpus of generated tasks and solver outcomes. The probe predicts target-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass. Across math, code, and software-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from $10.1\% \rightarrow 20.0\%$ for a Qwen2.5-3B-Instruct solver and from $5.3\% \rightarrow 12.6\%$ for a Qwen2.5-7B-Instruct solver. For SWE, PROPEL increases the share of generations at the targeted solve rate from $9.8\% \rightarrow 19.6\%$ for Qwen3.5-27B on repositories not seen during training of probe and generator.

2604.00730 2026-06-18 cs.CY cs.AI cs.LG cs.SE 版本更新 75%

A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch

基于CEFR启发的模糊C均值分类框架:自动化评估Scratch编程技能

Ricardo Hidalgo-Aragón, Jesús M. González-Barahona, Gregorio Robles

发表机构 * Universidad Rey Juan Carlos(雷昂卡洛斯大学)

专题命中 代码评测 :模糊C均值聚类评估Scratch编程技能

AI总结 提出一种基于CEFR的Scratch项目评估框架,使用模糊C均值聚类对200万+项目分级,识别B2瓶颈并引入分类确定性指标以平衡自动反馈与人工审核。

Comments Best Paper Award CSEDU 2026 -Minor change FPC fix-

详情
AI中文摘要

背景:学校、培训平台和技术公司日益需要以透明、可重复的方法大规模评估编程能力,以支持个性化学习路径。目标:本研究引入一个与欧洲共同语言参考标准(CEFR)一致的Scratch项目评估教学框架,为学生和教师提供通用能力等级,并为课程设计提供可行见解。方法:我们对通过此http URL评估的2008246个Scratch项目应用模糊C均值聚类,实施序数准则将聚类映射到CEFR等级(A1-C2),并引入增强分类指标,识别过渡学习者,实现持续进度跟踪,量化分类确定性以平衡自动反馈与教师评审。影响:该框架能够诊断系统性课程缺口——特别是“B2瓶颈”,由于逻辑同步和数据表示的认知负荷,仅13.3%的学习者处于该等级——同时提供基于确定性的触发机制以进行人工干预。

英文摘要

Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a "B2 bottleneck" where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation--while providing certainty--based triggers for human intervention.

2606.16000 2026-06-18 cs.CL cs.LG 新提交 70%

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

GRACE-DS:数据科学中的受保护奖励引导智能体修正环境

Aleksandr Tsymbalov, Danis Zaripov, Artem Epifanov, Anastasiya Palienko

发表机构 * ITMO University(ITMO大学) HSE University(高等经济学院)

专题命中 代码评测 :评估代码生成和AutoML智能体性能

AI总结 提出GRACE-DS,一个用于评估LLM驱动的AutoML智能体在部署前性能的隔离环境,通过隐藏的可执行验证器衡量预测性能、泄漏避免、可重复性等指标,实验证明其灵活迭代交互模式优于基线方法。

详情
AI中文摘要

我们介绍了GRACE-DS,一个数据科学中的受保护奖励引导智能体修正环境,用于对LLM驱动的AutoML智能体进行部署前评估。GRACE-DS是一组在隔离环境中的评估指标,可应用于特定组织的表格ML任务。它将智能体暴露于现实的工作流阶段,从规划和数据检查到特征工程、模型开发、验证、代码修复直至最终提交,同时隐藏的可执行验证器不仅衡量最终预测性能,还衡量泄漏避免、可重复性、协议有效性、修正行为和奖励对齐。最强的结构化机制——灵活迭代交互(我们的方法)——实现了比单次生成、非结构化交互和基于重启的基线更高的端到端归一化隐藏测试质量,同时提高了协议有效完成率。经过7000多个回合的验证,这些结果确立了GRACE-DS作为评估基于LLM的AutoML智能体在生产类条件下按照组织特定要求执行机器学习工作流能力的稳健平台。

英文摘要

We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be applied to tabular ML tasks specific to a particular organization. It exposes agents to realistic workflow stages, from planning and data inspection through feature engineering, model development, validation, and code repair to final submission, while hidden executable validators measure not only final predictive performance but also leakage avoidance, reproducibility, protocol validity, correction behavior, and reward alignment. The strongest structured regime, flexible iterative interaction (our approach), achieves higher end-to-end normalized hidden-test quality than single-shot generation, unstructured interaction, and restart-based baselines, while also improving protocol-valid completion. Validated across more than 7,000 episodes, these results establish GRACE-DS as a robust platform for assessing the capacity of LLM-based AutoML agents to execute machine learning workflows under production-like conditions and in accordance with organization-specific requirements.

2606.18536 2026-06-18 stat.AP cs.SE 新提交 60%

Analytics for Quality Assurance for Item Pools (AQuAP): Monitoring and Maintaining Item Bank Health in AI-Driven Assessment Systems

题库质量保证分析(AQuAP):AI驱动评估系统中题库健康的监控与维护

Alina A. von Davier, Xiaowan Zhang, Yigal Attali, Yena Park, Jacqueline Church, Andrew Runge, Geoff T. LaFlair, Alexander Tsigler

专题命中 代码评测 :AI评估系统中题库质量监控

AI总结 提出AQuAP仪表盘环境,通过有效题库规模等指标监控题库质量,支持大规模自动与人工结合的试题开发,确保高利害测试的题库健康。

Comments 11 pages, 4 figures

详情
AI中文摘要

教育评估的大规模数字化使得题库的持续监督既必要又复杂。本文提出了题库质量保证分析(AQuAP),一个用于监控试题质量和题库健康的仪表盘环境。AQuAP支持高利害测试中大规模试题生成程序的操作实施,这些程序包含在试题工厂(一个自动化和人工支持的测试开发框架)中。本文描述了AQuAP与试题开发过程的关系,概述了题库质量保证的更广泛度量框架,并强调了有效题库规模(EBS)作为题库活力的核心指标。EBS量化了在内容重复发生之前可以构建的独立测试会话数量,当与曝光度和使用度量结合时,它提供了对题库安全性、多样性和效率的洞察。我们进一步引入了题库健康度量,如最大曝光度、最大条件曝光度、调整后的有效题库规模和极少施测比例,所有这些都扩展了试题利用情况的图景。AQuAP展示了操作分析如何将心理测量概念转化为高容量、AI驱动的测试程序的质量保证工具。本文以多邻国英语测试(DET)流程为例进行说明。

英文摘要

The large-scale digitization of educational assessment has made the continuous oversight of item banks both essential and complex. This paper presents Analytics for Quality Assurance for Item Pools (AQuAP), a dashboard environment for monitoring item quality and item bank health. AQuAP supports the operational implementation of the large scale item generation procedures for high-stakes tests as included in the Item Factory, a framework for automated and human-supported test development. The paper describes AQuAP in relationship with the process of item development, outlines the broader metric framework for item-pool quality assurance, and highlights the Effective Bank Size (EBS) as one central indicator of pool vitality. EBS quantifies how many independent test sessions can be constructed before content repetition occurs and, when coupled with exposure and usage metrics, provides insight into item bank security, diversity, and efficiency. We further introduce bank-health metrics, such as maximum exposure, maximum conditional exposure, adjusted effective bank size, and the rarely-administered fraction, all of which extend this picture of item utilization. AQuAP illustrates how operational analytics can translate psychometric concepts into quality assurance tools for high-volume, AI-enabled testing programs. This work is illustrated with the Duolingo English Test (DET) processes.

2606.18421 2026-06-18 cs.SE 新提交 60%

Finding Compiler-Platform Interaction Bugs in Deep Learning Pipelines via Cross-Layer Constraints

通过跨层约束发现深度学习流水线中的编译器-平台交互错误

Yuxin Qiu, Jiyuan Wang, Ronak Badhe, Ben Limpanukorn, Miryung Kim, Qian Zhang

专题命中 代码评测 :测试深度学习编译器与平台交互错误

AI总结 提出一种自动化框架XCheck,通过提取全栈约束生成测试模型,发现编译器与硬件平台交互导致的错误,并在三个编译器上发现2034个错误案例。

详情
AI中文摘要

人工智能的日益部署需要鲁棒的深度学习编译器,如TVM和ONNX-MLIR。这些编译器以高级AI模型为输入,通过多层变换降低它们,并将其专门化到不同的硬件。测试此类编译器具有独特的挑战性,因为正确性取决于嵌入在整个编译栈中的隐式约束。现有的测试方法主要采用类型约束来限制输入模型生成,因此强调类型验证并监控编译崩溃或覆盖率增益。这种关注忽略了由编译和执行环境之间的交错效应引起的编译器-平台交互错误。在这项工作中,我们提出了一个可扩展的自动化DL编译器测试框架,用于同时(1)发现编译器-平台交互错误和(2)实现行为等价划分。我们的关键见解是,这些错误是由跨编译通道和硬件平台的交互引起的违反假设导致的。因此,我们超越了约束输入生成,并推导出全栈约束。我们的方法分为三步。首先,我们设计了一种自动化方法来提取全栈约束,这些约束共同指导模型生成并表征编译行为。其次,我们优先考虑暴露交互敏感行为的约束,以便我们生成的模型能够执行深度编译逻辑。第三,我们通过自动插入断言来监控覆盖率或通过/失败信号遗漏的不同编译症状,从而实现行为等价划分。我们在三个广泛使用的DL编译器上评估了我们的工具XCheck,发现了2034个揭示错误的案例,包括内存溢出、整数溢出以及根源于编译器-平台交互的静默意外编译。

英文摘要

The growing deployment of artificial intelligence (AI) necessitates robust deep learning (DL) compilers, such as TVM and ONNX-MLIR. These compilers take as input high-level AI models, lower them through multi-layer transformations, and specialize them to diverse hardware. Testing such compilers is uniquely challenging as correctness depends on implicit constraints embedded throughout the compilation stack. Existing testing approaches largely take type constraints to restrict input model generation and therefore emphasize type validation and monitor compilation crashes or coverage gains. This focus overlooks compiler-platform interaction bugs that arise from interleaved effects across compilation and execution environments. In this work, we propose a scalable, automated DL compiler testing framework for, in tandem, (1) finding compiler-platform interaction bugs and (2) enabling behavior equivalence partitioning. Our key insight is that these bugs are caused by violated assumptions arising from interactions across compilation passes and hardware platforms. Therefore, we move beyond constraining input generation and derive full-stack constraints. Our approach is three-fold. First, we design an automated approach to extract full-stack constraints that jointly guide model generation and characterize compilation behaviors. Second, we prioritize constraints that expose interaction-sensitive behaviors, so our generated models are capable of exercising deep compilation logic. Third, we enable behavior equivalence partitioning by automatically inserting assertions to monitor distinct compilation symptoms that coverage or pass/fail signals miss. We evaluated our tool, XCheck, on three widely-used DL compilers and found 2,034 bug-revealing cases, including memory overflows, integer overflows, and silent unexpected compilations that were rooted in compiler-platform interactions.

5. 其他AI编程 1 篇

2602.15149 2026-06-18 cs.CE cs.NA math.NA 版本更新 60%

SoliDualSPHysics: An extension of DualSPHysics for solid mechanics with hyperelasticity, plasticity, and fracture

SoliDualSPHysics:一种用于固体力学的DualSPHysics扩展,支持超弹性、塑性及断裂

Mohammad Naqib Rahimi, George Moutsanidis

专题命中 其他AI编程 :开源软件扩展,涉及代码但非AI编程核心

AI总结 本文提出SoliDualSPHysics,一种基于SPH的开源软件,扩展DualSPHysics以模拟超弹性、有限应变塑性及脆性断裂行为,采用总拉格朗日格式,支持动态加载下的裂纹萌生与扩展,验证了其准确性和可扩展性。

详情
AI中文摘要

我们介绍了SoliDualSPHysics,一种新颖的开源且基于GPU加速的软件,扩展DualSPHysics以实现超弹性、有限应变塑性及脆性断裂行为的数值模拟。该软件实现了总拉格朗日格式,允许直接应用外部载荷和边界条件,支持独立的固体力学模拟。脆性断裂通过相场方法与SPH耦合,允许在动态加载下实现裂纹萌生、扩展和分叉,无需额外标准或局部细化。框架还支持用户定义的数学表达式来规定时间与空间相关的量,补充了固体力学和断裂扩展,并增强了现有和未来DualSPHysics应用的灵活性。利用DualSPHysics原生的CPU/GPU并行架构,该软件在大规模模拟中实现了显著的计算加速,且通过基准数值问题和实验数据验证了其准确性、鲁棒性和良好的扩展性能。提供了全面的实现细节和用户文档,以确保可重复性和支持社区进一步开发。框架和源代码通过公共GitHub仓库免费提供。

英文摘要

We introduce SoliDualSPHysics, a novel open-source and GPU-accelerated software that extends DualSPHysics to enable the numerical simulation of hyperelastic, finite-strain plastic, and brittle fracture behavior in deformable solids within a unified smoothed particle hydrodynamics (SPH) software framework. The software implements a total Lagrangian formulation for solid mechanics that allows direct application of external loads and boundary conditions, enabling independent solid mechanics simulations. Brittle fracture is modeled through a phase-field approach coupled with SPH, allowing crack initiation, propagation, and branching under dynamic loading without explicit crack tracking, ad hoc crack-path criteria, or local refinement. The framework also supports user-defined mathematical expressions to prescribe time- and space-dependent quantities, complementing the solid and fracture extensions and enhancing flexibility across existing and future DualSPHysics applications. Leveraging DualSPHysics' native CPU/GPU parallel architecture, the software achieves substantial computational acceleration for large-scale simulations, and the implementation is verified and validated against benchmark numerical problems and experimental data, demonstrating accuracy, robustness, and favorable scaling performance. Comprehensive implementation details and user documentation are provided to ensure reproducibility and to support further development by the community. The framework and source code are freely available through a public GitHub repository.