代码大模型 / AI 编程 - arXivDaily 专题

2606.18286 2026-06-18 cs.LG 新提交 85%

CODEBLOCK: Learning to Supervise Code at the Right Granularity

CODEBLOCK: 学习在正确的粒度上监督代码

Zhijie Deng, Ling Li, Jinlong Pang, Kaiqin Hu, Qi Xuan, Zhaowei Zhu, Jiaheng Wei

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； UC Santa Cruz（加州大学圣克鲁兹分校）； Ant Group（蚂蚁集团）； BAIA, ZJUT（浙江工业大学智能信息处理实验室）； D5Data.ai

专题命中代码生成：提出CodeBlock框架，结构感知稀疏监督提升代码生成微调。

AI总结提出CodeBlock框架，通过选择结构完整的代码块而非孤立token进行稀疏监督，在仅使用1.9%监督token的情况下，在六个代码生成基准上取得优于全token微调的效果。

详情

AI中文摘要

代码大语言模型的监督微调通常对所有响应token应用统一的交叉熵损失，隐含假设每个token提供同等有用的学习信号。最近的token级选择方法通过仅监督高价值token挑战了自然语言SFT中的这一假设。然而，直接将token级掩码迁移到代码可能会破坏语法和语义连贯的程序单元，因为代码依赖于结构完整性和定义-使用关系。因此，我们提出CodeBlock，一个结构感知的稀疏监督框架，选择结构完整的代码证据而非孤立token。CodeBlock首先选择高质量的指令-响应对，然后将代码响应划分为语法连贯的编码项，通过聚合核心逻辑token上的广义交叉熵来估计其效用，并使用数据流可达性和桥接信号重新排序，以优先传播或连接重要程序依赖的块。在训练期间，完整响应仍作为上下文可用，但损失仅应用于选定的代码项和信息性自然语言token。在六个代码生成基准上的实验表明，CodeBlock在仅使用1.9%的监督响应token的情况下，实现了比全tokenSFT和竞争性选择基线更强的平均pass@1。

英文摘要

Supervised fine-tuning of code LLMs typically applies uniform cross-entropy loss to all response tokens, implicitly assuming that every token provides equally useful learning signal. Recent token-level selection methods challenge this assumption in natural-language SFT by supervising only high-value tokens. However, directly transferring token-level masking to code can break syntactically and semantically coherent program units, because code depends on structural completeness and definition-use relations. We therefore propose CodeBlock, a structure-aware sparse supervision framework that selects structure-complete code evidence rather than isolated tokens. CodeBlock first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training, the full response remains available as context, while loss is applied only to selected code items and informative natural-language tokens. Experiments on six code-generation benchmarks show that CodeBlock achieves stronger average pass@1 than full-token SFT and competitive selection baselines, while using only 1.9% of supervised response tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.19315 2026-06-18 cs.LG 新提交 80%

Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation

Diffusion-Proof：超越自回归生成的正式定理证明配方

Ruida Wang, Rui Pan, Pengcheng Wang, Shizhe Diao, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； NVIDIA（英伟达）

专题命中代码生成：扩散语言模型用于形式定理证明

AI总结提出Diffusion-Proof框架，首次将扩散语言模型应用于形式定理证明，通过全证明生成和局部校正方法，在ProofNet和MiniF2F上分别提升1.61%和6.14%，并解决了一个DeepSeek-Prover-V2-7B无法解决的IMO问题。

详情

AI中文摘要

近年来，增强大型语言模型（LLMs）的形式数学推理能力已成为数学和计算机科学社区的关键焦点。虽然在使用最先进的自回归（AR）LLMs进行形式定理证明方面取得了显著进展，但这些模型存在固有局限性。它们的下一个词预测生成方法可能因长程连贯性挑战和长序列错误累积而导致次优性能。最近，扩散LLMs（dLLMs）通过多词块的迭代去噪生成文本，提供了一种有前景的替代方案。然而，dLLMs在形式数学中的应用（其中保持长程连贯性至关重要）仍然研究不足。为解决上述挑战，我们提出了**Diffusion-Proof**，据我们所知，这是第一个训练和应用dLLMs进行形式定理证明的框架。我们的框架包含两种模型的训练和推理方法。第一个是*dLLM-Prover-7B*，它执行具有长程连贯策略使用的全证明写作。第二个是*dLLM-Corrector-7B*，这是一种新颖的大块扩散校正模型。它利用dLLMs的填充能力，使用双向信息进行局部证明校正。大量实验表明，**Diffusion-Proof**相对显著优于在同一数据集上训练的AR LLM基线。与基线相比，**Diffusion-Proof**在ProofNet-Test和MiniF2F-Test基准上分别实现了**1.61%**和**6.14%**的绝对提升。值得注意的是，**Diffusion-Proof**成功解决了一个更先进的思考模型DeepSeek-Prover-V2-7B无法解决的IMO问题，展示了dLLMs在形式定理证明中的独特优势。

英文摘要

Enhancing the formal math reasoning capabilities of Large Language Models (LLMs) has become a key focus in both mathematical and computer science communities in recent years. While significant progress has been made in using state-of-the-art Auto-Regressive (AR) LLMs for formal theorem proving, these models suffer from inherent limitations. Their next-token prediction generation methods may yield suboptimal performance due to the challenges of long-range coherence and the compounding of errors over long sequences. Recent advancements in diffusion LLMs (dLLMs), which generate text through iterative denoising of a multi-token block, offer a promising alternative. However, the application of dLLMs to formal mathematics, where maintaining long-range coherence is critical, remains largely understudied. To address the challenges above, we propose **Diffusion-Proof**, to the best of our knowledge, the first framework to train and apply dLLMs for formal theorem proving. Our frameworks contain training and inference methods for two models. The first one is *dLLM-Prover-7B*, which performs whole-proof writing with long-range coherent tactic usage. The second one is *dLLM-Corrector-7B*, which is a novel large block diffusion-based correction model. It leverages the in-filling capabilities of dLLMs to perform local proof correction using bi-directional information. Extensive experiments demonstrate that **Diffusion-Proof** relatively significantly outperforms the AR LLM baseline trained under the same dataset. **Diffusion-Proof** achieves an absolute improvement of **1.61%** on ProofNet-Test and **6.14%** on MiniF2F-Test benchmarks compare to the baseline. Notably, **Diffusion-Proof** successfully resolves one IMO problem that more advanced thinking model DeepSeek-Prover-V2-7B could not solve, showcasing the unique advantage of dLLMs in formal theorem proving.

URL PDF HTML ☆

赞 0 踩 0

2606.19042 2026-06-18 cs.SE cs.AI 新提交 80%

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

可变性去哪了？从氛围编码到通过再生的产品线

Xhevahire Tërnava

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France（LTCI，巴黎电信学院，巴黎理工学院，Palaiseau，法国）

专题命中代码生成：AI驱动编程，可变性再生。

AI总结研究AI驱动编程（氛围编码）中可变性缺失问题，提出通过再生实现可变性（VbR）方法，让LLM作为推导引擎生成无死代码的变体二进制。

Comments VARIABILITY 2026

详情

AI中文摘要

在氛围编码这一新兴的AI驱动范式中，LLM根据自然语言提示生成整个程序，但传统软件工程精心构建到代码中的可变性会发生什么？为了回答这个问题，我们对10个氛围编码的C/C++项目进行了探索性分析，结果表明在编译和运行时，工件内可变性几乎为零。所有可变性决策都在一个新的绑定时间——生成时间（即LLM生成源代码的时刻）得到解决。我们不将其视为需要修复的缺陷，而是提出了通过再生实现可变性（VbR），据我们所知，这是第一种产品线方法，其中LLM充当推导引擎，根据声明性规范为每个变体生成无死代码的专用二进制，同时变体调度器透明地将用户请求路由到匹配的二进制。我们形式化了VbR，将其与经典SPL推导进行对比，并在wc产品家族上演示了其完整流程。对于SPL工程，AI生成软件中的可变性应属于规范，而非代码。

英文摘要

In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

URL PDF HTML ☆

赞 0 踩 0

2606.18293 2026-06-18 cs.SE cs.AI 新提交 80%

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

Vibe Coding 吃掉我的作业：AI 方法在全新软件工程与编程中的评估

Callum Barbour

发表机构 * OpenAI

专题命中代码生成：评估AI编程（vibe coding）在软件工程中的可行性。

AI总结本文评估了“氛围编码”（用自然语言提示编程）在全新软件工程任务中的可行性，并分析了现有基准，通过开发 Python 简单独立编程任务评估套件提供见解。

Comments 10 pages, 2 figures

详情

AI中文摘要

得益于生成式 AI 的快速发展，我们正处于一个可能永远改变我们与计算机交互方式的范式转变之中。我们观察到，在没有领域基础知识的情况下，使用自然语言提示来构建应用程序和编码基础设施的做法日益增长，这种做法被称为“氛围编码”。可以说，这代表了编程领域自诞生以来一直追求的目标，即每一个更高层次的抽象。就输入方法而言，氛围编码有望成为高级编程元认知的终点：完全消除人类对代码语法的使用，转而用母语进行编程。本文旨在评估氛围编码在全新软件工程任务中的可行性，并分析用于衡量其软件工程能力的基准。为此，我们开发了一个评估套件，用于分析 LLM 在 Python 中执行简单、独立的全新编程任务的熟练程度，以提供对此问题的有范围限制的见解。

英文摘要

Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

URL PDF HTML ☆

赞 0 踩 0

2606.19257 2026-06-18 cs.CL 新提交 70%

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B：面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong（香港大学）； Peking University（北京大学）

专题命中代码生成：在代码推理基准上评估

AI总结提出块大小课程学习，通过从细粒度到粗粒度的渐进训练，解决块扩散语言模型在长链推理中性能差距问题，DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情

AI中文摘要

块扩散语言模型通过并行块级去噪加速解码，但其能否可靠地扩展到长思维链（CoT）推理仍未解决。为此，我们开发了开源块扩散推理模型DreamReasoner-8B，并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距：使用大块大小训练会导致推理性能极差，而小块大小则能保持有效的推理。为了弥合这一粒度差距，我们提出了块大小课程学习，逐步从细粒度块大小过渡到粗粒度块大小进行训练，从而克服了这一限制，并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中，DreamReasoner-8B取得了与领先的开源自回归模型（如Qwen3-8B）相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型：https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

URL PDF HTML ☆

赞 0 踩 0

2606.18425 2026-06-18 cs.SE cs.AI cs.DC 新提交 70%

From Specification to Execution: AI Assisted Scientific Workflow Management

从规范到执行：AI辅助的科学工作流管理

Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

发表机构 * RENCI, University of North Carolina at Chapel Hill, NC, USA（RENCI，北卡罗来纳大学教堂山分校）； Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA（信息科学研究所，南加州大学马里纳德尔雷耶斯分校）

专题命中代码生成：利用LLM生成工作流代码

AI总结提出一种AI辅助方法，通过规范驱动的工作流生成、自动化调试和分布式执行，结合Pegasus与MCP层，实现从自然语言到大规模科学工作流的端到端管理。

详情

AI中文摘要

科学工作流管理系统（WMS）支持复杂管道的可扩展和可重复执行，但工作流的设计、实现和调试仍然主要依赖人工，需要大量专业知识。最近使用大型语言模型（LLM）的方法在从自然语言生成工作流方面显示出潜力，但通常依赖于直接的代码合成，这限制了透明度、可重复性以及与工作流系统的集成。我们提出了一种AI辅助的科学工作流管理方法，结合了规范驱动的工作流生成、自动化调试和分布式执行。该方法引入了一个结构化的规范阶段，将工作流意图、设计和实现分离，允许在代码生成之前进行验证。我们还开发了一个基于LLM的调试代理，用于诊断和解决跨多个系统层的故障。为了支持分布式执行和用户交互，我们将广泛使用的WMS Pegasus与模型上下文协议（MCP）层集成，为工作流提交、监控和控制提供统一接口。我们使用一个用于医学影像的联邦学习工作流来评估该方法，该工作流具有并行、迭代和依赖密集的结构。该系统生成并执行了包含数千个作业的大规模工作流，减少了调试工作量，并允许非专家用户使用专家级设计模式构建工作流。这些结果表明，端到端的AI辅助工作流生成和执行是可行的，并指向了用于管理科学工作流生命周期的AI驱动平台。

英文摘要

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

URL PDF HTML ☆

赞 0 踩 0