代码大模型 / AI 编程 - arXivDaily 专题

2606.20158 2026-06-19 cs.SE 新提交 90%

N-Version Programming with Coding Agents

使用编码代理的N版本编程

Javier Ron, Benoit Baudry, Martin Monperrus

专题命中代码生成：使用编码代理生成实现，评估多样性对故障模式的影响。

AI总结本文在当代AI编码代理背景下重新审视N版本编程，通过Knight-Leveson实验评估代理系统、模型和实现语言的多样性对故障模式的影响，发现常见模式故障，但多数投票三版本单元显著降低故障数，证明该策略的工程实用性。

详情

AI中文摘要

本文在当代AI编码代理背景下重新审视N版本编程这一经典概念。通过重访开创性的Knight-Leveson实验，我们研究了代理系统、模型和实现语言之间的多样性是否会产生多样化的故障模式。使用Knight-Leveson的发射拦截器程序规范，我们在共享的预言机和100万个随机测试输入的测试集上评估了48个代理生成的实现。结果显示，与Knight-Leveson的发现一致，存在大量的共模故障。进一步分析表明，许多这些同时发生的故障可以追溯到规范中特别困难或模糊的地方。我们还证明了编码代理的多样性带来了实际效益：在多数投票的三版本单元中，平均故障数从单版本的387.44下降到三版本的130.99，并且有11,844个N版本单元表现出零观测故障。我们的原始结果是迄今为止最强的证据，表明使用编码代理的N版本编程是一种有用的工程策略。

英文摘要

This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial common-mode failure, along the findings of Knight-Leveson. Further analysis that many of those co-occuring failures can be traced to where is specification is particularly hard or ambiguous. We also demonstrate that diversity from coding agents provides practical benefit: across majority voting three-version units, the mean failure count drops from 387.44 for single versions to 130.99 for triples, and 11,844 N-version units exhibit zero observed failures. Our original results is the strongest evidence to date that N-Version Programming with coding agents is a useful engineering strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.19988 2026-06-19 cs.SE 新提交 90%

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

基于大语言模型的仓库级Solidity代码生成：从提示到微调

Shi Chen, Rongcun Wang, Yuan Tian, Xiaoyuan Xie, Wei Song, Rubing Huang

专题命中代码生成：评估LLM在Solidity代码生成中的表现

AI总结提出SolidityBench基准和SolidityScore指标，评估多种LLM方法在仓库级Solidity代码生成中的表现，发现监督微调最有效。

Comments 33 pages

详情

AI中文摘要

大语言模型（LLMs）在通用代码生成方面表现出强大的能力，但其在专业软件领域的有效性仍未得到充分探索。Solidity智能合约代表了一个高风险领域，生成的代码必须满足严格的语言级、安全性和软件工程约束。现有的基准和指标对于仓库级Solidity生成仍然不足，其中模型必须从自然语言需求中合成完整的合约。为了解决这一差距，我们引入了SolidityBench，一个包含5,470个仓库级Solidity智能合约及其自然语言描述的基准。我们还提出了SolidityScore，一种基于Solidity的语义度量，强调领域关键结构，如安全修饰符、合约声明和Solidity特定关键词。使用该基准，我们评估了代表性的代码LLM，包括Qwen2.5-Coder、DeepSeek-Coder和CodeLlama，涵盖零样本提示、思维链推理、上下文学习、检索增强生成和监督微调。结果表明，通用模型在仓库级Solidity生成中表现出系统性的结构缺陷。在非参数方法中，检索增强生成表现最佳，而上下文学习在超过两个示例后因上下文饱和而性能下降。监督微调通过将Solidity特定约束内化到模型参数中实现了最大的改进。总体而言，我们的研究为仓库级Solidity代码生成提供了全面的基准，并表明高质量领域数据结合监督微调是提高LLM生成智能合约可靠性的最有效策略。

英文摘要

Large Language Models (LLMs) have shown strong capabilities in general-purpose code generation, but their effectiveness in specialized software domains remains underexplored. Solidity smart contracts represent a high-stakes domain where generated code must satisfy strict language-level, security, and software-engineering constraints. Existing benchmarks and metrics remain insufficient for repository-level Solidity generation, where models must synthesize complete contracts from natural language requirements. To address this gap, we introduce SolidityBench, a benchmark of 5,470 repository-level Solidity smart contracts paired with natural language descriptions. We also propose SolidityScore, a Solidity-aware semantic metric that emphasizes domain-critical constructs such as security modifiers, contract declarations, and Solidity-specific keywords. Using this benchmark, we evaluate representative code LLMs, including Qwen2.5-Coder, DeepSeek-Coder, and CodeLlama, across zero-shot prompting, Chain-of-Thought reasoning, in-context learning, retrieval-augmented generation, and supervised fine-tuning. The results show that general-purpose models exhibit systematic structural deficiencies in repository-level Solidity generation. Among non-parametric methods, retrieval-augmented generation performs best, while in-context learning degrades beyond two examples due to context saturation. Supervised fine-tuning achieves the largest improvement by internalizing Solidity-specific constraints into model parameters. Overall, our study provides a comprehensive benchmark for repository-level Solidity code generation and shows that high-quality domain data combined with supervised fine-tuning is the most effective strategy for improving the reliability of LLM-generated smart contracts.

URL PDF HTML ☆

赞 0 踩 0

2606.19387 2026-06-19 cs.SE cs.AI 新提交 90%

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成：基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Fudan University（复旦大学）； USA（美国）

专题命中代码生成：利用LLM生成RTL硬件代码，结合形式化方法。

AI总结提出结合LLM创造力与形式化方法可解释性的硬件生成框架，通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

2606.19347 2026-06-19 cs.CL cs.AI cs.PL 新提交 90%

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

LLM在硬件设计的RTL编码中如何失败与泛化？

Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany, Yu-Chiang Frank Wang

发表机构 * NVIDIA Research（英伟达研究院）

专题命中代码生成：分析LLM在RTL编码中的失败与泛化

AI总结提出基于问题可解性的错误分类法，揭示LLM在RTL编码中受限于预训练知识，对齐技术仅教会编译，而推理能力才是关键瓶颈。

Comments Preview, under submission for EMNLP 2026

详情

AI中文摘要

将顺序编程先验转换为硬件设计的并行时序逻辑仍然是大型语言模型（LLM）的关键瓶颈。为了研究这一点，我们引入了一种新的错误分类法，该分类法基于问题可解性，受认知理论启发。我们的分类法将失败分为语法、语义、可解功能和不可解功能类型。评估揭示了VerilogEval基准上的严格经验上限，前沿模型初始通过率稳定在90.8%。这些平台期由不可解的功能错误定义，暴露出对测试时计算扩展免疫的持续知识差距。此外，我们揭示了一个显著的表面收敛差距：优化容易消除语法错误，但同时加剧了更深层次的功能失败。我们的发现表明，对齐技术仅仅教会模型编译。虽然重复采样策略可以修补可解错误，但寄存器传输级（RTL）编码能力仍然严格受限于预训练知识。解决当前基于LLM的硬件生成流水线中的挑战需要更多关于模型推理的研究，而不是对齐干预。

英文摘要

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

URL PDF HTML ☆

赞 0 踩 0

2604.11556 2026-06-19 cs.SE cs.AI 版本更新 90%

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

FM-Agent: 通过基于LLM的Hoare风格推理将形式化方法扩展到大型系统

Haoran Ding, Zhaoguo Wang, Haibo Chen

发表机构 * Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University（并行与分布式系统研究所，上海交通大学）

专题命中代码生成：LLM自动生成函数规范实现形式化推理

AI总结提出FM-Agent框架，利用LLM自动生成函数级规范，实现大型系统的组合式推理，在143k行代码的系统中2天内发现522个新bug。

详情

AI中文摘要

LLM辅助的软件开发已日益普遍，并能生成如编译器这样的大型系统。增强生成代码的正确性变得至关重要。然而，由于代码复杂性，大型系统的自动推理仍然具有挑战性。Hoare逻辑提供了一种将大型系统分解为较小组件并分别推理（即组合式推理）的方法。然而，现有工作仍难以扩展，因为Hoare逻辑要求为每个函数编写形式化规范，给人类带来沉重负担。当代码由LLM生成时，问题更加严重，因为开发人员缺乏对每个函数预期行为的深入理解。本文提出FM-Agent，这是第一个实现大型系统自动化组合式推理的框架。利用LLM，FM-Agent引入了一种自顶向下的范式来自动生成函数级规范。具体来说，FM-Agent从调用者期望函数如何行为中推导出函数的规范，因此即使实现有缺陷，生成的规范也能反映开发者的意图。开发者的意图通常用自然语言表达，而现有的验证器只支持公式。因此，FM-Agent推广了Hoare风格推理，以针对自然语言规范推理函数。最后，为了确认错误存在并解释错误原因，FM-Agent自动生成测试用例以触发潜在错误。在我们的评估中，FM-Agent在2天内成功推理了大型系统，每个系统最多有143k行代码。这些系统已经由开发者测试过，但FM-Agent仍然发现了522个新错误。这些错误可能导致严重后果，包括系统崩溃和错误的执行结果。

英文摘要

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

URL PDF HTML ☆

赞 0 踩 0

2606.20373 2026-06-19 cs.SE cs.AI 新提交 85%

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass：基于证据的LLM智能体用于编译器性能调优

Zepeng Li, Jie Ren, Zhanyong Tang, Jie Zheng, Zheng Wang

发表机构 * Shaanxi Normal University（陕西师范大学）； Northwest University（西北大学）； University of Leeds（利兹大学）

专题命中代码生成：LLM生成编译选项优化代码性能

AI总结提出AutoPass多智能体框架，通过查询编译器内部状态和中间表示，利用运行时反馈迭代优化编译选项，无需训练即可提升性能，在x86-64和ARM64上分别实现1.043倍和1.117倍加速。

详情

AI中文摘要

大型语言模型（LLM）在代码编译任务中展现出潜力，但由于复杂的微架构效应和噪声运行时测量，将其应用于运行时性能调优较为困难。我们提出AutoPass，一个用于编译器性能调优的多智能体框架，它利用编译器和运行时证据来指导LLM生成的优化决策。与先前的自动调优方案将编译器视为黑盒不同，AutoPass向LLM开放编译器，使其能够查询编译器内部的优化状态并分析中间表示以编排编译器选项。搜索过程利用测量的运行时反馈迭代地优化配置，以诊断性能回退并指导延迟改进的编辑。AutoPass在仅推理、无需训练的环境下运行，无需离线训练或任务特定的微调，因此可轻松应用于新的基准测试和平台。我们在LLVM编译器上实现AutoPass，并在服务器级x86-64和嵌入式ARM64系统上进行评估。AutoPass优于专家调优的启发式方法和经典自动调优方法，在x86-64和ARM64上相对于LLVM -O3分别实现了1.043倍和1.117倍的几何平均加速。

英文摘要

Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi-agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM-generated optimization decisions. Rather than treating the compiler as a black box like prior auto-tuning schemes, AutoPass opens up the compiler to the LLM, enabling it to query compiler-internal optimization states and analyze the intermediate representation to orchestrate compiler options. The search process iteratively refines optimization configurations using measured runtime feedback to diagnose regressions and guide latency-improving edits. AutoPass operates in an inference-only, training-free setting and requires no offline training or task-specific fine-tuning, making it readily applicable to new benchmarks and platforms. We implement AutoPass on the LLVM compiler and evaluate it on server-grade x86-64 and embedded ARM64 systems. AutoPass outperforms expert-tuned heuristics and classical autotuning methods, achieving geometric-mean speedups of 1.043x and 1.117x over LLVM -O3 on x86-64 and ARM64, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.19814 2026-06-19 cs.SE 新提交 85%

CoRaCommit: A VS Code Extension for Commit Message Generation with Exemplar Retrieval

CoRaCommit: 一种基于范例检索的提交消息生成的 VS Code 扩展

Chaoran Cai, Bo Xiong, Chong Wang, Lulu He, Peng Liang

专题命中代码生成：VS Code扩展，利用检索范例生成提交消息。

AI总结提出 CoRaCommit VS Code 扩展，通过检索相似提交范例作为提示上下文、并行调用多个大语言模型生成候选消息并基于用户反馈动态推荐，在 ApacheCM 数据集上优于现有扩展。

Comments 17 pages, 6 images, 3 tables, Manuscript submitted to a Journal (2026)

详情

AI中文摘要

提交消息是描述代码变更意图的关键文本制品，在版本控制、代码审查和历史追踪中扮演重要角色。然而，实践中提交消息主要由人工编写，耗时且常导致质量不一致和表达不统一。现有的用于提交消息生成的 VS Code 扩展通常直接基于代码差异调用大语言模型，而不利用相似提交范例作为参考，且很少支持用户反馈驱动的大语言模型推荐。为解决这些局限，本文提出 CoRaCommit，一种 VS Code 扩展，通过检索相似提交范例作为提示上下文、并行调用多个大语言模型进行候选提交消息比较，并基于用户反馈动态推荐大语言模型，从而增强提交消息生成。在 ApacheCM 数据集的 945 个提交上的实验结果表明，CoRaCommit 在 BLEU、CIDEr、METEOR 和 ROUGE-L 指标上优于现有 VS Code 扩展，证明了检索增强上下文对提交消息生成的有效性。

英文摘要

Commit messages are essential textual artifacts that describe the intent behind code changes, and play a critical role in version control, code review, and historical tracking. However, in practice, commit messages are primarily authored manually, which is time-consuming and often results in inconsistent quality and non-uniform expression. Existing VS Code extensions for commit message generation typically directly invoke large language models based on the code diff, without leveraging similar commit exemplars as references, and rarely support user feedback-driven LLM recommendation. To address these limitations, this paper presents CoRaCommit, a VS Code extension that enhances commit message generation by retrieving similar commit exemplars as prompt context, invoking multiple LLMs in parallel for candidate commit message comparison, and dynamically recommending LLMs based on user feedback. Experimental results on 945 commits from the ApacheCM dataset show that CoRaCommit outperforms existing VS Code extensions across BLEU, CIDEr, METEOR, and ROUGE-L metrics, demonstrating the effectiveness of retrieval-augmented context for commit message generation.

URL PDF HTML ☆

赞 0 踩 0

2606.11537 2026-06-19 cs.AI cs.CE 新提交 85%

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck（因斯布鲁克大学）； University of British Columbia（不列颠哥伦比亚大学）； Toronto Metropolitan University（多伦多都会大学）

专题命中代码生成：系统生成可执行Python程序解决表格问答

AI总结提出MoCA-Agent，通过声明级验证和代码生成解决金融表格问答中的数值推理错误，在十个基准上取得强性能。

详情

AI中文摘要

金融和表格问答不仅需要流畅的推理：答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent}，一种声明市场代码智能体，它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明，要求专业交易智能体买入或卖出这些声明，将其订单清算为置信度加权的接受/拒绝决策，并从市场支持的证据中合成可执行的Python程序。然后，一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误，最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上，\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能，包括在 FinQA 上达到 78.3%，在 FinanceMath 上达到 76.0%，在 MultiHiertt 上达到 71.2%，在 ESGenius 上达到 86.9%，以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明，在原子声明级别聚合证据，而不是整个答案，提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取：this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

URL PDF HTML ☆

赞 0 踩 0

2602.00510 2026-06-19 cs.AI cs.LG cs.SE 版本更新 85%

PCBSchemaGen: Reward-Guided LLM Code Synthesis for Printed Circuit Boards (PCB) Schematic Design with Structured Verification

PCBSchemaGen: 奖励引导的LLM代码合成用于印刷电路板(PCB)原理图设计及结构化验证

Huanghaohe Zou, Peng Han, Emad Nazerian, Mafu Zhang, Zhicheng Guo, Alex Q. Huang

发表机构 * Semiconductor Power Electronics Center (SPEC)（半导体功率电子中心）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Arizona State University（亚利桑那州立大学）

专题命中代码生成：LLM生成PCB原理图代码合成

AI总结提出PCBSchemaGen框架，通过结构化验证器引导冻结的LLM生成可修复的PCB原理图，在无单元测试的领域实现高准确率。

详情

AI中文摘要

大多数LLM代码合成基准依赖于单元测试作为奖励预言，但PCB原理图设计没有这样的测试：正确性由真实IC封装和引脚级分配的结构化物理约束定义，每个任务的金标准参考不可用，且SPICE仿真无法验证原理图级正确性。我们提出PCBSchemaGen，一个无需训练的推理时框架，将冻结的LLM转变为可验证、可修复的PCB原理图生成器。该框架从IC数据手册中提取领域模式以约束LLM解码，将其与一个具有引脚级错误定位的确定性5层连续奖励验证器配对，并通过汤普森采样臂获取赌博机优化候选方案。我们在两个PCB基准上评估，涵盖22个统一电路领域的227个真实IC任务，包括一个从公开原理图导出的套件，作为完全保留的泛化测试（验证器、知识图谱库和提示在评估前冻结）。在我们的框架下，一个开放权重的31B模型（Gemma-4-31B）平均通过PCBBench任务的81.3%，且同一框架在两个基准间迁移时无需更改验证器代码；而基于相同Gemma-4-31B骨干网络的Circuitron式推理时提示基线在困难的系统级设计上崩溃。这表明在确定性结构验证器下的推理时优化是在没有单元测试预言的领域中实现无参考LLM代码合成的一般方法。我们的基准和确定性验证器在此https URL公开可用。

英文摘要

Most LLM code-synthesis benchmarks rely on unit tests as the reward oracle, but PCB schematic design has none: correctness is defined by structured physical constraints over real IC packages and pin-level assignments, per-task golden references are unavailable, and SPICE simulation does not validate schematic-level correctness. We introduce PCBSchemaGen, a training-free inference-time framework that turns a frozen LLM into a verifiable, repairable PCB schematic generator. The framework induces a domain schema from IC datasheets to ground LLM decoding, pairs it with a deterministic 5-layer continuous-reward verifier with pin-level error localization, and refines candidates through a Thompson Sampling arm-acquiring bandit. We evaluate on 2 PCB benchmarks covering 227 real-IC tasks across 22 unified circuit domains, including a public-schematic-derived suite that serves as a fully held-out generalization test (verifier, KG library, and prompts frozen before any evaluation). Under our framework, an open-weight 31B model (Gemma-4-31B) passes 81.3% of PCBBench tasks on average, and the same framework transfers across both benchmarks with zero verifier code changes; a Circuitron-style inference-time prompting baseline on the same Gemma-4-31B backbone collapses on hard system-level designs. This suggests inference-time refinement under a deterministic structural verifier is a general recipe for reference-free LLM code synthesis in domains without unit-test oracles. Our benchmarks and deterministic verifier are publicly available at https://github.com/HZou9/PCBSchemaGen_v2.

URL PDF HTML ☆

赞 0 踩 0

2606.20173 2026-06-19 cs.SE 新提交 80%

Qiskit Code Migration with LLMs

使用大语言模型进行Qiskit代码迁移

Jose Manuel Suarez, Luis Mariano Bibbo, Joaquin Bogado, Alenandro Fernandez

专题命中代码生成：LLM+RAG自动迁移Qiskit代码。

AI总结针对量子软件开发套件版本演进导致的代码维护问题，提出结合大语言模型与检索增强生成（RAG）的混合方法，利用自动生成的迁移场景分类体系引导模型，实现Qiskit代码跨版本自动迁移，有效减少幻觉并提升迁移建议质量。

详情

AI中文摘要

量子开发套件（QDK）的快速演进引入了一种特定形式的技术债务，损害了代码可维护性并阻碍了软件复用。在量子软件工程（QSE）这一专业领域，高质量训练数据的稀缺和新兴框架的高波动性加剧了这一挑战，常导致通用大语言模型（LLM）产生不可靠或幻觉结果。本文提出一种将LLM与检索增强生成（RAG）相结合的混合方法，用于自动化Qiskit代码的跨版本迁移。所提方法通过利用自动生成的迁移场景分类体系作为结构化、版本特定的知识源来指导模型，从而提升迁移建议的精度和可靠性。该方法通过一个自动化、可扩展的工作流实现，评估了不同检索方案（无约束和限制性）下的LLM（Google Gemini Flash-2.5和OpenAI Gpt-oss-20b）。结果表明，基于分类体系的RAG架构，特别是在限制性方案下，显著减少了幻觉并提高了描述质量，其中Google Gemini Flash-2.5在检测复杂重构场景方面表现出更优性能。这些发现证实了这种以数据为中心的方法在促进技术独立性、提供缓解API过时问题的鲁棒智能助手方面的潜力，从而确保量子算法在快速变化的生态系统中的长期可用性，并降低量子软件工程（QSE）的学习曲线。

英文摘要

The rapid evolution of Quantum Development Kits (QDKs) introduces a specific form of technical debt that compromises code maintainability and hinders software reuse. In the specialized domain of Quantum Software Engineering (QSE), this challenge is intensified by the scarcity of high-quality training data and the high volatility of emerging frameworks, which often lead general-purpose Large Language Models (LLMs) to produce unreliable or hallucinated results. This paper proposes a hybrid approach integrating LLMs with Retrieval-Augmented Generation (RAG) to automate the migration of Qiskit code across versions. The proposed methodology enhances the precision and reliability of migration suggestions by leveraging an automatically generated taxonomy of migration scenarios as the structured, version-specific knowledge source to guide the models. The approach is implemented through an automated, extensible workflow evaluating LLMs (Google Gemini Flash-2.5 and OpenAI Gpt-oss-20b) under different retrieval schemes (unconstrained and restrictive). Results demonstrate that the taxonomy-based RAG architecture, particularly under the restrictive scheme, significantly reduces hallucinations and improves descriptive quality, with Google Gemini Flash-2.5 showing superior performance in detecting complex refactoring scenarios. These findings confirm the potential of this data-centric methodology to foster technological independence and provide robust, intelligent assistants that mitigate API obsolescence, ensuring the long-term availability of quantum algorithms within a rapidly shifting ecosystem and flattening the learning curve within Quantum Software Engineering (QSE).

URL PDF HTML ☆

赞 0 踩 0

2606.19474 2026-06-19 cs.CR cs.AI cs.SE 新提交 80%

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

LLM辅助后量子密码开发中的安全编码漂移：一种游戏化修复方案

R. D. N. Shakya, C. P. Wijesiriwardana, S. M. Vidanagamachchi, Nalin A. G. Arachchilage

发表机构 * University of Moratuwa（摩图瓦大学）； University of Ruhuna（鲁胡纳大学）； RMIT University（皇家墨尔本理工大学）

专题命中代码生成：研究LLM辅助后量子密码开发中的安全编码漂移。

AI总结提出LLM辅助PQC开发中的安全编码漂移模型，通过游戏化框架将LLM转变为主动安全协作者，以缓解长期依赖LLM导致的安全退化。

Comments Accepted for 2026 SIGIR Workshop on Vulnerabilities in Generative Systems for Information Retrieval track

详情

AI中文摘要

向后量子密码学（PQC）的过渡引入了相当大的实现复杂性，要求严格遵守恒定时间执行、侧信道抵抗和精确参数化。同时，大型语言模型（LLM）已深度嵌入软件开发工作流程，包括密码工程。虽然LLM提高了生产力，但证据表明它们经常生成不安全或次优的代码，特别是在安全关键领域。本文引入了PQC中的安全编码漂移，这是一种新颖的社会技术漏洞模型，捕捉了由于持续依赖LLM生成的代码而导致的安全编码实践逐渐退化。与先前关注静态漏洞的工作不同，我们将安全风险概念化为一种源于人机交互的纵向行为现象。为了缓解这一问题，我们提出了一种游戏化的、LLM增强的安全编码框架，将对抗性评估、行为反馈和安全评分嵌入开发工作流程。我们的方法将LLM从被动助手重新定义为主动安全协作者，为AI中介环境中的更安全PQC实现做出贡献。

英文摘要

The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

URL PDF HTML ☆

赞 0 踩 0

2606.01338 2026-06-19 cs.CL 版本更新 80%

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

在生物制药制造中本地LLM的自然语言到SQL查询基准测试：消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta, Ambika Baniya Bhandari

发表机构 * Department of Computer Science, University of the Cumberlands（大学的计算机科学系）； Department of Computer Science, DePaul University（德保罗大学计算机科学系）； Youngstown State University（亚当斯州立大学）

专题命中代码生成：评估本地LLM在生物制药制造中的NL2SQL性能。

AI总结本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能，发现代码调优的通用模型优于领域特定模型，但当前性能仍需人工监督。

详情

AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范（GMP）和欧盟AI法案等监管框架下运营，这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型（LLM）提供了一种保护隐私的替代方案，但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM（Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B）在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI，使用一个包含约63,000条记录的合成Microsoft SQL Server数据库，涵盖批次、制造执行系统（MES）和在线清洗（CIP）模块。模型在60个领域特定的自然语言问题上进行了基准测试，使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL，而Meditron 7B由于上下文窗口限制和SQL生成能力差，几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性，而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明，代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的，但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

URL PDF HTML ☆

赞 0 踩 0

2606.19644 2026-06-19 cs.SE 新提交 75%

Prompt Quality and Pull Request Outcomes: A Stage-Based Empirical Study of LLM-Assisted Development

提示质量与拉取请求结果：基于阶段的LLM辅助开发实证研究

Richard Sserunjogi, Daniel Ogenrwot, John Businge

专题命中代码生成：研究提示质量对LLM辅助代码生成和PR结果的影响。

AI总结通过分析265个开发者与ChatGPT的交互，研究提示结构（上下文、具体性、验证）对LLM辅助开发中代码生成、采纳和集成深度的影响，发现不同维度在不同阶段有不同作用。

Comments 48 pages, 2 figures

详情

AI中文摘要

大型语言模型（LLM）驱动的工具（如ChatGPT）越来越多地用于协作软件工程工作流，但提示结构如何影响下游拉取请求（PR）结果尚不清楚。先前的研究主要考察对话帮助性、生产力或粗粒度的采用指标，对提示结构在协作集成行为中的作用理解不足。我们分析了来自开源拉取请求中自我承认的ChatGPT使用的265个手动验证的开发者-ChatGPT交互。基于先前关于开发者面向工件和提示工程的研究，我们使用三个维度操作化提示结构：上下文、具体性和验证。我们首先评估LLM辅助注释是否能可靠地再现人类对提示结构的判断，发现在不同维度和工作流上下文中存在显著差异。具体性与人类判断的一致性最稳定；上下文被LLM系统性地低估；验证仍然难以一致评估，这促使采用人类-LLM混合注释策略。使用这个经过验证的框架，我们然后检查提示结构如何影响AI辅助PR工作流中的可操作代码生成、代码采纳和集成深度。具体性和上下文与可操作代码生成关联最强；验证成为代码采纳的主要预测因子；集成深度与上下文关联最强。总体而言，我们的发现表明，提示特征在AI辅助软件工程工作流中表现出不同的、阶段依赖的影响，通过上下文基础、任务具体性和可评估性线索影响下游采纳和集成。

英文摘要

Large language model (LLM)-powered tools such as ChatGPT are increasingly used in collaborative software engineering workflows, yet little is known about how prompt structure influences downstream pull request (PR) outcomes. Prior studies primarily examine conversational helpfulness, productivity, or coarse-grained adoption metrics, leaving the role of prompt structure in collaborative integration behavior insufficiently understood. We analyze 265 manually validated developer-ChatGPT interactions derived from self-admitted ChatGPT usage in open-source pull requests. Building on prior research on developer-facing artifacts and prompt engineering, we operationalize prompt structure using three dimensions: Context, Specificity, and Verification. We first evaluate whether LLM-assisted annotation can reliably reproduce human judgments of prompt structure, finding substantial variation across dimensions and workflow contexts. Specificity shows the most stable agreement with human judgments; Context is systematically under-scored by the LLM; and Verification remains difficult to assess consistently, motivating a hybrid human-LLM annotation strategy. Using this validated framework, we then examine how prompt structure influences actionable code generation, code adoption, and integration depth across AI-assisted PR workflows. Specificity and Context are most strongly associated with actionable code generation; Verification emerges as the primary predictor of code adoption; and integration depth is most strongly associated with Context. Overall, our findings show that prompt characteristics exert distinct, stage-dependent effects across AI-assisted software engineering workflows, influencing downstream adoption and integration through contextual grounding, task specificity, and evaluability cues.

URL PDF HTML ☆

赞 0 踩 0

2606.20072 2026-06-19 cs.CL 新提交 70%

Source-Grounded Data Generation for Text-to-JSON Learning

基于源数据的文本到JSON学习数据生成

Sunghee Ahn, Guijin Son, Youngjae Yu

发表机构 * Seoul National University（首尔大学）

专题命中代码生成：文本到JSON数据生成

AI总结提出STAGE方法，利用电子表格作为源数据，通过LLM生成报告和JSON模式，并验证真实值，显著提升文本到JSON任务的训练数据质量。

Comments Preprint

详情

AI中文摘要

从财务文件到临床记录，传统行业严重依赖冗长、非结构化的文档来存储高价值信息。将这些信息可靠地提取为结构化的、机器可读的表示形式，是使自动化系统能够访问这些内容的关键前提。JSON是这种结构化提取的自然目标，然而构建可靠且可扩展的文本到JSON训练数据仍然具有挑战性。为了解决这一差距，我们提出了STAGE（电子表格基础的文本到JSON工件生成），一种基于源数据的数据生成管道，通过使用LLM进行可扩展合成，同时根据底层电子表格验证真实值，来构建报告和JSON模式。在STAGE-Eval（我们的基于源数据的基准测试，包含851个示例的测试集）上的评估表明，STAGE生成的训练数据优于现有方法。这使Qwen3-4B的精确匹配从31.37%提高到74.27%，值准确率从45.46%提高到90.69%。

英文摘要

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

URL PDF HTML ☆

赞 0 踩 0

2606.19419 2026-06-19 cs.RO cs.AI 新提交 65%

Playful Agentic Robot Learning

趣味性具身机器人学习

Junyi Zhang, Jiaxin Ge, Hanjun Yoo, Letian Fu, Zihan Yang, Yaowei Liu, Raj Saravanan, Shaofeng Yin, Justin Yu, Dantong Niu, Zirui Wang, Roei Herzig, Ken Goldberg, Yutong Bai, David M. Chan, Ion Stoica, Angjoo Kanazawa, Jiahui Lei, Haiwen Feng, Trevor Darrell

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Impossible Research

专题命中代码生成：机器人编码智能体生成可执行代码策略。

AI总结提出RATs框架，让机器人通过自主探索学习可复用技能，在LIBERO-PRO和MolmoSpaces上分别提升20.6和17.0个百分点。

Comments Project page: https://playful-rats.github.io/

详情

AI中文摘要

当前的具身机器人系统可以编写可执行的代码即策略程序、观察反馈并在多次尝试中修正行为，但它们仍然主要是任务驱动的：可复用技能仅在明确指令后获得。我们研究趣味性具身机器人学习，其中具身编码代理在下游任务到来之前，将自主导向的趣味性作为持续技能学习阶段。我们引入RATs，即专为趣味性技能获取设计的机器人代理团队。在趣味性阶段，RATs提出新颖且可学习的探索性任务，规划并执行机器人代码策略，验证中间进展，诊断失败，通过密集的步骤级反馈进行重试，并将成功执行提炼到持久代码技能库中。在测试时，代理从该冻结库中重用相关技能以帮助解决新任务。在LIBERO-PRO和MolmoSpaces上的实验表明，与无趣味性和随机趣味性基线相比，趣味性学习技能在保留的下游任务上分别提升了20.6和17.0个百分点（相对于CaP-Agent0）。此外，学习到的技能可以通过简单地检索到上下文中插入到其他推理时代码即策略代理中，无需微调基础模型，即可在RoboSuite和真实世界迁移中分别提升8.9和8.8个百分点。

英文摘要

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

URL PDF HTML ☆

赞 0 踩 0

2606.05017 2026-06-19 cs.AR cs.MS 版本更新 60%

GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF256 with a Lucas-Exact Integer Identity

GoldenFloat: 从GF4到GF256的基于Phi的静态拆分浮点系列及其Lucas精确整数恒等式

Dmitrii Vasilev

专题命中代码生成：提出GoldenFloat浮点系列RTL生成器。

AI总结提出一种由单一闭式规则生成的静态拆分浮点系列GoldenFloat，并给出多宽度RTL生成器、Lucas精确累加器路径和FPGA编解码器三个具体实现。

Comments 20 pages, single-file LaTeX, ASCII source. v2: peer-anchor updates. Adds Sarnoff P3109 (arXiv:2606.04028), AMD MXFP4 silicon (arXiv:2605.09825), NVIDIA GB10 NVFP4 measurement, companion catalog (arXiv:2606.09686), MixFP4 (arXiv:2605.31035). FL-002 expanded: (c1) GF256 bias, (c2) count drift, (g) static-split vs micro-mixing. TTSKY26a regeneration timeline added. No mathematical claims revised

详情

AI中文摘要

我们提出一种面向硬件的GoldenFloat（GF）描述，这是一个由单一闭式规则生成的静态拆分浮点系列，以及三个具体成果：（i）一个开放的多宽度RTL生成器，覆盖GF4-GF256，并带有针对正确舍入参考的连续积分差分扫描；（ii）一个整数支持的Lucas精确累加器路径，在n=1,...,256时以500位精度验证；（iii）一个GF16 FPGA编解码器，在Artix-7（Xilinx XC7A35T）上以323 MHz通过35/35测试台。对于每个总宽度N>=4，指数宽度e=round((N-1)/phi^2)，其中小数部分f=N-1-e，phi=(1+sqrt(5))/2。该规则复现了九种格式（9/9）的已实现指数宽度，并一致扩展到GF128、GF512、GF1024。该规则与posit、takum、OCP-MX以及IEEE P3109多宽度浮点草案并列。我们不对其中任何一种提出每级精度或优越性声明。广度/工具链一致性框架被记录为一个开放猜想，并带有预注册的证伪路径。证伪分类账（FL-002）记录了开放问题及解决它们的实验。报告了日期为2026-05-31的RTL正确性勘误；制造的TTSKY26b芯片带有缺陷的乘法器组合，修正后的生成器是再生基线。

英文摘要

We present a hardware-oriented description of GoldenFloat (GF), a static-split floating-point family generated by a single closed rule, and three concrete artefacts: (i) an open multi-width RTL generator covering GF4-GF256 with a continuous-integration differential sweep against a correctly-rounded reference; (ii) an integer-backed Lucas-exact accumulator path verified at 500-digit precision for n = 1, ..., 256; and (iii) a GF16 FPGA codec passing a 35-of-35 testbench at 323 MHz on Artix-7 (Xilinx XC7A35T). A format-conformance oracle (Corona) ships in the same repository and is used as the blackbox check in our continuous-integration audit. The rule and its scope. For each total width N >= 4, the exponent width is e = round((N-1)/phi^2) with fraction f = N-1-e and phi = (1+sqrt(5))/2. The rule reproduces the realised exponent widths of nine formats GF4, GF8, GF12, GF16, GF20, GF24, GF32, GF64, GF256 (9/9) and extends consistently to GF128, GF512, GF1024. The rule is positioned alongside posit (2022 Posit Standard), takum (Hunhold 2024, 2025), OCP-MX (Rouhani et al. 2023), and the IEEE P3109 multi-width float draft, all of which are width-spanning families under a parameterised rule. We make no per-rung accuracy or superiority claim against any of them. What is open. The breadth/toolchain-coherence framing is recorded as an open conjecture with a pre-registered falsification path: a matched-substrate FPGA experiment and a matched-budget software ablation. A falsification ledger (FL-002) records the open questions and the experiments that would settle them. An RTL-correctness erratum dated 2026-05-31 is reported in Section 5.5; the fabricated TTSKY26b dies carry the defective multiplier portfolio, and the corrected generator is the regeneration baseline.

URL PDF HTML ☆

赞 0 踩 0