代码大模型 / AI 编程 - arXivDaily 专题

2606.06133 2026-06-18 cs.SE cs.AI cs.LG cs.LO 版本更新 90%

TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

TLA-Prover: 通过偏好优化低秩适配实现可验证的 TLA+ 规范合成

Eric Spencer, Arslan Bisharat, Brian Ortiz, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

发表机构 * Department of Computer Science, Loyola University Chicago（洛约拉芝加哥大学计算机科学系）

专题命中代码生成：TLA+形式化规范合成，偏好优化提升通过率

AI总结提出 TLA-Prover 模型，结合监督微调和基于修复的组相对策略优化，在 TLC 模型检查器上实现 TLA+ 规范合成，Gold/Diamond 级别通过率达 30%，约为未调优基线的 3.5 倍。

Comments 12 pages, 5 tables, 3 figures. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026)

详情

AI中文摘要

TLA+ 是一种用于验证分布式系统和安全关键协议的正式规范语言。大型语言模型（LLM）生成的 TLA+ 规范常常因语义原因无法通过 TLC 模型检查器。在 25 个 LLM 中，最佳公开基线的语法解析成功率为 26.6%，语义模型检查通过率为 8.6%。我们提出了 TLA-Prover，一个 200 亿参数的 TLA+ 规范合成模型。训练结合了在已验证示例上的监督微调（SFT）和基于修复的组相对策略优化（GRPO）。在 GRPO 阶段，模型学习修复自身被拒绝的规范。我们还从相同的 SFT 检查点训练了一个直接偏好优化（DPO）变体作为消融实验。TLC 直接提供奖励信号，无需学习奖励模型。每个输出分为四个等级：青铜（解析通过）、银（无警告）、金（通过 TLC）和钻石。要达到钻石级，模型的正确性属性会被自动微小修改；TLC 必须检测到违反。如果 TLC 仍然通过，则该属性始终为真且无贡献；输出无法达到钻石级。在一个保留的 30 问题基准上，TLA-Prover 在金级和钻石级均达到 9/30（即 pass@1 = 30%）。这大约是未调优基线 8.6% 的 3.5 倍。DPO 变体在钻石级达到 20%。金级和钻石级在每个检查点都一致；这防止了平凡属性失败模式。

英文摘要

TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.

URL PDF HTML ☆

赞 0 踩 0

2511.00802 2026-06-18 cs.SE cs.CL cs.LG 版本更新 85%

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

GrowthHacker: 使用代码修改型LLM代理的自动离线策略评估优化

Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard

发表机构 * Michigan Technological University, Houghton（密歇根技术大学）； Birmingham City University（伯明翰城市大学）； University of British Columbia, Kelowna（不列颠哥伦比亚大学, 肯洛纳）

专题命中代码生成：利用LLM代理自动修改代码优化离线策略评估。

AI总结提出GrowthHacker基准，利用LLM代理自动迭代修改代码以优化离线策略评估（OPE）实现，在Open Bandit Pipeline和Scope-RL上评估多种框架，证明基于LLM的代理可作为自动增长黑客持续改进OPE系统。

Comments Accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM), 2026

详情

DOI: 10.1145/3815588

AI中文摘要

随着数据驱动开发的广泛采用，在线A/B测试已成为衡量新技术效果的既定方法。然而，部署在线实验需要设计、实现和部署资源，并可能对用户产生负面影响（例如，不安全或不道德的结果），同时需要数周的数据收集。为了解决这一问题，离线策略评估（OPE）或离线A/B测试这一日益增长的研究领域，使用先前收集的日志数据离线评估新技术。OPE也是强化学习中的一个基本问题，在在线测试昂贵或风险高的领域（如医疗保健、推荐系统、教育和机器人技术）中非常重要。尽管代码生成大语言模型（LLM）和代理工作流取得了进展，但关于LLM和基于LLM的代理是否以及如何自动优化OPE实现，我们知之甚少。我们提出了GrowthHacker，这是一个基准测试，用于在大规模公共数据集上评估基线LLM和基于LLM的代理。GrowthHacker自主迭代修改代码，运行OPE，并使用指标指导后续优化。我们在Open Bandit Pipeline（OBP）和Scope-RL上评估方法，并开发了一个双代理框架，该框架解决了现有框架的局限性，同时降低了复杂性。在两个库中，双代理显示出最高的可靠性（98.1%-100%成功率）和正向结果率（78%），正向结果的中位改进为4.4%；CrewAI实现了最高的平均改进（37.9%），并且是唯一没有极端值失败的框架。AutoGen和Default各达到65%的正向结果率。这些结果证明了使用基于LLM的代理作为自动“增长黑客”持续改进OPE系统的可行性，对在手动优化成本高昂的情况下扩展数据驱动决策具有重要意义。

英文摘要

With data-driven development now widely adopted, online A/B testing is an established method for measuring the effects of new technologies. However, deploying online experiments demands resources for design, implementation, and deployment, and may negatively impact users (e.g., unsafe or unethical outcomes) while requiring weeks of data collection. To address this, the growing research area of off-policy evaluation (OPE), or offline A/B testing, assesses new technologies offline using previously collected logged data. OPE is also a fundamental problem in reinforcement learning and is important where online testing is expensive or risky, such as healthcare, recommender systems, education, and robotics. Despite advances in code-generation large language models (LLMs) and agentic workflows, little is known about whether and how LLMs and LLM-based agents can automatically optimize OPE implementations. We propose GrowthHacker, a benchmark that evaluates baseline LLMs and LLM-based agents on large-scale public datasets. GrowthHacker autonomously and iteratively modifies code, runs OPE, and uses the metrics to guide subsequent optimization. We evaluate methods on Open Bandit Pipeline (OBP) and Scope-RL, and develop a two_agent framework that addresses limitations of existing frameworks while reducing complexity. Across both libraries, two_agent shows the highest reliability (98.1%-100% success rate) and positive-outcome rate (78%), with a median improvement of 4.4% among positive outcomes; CrewAI achieves the highest average improvement (37.9%) and is the only framework with zero extreme-value failures. AutoGen and Default each reach 65% positive-outcome rates. These results establish the feasibility of using LLM-based agents as automated "growth hackers" to continuously improve OPE systems, with implications for scaling data-driven decision-making where manual optimization is expensive.

URL PDF HTML ☆

赞 0 踩 0