代码大模型 / AI 编程

2606.01338 2026-06-19 cs.CL 版本更新 80%

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

在生物制药制造中本地LLM的自然语言到SQL查询基准测试：消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta, Ambika Baniya Bhandari

发表机构 * Department of Computer Science, University of the Cumberlands（大学的计算机科学系）； Department of Computer Science, DePaul University（德保罗大学计算机科学系）； Youngstown State University（亚当斯州立大学）

专题命中代码生成：评估本地LLM在生物制药制造中的NL2SQL性能。

AI总结本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能，发现代码调优的通用模型优于领域特定模型，但当前性能仍需人工监督。

详情

AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范（GMP）和欧盟AI法案等监管框架下运营，这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型（LLM）提供了一种保护隐私的替代方案，但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM（Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B）在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI，使用一个包含约63,000条记录的合成Microsoft SQL Server数据库，涵盖批次、制造执行系统（MES）和在线清洗（CIP）模块。模型在60个领域特定的自然语言问题上进行了基准测试，使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL，而Meditron 7B由于上下文窗口限制和SQL生成能力差，几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性，而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明，代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的，但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

URL PDF HTML ☆

赞 0 踩 0

2606.19644 2026-06-19 cs.SE 新提交 75%

Prompt Quality and Pull Request Outcomes: A Stage-Based Empirical Study of LLM-Assisted Development

提示质量与拉取请求结果：基于阶段的LLM辅助开发实证研究

Richard Sserunjogi, Daniel Ogenrwot, John Businge

专题命中代码生成：研究提示质量对LLM辅助代码生成和PR结果的影响。

AI总结通过分析265个开发者与ChatGPT的交互，研究提示结构（上下文、具体性、验证）对LLM辅助开发中代码生成、采纳和集成深度的影响，发现不同维度在不同阶段有不同作用。

Comments 48 pages, 2 figures

详情

AI中文摘要

大型语言模型（LLM）驱动的工具（如ChatGPT）越来越多地用于协作软件工程工作流，但提示结构如何影响下游拉取请求（PR）结果尚不清楚。先前的研究主要考察对话帮助性、生产力或粗粒度的采用指标，对提示结构在协作集成行为中的作用理解不足。我们分析了来自开源拉取请求中自我承认的ChatGPT使用的265个手动验证的开发者-ChatGPT交互。基于先前关于开发者面向工件和提示工程的研究，我们使用三个维度操作化提示结构：上下文、具体性和验证。我们首先评估LLM辅助注释是否能可靠地再现人类对提示结构的判断，发现在不同维度和工作流上下文中存在显著差异。具体性与人类判断的一致性最稳定；上下文被LLM系统性地低估；验证仍然难以一致评估，这促使采用人类-LLM混合注释策略。使用这个经过验证的框架，我们然后检查提示结构如何影响AI辅助PR工作流中的可操作代码生成、代码采纳和集成深度。具体性和上下文与可操作代码生成关联最强；验证成为代码采纳的主要预测因子；集成深度与上下文关联最强。总体而言，我们的发现表明，提示特征在AI辅助软件工程工作流中表现出不同的、阶段依赖的影响，通过上下文基础、任务具体性和可评估性线索影响下游采纳和集成。

英文摘要

Large language model (LLM)-powered tools such as ChatGPT are increasingly used in collaborative software engineering workflows, yet little is known about how prompt structure influences downstream pull request (PR) outcomes. Prior studies primarily examine conversational helpfulness, productivity, or coarse-grained adoption metrics, leaving the role of prompt structure in collaborative integration behavior insufficiently understood. We analyze 265 manually validated developer-ChatGPT interactions derived from self-admitted ChatGPT usage in open-source pull requests. Building on prior research on developer-facing artifacts and prompt engineering, we operationalize prompt structure using three dimensions: Context, Specificity, and Verification. We first evaluate whether LLM-assisted annotation can reliably reproduce human judgments of prompt structure, finding substantial variation across dimensions and workflow contexts. Specificity shows the most stable agreement with human judgments; Context is systematically under-scored by the LLM; and Verification remains difficult to assess consistently, motivating a hybrid human-LLM annotation strategy. Using this validated framework, we then examine how prompt structure influences actionable code generation, code adoption, and integration depth across AI-assisted PR workflows. Specificity and Context are most strongly associated with actionable code generation; Verification emerges as the primary predictor of code adoption; and integration depth is most strongly associated with Context. Overall, our findings show that prompt characteristics exert distinct, stage-dependent effects across AI-assisted software engineering workflows, influencing downstream adoption and integration through contextual grounding, task specificity, and evaluability cues.

URL PDF HTML ☆

赞 0 踩 0

2606.20072 2026-06-19 cs.CL 新提交 70%

Source-Grounded Data Generation for Text-to-JSON Learning

基于源数据的文本到JSON学习数据生成

Sunghee Ahn, Guijin Son, Youngjae Yu

发表机构 * Seoul National University（首尔大学）

专题命中代码生成：文本到JSON数据生成

AI总结提出STAGE方法，利用电子表格作为源数据，通过LLM生成报告和JSON模式，并验证真实值，显著提升文本到JSON任务的训练数据质量。

Comments Preprint

详情

AI中文摘要

从财务文件到临床记录，传统行业严重依赖冗长、非结构化的文档来存储高价值信息。将这些信息可靠地提取为结构化的、机器可读的表示形式，是使自动化系统能够访问这些内容的关键前提。JSON是这种结构化提取的自然目标，然而构建可靠且可扩展的文本到JSON训练数据仍然具有挑战性。为了解决这一差距，我们提出了STAGE（电子表格基础的文本到JSON工件生成），一种基于源数据的数据生成管道，通过使用LLM进行可扩展合成，同时根据底层电子表格验证真实值，来构建报告和JSON模式。在STAGE-Eval（我们的基于源数据的基准测试，包含851个示例的测试集）上的评估表明，STAGE生成的训练数据优于现有方法。这使Qwen3-4B的精确匹配从31.37%提高到74.27%，值准确率从45.46%提高到90.69%。

英文摘要

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

URL PDF HTML ☆

赞 0 踩 0

2606.19419 2026-06-19 cs.RO cs.AI 新提交 65%

Playful Agentic Robot Learning

趣味性具身机器人学习

Junyi Zhang, Jiaxin Ge, Hanjun Yoo, Letian Fu, Zihan Yang, Yaowei Liu, Raj Saravanan, Shaofeng Yin, Justin Yu, Dantong Niu, Zirui Wang, Roei Herzig, Ken Goldberg, Yutong Bai, David M. Chan, Ion Stoica, Angjoo Kanazawa, Jiahui Lei, Haiwen Feng, Trevor Darrell

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Impossible Research

专题命中代码生成：机器人编码智能体生成可执行代码策略。

AI总结提出RATs框架，让机器人通过自主探索学习可复用技能，在LIBERO-PRO和MolmoSpaces上分别提升20.6和17.0个百分点。

Comments Project page: https://playful-rats.github.io/

详情

AI中文摘要

当前的具身机器人系统可以编写可执行的代码即策略程序、观察反馈并在多次尝试中修正行为，但它们仍然主要是任务驱动的：可复用技能仅在明确指令后获得。我们研究趣味性具身机器人学习，其中具身编码代理在下游任务到来之前，将自主导向的趣味性作为持续技能学习阶段。我们引入RATs，即专为趣味性技能获取设计的机器人代理团队。在趣味性阶段，RATs提出新颖且可学习的探索性任务，规划并执行机器人代码策略，验证中间进展，诊断失败，通过密集的步骤级反馈进行重试，并将成功执行提炼到持久代码技能库中。在测试时，代理从该冻结库中重用相关技能以帮助解决新任务。在LIBERO-PRO和MolmoSpaces上的实验表明，与无趣味性和随机趣味性基线相比，趣味性学习技能在保留的下游任务上分别提升了20.6和17.0个百分点（相对于CaP-Agent0）。此外，学习到的技能可以通过简单地检索到上下文中插入到其他推理时代码即策略代理中，无需微调基础模型，即可在RoboSuite和真实世界迁移中分别提升8.9和8.8个百分点。

英文摘要

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

URL PDF HTML ☆

赞 0 踩 0

2606.05017 2026-06-19 cs.AR cs.MS 版本更新 60%

GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF256 with a Lucas-Exact Integer Identity

GoldenFloat: 从GF4到GF256的基于Phi的静态拆分浮点系列及其Lucas精确整数恒等式

Dmitrii Vasilev

专题命中代码生成：提出GoldenFloat浮点系列RTL生成器。

AI总结提出一种由单一闭式规则生成的静态拆分浮点系列GoldenFloat，并给出多宽度RTL生成器、Lucas精确累加器路径和FPGA编解码器三个具体实现。

Comments 20 pages, single-file LaTeX, ASCII source. v2: peer-anchor updates. Adds Sarnoff P3109 (arXiv:2606.04028), AMD MXFP4 silicon (arXiv:2605.09825), NVIDIA GB10 NVFP4 measurement, companion catalog (arXiv:2606.09686), MixFP4 (arXiv:2605.31035). FL-002 expanded: (c1) GF256 bias, (c2) count drift, (g) static-split vs micro-mixing. TTSKY26a regeneration timeline added. No mathematical claims revised

详情

AI中文摘要

我们提出一种面向硬件的GoldenFloat（GF）描述，这是一个由单一闭式规则生成的静态拆分浮点系列，以及三个具体成果：（i）一个开放的多宽度RTL生成器，覆盖GF4-GF256，并带有针对正确舍入参考的连续积分差分扫描；（ii）一个整数支持的Lucas精确累加器路径，在n=1,...,256时以500位精度验证；（iii）一个GF16 FPGA编解码器，在Artix-7（Xilinx XC7A35T）上以323 MHz通过35/35测试台。对于每个总宽度N>=4，指数宽度e=round((N-1)/phi^2)，其中小数部分f=N-1-e，phi=(1+sqrt(5))/2。该规则复现了九种格式（9/9）的已实现指数宽度，并一致扩展到GF128、GF512、GF1024。该规则与posit、takum、OCP-MX以及IEEE P3109多宽度浮点草案并列。我们不对其中任何一种提出每级精度或优越性声明。广度/工具链一致性框架被记录为一个开放猜想，并带有预注册的证伪路径。证伪分类账（FL-002）记录了开放问题及解决它们的实验。报告了日期为2026-05-31的RTL正确性勘误；制造的TTSKY26b芯片带有缺陷的乘法器组合，修正后的生成器是再生基线。

英文摘要

We present a hardware-oriented description of GoldenFloat (GF), a static-split floating-point family generated by a single closed rule, and three concrete artefacts: (i) an open multi-width RTL generator covering GF4-GF256 with a continuous-integration differential sweep against a correctly-rounded reference; (ii) an integer-backed Lucas-exact accumulator path verified at 500-digit precision for n = 1, ..., 256; and (iii) a GF16 FPGA codec passing a 35-of-35 testbench at 323 MHz on Artix-7 (Xilinx XC7A35T). A format-conformance oracle (Corona) ships in the same repository and is used as the blackbox check in our continuous-integration audit. The rule and its scope. For each total width N >= 4, the exponent width is e = round((N-1)/phi^2) with fraction f = N-1-e and phi = (1+sqrt(5))/2. The rule reproduces the realised exponent widths of nine formats GF4, GF8, GF12, GF16, GF20, GF24, GF32, GF64, GF256 (9/9) and extends consistently to GF128, GF512, GF1024. The rule is positioned alongside posit (2022 Posit Standard), takum (Hunhold 2024, 2025), OCP-MX (Rouhani et al. 2023), and the IEEE P3109 multi-width float draft, all of which are width-spanning families under a parameterised rule. We make no per-rung accuracy or superiority claim against any of them. What is open. The breadth/toolchain-coherence framing is recorded as an open conjecture with a pre-registered falsification path: a matched-substrate FPGA experiment and a matched-budget software ablation. A falsification ledger (FL-002) records the open questions and the experiments that would settle them. An RTL-correctness erratum dated 2026-05-31 is reported in Section 5.5; the fabricated TTSKY26b dies carry the defective multiplier portfolio, and the corrected generator is the regeneration baseline.

URL PDF HTML ☆

赞 0 踩 0

2512.00560 2026-06-19 cs.SE 版本更新 80%

SAGE: Semantic-Aware Gray-Box Game Regression Testing with Large Language Models

SAGE: 基于语义的灰盒游戏回归测试与大型语言模型

Jinyu Cai, Jialong Li, Nianyu Li, Zhenyu Mao, Mingyue Zhang, Kenji Tei

专题命中软件智能体：利用LLM引导强化学习自动生成游戏测试套件。

AI总结提出SAGE框架，利用LLM引导强化学习自动生成测试套件，通过语义多目标优化精简测试，并基于更新日志语义分析优先排序，在Overcooked Plus和Minecraft中实现高效回归测试。

Comments This paper has been accepted by Automated Software Engineering journal

详情

AI中文摘要

现代实时服务游戏的快速迭代周期使得回归测试对于维持质量和稳定性不可或缺。然而，现有的回归测试方法面临关键限制，特别是在无法完全访问源代码的常见灰盒设置中：它们严重依赖手动构建测试用例，难以维护因冗余而日益庞大的测试套件，并且缺乏有效的机制来优先排序相关测试。这些挑战导致测试成本过高、自动化程度有限以及缺陷检测不足。为了解决这些问题，我们提出了SAGE，一个面向灰盒游戏环境的语义感知回归测试框架。SAGE系统地解决了测试生成、维护和选择的核心挑战。它采用LLM引导的强化学习进行高效、目标导向的探索，以自动生成多样化的基础测试套件。随后，它应用基于语义的多目标优化，通过平衡成本、覆盖率和稀有性，将该套件精炼为紧凑、高价值的子集。最后，它利用基于LLM的更新日志语义分析，优先排序与版本变更最相关的测试用例，从而实现跨迭代的高效适应。我们在两个代表性环境Overcooked Plus和Minecraft上评估了SAGE，并与自动化基线和人工记录的测试用例进行了比较。在所有环境中，SAGE以显著更低的执行成本实现了更优的缺陷检测，并展现出对版本更新的强大适应性。

英文摘要

The rapid iteration cycles of modern live-service games make regression testing indispensable for maintaining quality and stability. However, existing regression testing approaches face critical limitations, especially in common gray-box settings where full source code access is unavailable: they heavily rely on manual effort for test case construction, struggle to maintain growing suites plagued by redundancy, and lack efficient mechanisms for prioritizing relevant tests. These challenges result in excessive testing costs, limited automation, and insufficient bug detection. To address these issues, we propose SAGE, a semanticaware regression testing framework for gray-box game environments. SAGE systematically addresses the core challenges of test generation, maintenance, and selection. It employs LLM-guided reinforcement learning for efficient, goal-oriented exploration to automatically generate a diverse foundational test suite. Subsequently, it applies a semantic-based multi-objective optimization to refine this suite into a compact, high-value subset by balancing cost, coverage, and rarity. Finally, it leverages LLM-based semantic analysis of update logs to prioritize test cases most relevant to version changes, enabling efficient adaptation across iterations. We evaluate SAGE on two representative environments, Overcooked Plus and Minecraft, comparing against both automated baselines and human-recorded test cases. Across all environments, SAGE achieves superior bug detection with significantly lower execution cost, while demonstrating strong adaptability to version updates.

URL PDF HTML ☆

赞 0 踩 0

2606.20487 2026-06-19 cs.CL 新提交 70%

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

超越全局重规划：跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Shanghai Innovation Institute（上海创新研究院）； Southeast University（东南大学）； Tsinghua University（清华大学）

专题命中软件智能体：涉及API-CLI-GUI执行和失败恢复

AI总结提出分层重规划框架H-RePlan，通过统一API-CLI-GUI执行和跨层失败抽象，区分设备本地策略恢复与全局重规划，在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情

AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备，要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配，但恢复仍然粗粒度：当执行失败时，它们通常重试相同策略、重新分配子任务或修改全局计划，而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan}，一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略，并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力，我们引入\textbf{HeraBench}，一个故障注入基准，它在Linux和Android设备上构建跨设备工作流，并注入策略级和设备级故障。实验表明，H-RePlan显著优于单策略和粗粒度多设备基线，实现了更高的完成率、指令遵循率和完美通过率，同时降低了可靠端到端成功所需的令牌成本。这些结果表明，范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

URL PDF HTML ☆

赞 0 踩 0

2606.20134 2026-06-19 cs.LO cs.PL 新提交 70%

An MSO Framework for Weak-Memory Verification and Robustness

弱内存验证与鲁棒性的MSO框架

Giovanna Kobus Conrado, Andreas Pavlogiannis

专题命中代码评测：弱内存验证与鲁棒性的MSO框架。

AI总结本文研究单子二阶逻辑作为弱内存元理论，证明顺序一致性执行有界树宽而TSO无界，展示多种模型可MSO公理化，并引入读自鲁棒性概念，实现统一验证算法。

Comments Accepted at CONCUR 2026

详情

AI中文摘要

内存模型是并发程序执行的形式化规范，解释了编译器和架构优化引入的弱行为。其数量和复杂性的增加促使人们通过在适当的元理论中公理化模型来统一验证整个模型类别。本文正式研究单子二阶逻辑（MSO）作为弱内存的元理论，通过证明各种流行弱内存模型的树宽和MSO可表达性结果，使得我们能够统一处理多个验证问题。总结如下：首先，我们证明顺序一致性（$\mathsf{SC}$）下的执行具有有界树宽，而总存储顺序（$\mathsf{TSO}$）下的执行则无界。其次，我们证明包括Release/Acquire和完整RC20在内的广泛模型是MSO可公理化的，而其他模型如Strong Release/Acquire和$\mathsf{TSO}$则不可，除非正交向量问题（在SETH下需要二次时间）可以在线性时间内解决。最后，我们引入读自鲁棒性概念，作为对近期粗粒度鲁棒性准则工作的扩展。我们证明树宽界限（上界和下界）对任何MSO可公理化模型$\mathsf{MM}$具有深远的算法意义：存在一个算法，对于每个程序$\mathsf{P}$，要么验证$\mathsf{P}$在$\mathsf{MM}$下的正确性，要么报告$\mathsf{P}$对$\mathsf{MM}$不是读自鲁棒的。总体而言，我们的结果为弱内存验证和鲁棒性建立了一个丰富且多功能的理论框架。

英文摘要

Memory models are formal specifications of concurrent-program executions, accounting for weak behaviors introduced by compiler and architectural optimizations. The increase of their number and complexity has spawned efforts for uniform verification across whole classes of models, by axiomatizing the models in an adequate metatheory that admits a uniform treatment. In this work, we formally study Monadic Second-Order logic (MSO) as a metatheory for weak memory, by proving results on the treewidth and MSO-expressibility of various popular weak-memory models, as this combination allows us to uniformly tackle several verification problems. In summary, our results are as follows. First, we prove that executions under Sequential Consistency ($\mathsf{SC}$) have bounded treewidth, while already those under Total Store Order ($\mathsf{TSO}$) do not. Second, we prove that a broad range of models, including Release/Acquire and the full RC20, are MSO-axiomatizable, while others, such as Strong Release/Acquire and $\mathsf{TSO}$, are not, unless the Orthogonal Vectors problem $\unicode{x2013}$ which requires quadratic time under SETH $\unicode{x2013}$ can be solved in linear time. Finally, we introduce the notion of reads-from robustness, as an extension to recent work on coarse robustness criteria. We show that our treewidth bounds (both upper and lower) have far-reaching algorithmic implications for any of our MSO-axiomatizable models $\mathsf{MM}$: there is an algorithm that, for every program $\mathsf{P}$, either verifies $\mathsf{P}$ under $\mathsf{MM}$ or reports that $\mathsf{P}$ is not reads-from robust against $\mathsf{MM}$. Overall, our results establish a rich and versatile theoretical framework for weak-memory verification and robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.19654 2026-06-19 cs.CR cs.SE 新提交 70%

PUFFERDOS: Efficient and Effective Attack String Generation for Regular Expression Denial of Service Vulnerabilities

PUFFERDOS：针对正则表达式拒绝服务漏洞的高效攻击字符串生成

Shangzhi Xu, Ziqi Ding, Xiao Cheng, Yuekang Li, Nan Sun, Benjamin Turnbull, Shuangxiang Kan, Siqi Ma

专题命中代码评测：生成正则表达式拒绝服务攻击字符串，涉及程序分析

AI总结提出PUFFERDOS方法，通过定义三种脆弱模式并利用合成技术与组合符号执行，生成在现实长度预算内且经程序验证有效的ReDoS攻击字符串。

Comments Accepted by S&P'26

详情

AI中文摘要

ReDoS攻击构成了一类关键的资源耗尽漏洞。在此类攻击中，攻击者利用正则表达式引擎的病态最坏情况执行行为，诱导高度不对称的计算工作负载，最终耗尽系统资源并降低服务可用性。为了保护系统免受ReDoS攻击，研究人员提出了许多检测技术，这些技术通过生成攻击字符串来模拟攻击过程，以便在早期开发阶段主动利用ReDoS漏洞并促进修复。现有技术大致分为两类：搜索病态正则表达式结构的静态分析，以及合成候选攻击字符串的动态探索方法。然而，生成的攻击字符串通常不适用于实际利用，因为它们往往假设不切实际的输入长度预算，并且未在程序级别验证攻击的有效性和效率。因此，许多生成的字符串在应用于实际程序时无法触发易受攻击的正则表达式，进一步限制了其实用性。为了解决这些不足，我们引入了一种有效且高效的攻击字符串生成器PUFFERDOS，旨在合成在现实长度预算内可行且经程序级别验证的攻击输入，从而实现对实际程序中ReDoS漏洞的有效利用。具体来说，我们首先基于观察和形式化验证定义了三种脆弱模式。根据这些模式，PUFFERDOS采用合成技术生成攻击字符串，然后通过针对ReDoS的组合符号执行对字符串进行细化和验证，以确保现实世界中的可利用性。

英文摘要

ReDoS attacks constitute a critical class of resource-exhaustion vulnerabilities. In such attacks, adversaries exploit the pathological worst-case execution behavior of regular expression (regex) engines to induce highly asymmetric computational workloads, ultimately exhausting system resources and degrading service availability. To protect systems against ReDoS attacks, numerous detection techniques have been proposed that simulate the attack process by generating attack strings to proactively exploit ReDoS vulnerabilities at the early development stage and facilitate remediation. Existing techniques broadly fall into two classes: static analyses that search for pathological regex structures, and dynamic exploration methods that synthesize candidate attack strings. However, the generated attack strings are often impractical for real-world exploitation because they usually assume unrealistic input-length budgets and do not validate the effectiveness and efficiency of the attack at the program level. Therefore, many generated strings fail to trigger vulnerable regexes when applied to real-world programs, further limiting the practical utility. To address these shortcomings, we introduce an effective and efficient attack string generator, PUFFERDOS, designed to synthesize attack inputs that are both feasible within realistic length budgets and validated at the program level, enabling effective exploitation of ReDoS vulnerabilities in real-world programs. Specifically, we first define three vulnerable patterns based on our observation and formal verification. According to the patterns, PUFFERDOS conducts a synthesis technique to generate attack strings, and then refines and validates the strings with ReDoS-specific compositional concolic execution to guarantee real-world exploitability.

URL PDF HTML ☆

赞 0 踩 0

2606.20129 2026-06-19 cs.SE 新提交 60%

Learning Critical Testing Literacy Through Puzzles: an Experience Report

通过谜题学习关键测试素养：经验报告

Niels Doorn, Bart Th. Knaack, Tanja E. J. Vos, Beatriz Marín

专题命中代码评测：通过谜题学习软件测试素养。

AI总结本文报告了使用谜题教授关键测试素养（CTL）的13次工作坊经验，发现参与者通过解谜、汇报和反思的完整序列学习效果显著，并开发了开源分析工具。

详情

AI中文摘要

在本文中，我们报告了使用谜题学习CTL的工作坊经验和收获。背景：软件测试重要但难以教授。我们引入了一个基于谜题的学习活动知识体系来教授CTL，该体系基于关键测试者认知模型，形成了P4TEST教学框架。我们与学生、测试人员、教师和小学生共举办了13次工作坊，评估基于谜题的关键测试素养教学。经验：在11次工作坊中，我们采用半结构化方法，变化谜题、材料和时长。在另外两次工作坊中，我们引入了工作手册和出声思考环节，以收集更多关于学习体验的数据。观察：参与者普遍认为自己在解谜时进行实验。学生倾向于收敛于解决方案，而专业人员继续探索。情绪在行为中可见，但难以通过书面反思单独浮现。出声思考环节揭示了即时推理；书面反思引发了更多元认知反思。主题“意义建构/行动中反思”捕捉了参与者如何构建问题、应对死胡同和转变策略。反思：谜题本身并非干预手段；解谜、汇报和反思的完整序列才是。更刻意地设计这一序列是未来的工作。我们还开发了一个带有内置分析功能的开源网络应用程序，用于定制工作坊。

英文摘要

In this paper, we report our experiences and takeaways from workshops using puzzles to learn CTL. Background: Software testing is important yet difficult to teach. We introduced a BoK of puzzle-based learning activities to teach CTL, based on a model of critical tester's cognition, leading to the pedagogical framework P4TEST. We conducted thirteen workshops with students, testers, teachers, and primary school pupils to assess puzzle-based teaching of critical testing literacy. Experience: Across eleven workshops, we used a semi-structured approach, varying puzzles, materials, and timing. In two additional workshops, we introduced workbooks and think-aloud sessions to gather more data on the learning experience. Observations: Participants consistently perceived themselves as experimenting while solving puzzles. Students tended to converge on solutions, while professionals continued exploring. Emotions were visible in behaviour but hard to surface through written reflection alone. Think-aloud sessions revealed immediate reasoning; written reflections elicited more meta-cognitive reflection. The theme Sensemaking / reflection-in-action captured how participants framed problems, navigated dead ends, and shifted strategies. Reflections: Puzzles are not the intervention: the entire sequence of solving, debriefing, and reflecting is. Designing that sequence more deliberately is the work ahead. We also developed an open-source web application with built-in analytics to customise workshops.

URL PDF HTML ☆

赞 0 踩 0

2606.20370 2026-06-19 astro-ph.IM astro-ph.GA 新提交 60%

ELMA: ELlipse-based bar MAjor axis estimator

ELMA：基于椭圆的棒主轴估计器

Bruna R. Bragança de Lima, Andressa Wille, Rafael S. de Souza, Ana L. Chies-Santos

专题命中代码评测：Python包用于星系棒长度自动估计

AI总结提出ELMA Python包，通过迭代椭圆等照度线拟合自动估计星系棒长度，在GOODS-South的JWST/NIRCam图像上验证。

Comments 4 pages, 1 figure, published in RNAAS

Journal ref Research Notes of the AAS, Volume 10, Number 6, 2026

详情

DOI: 10.3847/2515-5172/ae7d2d

AI中文摘要

星系棒是盘星系中关键的非轴对称结构，驱动角动量重新分布，并促进长期演化、中心质量积累以及核结构的形成。然而，对棒长度的稳健且均匀的测量仍然具有挑战性，特别是在大型成像巡天中，人工估计耗时且对方法选择敏感。我们推出了elma，一个独立的、可通过pip安装的Python包，用于自动估计已被识别为候选棒状系统的星系中的棒长度。该方法直接对二维成像数据进行操作，使用迭代椭圆等照度线拟合来追踪径向椭圆率轮廓，并从与椭圆率局部最大值对应的半长轴中识别出投影棒长度估计值。利用图像的WCS信息和用户提供的红移，elma将角度测量值转换为投影物理长度。我们在GOODS-South天区的JWST/NIRCam成像的棒状星系上演示了该包。代码在MIT许可下发布在Github仓库中。

英文摘要

Galactic bars are key non-axisymmetric structures in disk galaxies, driving angular-momentum redistribution and contributing to secular evolution, central mass build-up, and the formation of nuclear structures. Robust and homogeneous measurements of bar length, however, remain challenging, particularly for large imaging surveys, where manual estimates are time-consuming and sensitive to methodological choices. We introduce elma, a standalone, pip-installable Python package for automated bar-length estimation in galaxies already identified as candidate barred systems. The method operates directly on two-dimensional imaging data, using iterative elliptical-isophote fitting to trace the radial ellipticity profile and identify a projected bar-length estimate from the semi-major axis associated with the local maximum in ellipticity. Using the image WCS information and a user-supplied redshift, elma converts angular measurement into a projected physical length. We demonstrate the package on JWST/NIRCam imaging of barred galaxies in the GOODS-South field. The code is released under the MIT license at a repository in Github.

URL PDF HTML ☆

赞 0 踩 0

2606.18941 2026-06-19 cs.PL cs.CL 新提交 70%

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC：使用基于SMT的模型检查对图形化PLCopen XML梯形图程序进行形式验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * Computer Science, The University of Manchester（计算机科学，曼彻斯特大学）； Electrical Engineering, Federal University of Amazonas (UFAM)（电气工程，亚马逊联邦大学（UFAM））

专题命中程序修复：形式验证PLC程序，属于程序修复

AI总结针对ESBMC-PLC无法处理图形化PLCopen XML梯形图的问题，提出基于DFS的图形LD解析器，将连接图转换为布尔触点合取，并采用三级I/O推断方案，成功实现完整GOTO IR转换，验证了3个图形LD程序。

Comments 18 pages

详情

AI中文摘要

PLCopen XML为IEC 61131-3梯形图程序定义了两种编码格式：一种使用<rung>元素的文本编码，另一种将梯形逻辑表示为localId/refLocalId连接的有向图的图形编码。ESBMC-PLC支持文本格式，但将来自CONTROLLINO、Beremiz和OpenPLC Editor的图形导出解析为空GOTO中间表示，导致空洞的验证成功。本文提出Graph-ESBMC-PLC，通过基于DFS的图形LD解析器填补了这一空白。该解析器从leftPowerRail遍历连接图到每个线圈，将梯形路径提取为布尔触点合取，并应用三级I/O推断方案。按rightPowerRail的connectionPointIn序列对线圈排序，确保SET线圈在RESET线圈之前处理，匹配IEC扫描周期语义。图形到IR的转换无需改动ESBMC后端。在来自CONTROLLINO/OpenPLC Editor的3个图形LD程序上的验证表明，所有程序都生成了包含非确定性输入和梯形逻辑的完整GOTO IR，而之前生成的是空IR。所有3个程序在k=2时在70ms内验证为SAFE。11个文本LD基准测试完全保留，无回归。两个不含LD内容或不支持定时器语义的Beremiz示例被报告为发现的局限性。工件位于Zenodo（DantasCordeiro2026graphical，doi: https://doi.org/10.5281/zenodo.20699856）。

英文摘要

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using <rung> elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents ESBMC-GraphPLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

URL PDF HTML ☆

赞 0 踩 0

2601.22978 2026-06-19 cs.CR cs.PL 版本更新 60%

Triosecuris: Formally Verified Protection Against Speculative Control-Flow Hijacking

Triosecuris：针对推测控制流劫持的形式化验证防御

Jonathan Baumann, Yonghyun Kim, Yan Farba, Catalin Hritcu, Julay Leatherman-Brooks

专题命中程序修复：形式化验证防御推测控制流劫持

AI总结提出Triosecuris，结合CET风格硬件辅助控制流完整性与编译器插入的推测加载硬化，通过形式化证明实现相对安全性，确保任意程序在推测执行下不泄露比源程序无推测时更多的信息。

Comments To appear at CSF'26; extended version with appendices. W.r.t. first revision: extended with concrete protection against Spectre RSB and renamed to Triosecuris

Journal ref 39th IEEE Computer Security Foundations Symposium (CSF) (2026) 544-559

详情

DOI: 10.1109/CSF68417.2026.00036

AI中文摘要

本文介绍了Triosecuris，一种针对Spectre BTB、RSB和PHT的形式化验证防御，它结合了CET风格的硬件辅助控制流完整性与编译器插入的推测加载硬化（SLH）。Triosecuris基于一个新颖的观察：在CET风格保护存在的情况下，我们可以精确检测间接调用的BTB误推测和返回的RSB误推测，并设置SLH误推测标志。我们在Rocq中将Triosecuris形式化为一种变换，并提供了机器检查的证明，表明它实现了相对安全性：任何经过变换的程序在推测执行下泄露的信息不超过源程序在无推测执行下泄露的信息。这一强安全保证适用于任意程序，即使那些不遵循密码学常数时间编程规范的程序。

英文摘要

This paper introduces Triosecuris, a formally verified defense against Spectre BTB, RSB, and PHT that combines CET-style hardware-assisted control-flow integrity with compiler-inserted speculative load hardening (SLH). Triosecuris is based on the novel observation that in the presence of CET-style protection, we can precisely detect BTB misspeculation for indirect calls and RSB misspeculation for returns and set the SLH misspeculation flag. We formalize Triosecuris as a transformation in Rocq and provide a machine-checked proof that it achieves relative security: any transformed program running with speculation leaks no more than what the source program leaks without speculation. This strong security guarantee applies to arbitrary programs, even those not following the cryptographic constant-time programming discipline.

URL PDF HTML ☆

赞 0 踩 0

1. 代码生成 5 篇

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

Prompt Quality and Pull Request Outcomes: A Stage-Based Empirical Study of LLM-Assisted Development

Source-Grounded Data Generation for Text-to-JSON Learning

Playful Agentic Robot Learning

GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF256 with a Lucas-Exact Integer Identity

2. 软件智能体 2 篇

SAGE: Semantic-Aware Gray-Box Game Regression Testing with Large Language Models

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

3. 代码评测 4 篇

An MSO Framework for Weak-Memory Verification and Robustness

PUFFERDOS: Efficient and Effective Attack String Generation for Regular Expression Denial of Service Vulnerabilities

Learning Critical Testing Literacy Through Puzzles: an Experience Report

ELMA: ELlipse-based bar MAjor axis estimator

4. 程序修复 2 篇

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Triosecuris: Formally Verified Protection Against Speculative Control-Flow Hijacking