arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

代码大模型 / AI 编程

代码生成、软件工程智能体、程序修复、测试生成和开发者工具。

今日/当前日期收录 1 信号源:cs.SE, cs.CL, cs.AI, cs.LG, cs.PL
2511.18288 2026-06-19 cs.SE 版本更新 90%

Can Large Language Models Reason About Complex Execution Paths? An Empirical Study on Python

大型语言模型能否推理复杂执行路径?基于Python的实证研究

Wenhan Wang, Kaibo Liu, Zeyu Sun, An Ran Chen, Ge Li, Gang Huang, Lei Ma

专题命中 代码评测 :实证研究LLM在Python执行路径推理中的能力。

AI总结 本文实证研究大型语言模型在Python执行路径推理中的可行性,构建测试用例生成和缺陷分类任务,发现LLM能提升路径覆盖率,但强推理模型不一定优于弱模型。

Comments Accepted by ACM Transactions on Software Engineering and Methodology (TOSEM)

详情
AI中文摘要

执行路径推理是理解程序语义的关键步骤,对于生成覆盖特定分支/路径的测试用例或检测由某些路径触发的缺陷(无需实际执行程序)至关重要。传统上,执行路径推理可通过符号执行技术实现,但现有的基于SMT的符号执行方法在处理复杂数据结构及外部API调用时面临困难。在具有高度灵活语法的语言(如Python)中,这一挑战更为突出,导致缺乏广泛采用的执行路径推理工具。因此,基于AI的方法进行执行路径推理成为一个有前景的方向。本文研究了采用大型语言模型(LLMs)进行Python执行路径推理的可行性,而传统的基于路径的符号执行工具在此环境中不可用。我们对两类路径推理任务进行了实证研究:用于测试用例生成的生成任务和用于缺陷检测的分类任务。我们从竞赛级程序和真实世界仓库中构建了新的评估流水线和基准。结果表明,最先进的LLMs能够正确推理执行路径,并提高真实世界软件的测试覆盖率,尽管推理能力更强的模型并不总是优于较弱的模型。这些发现凸显了利用LLMs作为路径感知代码推理的补充启发式方法的潜力,特别是在缺乏成熟符号执行工具的程序语言中。我们已在以下网址发布了基准和评估脚本:此 https URL。

英文摘要

Execution path reasoning is a key step towards program semantics understanding. It is crucial for generating test cases that cover certain branches/paths, or detecting bugs that are triggered by some paths without actually executing the program. Traditionally, execution path reasoning can be achieved by symbolic execution techniques, but existing SMT-based symbolic execution approaches struggle with complex data structures and external API calls. This challenge is even more pronounced in languages with highly flexible syntax, such as Python, resulting in a lack of widely adopted tools for reasoning on execution paths. Therefore, reasoning execution paths with AI-based approaches become a promising direction. In this paper, we investigate the feasibility of adopting large language models (LLMs) for execution path reasoning on Python, where traditional path-based symbolic execution tools are unavailable. We conduct an empirical study on two types of path reasoning tasks: generation tasks for test case generation and classification tasks for bug detection. We build new evaluation pipelines and benchmarks from both competition-level programs and real-world repositories. Our results show that state-of-the-art LLMs can perform correct reasoning on execution paths and improve test coverage on real-world software, though models with stronger reasoning abilities do not always outperform weaker ones. These findings highlight the potential of utilizing LLMs as a complementary heuristic for path-aware code reasoning, especially in program languages lacking mature symbolic execution tools. We have released our benchmark and evaluation scripts at https://github.com/jacobwwh/llm-path-study.