代码大模型 / AI 编程 - arXivDaily 专题

2512.00560 2026-06-19 cs.SE 版本更新 80%

SAGE: Semantic-Aware Gray-Box Game Regression Testing with Large Language Models

SAGE: 基于语义的灰盒游戏回归测试与大型语言模型

Jinyu Cai, Jialong Li, Nianyu Li, Zhenyu Mao, Mingyue Zhang, Kenji Tei

专题命中软件智能体：利用LLM引导强化学习自动生成游戏测试套件。

AI总结提出SAGE框架，利用LLM引导强化学习自动生成测试套件，通过语义多目标优化精简测试，并基于更新日志语义分析优先排序，在Overcooked Plus和Minecraft中实现高效回归测试。

Comments This paper has been accepted by Automated Software Engineering journal

详情

AI中文摘要

现代实时服务游戏的快速迭代周期使得回归测试对于维持质量和稳定性不可或缺。然而，现有的回归测试方法面临关键限制，特别是在无法完全访问源代码的常见灰盒设置中：它们严重依赖手动构建测试用例，难以维护因冗余而日益庞大的测试套件，并且缺乏有效的机制来优先排序相关测试。这些挑战导致测试成本过高、自动化程度有限以及缺陷检测不足。为了解决这些问题，我们提出了SAGE，一个面向灰盒游戏环境的语义感知回归测试框架。SAGE系统地解决了测试生成、维护和选择的核心挑战。它采用LLM引导的强化学习进行高效、目标导向的探索，以自动生成多样化的基础测试套件。随后，它应用基于语义的多目标优化，通过平衡成本、覆盖率和稀有性，将该套件精炼为紧凑、高价值的子集。最后，它利用基于LLM的更新日志语义分析，优先排序与版本变更最相关的测试用例，从而实现跨迭代的高效适应。我们在两个代表性环境Overcooked Plus和Minecraft上评估了SAGE，并与自动化基线和人工记录的测试用例进行了比较。在所有环境中，SAGE以显著更低的执行成本实现了更优的缺陷检测，并展现出对版本更新的强大适应性。

英文摘要

The rapid iteration cycles of modern live-service games make regression testing indispensable for maintaining quality and stability. However, existing regression testing approaches face critical limitations, especially in common gray-box settings where full source code access is unavailable: they heavily rely on manual effort for test case construction, struggle to maintain growing suites plagued by redundancy, and lack efficient mechanisms for prioritizing relevant tests. These challenges result in excessive testing costs, limited automation, and insufficient bug detection. To address these issues, we propose SAGE, a semanticaware regression testing framework for gray-box game environments. SAGE systematically addresses the core challenges of test generation, maintenance, and selection. It employs LLM-guided reinforcement learning for efficient, goal-oriented exploration to automatically generate a diverse foundational test suite. Subsequently, it applies a semantic-based multi-objective optimization to refine this suite into a compact, high-value subset by balancing cost, coverage, and rarity. Finally, it leverages LLM-based semantic analysis of update logs to prioritize test cases most relevant to version changes, enabling efficient adaptation across iterations. We evaluate SAGE on two representative environments, Overcooked Plus and Minecraft, comparing against both automated baselines and human-recorded test cases. Across all environments, SAGE achieves superior bug detection with significantly lower execution cost, while demonstrating strong adaptability to version updates.

URL PDF HTML ☆

赞 0 踩 0