Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification
功能熵:通过不确定性量化预测LLM生成代码的功能正确性
Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra
AI总结 针对LLM生成代码功能不正确的问题,提出基于功能等价性的不确定性量化方法(功能熵),在多个编程语言和模型上优于现有方法。
详情
大型语言模型在代码生成方面表现出令人印象深刻的能力,但它们经常生成功能不正确的代码。不确定性量化(UQ)方法已成为检测自然语言生成中幻觉的有前途的方法,但它们在代码生成任务中的有效性仍未得到充分探索。我们系统地评估了UQ技术如何跨三种编程语言、五个LLM和超过1700个问题迁移到代码生成。我们发现,一些基于令牌概率的方法无需修改即可有效泛化,而依赖自然语言推理(NLI)的基于采样的方法失败,因为NLI模型无法区分功能不同的代码,导致大多数响应崩溃为单个语义簇。为了解决这个问题,我们引入了功能等价性方法,这是一类特定于代码的方法,用基于LLM的功能等价性评估取代基于NLI的语义等价性,包括功能熵,即语义熵的代码特定模拟。功能等价性方法在15个模型-基准组合中的11个中实现了最高的AUROC,并在大多数设置中实现了最佳校准,始终优于基于NLI的对应方法以及所有其他评估方法。
Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.