Evaluating Non-English Developer Support in Machine Learning for Software Engineering
评估机器学习软件工程中非英语开发者支持
Jonathan Katzy, Yongcheng Huang, Gopal-Raj Panchu, Maksym Ziemlewski, Paris Loizides, Sander Vermeulen, Arie van Deursen, Maliheh Izadi
AI总结 研究评估非英语代码注释生成及现有评估方法的可靠性,发现非英语环境下生成性能显著下降,自动评估方法无法可靠区分正确与错误注释。
详情
大型语言模型在软件工程中被越来越多地使用,但代码生成及其评估仍以英语为中心,导致对当前工具支持多语言开发的理解存在重大空白。本文研究非英语代码注释生成及现有方法的可靠性,评估了五种代码LLM在五种自然语言中的表现,生成12,500条注释并发布公开标注数据集及26种错误类型分类。研究发现,非英语环境下生成性能大幅下降,语言错误增加15.1倍,伴随不一致生成和语义错误。自动方法无法可靠评估非英语注释,神经度量和LLM-as-judge方法均无法有效捕捉语言和语义错误,显示人类判断仍不可或缺。
Large Language Models are increasingly used in software engineering, but both code generation and its evaluation remain predominantly English-centric. This leaves a major gap in our understanding of how well current tools support multilingual development, where code contains non-English natural language. In this paper, we investigate non-English code comment generation and the reliability of current methods for evaluating such outputs. We evaluate five code LLMs (CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2) across five natural languages: Dutch, English, Greek, Polish and Chinese. We further conduct an open-coding study of 12,500 generated comments, from which we derive a publicly released human-annotated dataset and a taxonomy of 26 error types. We use these human annotations, to evaluate the performance of neural metrics, and LLM-as-a-judge pipelines. Our findings show that generative performance deteriorates substantially outside English, with linguistic errors increasing by up to 15.1$\times$, alongside frequent incoherent generations and a rise in semantic errors. More critically, we show that detecting errors in non-English comments underperforms. Across classical overlap-based metrics, off-the-shelf neural metrics, extended neural metrics using newer multilingual, language-specific, and code-specific models, and LLM-as-a-judge pipelines, no automatic approach provides reliable and consistent assessment. Neural metrics fail to distinguish correct comments from incorrect outputs or even random noise, and tend to overestimate quality in non-English settings. LLM-as-a-judge methods achieve the highest agreement with human annotations but fail to reliably capture important language-related and semantic errors. Overall, our results show that evaluation and generation are key barriers for multilingual tooling, and that human judgment remains indispensable.