Valid Inference with Synthetic Data via Task Exchangeability
通过任务可交换性实现基于合成数据的有效推断
Lezhi Tan, Tijana Zrnic
AI总结 提出任务可交换性条件,确保在科学研究中使用合成数据进行统计推断的有效性,并给出在民意调查和AI评估中的应用。
详情
越来越多的工作主张在科学研究中使用合成数据。例如,社会科学家主张在试点研究中使用LLM生成的“硅样本”;AI评估越来越依赖“LLM作为裁判”的输出;蛋白质组学研究通过生成合成蛋白质结构的生成模型加速。这些发展引发了一个有趣的可能性:合成数据可以帮助研究人员提出更多问题、进行更多研究并加速发现。但它们也引发了一个根本性的担忧:合成数据可能有偏、有噪声且设定错误。在这项工作中,我们提出了在科学研究中使用合成数据的统计原则,并具有可证明的有效性保证。关键见解是一个我们称为任务可交换性的新技术条件。非正式地说,这是一个要求,即研究人员可以识别出有真实数据可用的历史任务,使得他们当前感兴趣的任务与历史任务在适当的数学意义上可交换。我们开发了在任务可交换性下进行有效推断的方法,以及即使在可交换性之外也能提供保证的扩展。我们通过硅样本的民意调查和自动评分器的AI评估来展示该框架。
There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.