arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 1 信号源:cs.CV, cs.GR, cs.MM
2606.19259 2026-06-18 cs.CV cs.AI 新提交 70%

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

一个用于检测 GPT-Image-2 生成的含丰富文本图像的多领域基准

Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

发表机构 * College of Computer Science(计算机科学学院)

专题命中 其他图像生成 :检测GPT-Image-2生成的图像

AI总结 针对现有基准缺乏文本丰富图像检测的问题,构建了包含8602张图像、覆盖6个类别的多领域基准,评估5种检测器,发现性能高度依赖领域且易受JPEG压缩影响。

详情
AI中文摘要

含丰富文本的图像通常包含隐私敏感、交易或决策相关信息。随着最近多模态图像生成模型合成逼真文本内容和结构化视觉设计的能力越来越强,检测AI生成的含丰富文本图像已成为数字信任和内容真实性的重要挑战。然而,现有基准主要关注以物体为中心的图像,对文本语义和布局组织至关重要的场景覆盖有限。在本文中,我们引入了一个用于检测OpenAI的GPT Image 2生成的含丰富文本图像的多领域基准。该基准包含8602张图像,涵盖六个代表性类别:商业海报、信息图表、学术海报、收据、表格和UI截图。利用该基准,我们在零样本设置下评估了五种代表性AI生成图像检测器,并分析了它们的整体性能、类别性能和后处理鲁棒性。我们的结果表明,检测器性能高度依赖于领域:在某些类别上表现良好的方法往往在其他类别上失败,即使最强的传统检测器也对JPEG压缩表现出严重敏感性。我们进一步使用多模态视觉语言模型进行了探索性评估,揭示了其在结构化格式上的潜力和局限性。这些发现突显了针对现代AI生成图像需要文本和布局感知的检测方法。我们的数据集发布于XXX。

英文摘要

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.