arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

2026-06-18 至 2026-06-18 收录 24 信号源:cs.CV, cs.GR, cs.MM

1. 图像修复 4 篇

2606.19195 2026-06-18 cs.CV 新提交 95%

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Moebius: 0.2B轻量级图像修复框架,性能达10B级别

Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) VIVO AI Lab(VIVO人工智能实验室)

专题命中 图像修复 :轻量级图像修复框架,属于图像修复

AI总结 提出Moebius轻量级图像修复框架,通过局部-λ混合交互模块和自适应多粒度蒸馏策略,以0.22B参数实现与10B级模型FLUX.1-Fill-Dev相当甚至更优的生成质量,推理速度提升15倍以上。

详情
AI中文摘要

尽管10B级别的工业基础模型推动了图像修复的边界,但其高昂的计算成本严重阻碍了实际部署。构建高度优化的任务特定专家模型是一个有前景的解决方案,然而极端的结构压缩不可避免地引发了严重的表示瓶颈。为解决这一问题,我们提出了Moebius,一个高效的轻量级修复框架。我们通过引入局部-λ混合交互($L\lambda MI$)模块系统地重构了扩散主干。该模块由局部-λ和交互-λ子模块组成,巧妙地将空间上下文和全局语义先验总结为固定大小的线性矩阵,在保留复杂潜在交互的同时大幅减少参数。此外,为了释放这种高度紧凑架构的全部表示能力,我们将其与自适应多粒度蒸馏策略协同配对。该策略严格在潜在空间内操作以避免昂贵的像素空间解码,动态平衡多个基于梯度的损失以实现高保真对齐。在自然和肖像基准上的大量实验表明,这种最优协同使Moebius能够媲美甚至超越10B级工业通用模型FLUX.1-Fill-Dev的生成质量。值得注意的是,Moebius仅使用不到2%的参数(0.22B vs. 11.9B)就实现了这一点,同时总推理时间加速超过15倍,为高保真修复设立了新的效率标准。项目页面见此https URL。

英文摘要

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$λ$ Mix Interaction ($LλMI$) block. Comprising Local-$λ$ and Interactive-$λ$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $>15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.

2603.05010 2026-06-18 cs.CV 版本更新 90%

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

生成式图像恢复进展:能力、局限性与评估实践研究

Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang, Chao Dong, Jinjin Gu

发表机构 * Fudan University(复旦大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) University of the Chinese Academy of Sciences(中国科学院大学) Multimedia Laboratory, The Chinese University of Hong Kong(香港中文大学多媒体实验室) Shenzhen University of Advanced Technology(深圳先进技术大学)

专题命中 图像修复 :研究生成式图像恢复,包括扩散和GAN模型

AI总结 通过多维度评估管道系统比较扩散、GAN等生成式模型与PSNR导向模型,揭示从细节不足到细节质量与语义控制的范式转变,并训练了更符合人类感知的IQA模型。

Comments Accepted by CVPR 2026 Findings

详情
AI中文摘要

生成式图像恢复(GIR)在感知真实感方面取得了显著进展,但与先前方法相比,其实际能力究竟有多大提升?为回答这一问题,我们基于新的多维度评估管道开展大规模研究,该管道从细节、清晰度、语义正确性和整体质量四个维度评估模型。我们的分析涵盖多种架构,包括基于扩散的、基于GAN的、PSNR导向的以及通用生成模型,揭示了关键的性能差异。此外,我们的分析揭示了失败模式的演变,这标志着以感知为导向的低层视觉领域发生了范式转变。核心挑战正从先前的细节稀缺(欠生成)问题演变为细节质量和语义控制(防止过生成)的新前沿。我们还利用我们的基准训练了一个新的IQA模型,该模型更符合人类感知判断。最终,本工作对现代生成式图像恢复模型进行了系统研究,提供了关键见解,重新定义了对其真实状态的理解,并为未来发展指明了方向。

英文摘要

Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

2602.00176 2026-06-18 cs.CV cs.AI 版本更新 70%

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

基于噪声条件频率暴露的扩散逆问题后验延续

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

专题命中 图像修复 :提出后验延续框架用于扩散逆问题,包括图像修复。

AI总结 提出后验延续框架,根据扩散噪声水平逐步暴露测量频率,结合稳定采样器实现超分辨率、修复和去模糊的先进性能。

详情
AI中文摘要

扩散后验采样通过将预训练的扩散先验与测量一致性指导相结合来解决逆问题。然而,在高噪声水平下,全频带指导可能不可靠,因为干净估计包含分数诱导误差,且高频测量方向弱可识别。我们认为后验指导应根据瞬时扩散噪声水平暴露测量频率。基于这一原则,我们提出一个后验延续框架,构建一系列中间后验,其似然强调当前可靠频带并逐渐恢复全频带一致性。我们通过一个稳定采样器实例化该框架,该采样器结合了扩散预测器、频率受限似然细化以及Haar域承诺规则,该规则提交可靠粗校正同时推迟弱可识别细节。在超分辨率、修复和去模糊任务中,我们的方法实现了具有竞争力乃至最先进的恢复性能,包括在FFHQ和ImageNet评估中,运动去模糊相比强基线PSNR提升高达5 dB。

英文摘要

Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance. However, full-band guidance can be unreliable at high noise levels, where clean estimates contain score-induced errors and high-frequency measurement directions are weakly identifiable. We argue that posterior guidance should expose measurement frequencies according to the instantaneous diffusion noise level. Based on this principle, we propose a posterior continuation framework that constructs a family of intermediate posteriors whose likelihood emphasizes currently reliable frequency bands and gradually returns to full-band consistency. We instantiate this framework with a stabilized sampler that combines a diffusion predictor, frequency-limited likelihood refinement, and a Haar-domain commitment rule that commits reliable coarse corrections while deferring weakly identifiable details. Across super-resolution, inpainting, and deblurring, our method achieves competitive-to-state-of-the-art restoration performance, including up to 5 dB PSNR improvement on motion deblurring over strong baselines in evaluations on FFHQ and ImageNet.

2204.14224 2026-06-18 cs.CV cs.LG eess.IV 版本更新 65%

Investigation of Neural Network Methods for Reconstruction and Classification of Texture Images Under Conditions of Incomplete Information

不完全信息条件下纹理图像重建与分类的神经网络方法研究

Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Darkhan Kurmangaliyev, Daniyar Nurseitov, Tatyana Dedova, Larissa Balakay, Serik Nurakynov

发表机构 * Satbayev University(萨特巴耶夫大学) Institute of Ionosphere LLP(电离层研究所) Information Technology Department(信息技术部门) Assiut University(阿西乌特大学)

专题命中 图像修复 :使用GAN进行图像修复,重建缺失细节。

AI总结 提出结合目标检测、GAN(CRA)修复和Transformer/CNN分类的端到端框架,发现重建质量高(PSNR 28.7dB)但分类准确率仅53%,通过置信度混合集成将MCA从48%提升至58%,揭示生成模型产生语义模糊特征的问题。

Comments IEEE ACCESS

详情
AI中文摘要

异质自然纹理的自动化分析常因物理损伤和数据丢失而受阻,这对计算机视觉构成了重大挑战。虽然深度学习在受控环境中已显示出成功,但其在信息不完全条件下对复杂地质材料的应用仍未被充分探索。本研究提出了一个用于高分辨率岩心样本图像修复和分类的集成框架。我们设计了一个端到端流水线,利用目标检测进行样本分割,随后使用具有上下文残差聚合(CRA)的生成对抗网络(GAN)进行图像修复,以重建缺失的高频细节。接着,我们在重建数据上评估了现代基于Transformer(Swin、ViT)和CNN架构的性能。实验揭示了重建质量与下游效用之间的关键分歧:尽管结构保真度高(PSNR 28.7 dB,FID 74.01),分类准确率却停滞在53%。为了改善少数类检测,我们提出了一种基于置信度的混合集成方法,将MCA从48%提升至58%。这些结果凸显了当前最先进生成模型的局限性,它们可能产生视觉上合理但语义模糊的特征(“幻觉”),从而混淆分类器。本工作深入探讨了图像重建质量与分类性能之间的依赖关系,为无损检测和材料科学领域的未来研究提供了可复现的基线。鉴于井间准确率仍处于49-53%范围,我们将所得到的系统定位为岩相解释的决策支持和筛选工具,而非完全自主的分类器。代码可在以下网址获取:https://github.com/your-repo(注:原文URL未提供,此处为示例)

英文摘要

The automated analysis of heterogeneous natural textures is frequently hindered by physical damage and data loss, presenting a significant challenge to computer vision. While deep learning has shown success in controlled environments, its application to complex geological materials under conditions of incomplete information remains underexplored. This study presents an integrated framework for the inpainting and classification of high-resolution core sample images. We propose an end-to-end pipeline that utilizes object detection for sample segmentation, followed by image inpainting using Generative Adversarial Networks (GANs) with Contextual Residual Aggregation (CRA) to reconstruct missing high-frequency details. Subsequently, we evaluate the performance of modern Transformer-based (Swin, ViT) and CNN architectures on the reconstructed data. Our experiments revealed a critical divergence between reconstruction quality and downstream utility: despite high structural fidelity (PSNR 28.7~dB, FID 74.01), classification accuracy plateaued at 53\%. To improve minority-class detection, we propose a confidence-based hybrid ensemble that raises MCA from 48\% to 58\%. These results highlight the limitations of current state-of-the-art generative models, which may produce visually plausible but semantically ambiguous features ("hallucinations") that confound classifiers. This work provides insights into the dependencies between image reconstruction quality and classification performance, offering a reproducible baseline for future research in non-destructive testing and material science. Given that cross-well accuracy remains in the 49--53\% range, we position the resulting system as a decision-support and screening tool for lithofacies interpretation rather than as a fully autonomous classifier. The code is available at https://github.com/GalymzhanAbdimanap/Lithology_recognition

2. 图像编辑 5 篇

2606.19103 2026-06-18 cs.CV cs.AI 新提交 90%

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

ProductConsistency:通过SFT和RL改进基于指令的图像编辑中的产品身份保持

Mukund Khanna, Raj Singh Yadav, Kunal Singh

发表机构 * Fractal Analytics

专题命中 图像编辑 :基于指令的图像编辑,保持产品身份。

AI总结 针对基于指令的图像编辑中产品特征保持不足的问题,提出ProductConsistency数据集和循环一致性奖励,结合监督微调与强化学习,显著提升产品一致性、文本渲染和视觉质量。

Comments CVPR HiGen 2026

详情
AI中文摘要

近期基于指令的图像编辑的进展使模型能够根据自然语言指令执行复杂的视觉编辑。然而,在以产品为中心的场景中,保留产品特征、品牌和文本元素至关重要,当前的开源和闭源模型往往难以维持这种细粒度的对象身份。这一问题因缺乏具有文本保真度约束的基于指令的产品图像编辑数据集而进一步加剧,导致该能力在很大程度上被视为基于指令的图像编辑模型的隐式能力。在这项工作中,我们引入了ProductConsistency数据集,旨在改进以产品为中心的图像编辑。我们的方法包括一个用于产品编辑的包含87k样本的监督微调(SFT)数据集、一个包含869张独特产品图像的强化学习(RL)数据集,以及一个新的基准数据集ProductConsistency Benchmark,以允许对编辑模型进行严格和标准化的评估。为了指导RL训练,我们提出了一种循环一致性奖励,通过使用原始产品描述与从编辑图像生成的描述之间的字幕相似性来强制保持产品身份的语义。我们使用我们的数据集对Qwen-Image-Edit-2511和Flux.1-Kontext-dev进行了微调,并在OCR和感知指标以及基于MLLM的评估中展示了相对于基线模型的一致改进,表明更强的产品一致性、文本渲染和整体视觉质量;其中Qwen-Image-Edit-2511模型实现了字符错误率降低5倍。代码和流程可在此https URL获取。

英文摘要

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

2606.18906 2026-06-18 cs.CV 新提交 90%

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

BindEdit: 驯服注意力泄漏以实现精确的多目标图像编辑

Chaewon Park, Soyoon Lee, Naeun Lee, Minjung Shin, Seogkyu Jeon, Kibeom Hong

发表机构 * Sookmyung Women’s University(成均女性大学) Yonsei University(延世大学) Samsung Research(三星研究院)

专题命中 图像编辑 :提出多目标图像编辑方法抑制注意力泄漏

AI总结 针对多目标图像编辑中的语义混合和对象重复问题,提出BindEdit方法,通过联合正则化交叉注意力和自注意力、交叉注意力重平衡机制及区域保真项,在单次扩散轨迹内抑制注意力泄漏,实现精确编辑。

Comments Preprint

详情
AI中文摘要

真实图像编辑能够精确操作视觉内容,但现有方法在复杂的多目标场景中常常失败,导致语义混合、对象重复或编辑不完整。我们将这些失败归因于注意力泄漏,即在去噪过程中,跨空间区域和文本标记的信号变得纠缠。具体来说,我们识别出两种不同形式的泄漏:编辑-标记泄漏,其中模糊的标记-区域对齐导致对象混合;以及源主导泄漏,其中未改变的源对象的标记压倒了目标实体应有的注意力。为了解决这些泄漏,我们提出了\textbf{BindEdit},它在单次扩散轨迹内强制执行注意力级别的约束。为了抑制编辑-标记泄漏,BindEdit联合正则化交叉注意力和自注意力,使得每个目标标记组绑定到其对应的空间区域,同时保持实例级别的分离。为了抑制源主导泄漏,一种交叉注意力重平衡机制放大目标标记的影响,并减弱可编辑区域内残留的源语义。此外,区域保真项确保每个目标概念在整个编辑掩码中连贯表达。另外,我们提出了一个全面的多目标基准,涵盖不同的对象数量和类别。大量实验表明,BindEdit在单次扩散轨迹内始终优于现有方法,在单目标和多目标编辑场景中均保持稳健性能。

英文摘要

Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

2606.19073 2026-06-18 cs.CV 新提交 85%

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

驯服I2V模型用于图像HOI编辑:认知基准与智能体自校正框架

Jiayi Gao, Qingchao Chen, Yuxin Peng, Yang Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University, Beijing, China(王轩计算机技术研究所,北京大学,北京,中国) National Institute of Health Data Science, Peking University, Beijing, China(国家健康数据科学研究院,北京大学,北京,中国)

专题命中 图像编辑 :图像HOI编辑,利用I2V模型。

AI总结 提出HOI-Edit基准和SCPE框架,利用I2V模型的时间生成能力进行动态人-物交互编辑,通过自校正提示迭代优化,实现与SOTA竞争的性能。

详情
AI中文摘要

当前的图像编辑方法在静态属性上表现出色,但在复杂的人-物交互(HOI)上失败,这是一个关键挑战,现有基准将HOI与静态属性混淆,依赖无法同时评估动态交互有效性和纠缠的人-物对保留的全局指标。因此,我们首先引入HOI-Edit,一个包含三个渐进认知层次的综合基准,其特点是自动化指标HOI-Eval,通过让VLM在思考后对包含基础人-物对的图像进行问答,可靠地评估实例级交互。考虑到任务本质是重塑动态关系,我们对图像到视频(I2V)模型进行基准测试,发现它们由于其时间生成能力而天生适合动态编辑。关键的是,除了优越的性能,这种能力提供了“失败过程的重放”,为错误原因提供了独特的可诊断性。因此,我们提出SCPE(自校正过程编辑),一种新颖的智能体自校正框架,通过迭代优化的提示约束I2V模型的生成,使生成的视频更准确地呈现目标HOI。从这些视频中提取的帧是最终的编辑结果。在HOI-Edit上,SCPE在交互上达到了与最先进(SOTA)编辑模型(如Nano Banana)竞争的性能。代码可在该https URL获取。

英文摘要

Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.

2605.21431 2026-06-18 cs.CV 版本更新 85%

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

iTryOn: 通过空间-语义引导掌握交互式视频虚拟试穿

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

发表机构 * Shenzhen Campus of Sun Yat-sen University Taobao \& Tmall Group of Alibaba

专题命中 图像编辑 :交互式视频虚拟试穿,属于图像生成与编辑。

AI总结 本文提出iTryOn框架,通过空间-语义引导解决交互式视频虚拟试穿中的语义模糊和复杂服装变形问题,实现了更动态可控的虚拟试穿体验。

Comments Project Page: https://zhengjun-ai.github.io/itryon-page. Accepted by ICML 2026

详情
AI中文摘要

视频虚拟试穿(VVT)旨在无缝替换视频中人物身上的衣物。尽管现有方法在保持时间一致性方面取得了显著进展,但它们主要局限于非交互场景,其中模型仅展示衣物。这种限制忽略了现实世界服装展示中的关键方面:主动的人-衣物互动。为弥合这一差距,我们引入并正式化了一个新的挑战性任务:交互式视频虚拟试穿(Interactive VVT),其中视频中的主体主动与衣物互动。该任务引入了超出简单纹理保留的独特挑战,包括:(1)从标准姿态信息中解决交互的语义模糊性,以及(2)从视频中学习复杂的衣物变形,其中交互时刻稀少且短暂。为了解决这些挑战,我们提出了iTryOn,一种基于大规模视频扩散Transformer的新型框架。iTryOn首创多级交互注入机制,以引导复杂动态的生成。在空间层面,我们引入了服装无关的3D手先验,以提供精细的指导,精确的手-服装接触,有效解决空间模糊性。在语义层面,iTryOn利用全局描述词提供整体上下文,并利用时间戳动作描述词提供局部交互,通过我们新颖的Action-aware Rotational Position Embedding(A-RoPE)进行同步。广泛的实验表明,iTryOn不仅在传统VVT基准上实现了最先进的性能,还在新的交互设置中建立了显著的领先优势,标志着更动态和可控的虚拟试穿体验的重要一步。

英文摘要

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

2604.03156 2026-06-18 cs.CV 版本更新 85%

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

CAMEO: 一种条件感知与质量驱动的多智能体图像编辑编排器

Yuhan Pu, Hao Zheng, Ziqian Mo, Zirui Pang, Hill Zhang, Tianyi Fan, Shuhong Wu, Jiaheng Wei

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Harbin Institute of Technology(哈尔滨工业大学) Shenzhen University(深圳大学) Claremont McKenna College(克莱蒙特麦肯纳学院) Research Institute of Petroleum Exploration and Development, CNPC(石油勘探开发研究院,中石油)

专题命中 图像编辑 :多智能体框架进行条件图像编辑,含质量评估

AI总结 提出CAMEO多智能体框架,将条件图像编辑重构为质量感知的反馈驱动过程,通过分解编辑阶段、嵌入评估循环,在异常插入和人体姿态切换任务中平均胜率提升20%。

详情
AI中文摘要

条件图像编辑旨在根据文本提示和可选的参考指导修改源图像。这种编辑在需要严格结构控制的场景中至关重要(例如,驾驶场景中的异常插入和复杂人体姿态变换)。尽管近期大规模编辑模型(如Seedream、Nano Banana等)取得了进展,但大多数方法依赖单步生成。这种范式通常缺乏显式质量控制,可能引入与原始图像的过度偏差,并经常产生结构伪影或环境不一致的修改,通常需要手动调整提示才能获得可接受的结果。我们提出\textbf{CAMEO},一个结构化的多智能体框架,将条件编辑重构为质量感知、反馈驱动的过程,而非一次性生成任务。CAMEO将编辑分解为协调的阶段:规划、结构化提示、假设生成和自适应参考定位,仅在任务复杂度需要时才调用外部指导。为克服现有方法缺乏内在质量控制的不足,评估直接嵌入编辑循环中。通过结构化反馈迭代优化中间结果,形成闭环过程,逐步纠正结构和上下文不一致性。我们在异常插入和人体姿态切换任务上评估CAMEO。在多个强编辑骨干网络和独立评估模型上,CAMEO相比多个最先进模型平均胜率提升20%,展示了在条件图像编辑中更强的鲁棒性、可控性和结构可靠性。

英文摘要

Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

3. 文生图 5 篇

2508.03483 2026-06-18 cs.CV cs.AI 版本更新 90%

When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

当汽车有刻板印象:审计文本到图像模型中对象的群体偏见

Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng

发表机构 * AIM Intelligence(AIM智能研究院) Yonsei University(延世大学)

专题命中 文生图 :审计文本到图像模型中的群体偏见,涉及图像生成。

AI总结 提出SODA框架,通过三个指标系统测量文本到图像模型在生成对象中的群体偏见,发现中性提示隐含偏向中年和白人,且人口统计线索导致高度偏斜的刻板输出。

详情
AI中文摘要

虽然先前关于文本到图像生成的研究主要集中在人类描绘中的偏见,但生成对象中的群体偏见仍然相对未被充分探索。我们引入了SODA(刻板对象诊断审计),这是一个新颖的框架,通过自动属性发现和三个标准化指标系统地测量这些偏见:基础与群体差异(BDS)、跨群体差异(CDS)和视觉属性集中度(VAC)。将SODA应用于五个最先进模型和八个对象类别(例如汽车)的8000张图像,我们发现“中性”提示产生的输出在视觉上最接近中年和白人,表明这些群体在模型默认设置中被隐含地过度代表。此外,人口统计线索触发了高度偏斜的刻板输出:26.6%的对象-模型-群体组合产生的结果中,所有20张生成图像共享完全相同的属性值(例如,为女性生成玫瑰金笔记本电脑)。最后,提示级别的去偏减少了群体间差异,但矛盾地压缩了群体内多样性,用一种刻板印象取代了另一种。SODA提供了一个实用的流程,使这些隐含关联变得可测量,作为迈向更负责任的人工智能发展的一步。

英文摘要

While prior research on text-to-image generation has predominantly focused on biases in human depictions, demographic bias in generated objects remains relatively underexplored. We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring these biases through automated attribute discovery and three standardized metrics: Base vs. Demographic Divergence (BDS), Cross-Demographic Disparity (CDS), and Visual Attribute Concentration (VAC). Applying SODA to 8,000 images across five state-of-the-art models and eight object categories (e.g., cars), we find that "neutral" prompts produce outputs most visually similar to middle-aged and White people, suggesting these groups are implicitly over-represented in model defaults. Furthermore, demographic cues trigger highly skewed stereotypical outputs: 26.6% of object-model-demographic combinations produce results where all 20 generated images share the exact same attribute value (e.g., rose gold laptops for women). Finally, prompt-level debiasing reduces inter-group disparity but paradoxically collapses within-group diversity, replacing one stereotype with another. SODA offers a practical pipeline for making these implicit associations measurable, serving as a step toward more responsible AI development.

2606.11615 2026-06-18 cs.CV cs.CR cs.LG 新提交 85%

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

Adv-TGD:面向人脸识别冒充攻击的对抗性文本引导扩散

Omid Ahmadieh, Nima Karimian

发表机构 * University of South Florida, Bellini College of Artificial Intelligence, Cybersecurity and Computing(南佛罗里达大学贝利尼人工智能、网络安全与计算学院)

专题命中 文生图 :文本引导扩散生成对抗人脸

AI总结 提出Adv-TGD框架,利用Stable Diffusion和LoRA微调生成逼真对抗人脸,在保持视觉质量的同时实现高成功率身份冒充攻击,平均ASR达85.90%。

详情
AI中文摘要

人脸识别(FR)技术的广泛普及引发了严重的隐私担忧,因为面部数据可能在未经同意的情况下被利用。为了解决这一挑战,我们提出了Adv-TGD,一个生成式对抗攻击框架,能够合成逼真的人脸,冒充目标身份并欺骗人脸识别系统。基于Stable Diffusion,Adv-TGD对每个样本进行LoRA微调,以简洁的文本提示为条件,生成自然但具有对抗性操控的身份。与传统的身份攻击方法不同,我们的方法在单步去噪过程中为每个源-目标对优化轻量级交叉注意力适配器。潜在混合受到面部局部热图掩码的约束,以确保空间精确的身份操控,同时保留非敏感区域。我们引入了一个复合目标,结合了掩码epsilon-MSE重建、FR嵌入空间中的阈值化身份差异、方向特征对齐和源相似性抑制,以平衡对抗攻击和视觉真实性。可选地,LLaVA生成的属性提示增强了细粒度语义细节,而不会重新引入身份线索。在黑盒评估协议下,Adv-TGD在IR152、IRSE50、MobileFace和FaceNet上平均攻击成功率(ASR)达到85.90%,超过语义SOTA基线Adv-CPG +6.25个百分点、基于扩散的化妆方法DiffAIM +3个百分点以及基于噪声的P3-Mask +16个百分点。尽管攻击效果强劲,Adv-TGD仍保持了高视觉保真度(PSNR = 27.15 dB,SSIM = 0.981)。此外,我们通过成功将其扩展到野外数据集(LADN)、通用对象分类(ImageNet)和基于Transformer的扩散模型(FLUX.1),展示了我们框架的灵活性。

英文摘要

The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion v2.1, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a fixed-timestep denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by 6.25 points, the diffusion-based makeup method DiffAIM by 3 points, and the noise-based P3-Mask by 16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 28.18 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

2605.14877 2026-06-18 cs.CV 版本更新 85%

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

HeatKV:针对视觉自回归建模的头部调制KV缓存压缩

Jonathan Cederlund, Axel Berg, William Isaksson, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson

发表机构 * Dept. of Automatic Control, Lund University(自动控制系,吕勒欧大学) Arm(Arm公司)

专题命中 文生图 :提出HeatKV压缩方法用于视觉自回归图像生成。

AI总结 本文提出HeatKV方法,通过根据每个头部对先前生成尺度的注意力进行调整,实现更高效的KV缓存压缩,提升内存利用率并保持图像生成质量。

Comments 18 pages total including appendix; 6 main-paper figures, 2 appendix figures; 4 tables

详情
AI中文摘要

视觉自回归(VAR)模型最近在保持低延迟的同时展示了出色的图像生成质量。然而,它们受到严重的KV缓存内存限制,通常需要每个生成图像数吉字节的内存。我们引入了HeatKV,一种新的压缩方法,该方法根据每个头部对先前生成尺度的注意力来调整缓存分配。使用一个小的离线校准集,注意力头部根据其在先前尺度上的注意力分数进行排序。基于此排序,我们构建了一个针对给定内存预算定制的静态剪枝计划。应用于Infinity-2B模型时,HeatKV在KV缓存内存分配的压缩比上比现有方法高2倍,同时保持相似或更好的图像保真度、提示对齐度和人类感知分数。我们的方法在VAR模型的KV缓存压缩中达到了新的最先进的水平,展示了细粒度、特定头部的缓存分配的有效性。

英文摘要

Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation. Code and calibration script available at https://github.com/arm-research/heatkv.

2606.18555 2026-06-18 cs.CV 新提交 70%

Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition

重新思考文本到图像作为室内场景识别的语义感知数据增强

Trong-Vu Hoang, Quang-Binh Nguyen, Dinh-Khoi Vo, Hoai-Danh Vo, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM, Vietnam(越南国立大学胡志明市理科大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国立大学胡志明市分校)

专题命中 文生图 :利用稳定扩散生成合成图像

AI总结 针对室内图像数据不足,提出利用稳定扩散生成合成图像进行数据增强,并通过扩散重建误差防止滥用,在MIT室内场景数据集上验证了有效性。

Comments MAPR 2024

详情
AI中文摘要

在计算机视觉领域,室内图像识别由于光照条件、遮挡以及有限空间内多样化物体排列的复杂相互作用而面临挑战。为了解决训练室内图像缺乏的问题,我们引入了一种新颖的方法,利用稳定扩散(SD)生成合成图像,作为强大的数据增强工具。SD的使用提供了一个原则性框架,用于合成多样且逼真的室内场景,从而丰富训练数据池,以构建鲁棒的室内图像识别模型。在MIT室内场景数据集上的实验结果表明,当真实数据有限时,我们提出的方法在增强深度模型训练方面具有潜力。此外,为了防止SD合成图像的滥用,我们引入了一种基于扩散重建误差(DIRE)的应对措施。强大的DIRE表示使得仅使用轻量级深度模型就能训练鲁棒的分类器。实验表明,我们的方法能够完美识别SD生成的图像,使用MobilenetV3的准确率达到100%。

英文摘要

In the realm of computer vision, indoor image recognition presents challenges due to the intricate interplay of lighting conditions, occlusions, and diverse object arrangements within confined spaces. To address the lacks of training indoor images, we introduce a novel approach leveraging Stable Diffusion (SD) for the generation of synthetic images, which serve as a powerful data augmentation tool. The utilization of SD offers a principled framework for synthesizing diverse and realistic indoor scenes, thereby enriching the training data pool for robust indoor image recognition models. Experimental findings on the MIT Indoor Scene dataset reveal the potential of our proposed approach in enhancing the training of deep models when authentic data is limited. Furthermore, to prevent the misuse of SD synthetic images, we introduce a counter measure based on DIffusion Reconstruction Error (DIRE). The powerful DIRE presentation enables training robust classifiers only using lightweight deep models. Experiments show that our approach can perfectly recognize SD generated images with the accuracy of 100% using MobilenetV3.

2606.18554 2026-06-18 cs.CV 新提交 60%

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

伪造灾难:扩散时代跨域合成灾难检测基准

Duc-Manh Phan, Quoc-Duy Tran, Duy-Khang Do, Anh-Tuan Vo, Hai-Dang Nguyen, Trong Le Do, Mai-Khiem Tran, Vinh-Tiep Nguyen, Tam V. Nguyen, Isao Echizen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国家大学下属理科大学) Vietnam National University, Ho Chi Minh(胡志明市国家大学) University of Information Technology, VNU-HCM(胡志明市国家大学下属信息技术大学) University of Dayton(代顿大学) National Institute of Informatics(国立信息学研究所)

专题命中 文生图 :检测扩散模型生成的合成图像

AI总结 针对扩散模型生成的逼真灾难图像难以检测的问题,提出包含30000张图像(6000张真实、24000张合成)的基准数据集,实验发现微调检测器在未知生成器上准确率下降50%,零样本检测器也不稳定,凸显了跨域检测的迫切需求。

Comments SOICT 2025

详情
AI中文摘要

文本到图像扩散模型的快速进步使得创建高度逼真的合成图像成为可能,这些图像与真实照片极为相似,使得区分真实内容与AI生成的伪造品越来越困难。这对网络安全、数字取证和灾难响应构成了挑战,其中洪水、火灾或地震的虚假图像可能传播错误信息或扰乱应急行动。为此,我们引入了Forged Calamity,一个用于合成灾难检测的基准数据集,包含30000张图像,其中包括6000张真实样本和由四种扩散模型生成的24000张合成样本。在微调和零样本设置下的全面实验揭示了当前取证方法的一致弱点。微调检测器在分布内表现良好,但在未见过的生成器或灾难类型上准确率下降高达50%,显示出对模型特定伪影的过拟合。零样本通用检测器也难以保持稳定的准确率,只有少数具有鲁棒表示能力的模型表现出有限的韧性。这些发现凸显了持续存在的泛化差距,以及在扩散时代确保视觉真实性迫切需要领域和模型无关的检测方法。

英文摘要

The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50\% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

4. 扩散模型 5 篇

2606.19162 2026-06-18 cs.LG cs.CV 新提交 85%

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直就在你的数据中:用判别器引导的强化学习纠正流匹配

Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal

发表机构 * FAIR at Meta Columbia University Mila -- Qu\' e bec AI Institute McGill University Canada CIFAR AI Chair

专题命中 扩散模型 :用RL纠正流匹配模型视觉缺陷,提升生成质量

AI总结 针对流匹配模型因损失函数与样本质量不匹配导致的视觉缺陷,提出判别器引导的强化学习(DRL),利用预训练空间中判别器的logit作为奖励,显著提升无引导FID和语义FD,并改善偏好对齐。

Comments 84 pages, including appendices

详情
AI中文摘要

得分匹配和流匹配模型通常依赖基于偏好的强化学习来实现两个目的:与主观偏好对齐,以及令人惊讶地恢复视觉真实性和连贯对象结构等属性——而这些属性本应通过匹配训练从数据本身学习。我们认为这反映了结构上的不匹配。匹配损失衡量训练时边缘分布下速度或得分场的$\ell_2$回归误差,这一代理指标与决定推理时样本质量的视觉和语义属性对齐不良。给定一个与这些属性对齐的奖励,强化学习通过评估模型自身生成的样本并直接遵循奖励景观来规避不匹配。挑战在于如何在不依赖人类偏好的情况下获得这样的奖励,因为人类偏好昂贵且会将数据真实性与标注者倾向混为一谈。我们提出判别器引导的强化学习(DRL)。DRL训练一个判别器,在预训练表示空间中区分数据样本和基础模型样本,并将其logit作为KL正则化强化学习中的奖励。预训练空间将判别器限制在感知有意义的方向上,而logit估计数据与模型之间的对数似然比,这是针对数据分布的最优奖励。在SiT、JiT、REPA和RAE上,DRL降低了无引导FID(例如,SiT上从9.38降至2.62)和语义空间FD(例如,SiT上DINOv3从88.2降至19.3),在所有骨干网络上均有一致提升,并且在没有经过偏好奖励训练的情况下改善了人类偏好奖励。在后续基于偏好的后训练中,DRL还在偏好奖励与图像保真度之间产生了更好的帕累托前沿,在提高对齐度的同时减少了过饱和和过亮等低级伪影。

英文摘要

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

2606.18765 2026-06-18 cs.CV 新提交 85%

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

SpectralDiT:流匹配DiT的时间步条件谱残差校正

Jiayu Tian

发表机构 * Peking University(北京大学)

专题命中 扩散模型 :改进流匹配DiT,谱残差校正提升生成质量。

AI总结 提出SpectralDiT,通过时间步条件谱残差校正模块,在CIFAR-10和ImageNet-100上以极少额外计算和参数提升流匹配DiT的生成质量,FID分别降低5.1%和8.7%。

详情
AI中文摘要

我们提出SpectralDiT,一种对流匹配扩散变换器(Diffusion Transformers)的轻量级修改,它在MLP残差分支中添加了时间步条件谱校正。该模块将每个残差更新分解为补丁-令牌网格上的低频和高频分量,然后学习一个零初始化的加法门,使得模型最初与基线DiT匹配。在CIFAR-10像素空间生成中,SpectralDiT在补丁大小为1时将FID从20.78提升至19.71,并缩小了径向傅里叶谱差距。此外,我们将方法扩展到ImageNet-100上的潜在扩散。在额外理论FLOPs增加0.6%和参数增加1.36%的情况下,SpectralDiT改进了潜在流匹配,在无分类器引导(CFG 2.0)下实现了8.7%的相对FID降低。所有报告结果均为五个种子的平均值。在CIFAR-10上的消融实验和门控可视化揭示了稳定的块特定谱校正模式。

英文摘要

We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

2606.05883 2026-06-18 cs.CV 版本更新 85%

Geometry-Aware Dataset Condensation for Diffusion Model Training

面向扩散模型训练的几何感知数据集压缩

Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li

发表机构 * GitHub

专题命中 扩散模型 :面向扩散模型训练的几何感知数据集压缩

AI总结 针对扩散模型训练,提出基于几何感知分布对齐的真实子集选择方法,利用单侧部分最优传输保持几何结构,并辅以轻量级特征统计与语义一致性正则化,通过两阶段离散优化实现高效压缩。

Comments ICML 2026

详情
AI中文摘要

数据集压缩旨在通过合成或选择从真实数据中构建紧凑数据集。然而,现有方法不适用于扩散模型训练:合成数据生成通常产生不适合真实建模的低保真样本,而真实子集选择通常无法保留扩散似然目标所需的分布几何结构。为解决此问题,我们提出将真实子集选择重新表述为几何感知分布对齐问题。通过引入单侧部分最优传输,我们的方法选择性地将紧凑子集与完整数据分布对齐,同时允许低密度区域中的未匹配质量,确保保留扩散模型训练所需的有效几何结构。为进一步保证分布保真度,我们用轻量级特征统计和语义一致性正则化补充几何对齐。提出了一种高效的两阶段离散优化策略来实现该对齐目标。在扩散变体、子集大小、图像分辨率和训练轮次上的大量实验表明,我们的方法在扩散模型训练中实现了优越的保真度和分布覆盖。代码可在 https://github.com/2018cx/GADC 获取。

英文摘要

Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.

2606.19163 2026-06-18 cs.DC 新提交 75%

Pulse: Training Acceleration for Large Diffusion Models with Automatic Pipeline Parallelism

Pulse: 面向大规模扩散模型的自动流水线并行训练加速

Boran Sun, Guoyong Jiang, Lin Zhang, Chen Chen, Yuechen Tao, Zhishu Che, Jieling Yu, Shan Chang, Huaxi Gu, Fangming Liu, Bo Li

专题命中 扩散模型 :针对扩散模型训练加速,优化UNet流水线并行

AI总结 提出PULSE自动流水线并行策略,通过将跳跃连接层同设备放置、局部缓存激活值,消除跨流水线通信,结合动态规划分区器、ILP调度合成器和混合并行调优器,在通信受限硬件上实现最高2.3倍吞吐提升。

Comments Accepted by International Conference on Distributed Computing Systems(ICDCS'26)

详情
AI中文摘要

扩散模型目前是高保真图像和视频生成的主流方法,但在GPU集群上扩展其训练仍具挑战。与仅含Transformer的架构不同,扩散骨干通常采用具有异构层和长距离跳跃连接的UNet风格编码器-解码器结构。在传统流水线并行下,这些非局部依赖迫使大型跳跃激活值及其梯度穿越多个流水线边界,使得点对点(P2P)通信成为主要瓶颈,并显著降低流水线效率。本文提出PULSE,一种自动流水线并行训练策略,将跳跃局部性作为首要优化目标。PULSE通过将跳跃连接的编码器-解码器层放置在同一设备上,并在本地缓存跳跃激活值以供反向传播使用,从而消除跳跃引起的通信。为了在保持高流水线利用率的同时实现这种放置,PULSE协同设计了:(1)一种跳跃感知的动态规划分区器,在对称共置约束下平衡异构阶段负载;(2)一种基于ILP的调度合成器,为生成的阶段到设备映射生成气泡高效的波调度;(3)一种混合并行调优器,在内存和网络约束下选择流水线/数据并行度及微批次大小。大量实验表明,与最先进的并行策略相比,通信量可减少89%,在通信受限硬件上训练吞吐量可提升高达2.3倍。

英文摘要

Diffusion models are now a dominant approach for high-fidelity image and video generation, yet scaling their training across GPU clusters remains challenging. Unlike transformer-only architectures, diffusion backbones commonly adopt UNet-style encoder-decoder structures with heterogeneous layers and long-range skip connections. Under conventional pipeline parallelism, these non-local dependencies force large skip activations and their gradients to traverse multiple pipeline boundaries, making peer-to-peer (P2P) communication a dominant bottleneck and substantially reducing pipeline efficiency. In this paper, we present PULSE, an automatic pipeline-parallel training strategy that makes skip locality a first-class optimization objective. PULSE eliminates skip-induced communication by collocating skip-connected encoder-decoder layers on the same device and caching skip activations locally for later use in backpropagation. To realize this placement while maintaining high pipeline utilization, PULSE co-designs: (1) a skip-aware dynamic-programming partitioner that balances heterogeneous stage workloads under symmetric collocation constraints, (2) an ILP-based schedule synthesizer that generates bubble-efficient wave schedules for the resulting stage-to-device mapping, and (3) a hybrid parallelism tuner that selects pipeline/data-parallel degrees and microbatch sizes under memory and network constraints. Our extensive experiments show that the volume of communication can be reduced by 89 percent, and the training throughput can be increased by up to 2.3x on communication-bound hardware, compared with state-of-the-art parallelism strategies.

2606.19151 2026-06-18 cs.CY cs.CV 新提交 70%

The Market in the Model: Latent Diffusion as Neural Economy

模型中的市场:潜在扩散作为神经经济

Eryk Salvaggio

发表机构 * Cambridge Digital Humanities(剑桥数字人文研究中心) University of Cambridge(剑桥大学) Machine Visual Culture Research Group(机器视觉文化研究组) Max Planck Institute(马克斯·普朗克研究所)

专题命中 扩散模型 :分析潜在扩散模型机制,属于图像生成理论

AI总结 本文从计算机视觉工程问题出发,分析潜在扩散模型的机制,论证其作为神经经济运作,将社会交流抽象为可通约向量,并警示仅关注版权与商品防御的批评可能强化模型产生的拜物教。

详情
AI中文摘要

在视觉文化和人文学科中,对生成图像模型的有价值批评强调了数据集在塑造其生成图像中的作用。然而,对嵌入模型机制的意识形态立场的细致研究一直被忽视,使得它们被想象为“黑箱”。为了扩展而非取代数据集批评,本文从潜在扩散模型被引入以解决计算机视觉工程师问题的角度,以及每个组件被赋予自动化决策的任务,审视了其机制。我通过其各部分的历史以及系统刻入每个生成图像中的视觉理论来解释这个集成。借鉴Impett和Offert的神经交换价值概念,我提出这一分析以论证该模型作为神经经济运作:一个封闭的符号系统,将社会交流抽象为可通约向量,同时将社会领域转化为待售包裹。逐组件追踪训练和生成流程揭示了每个操作取代了什么,以及它如何进一步巩固平台经济和注意力经济对社会交流的逻辑。本文警告,任何只关注版权和商品防御的批评都可能重申模型所产生的拜物教,并主张以社会交换为中心。

英文摘要

Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as "black boxes." In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert's notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.

5. 可控生成 3 篇

2606.16849 2026-06-18 cs.NE cs.GR cs.HC 新提交 80%

Evolution & Foundation: AI Shares Creative Control

进化与基础模型:AI共享创意控制

Dylan Banarse, Stephen Todd, William Latham, Frederic Fol Leymarie

专题命中 可控生成 :遗传算法与多模态AI生成3D有机形态

AI总结 提出一种结合遗传算法与多模态AI基础模型的框架,实现自动化设计3D有机形态,将艺术家角色从直接选择转变为系统设计,加速创意探索。

详情
AI中文摘要

本文研究使用进化系统进行自动化设计和艺术评估的创意过程。我们考虑多模态人工智能(AI)模型如何与组合生成和进化计算系统进行通信和引导。通过将遗传算法与大规模AI基础模型的视觉推理能力相结合,创建了一个用于进化美观的复杂3D有机形态的框架。该框架将艺术家的角色从密集的直接选择转变为系统设计;将详细的逐步策划转移给能够进行多模态审美判断的AI代理。该框架使人类艺术家/设计师能够快速穿越多维进化参数空间的大片区域,基于其语义目标找到创意结果。为每个实验生成AI审美推理的详细审计轨迹。交互式可视化工具,连同AI生成的摘要和进化叙事,使得能够深入探索每个进化实验,并提供对AI引导过程的透明洞察。

英文摘要

This paper investigates the creative process of automated design and artistic evaluation using an evolutionary system. We consider how a multimodal artificial intelligence (AI) model can communicate and guide a combined generative and evolutionary computational system. This creates a framework for the evolution of aesthetically pleasing complex 3D organic forms by integrating genetic algorithms with the visual reasoning capabilities of large-scale AI foundation models. The framework shifts the artist role from that of intensive direct selection to one of system design; transferring detailed step-by-step curation to an AI agent capable of multimodal aesthetic judgement. This framework enables the human artist/designer to rapidly traverse large areas of multi-dimensional evolutionary parameter space to find creative outcomes based on their semantic targets. Detailed audit trails of the AI's aesthetic reasoning are generated for each experiment. Interactive visualisation tools, together with AI-generated summaries and evolutionary narratives, enable deep exploration into each evolutionary experiment and providing a transparent insight into the AI-guided process.

2606.13768 2026-06-18 cs.CV cs.AI 新提交 80%

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

CineOrchestra:面向电影视频生成的统一实体中心条件控制

Sharath Girish, Tsai-Shien Chen, Zhikang Dong, Mukesh Singhal, Hao Chen, Sergey Tulyakov, Aliaksandr Siarohin

发表机构 * Snap Inc.(Snap公司) UC Merced(加州大学默塞德分校)

专题命中 可控生成 :扩散模型实现细粒度条件控制

AI总结 提出CineOrchestra,一种统一控制主体、事件、相机和镜头切换的视频扩散模型,通过实体中心条件原语和参数无关的旋转位置编码实现多轴联合控制,在密集描述跟随和镜头切换时序上超越六种专用方法。

Comments Project page: https://snap-research.github.io/CineOrchestra

详情
AI中文摘要

电影视频描绘了多个主体在特定时刻行动或互动,通过有意的相机运动捕捉,并由镜头切换拼接而成。这些元素共同要求比当前文本到视频模型更细粒度的控制。现有工作分别处理每个轴:多主体个性化、时间控制、多镜头合成或相机控制;没有先前的框架能联合集成所有四个轴。我们提出CineOrchestra,一种统一的视频扩散模型,同时控制主体、事件、相机和镜头切换。我们的关键洞察是,这些异构的电影元素共享一个基本结构:每个元素都是在特定时间间隔内行动的实体,因此都可以通过一个共享的实体中心条件原语结构来表达,并辅以视觉实体的参考图像。这种表述将架构挑战简化为单个位置编码问题,我们通过两个参数无关的协调旋转嵌入来解决:(a) 间隔采样的时间RoPE,在持续时间差异巨大的事件上产生一致注意力行为;(b) 2D实体-时间交叉注意力RoPE,消除每个实体条件的歧义,并将其路由到对应的时空区域。在两个新基准上,CineOrchestra在密集描述跟随和镜头切换时序上优于六种每轴专家方法,在成对用户研究和组件消融中持续获得增益。

英文摘要

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations. Project page: https://snap-research.github.io/CineOrchestra

2606.18788 2026-06-18 cs.CV cs.CL 新提交 75%

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

HandwritingAgent: 语言驱动的可缩放矢量空间手写合成

Jaward Sesay, Yue Yu, Börje F. Karlsson

发表机构 * Beijing Institute of Technology(北京理工大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

专题命中 可控生成 :语言驱动的手写笔画序列生成

AI总结 提出HandwritingAgent,利用大推理模型在SVG格式中自动回归生成手写笔画序列,无需风格特定训练,通过自然语言和参考图像控制风格,在模仿、识别、多语言及复杂数学表达式合成等任务上达到或超越现有最优方法。

详情
AI中文摘要

教会机器模仿自然手写风格仍然是一个开放挑战,因为它需要合成在形状、纹理、压力和字体上动态变化的笔画序列——不仅在不同个体之间,而且在同一个人的手写中也是如此。针对这一挑战的尝试主要探索了在线和离线环境下的深度学习方法。然而,这些方法通常受到风格特定架构选择、对大型数据集的严重依赖、高计算成本以及缺乏通过自然语言灵活控制书写风格的限制。为此,我们引入了HandwritingAgent,一个语言驱动的智能体,它可以直接在可缩放矢量图形(SVG)格式中合成自然手写序列,无需风格特定训练。该智能体利用大型推理模型在离散网格画布环境中对目标手写字形进行几何分析并自回归生成笔画序列。生成过程以对话或非对话模式提供的文本以及参考手写风格图像为条件。在涵盖模仿、识别、多语言手写合成以及复杂手写数学和科学表达式生成等多样化手写任务上的实验表明,性能有显著提升,HandwritingAgent匹配或超越了最先进的生成式手写模型,同时提供了一种更高效、可控且泛化能力更强的合成方法。

英文摘要

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

6. 其他图像生成 2 篇

2606.19259 2026-06-18 cs.CV cs.AI 新提交 70%

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

一个用于检测 GPT-Image-2 生成的含丰富文本图像的多领域基准

Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

发表机构 * College of Computer Science(计算机科学学院)

专题命中 其他图像生成 :检测GPT-Image-2生成的图像

AI总结 针对现有基准缺乏文本丰富图像检测的问题,构建了包含8602张图像、覆盖6个类别的多领域基准,评估5种检测器,发现性能高度依赖领域且易受JPEG压缩影响。

详情
AI中文摘要

含丰富文本的图像通常包含隐私敏感、交易或决策相关信息。随着最近多模态图像生成模型合成逼真文本内容和结构化视觉设计的能力越来越强,检测AI生成的含丰富文本图像已成为数字信任和内容真实性的重要挑战。然而,现有基准主要关注以物体为中心的图像,对文本语义和布局组织至关重要的场景覆盖有限。在本文中,我们引入了一个用于检测OpenAI的GPT Image 2生成的含丰富文本图像的多领域基准。该基准包含8602张图像,涵盖六个代表性类别:商业海报、信息图表、学术海报、收据、表格和UI截图。利用该基准,我们在零样本设置下评估了五种代表性AI生成图像检测器,并分析了它们的整体性能、类别性能和后处理鲁棒性。我们的结果表明,检测器性能高度依赖于领域:在某些类别上表现良好的方法往往在其他类别上失败,即使最强的传统检测器也对JPEG压缩表现出严重敏感性。我们进一步使用多模态视觉语言模型进行了探索性评估,揭示了其在结构化格式上的潜力和局限性。这些发现突显了针对现代AI生成图像需要文本和布局感知的检测方法。我们的数据集发布于XXX。

英文摘要

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

2605.08189 2026-06-18 eess.AS 版本更新 55%

DiffVQE: Hybrid Diffusion Voice Quality Enhancement Under Acoustic Echo and Noise

DiffVQE:声学回声和噪声下的混合扩散语音质量增强

Haljan Lugo, Ernst Seidel, Pejman Mowlaee, Ziyue Zhao, Tim Fingscheidt

专题命中 其他图像生成 :提出扩散模型用于语音质量增强,非图像生成。

AI总结 提出首个基于扩散的声学回声控制模型DiffVQE,在回声和噪声控制性能、计算复杂度和模型大小上均优于判别式DeepVQE模型。

Comments 6 pages, 4 figures, accepted at Interspeech 2026

详情
AI中文摘要

声学回声和背景噪声对免提系统和免提电话中的语音增强提出了挑战。判别式训练的端到端方法为联合声学回声控制(AEC)和去噪提供了强大的解决方案。然而,随着生成方法的出现,基于扩散的方法在语音增强任务中表现出卓越的性能。在这项工作中,据我们所知,我们提供了第一个(仍然是非因果的)基于扩散的AEC模型(DiffVQE),该模型在拓扑结构、训练数据和训练框架方面是可复现的。到目前为止,在不使用扩散的情况下,微软的判别式DeepVQE模型已被证明优于ICASSP 2023 AEC挑战赛的任何参赛作品,取得了卓越的性能。使用来自Interspeech 2025 URGENT挑战赛的数据作为多样化、高质量的训练数据集,我们的DiffVQE在回声和噪声控制性能以及计算复杂度和模型大小方面均优于DeepVQE。

英文摘要

Acoustic echo and background noise pose challenges on speech enhancement in hands-free systems and speakerphones. Discriminatively trained end-to-end methods represent a powerful solution for joint acoustic echo control (AEC) and denoising. However, with the advent of generative methods, diffusion-based approaches have seen remarkable performance in speech enhancement tasks. In this work, to the best of our knowledge, we provide the first (still non-causal) diffusion-based AEC model (DiffVQE) that is reproducible in terms of topology, training data, and training framework. So far, without employing diffusion, Microsoft's discriminative DeepVQE model has been shown to excel any of the ICASSP 2023 AEC Challenge entries achieving remarkable performance. Using data from the Interspeech 2025 URGENT Challenge for a diverse, high-quality training dataset, our DiffVQE excels DeepVQE both in echo and noise control performance, as well as in computational complexity and model size.