arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 2 信号源:cs.CV, cs.GR, cs.MM
2605.21431 2026-06-18 cs.CV 版本更新 85%

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

iTryOn: 通过空间-语义引导掌握交互式视频虚拟试穿

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

发表机构 * Shenzhen Campus of Sun Yat-sen University Taobao \& Tmall Group of Alibaba

专题命中 图像编辑 :交互式视频虚拟试穿,属于图像生成与编辑。

AI总结 本文提出iTryOn框架,通过空间-语义引导解决交互式视频虚拟试穿中的语义模糊和复杂服装变形问题,实现了更动态可控的虚拟试穿体验。

Comments Project Page: https://zhengjun-ai.github.io/itryon-page. Accepted by ICML 2026

详情
AI中文摘要

视频虚拟试穿(VVT)旨在无缝替换视频中人物身上的衣物。尽管现有方法在保持时间一致性方面取得了显著进展,但它们主要局限于非交互场景,其中模型仅展示衣物。这种限制忽略了现实世界服装展示中的关键方面:主动的人-衣物互动。为弥合这一差距,我们引入并正式化了一个新的挑战性任务:交互式视频虚拟试穿(Interactive VVT),其中视频中的主体主动与衣物互动。该任务引入了超出简单纹理保留的独特挑战,包括:(1)从标准姿态信息中解决交互的语义模糊性,以及(2)从视频中学习复杂的衣物变形,其中交互时刻稀少且短暂。为了解决这些挑战,我们提出了iTryOn,一种基于大规模视频扩散Transformer的新型框架。iTryOn首创多级交互注入机制,以引导复杂动态的生成。在空间层面,我们引入了服装无关的3D手先验,以提供精细的指导,精确的手-服装接触,有效解决空间模糊性。在语义层面,iTryOn利用全局描述词提供整体上下文,并利用时间戳动作描述词提供局部交互,通过我们新颖的Action-aware Rotational Position Embedding(A-RoPE)进行同步。广泛的实验表明,iTryOn不仅在传统VVT基准上实现了最先进的性能,还在新的交互设置中建立了显著的领先优势,标志着更动态和可控的虚拟试穿体验的重要一步。

英文摘要

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

2604.03156 2026-06-18 cs.CV 版本更新 85%

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

CAMEO: 一种条件感知与质量驱动的多智能体图像编辑编排器

Yuhan Pu, Hao Zheng, Ziqian Mo, Zirui Pang, Hill Zhang, Tianyi Fan, Shuhong Wu, Jiaheng Wei

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Harbin Institute of Technology(哈尔滨工业大学) Shenzhen University(深圳大学) Claremont McKenna College(克莱蒙特麦肯纳学院) Research Institute of Petroleum Exploration and Development, CNPC(石油勘探开发研究院,中石油)

专题命中 图像编辑 :多智能体框架进行条件图像编辑,含质量评估

AI总结 提出CAMEO多智能体框架,将条件图像编辑重构为质量感知的反馈驱动过程,通过分解编辑阶段、嵌入评估循环,在异常插入和人体姿态切换任务中平均胜率提升20%。

详情
AI中文摘要

条件图像编辑旨在根据文本提示和可选的参考指导修改源图像。这种编辑在需要严格结构控制的场景中至关重要(例如,驾驶场景中的异常插入和复杂人体姿态变换)。尽管近期大规模编辑模型(如Seedream、Nano Banana等)取得了进展,但大多数方法依赖单步生成。这种范式通常缺乏显式质量控制,可能引入与原始图像的过度偏差,并经常产生结构伪影或环境不一致的修改,通常需要手动调整提示才能获得可接受的结果。我们提出\textbf{CAMEO},一个结构化的多智能体框架,将条件编辑重构为质量感知、反馈驱动的过程,而非一次性生成任务。CAMEO将编辑分解为协调的阶段:规划、结构化提示、假设生成和自适应参考定位,仅在任务复杂度需要时才调用外部指导。为克服现有方法缺乏内在质量控制的不足,评估直接嵌入编辑循环中。通过结构化反馈迭代优化中间结果,形成闭环过程,逐步纠正结构和上下文不一致性。我们在异常插入和人体姿态切换任务上评估CAMEO。在多个强编辑骨干网络和独立评估模型上,CAMEO相比多个最先进模型平均胜率提升20%,展示了在条件图像编辑中更强的鲁棒性、可控性和结构可靠性。

英文摘要

Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.