arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 5 信号源:cs.CV, cs.GR, cs.MM
2606.19718 2026-06-19 cs.CV 新提交 90%

One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

基于3D先验引导扩散模型的单样本新视角与姿态人体图像合成

Shenjian Gong, Kangkan Wang, Shanshan Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院教育部高维信息智能感知与系统重点实验室、江苏省社会安全图像与视频理解重点实验室及PCA实验室) Advanced Laser Technology Laboratory of Anhui Province, Electronic Engineering Institute, National University of Defense Technology, and Jianghuai Advance Technology Center(国防科技大学电子工程学院安徽省先进激光技术实验室及江淮前沿技术中心)

专题命中 可控生成 :基于扩散模型合成新视角和姿态的人体图像。

AI总结 提出一种基于条件去噪扩散模型的方法,利用3D人体先验(法线图和颜色提示)作为几何和颜色条件,从单张参考图像合成任意姿态和视角的高质量人体图像,包括被遮挡部分。

Comments 30 pages, 10 figures

详情
AI中文摘要

本文解决了单样本新视角和姿态人体图像合成的挑战。现有方法通过一组2D姿态关键点将参考人体图像转移到目标姿态,或基于可泛化人体NeRF(使用人体模型先验提取逐点特征)合成人体图像。然而,基于姿态转移的方法无法处理使用模糊2D姿态作为条件的复杂人体姿态,而可泛化人体NeRF在缺乏可靠特征时可能无法准确恢复被遮挡/不可见的人体部分。为解决这些问题,我们提出了一种基于条件去噪扩散模型的新方法,用于从单张人体图像进行新视角和姿态合成。我们的扩散模型将新视角和姿态合成问题分解为一系列条件去噪步骤。具体而言,为了生成具有复杂和任意姿态的人体,我们将3D人体先验(即3D法线图和颜色提示)作为几何和颜色条件引入生成过程。通过一系列扩散步骤将参考人体转移到目标人体,我们的扩散模型能够实现高质量合成,包括被遮挡/不可见部分。此外,我们提出了一种基于自重建的自定义细化方法,以在测试新视角时增强细节。在多个公共数据集上的实验结果表明,我们的方法显著优于先前方法,并显示出更好的跨数据集泛化能力。代码将在https://this https URL上公开。

英文摘要

This paper addresses the challenge of one-shot novel view and pose human image synthesis. The existing methods transfer the reference human image to a target pose using a set of 2D pose keypoints or synthesize human images based on generalizable human NeRF which uses human model priors to extract point-wise features. However, pose transfer based methods can not handle complex human pose using ambiguous 2D pose as the condition, while generalizable human NeRFs may be inaccurate to recover occluded/invisiable human parts without extracted reliable features. To solve these problems, we propose a novel approach for novel view and pose synthesis from a singe human image via conditional denoising diffusion model. Our diffusion model divides the novel view and pose synthesis problem into a sequence of conditional denoising steps. Specifically, to generate humans with complex and arbitrary poses, we introduce 3D human priors, i.e., 3D normal map and color prompt, as geometry and color conditions into the generation process. By transferring the reference human into the target human with a series of diffusion steps, our diffusion model enables high-quality synthesis including the occluded/invisible parts. Further, we propose a self-reconstruction based customized refinement to enhance fine details when tested on novel persons.Experimental results on different public datasets demonstrate that our approach significantly outperforms previous methods and also shows better generalization ability across datasets. The code will be made publicly available at https://github.com/Yankeegsj/3DPGDM.

2606.20110 2026-06-19 cs.CV 新提交 80%

FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model

FrozenDrive: 零样本文本引导驾驶场景生成与数据增强的无参数冻结扩散模型

Yuhwan Jeong, Hyeonseong Kim, Daehyun We, Seonkyu Song, Jinnyeong Yang, Hyun-Kurl Jang, Youngho Yoon, Kuk-Jin Yoon

发表机构 * KAIST, Visual Intelligence Lab(韩国科学技术院视觉智能实验室)

专题命中 可控生成 :文本引导的驾驶场景生成

AI总结 提出FrozenDrive框架,利用冻结的预训练扩散模型,通过知识保留的时空注意力实现多视图一致性和时间连贯性,无需微调即可生成恶劣天气下的驾驶场景,提升自动驾驶模型鲁棒性。

Comments Accepted to ECCV 2026

详情
AI中文摘要

自动驾驶的合成数据正在激增,这得益于扩散模型能够实现可扩展的场景生成。然而,关键障碍依然存在,因为强制执行多视图和时间一致性通常依赖于骨干网络微调或添加层,这会侵蚀预训练知识并削弱文本对齐。模型也保持接近训练分布,在恶劣天气和未见配置下表现不佳,并且保真度偏向频繁类别而非稀有类别。我们通过FrozenDrive解决这些差距,这是一个可控生成框架,在保持预训练扩散模型知识的同时实现强一致性。FrozenDrive以丰富的驾驶堆栈信号和文本提示为条件,并引入知识保留的时空注意力,在无参数的冻结扩散骨干中单次通过时施加跨视图对齐和时间连贯性。额外的对象聚焦约束提高了稀有类别的每个对象保真度。无需任何天气或场景特定的微调,我们的模型从文本合成全局连贯的多视图驾驶场景,特别是在恶劣和稀有条件下,并超越了先前的基线。在nuScenes上,FrozenDrive增强数据显著提升了AD模型的性能,尤其是在夜间和雨天,当使用我们的场景定向数据训练时,展示了更强的鲁棒性。

英文摘要

Synthetic data for autonomous driving is surging, powered by diffusion models that promise scalable scene generation. Yet key obstacles remain, as enforcing multi-view and temporal consistency often relies on backbone fine-tuning or added layers, which erodes pre-trained knowledge and weakens text alignment. Models also stay close to the training distribution, struggling under adverse weather and unseen configurations, and fidelity favors frequent over rare classes. We address these gaps with FrozenDrive, a controllable generative framework that preserves a pretrained diffusion models knowledge while achieving strong consistency. FrozenDrive conditions on rich driving-stack signals and text prompts, and introduces knowledge-preserving spatio-temporal attention to impose cross-view alignment and temporal coherence in a single pass within a parameter-free frozen diffusion backbone. An additional object-focused constraint improves per-object fidelity for rare categories. Without any weather- or scene-specific fine-tuning, our model synthesizes globally coherent multi-view driving scenes from text, particularly under adverse and rare conditions, and surpasses prior baselines. On nuScenes, FrozenDrive augmented data significantly improves AD models performance, especially at night and in rain, demonstrating stronger robustness when trained with our scenario-targeted data.

2606.20083 2026-06-19 cs.CV 新提交 80%

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Holo-World: 视频世界模型的统一相机、物体和天气控制

Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen, Wei Li, Dachun Kai, Chunfeng Wang, Xiaoyan Sun

发表机构 * University of Science and Technology of China(中国科学技术大学) Li Auto Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

专题命中 可控生成 :相机、物体、天气联合控制

AI总结 提出Holo-World,一种从单张图像联合控制相机、物体运动和天气的统一视频世界模型,通过场景适配器和解耦CFG实现世界保持与天气迁移。

Comments Project Page: \url{https://xiangchenyin.github.io/Holo-World} Code: \url{https://github.com/XiangchenYin/Holo-World}

详情
AI中文摘要

视频世界模型正朝着在可控相机和物体运动下保持观察到的世界,同时允许其环境状态变化的方向发展。然而,这些控制仍然是孤立的,天气生成通常依赖于已经指定未来结构的源视频或重建场景。我们研究了一种基于第一帧锚定的源到状态设置,其中模型从单张图像开始,遵循明确的相机和物体控制以及可选的天气指令,然后生成一个视频,该视频要么保持源世界,要么将其转移到目标天气状态。为了解决这些挑战,我们首先构建了HoloStateData,一个状态视频数据集,将多样化的视频转换为用于相机、物体和天气监督的统一控制样本。其次,我们引入了Holo-World,一个统一的、可控制的视频世界模型,从单张图像联合控制场景。其统一场景适配器将世界保持和天气迁移分解为不同的参数子空间,使用渲染背景、几何缓冲区和物体控制来维持受控场景结构,同时建模依赖天气的外观和粒子效果。此外,场景-天气解耦CFG分别引导场景和天气残差,增强目标天气效果而不过度放大完整条件。定量和定性实验表明,Holo-World在保持精确的相机和物体控制以及一致场景结构的同时,将场景迁移到多样化的目标天气状态,在天气状态生成上优于视频到视频的天气编辑基线。我们的项目页面可在\url{this https URL}获取。

英文摘要

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.

2606.19736 2026-06-19 cs.CV 新提交 80%

VFACamou: View-Fused Adversarial Camouflage for Environment-Adaptive Physical Evasion

VFACamou: 视图融合的对抗性伪装用于环境自适应物理规避

Shihui Yan, Hu Liu, Junyu Shi, Zihui Zhu, Ziqi Zhou, Yufei Song, Youming Geng, Minghui Li, Shengshan Hu

发表机构 * State Key Laboratory of Intelligent Vehicle Safety Technology(智能汽车安全技术国家重点实验室) School of Cyber Science and Engineering, Huazhong University of Science and Technology(华中科技大学网络空间安全学院) School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Hebei Energy College of Vocation And Technology(河北能源职业技术学院)

专题命中 可控生成 :使用扩散纹理生成器生成对抗图案。

AI总结 提出一种端到端框架,结合UV体积渲染与扩散纹理生成器,并引入照明颜色一致性估计器和多尺度动态训练策略,生成可穿戴对抗图案,在无人机侦察等动态视角和光照变化下实现稳定物理攻击。

Comments Accepted by ICME 2026

详情
AI中文摘要

物理世界中的对抗性伪装仍然极具挑战性,尤其是在无人机侦察场景下,目标会经历连续的几何变化和极端光照变化。现有方法要么优化无法泛化到动态视角的2D数字扰动,要么产生视觉上不自然的纹理而无法在实际场景中部署。因此,我们提出一个端到端的对抗性伪装生成框架,能够自动生成可穿戴的对抗图案,并在视角、姿态和光照条件变化的真实物理环境中保持稳定的攻击性能。我们的方法将UV体积渲染与基于扩散的纹理生成器相结合,使得在不同尺度、姿态和光照条件下外观保持一致。为了确保环境真实性,我们提出一个照明颜色一致性估计器,提取主导背景属性并引导自然纹理损失,使生成的UV纹理与周围环境对齐。多尺度动态训练策略进一步增强了对抗视角变化和身体变形的鲁棒性。在多个主流检测器上的大量实验表明,我们的方法在保持高感知自然性的同时实现了强大且稳定的物理攻击性能,在不引入不自然伪影的情况下降低了人类检测率。

英文摘要

Adversarial camouflage in the physical world remains highly challenging, particularly under UAV reconnaissance where targets undergo continuous geometric changes and extreme illumination variations. Existing methods either optimize 2D digital perturbations that fail to generalize to dynamic viewpoints or produce visually unnatural textures that cannot be deployed in real scenarios. Therefore, we propose an end-to-end framework for adversarial camouflage generation that automatically produces wearable adversarial patterns and maintains stable attack performance in real physical environments with changing viewpoints, poses, and lighting conditions. Our method integrates UV-volume rendering with a diffusion-based texture generator, enabling consistent appearance under varying scales, poses, and lighting conditions. To ensure environmental realism, we propose an illumination color consistency estimator that extracts dominant background attributes and guides a natural texture loss to align the generated UV texture with the surrounding environment. A multi-scale dynamic training strategy further enhances robustness against viewpoint shifts and body deformation. Extensive experiments across multiple mainstream detectors demonstrate that our method achieves strong and stable physical attack performance while maintaining high perceptual naturalness, reducing human detection rates without introducing unnatural artifacts.

2606.15015 2026-06-19 cs.CV cs.AI 新提交 70%

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Brian Sheil, Yixiong Jing

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学)

专题命中 可控生成 :指导物理一致视频生成

AI总结 提出神经能量场框架NEXUS,通过标量能量和耗散项建模保守与非保守动力学,提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情
AI中文摘要

基于物理的视频生成需要可控的3D物体动力学,这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应,难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS,一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图,并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发,NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应(包括重力和弹性变形)被组合为加性能量项,而非保守效应(如阻尼和冲击引起的能量损失)则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到,并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中,NEXUS在不同力学属性和物理效应组合下,相较于代表性的学习和物理结构化动力学基线,提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导,在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.