视频大模型 - arXivDaily 专题

2606.20083 2026-06-19 cs.CV 新提交 90%

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Holo-World: 视频世界模型的统一相机、物体和天气控制

Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen, Wei Li, Dachun Kai, Chunfeng Wang, Xiaoyan Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）； Li Auto ； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合国家科学中心人工智能研究院）

专题命中视频生成：可控视频世界模型生成

AI总结提出Holo-World，一种从单张图像联合控制相机、物体运动和天气的统一视频世界模型，通过场景适配器和解耦CFG实现世界保持与天气迁移。

Comments Project Page: \url{https://xiangchenyin.github.io/Holo-World} Code: \url{https://github.com/XiangchenYin/Holo-World}

详情

AI中文摘要

视频世界模型正朝着在可控相机和物体运动下保持观察到的世界，同时允许其环境状态变化的方向发展。然而，这些控制仍然是孤立的，天气生成通常依赖于已经指定未来结构的源视频或重建场景。我们研究了一种基于第一帧锚定的源到状态设置，其中模型从单张图像开始，遵循明确的相机和物体控制以及可选的天气指令，然后生成一个视频，该视频要么保持源世界，要么将其转移到目标天气状态。为了解决这些挑战，我们首先构建了HoloStateData，一个状态视频数据集，将多样化的视频转换为用于相机、物体和天气监督的统一控制样本。其次，我们引入了Holo-World，一个统一的、可控制的视频世界模型，从单张图像联合控制场景。其统一场景适配器将世界保持和天气迁移分解为不同的参数子空间，使用渲染背景、几何缓冲区和物体控制来维持受控场景结构，同时建模依赖天气的外观和粒子效果。此外，场景-天气解耦CFG分别引导场景和天气残差，增强目标天气效果而不过度放大完整条件。定量和定性实验表明，Holo-World在保持精确的相机和物体控制以及一致场景结构的同时，将场景迁移到多样化的目标天气状态，在天气状态生成上优于视频到视频的天气编辑基线。我们的项目页面可在\url{this https URL}获取。

英文摘要

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.

URL PDF HTML ☆

赞 0 踩 0

2606.20310 2026-06-19 cs.CV 新提交 85%

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

通过PRISM：视频扩散模型中间状态中的偏好表示

Haoxuan Wu, Lai Man Po, Mengyang Liu, Kun Li, Hongzheng Yang, Wei Liu

发表机构 * City University of Hong Kong（香港城市大学）； Video Rebirth ； The Chinese University of Hong Kong（香港中文大学）

专题命中视频生成：从视频扩散模型中间状态解码偏好

AI总结提出PRISM方法，利用冻结的视频扩散骨干网络和轻量级查询聚合头从噪声潜变量中解码偏好信号，实现高精度偏好预测和噪声鲁棒性，支持早期最佳采样以降低计算成本并提升视频质量。

详情

AI中文摘要

使用干净的、基于像素的奖励模型评估视频生成，会使评估与噪声扩散过程脱节，并产生巨大的VAE解码成本。在本文中，我们通过提出一个基本问题来挑战这一范式：一个强大的视频生成器能否直接从噪声潜变量中内在地区分偏好？为了回答这个问题，我们引入了\textbf{PRISM}（\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels）。PRISM采用一个轻量级的基于查询的聚合头，配合冻结的视频扩散骨干网络，从噪声潜变量中解码偏好信号。令人惊讶的是，PRISM不仅达到了最先进的偏好准确率，还解锁了强大的噪声鲁棒性，从而实现了早期最佳-$N$采样。这使得在去噪的初始阶段就能过滤掉次优候选，大幅减少计算量并提升视频质量。我们还揭示了骨干网络的生成性能与其内在评估能力之间的强正相关性，从而实现了视频骨干网络的自我改进。

英文摘要

Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbf{PRISM} (\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone's generative performance and its inherent evaluative power, enabling self-improving video backbones.

URL PDF HTML ☆

赞 0 踩 0

2606.20233 2026-06-19 cs.CV 新提交 85%

Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models

使用角色-环境协调视频生成模型的电影级合成

Tianyi Xiang, Mingming He, Li Ma, Jing Liao

发表机构 * City University of Hong Kong（香港城市大学）； Independent Researcher（独立研究员）

专题命中视频生成：端到端视频扩散框架用于合成

AI总结提出端到端视频扩散框架，通过三掩码引导和RGB-D联合去噪建模角色与环境的双向物理与光照交互，实现高质量动态视频合成。

详情

AI中文摘要

电影级合成旨在将绿幕角色融入新环境，同时保持物理和光度真实性。先前的方法通常未能捕捉角色与其周围环境之间的复杂双向交互，我们将其表征为角色到环境（C2E）的物理交互和环境到角色（E2C）的光照协调。为了解决这个问题，我们提出了一个端到端的视频扩散框架，联合建模C2E和E2C交互，特别处理交互道具的挑战。我们的方法引入了一种三掩码引导架构，结合RGB-D联合去噪，以确保角色、道具和环境之间的物理一致交互。我们进一步开发了一种高效的先验驱动数据整理流程，无需昂贵的渲染即可构建高质量的重光照对。最后，参考条件机制实现了可控的环境合成和精确的道具替换。大量实验表明，我们的框架在电影级动态视频合成方面显著优于现有方法。

英文摘要

Cinematic compositing aims to integrate green-screen characters into novel environments while maintaining physical and photometric realism. Previous methods often fail to capture the complex bidirectional interactions between characters and their surroundings, which we characterize as Character-to-Environment (C2E) physical interaction and Environment-to-Character (E2C) lighting harmonization. To address this, we propose an end-to-end video diffusion framework that jointly models C2E and E2C interactions, specifically handling the challenges of interactive props. Our approach introduces a tri-mask-guided architecture with RGB-D joint denoising to ensure physically consistent interactions among the character, props, and environment. We further develop an efficient prior-driven data curation pipeline to construct high-quality relighting pairs without expensive rendering. Finally, a reference-conditioned mechanism enables controllable environment synthesis and precise prop replacement. Extensive experiments demonstrate that our framework significantly outperforms existing methods in cinematic-quality dynamic video compositing.

URL PDF HTML ☆

赞 0 踩 0

2606.19958 2026-06-19 cs.CV 新提交 85%

SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis

SketchKeyAnime：基于参考锚点的稀疏关键草图动画合成

Meixi Li, Xianlin Zhang, Yue Zhang, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

专题命中视频生成：提出SketchKeyAnime框架生成可控动画

AI总结提出SketchKeyAnime视频扩散框架，通过双分支条件机制和可学习门控的草图交叉注意力，从单张参考RGB图像和稀疏关键草图生成结构可控、外观一致且时间连贯的动画，在Sakuga-42M数据集上显著优于基线方法。

详情

AI中文摘要

传统动画制作严重依赖手工绘制和迭代细化，特别是关键姿势设计、中间帧生成和角色着色。虽然现有的动画和视频生成方法取得了显著进展，但它们通常依赖于RGB边界帧、密集的帧级条件或完整的草图序列，限制了在低成本输入条件下的适用性。我们提出了SketchKeyAnime，一个视频扩散框架，用于从稀疏关键草图输入生成结构可控、外观一致且时间连贯的动画。给定单个参考RGB图像和几个按时间索引的关键草图，SketchKeyAnime引入了一种双分支条件机制，以编码局部几何约束以及语义-时间上下文。它利用草图交叉注意力，通过可学习门控融合参考图像和草图条件，并加入自适应加权损失以加强对关键草图帧和线条艺术区域的监督。在Sakuga-42M的Aesthetic子集上的实验结果表明，我们的方法始终优于代表性的动画插值和草图引导生成基线。与最佳基线相比，SketchKeyAnime将EDMD降低了31.9%，FVD降低了9.5%，展示了卓越的草图保真度和时间连贯性，同时在大多数定量指标上实现了最佳整体性能。这些结果验证了所提出的框架，并突显了其在低成本、高度可控动画创作中的潜力。

英文摘要

Traditional animation production relies heavily on manual drawing and iterative refinement, particularly for key-pose design, in-betweening, and character coloring. While existing animation and video generation methods have made notable progress, they typically depend on RGB boundary frames, dense frame-wise conditions, or complete sketch sequences, limiting their applicability under low-cost input conditions. We present SketchKeyAnime, a video diffusion framework for generating structurally controllable, appearance-consistent, and temporally coherent animations from sparse key-sketch inputs. Given a single reference RGB image and a few temporally indexed key sketches, SketchKeyAnime introduces a dual-branch conditioning mechanism to encode local geometric constraints alongside semantic-temporal context. It leverages Sketch Cross Attention to fuse reference image and sketch conditions with learnable gating, and incorporates an Adaptive Weighted Loss to strengthen supervision on key-sketch frames and line-art regions. Experimental results on the Aesthetic subset of Sakuga-42M show that our approach consistently outperforms representative animation interpolation and sketch-guided generation baselines. Compared to the best-performing baseline, SketchKeyAnime reduces EDMD by 31.9\% and FVD by 9.5\%, demonstrating superior sketch fidelity and temporal coherence, while achieving the best overall performance across most quantitative metrics. These results validate the proposed framework and highlight its potential for low-cost, highly controllable animation creation.

URL PDF HTML ☆

赞 0 踩 0

2606.19676 2026-06-19 cs.CV cs.AI 新提交 85%

TeleMorpher: Toward Robust Simultaneous Motion-Location Editing

TeleMorpher: 迈向鲁棒的同步运动-位置编辑

Haengbok Chung

专题命中视频生成：基于扩散模型的视频运动与位置同步编辑

AI总结提出TeleMorpher，一种基于扩散模型的一步式框架，通过运动先验、姿态扭曲和基线运动编辑器注入，实现视频中主角运动与位置的同步编辑，在定量和定性评估中表现优异。

详情

AI中文摘要

扩散模型在图像和视频生成与编辑中取得了显著成功。尽管最近的研究将工作扩展到运动编辑，但同步变换运动与位置——尽管具有实际重要性——仍基本未被探索。为了更好地理解鲁棒的运动-位置编辑，我们首先分析了降低其质量的根本因素。基于此分析，我们提出了TeleMorpher，据我们所知，这是首个用于同步运动-位置编辑的一步式框架之一。我们的方法利用运动先验（从现成模型生成的目标运动中心视频作为运动编辑指导）和真实运动，实现更可控和精确的运动-位置编辑。通过这种方式，我们的框架工作如下：(1) 首先通过预训练的分割和修复模型分离主角和背景。(2) 然后，我们引入一种无需训练的姿势扭曲，以运动先验为指导编辑主角的运动。(3) 扭曲运动视频的结果在推理时直接注入基线运动编辑器，减轻源运动与目标运动之间的差异，同时保留源视频的外观。(4) 为提高定量评估的可靠性，我们提出了两个新的基于LPIPS的指标，分别测量运动编辑前后背景一致性以及通过测量从源视频和目标视频中提取的主角骨架差异来评估运动编辑性能的保真度。在野外视频和TaiChi数据集上的实验表明，TeleMorpher在定量和定性测量（真实人类评估）中均取得了优越性能，凸显了其有效性。

英文摘要

Diffusion models have achieved remarkable success in image and video generation and editing. While recent studies have extended these efforts toward motion editing, simultaneously transforming both motion and location-despite its practical importance-remains largely unexplored. To better understand robust motion-location editing, we first analyze the fundamental factors that degrade its quality. Based on this analysis, we propose TeleMorpher, one of the first one-shot frameworks to the best of our knowledge, for simultaneous motion-location editing. Our approach leverages motion priors, a target motion-centric video generated from an off-the-shelf model as motion-editing guidance, and the ground truth motion to enable more controllable and precise motion-location editing. Via this, our framework works as follows: (1) we first disentangle the protagonist and the background via pre-trained segmentation and inpainting models. (2) Then, we introduce a training-free pose warping that edits the protagonist's motion with the motion prior as the guidance. (3) The result of warped motion video is directly injected into a baseline motion editor during inference, mitigating the difference between source and target motions while preserving the appearance of the source video. (4) To enhance the reliability of quantitative evaluations, we propose two new LPIPS-based metrics that measure the background consistency before and after the motion editing and the fidelity of motion editing performance via measuring the difference between the extracted protagonist's skeletons from source and target videos. Experiments with in-the-wild videos and the TaiChi dataset demonstrate that TeleMorpher achieves superior performance across both quantitative and qualitative measurements (real-human evaluation), underscoring its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.19495 2026-06-19 cs.CV 新提交 85%

LooseControlVideo: Directorial Video Control using Spatial Blocking

LooseControlVideo: 使用空间分块进行导演式视频控制

Shariq Farooq Bhat, Niloy J. Mitra, Kalyan Sunkavalli

发表机构 * Adobe Research（Adobe研究院）

专题命中视频生成：文本到视频生成中3D框控制多对象场景

AI总结提出LooseControlVideo框架，通过稀疏定向3D框作为“分块”代理，实现文本到视频生成中多对象场景的直观布局与轨迹控制，显著优于现有2D框和流方法。

Comments Project page at https://shariqfarooq123.github.io/LooseControlVideo/

详情

AI中文摘要

在文本到视频生成中，精确的3D空间编排仍然是一个重大挑战，特别是对于语义布局和时间动态经常纠缠的多对象场景。虽然现有的深度条件模型实现了良好的结构保真度，但它们需要密集的、帧精确的指导，这对于涉及可变形对象的动态事件来说，制作起来非常费力。我们提出了LooseControlVideo，一个通过使用稀疏的、定向的3D框作为“分块”代理来实现直观和表达性控制的框架。这允许用户创作高级布局和轨迹，同时利用视频生成模型生成逼真的遮挡、动态和交互。我们通过在带有DNOCS（一种用于3D大小、方向和深度排序遮挡的新型编码）注释的视频数据集上微调Wan 2.2骨干网络来实现这一点。此外，我们的方法允许局部细化，例如调整跳跃轨迹或添加交互，而对全局场景上下文的干扰最小。在nuScenes、HO-3D和BEHAVE基准上的广泛评估表明，LooseControlVideo显著优于现有的2D框和基于流的基线。我们的结果表明，与当前最先进的布局条件模型相比，轨迹误差提高了1.2倍到3倍；刚体运动一致性提高了2倍；遮挡精度提高了1.5倍到2倍，表明定向3D基元为复杂的多智能体视频创作提供了良好的几何先验。

英文摘要

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.

URL PDF HTML ☆

赞 0 踩 0

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 新提交 80%

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey（萨里大学视觉、语音与信号处理中心）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Fisheries College, Ocean University of China（中国海洋大学水产学院）； College of Information and Electrical Engineering, China Agricultural University（中国农业大学信息与电气工程学院）

专题命中视频生成：音频编辑，非视频，但涉及扩散模型

AI总结提出混合两阶段扩散变压器架构，通过粗到细策略平衡全局语义对齐与局部细节编辑，在重叠音频事件和复杂指令任务上提升性能与效率。

详情

AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容，同时保留其余声学内容。尽管扩散模型取得了显著进展，但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互，这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下，扩散变压器提供了更强的全局建模和多模态融合，但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率，我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构，用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐，然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明，所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升，同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

URL PDF HTML ☆

赞 0 踩 0