图像生成 - arXivDaily 专题

2606.20416 2026-06-19 cs.LG cs.CV 新提交 90%

On the Redundancy of Timestep Embeddings in Diffusion Models

扩散模型中时间步嵌入的冗余性研究

José A. Chávez

发表机构 * Independent Researcher, Lima, Peru（独立研究者，秘鲁利马）

专题命中扩散模型：研究扩散模型中时间步嵌入的冗余性，影响图像生成

AI总结本文通过理论和实验证明，在U-Net和Diffusion Transformer架构中，扩散模型无需显式时间步嵌入也能达到全局最优，甚至在某些指标上超越有条件模型。

Comments 17 pages

详情

AI中文摘要

扩散模型严重依赖显式的时间步嵌入来调节不同噪声尺度下的去噪过程。在这项工作中，我们通过分析时间步嵌入对U-Net和Diffusion Transformer架构的影响，挑战了这些时间信号的必要性。除了经验证据外，我们提供了一个理论框架，证明在某些条件下，无需显式时间步条件即可达到扩散训练目标的全局最小值。我们的发现揭示了当完全移除时间步嵌入时令人惊讶的鲁棒性。在CelebA和CIFAR-10数据集上的大量消融研究表明，这些时间无关模型可以保持高结构保真度，甚至在竞争性指标（包括FID、精确率和召回率）上超越其有条件对应模型。我们的分析表明，这些架构可以在特定假设下从损坏输入中隐式推断噪声尺度，使得显式时间条件变得冗余。这项研究挑战了长期以来的时间条件范式，并为更高效、更注重结构的生成架构铺平了道路。

英文摘要

Diffusion models rely heavily on explicit timestep embeddings to modulate the denoising process across various noise scales. In this work, we challenge the necessity of these temporal signals by analyzing their impact on U-Net and Diffusion Transformer architectures. Beyond empirical evidence, we provide a theoretical framework demonstrating that, under certain conditions, the global minimizer of the diffusion training objective can be achieved without explicit timestep conditioning. Our findings reveal a surprising robustness when timestep embeddings are completely removed. Extensive ablation studies on the CelebA and CIFAR-10 datasets show that these time-agnostic models can maintain high structural fidelity and even surpass their conditioned counterparts in competitive metrics, including FID, precision, and recall. Our analysis suggests these architectures can implicitly infer noise scales from the corrupted input under specific assumptions, rendering explicit temporal conditioning redundant. This study challenges long-standing temporal conditioning paradigms and paves the way for more efficient and structurally focused generative architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.19970 2026-06-19 cs.CV 新提交 90%

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

CrossFlow: 跨潜在空间与像素空间的单步生成

Xiyuan Wang, Xiao Zhang, Yang Li, Ruoxi Jiang, Zhao Zhong, Liefeng Bo, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； Tencent（腾讯）； Fudan University（复旦大学）

专题命中扩散模型：提出跨空间流模型实现单步生成

AI总结提出CrossFlow，一种跨空间流模型，将噪声潜在输入直接映射到像素图像，通过无速度单步目标实现潜在到像素的生成，并替代潜在扩散中的解码器，在ImageNet-1k上达到1.62 FID。

Comments Preprint, Under Review

详情

AI中文摘要

大多数扩散和流匹配生成器在相同的表示空间中定义先验、概率路径和预测目标。潜在扩散通过将该路径移动到自编码器潜在空间来提高效率，但最终样本仍由单独训练的解码器生成。这种分离造成了不匹配：生成器针对潜在空间预测进行优化，而最终质量取决于解码器如何处理可能与干净编码器输出不同的生成潜在变量。我们引入了CrossFlow，一种跨空间流公式，将噪声潜在输入直接映射到像素空间图像。关键技术步骤是一个无速度的单步目标：潜在轨迹定义了训练路径，但监督预测是图像而非潜在位移。这使得一个模型既可以作为单步潜在到像素生成器，也可以作为潜在扩散管道的解码器替代品。在类别条件ImageNet-1k $256\ imes256$上，CrossFlow-XL通过一次函数评估达到了1.62 FID。消融实验表明，潜在编码器以及像素空间感知和对抗损失对保真度很重要。这些结果表明，跨空间流目标可以结合潜在表示的效率与直接像素空间监督，而无需在推理时使用单独的解码器。

英文摘要

Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.

URL PDF HTML ☆

赞 0 踩 0

2606.19662 2026-06-19 cs.CV 新提交 90%

Learning When to Denoise: Optimizing Asynchronous Schedules for Latent Diffusion

学习何时去噪：优化潜在扩散的异步调度

Bingshuo Qian, Xiang Cheng

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）

专题命中扩散模型：学习异步调度优化多表示扩散模型的去噪顺序

AI总结提出学习异步调度策略，通过调度校正目标优化多表示扩散模型的去噪顺序，在ImageNet 256x256上以不到1%额外训练计算实现4倍加速，FID达1.02。

Comments 25 pages, 9 figures, 4 tables

详情

AI中文摘要

多表示扩散模型可以通过对图像的互补视图进行去噪来改善视觉合成，但其性能关键取决于决定每个表示何时去噪的异步调度。我们提出学习这种调度。我们的方法在多个表示空间上制定异步流匹配，并使用调度校正目标，该目标在调度变化时保持每个表示的局部噪声时间权重固定。我们用一个灵活的参数类实例化调度，该类通过构造是凸且单调的，并使用快速联合探针进行学习，额外训练计算少于1%。在ImageNet 256x256上，学习的调度在匹配的675M参数XL骨干下显著提高了收敛速度和最终质量。使用AutoGuidance，我们的200 epoch模型达到FID 1.05，与800 epoch的SFD-XL基线相当，训练量减少4倍。训练到600 epoch进一步改善到FID 1.02，优于1B参数的SFD-XXL结果（FID 1.04），同时使用更小的模型。在无引导设置中，我们的200 epoch模型达到FID 2.37，已经低于最佳800 epoch SFD-XL结果（2.54），训练量减少4倍，并在600 epoch时改善到FID 2.14。代码可在https://this URL获取。

英文摘要

Multi-representation diffusion models can improve visual synthesis by denoising complementary views of an image, but their performance depends critically on the asynchronous schedule that determines when each representation is denoised. We propose to learn this schedule. Our method formulates asynchronous flow matching over multiple representation spaces and uses a schedule-corrected objective that keeps each representation's local noising-time weights fixed as the schedule changes. We instantiate the schedule with a flexible parametric class that is convex and monotone by construction, and learn it using a fast joint probe with less than 1% additional training compute. On ImageNet 256x256, the learned schedule substantially improves both convergence speed and final quality under a matched 675M-parameter XL backbone. With AutoGuidance, our 200-epoch model reaches FID 1.05, matching the 800-epoch SFD-XL baseline with 4x less training. Training to 600 epochs further improves to FID 1.02, outperforming the 1B-parameter SFD-XXL result of FID 1.04 while using a smaller model. In the unguided setting, our 200-epoch model reaches FID 2.37, already below the best 800-epoch SFD-XL result (2.54) at 4x less training, and improves to FID 2.14 at 600 epochs. Code is available at https://github.com/bsq532087/LWD

URL PDF HTML ☆

赞 0 踩 0

2606.20112 2026-06-19 cs.CV eess.IV 新提交 85%

Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

像素级残差扩散Transformer：可扩展的3D CT体生成

Zhenkai Zhang, Markus Hiller, Krista A. Ehinger, Tom Drummond

发表机构 * School of Computing and Information Systems, The University of Melbourne（墨尔本大学计算与信息系统学院）

专题命中扩散模型：基于扩散Transformer的3D图像生成

AI总结提出像素级残差扩散Transformer（PRDiT），通过两阶段训练（局部MLP盲估计器分离低频结构+全局残差扩散Transformer建模高频残差）实现高保真3D CT体生成，在LIDC-IDRI和RAD-ChestCT数据集上优于现有方法。

Comments Accepted at ICLR 2026. Code available at https://github.com/Fredy-Zhang/PRDiT

详情

AI中文摘要

由于现有生成模型固有的巨大计算需求和优化困难，生成具有精细细节的高分辨率3D CT体仍然具有挑战性。在本文中，我们提出了像素级残差扩散Transformer（PRDiT），这是一种可扩展的生成框架，可直接在体素级别合成高质量的3D医学体。PRDiT引入了一个两阶段训练架构，包括：1）一个局部去噪器，形式为基于MLP的盲估计器，作用于重叠的3D块，以有效分离低频结构；2）一个全局残差扩散Transformer，采用内存高效注意力来建模和细化整个体上的高频残差。这种从粗到细的建模策略简化了优化，增强了训练稳定性，并有效保留了细微结构，而无需自编码器瓶颈。在LIDC-IDRI和RAD-ChestCT数据集上进行的大量实验表明，PRDiT始终优于最先进的模型，如HA-GAN、3D LDM和WDM-3D，在3D FID、MMD和Wasserstein距离指标上显著降低。

英文摘要

Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.

URL PDF HTML ☆

赞 0 踩 0

2606.20076 2026-06-19 cs.CV cs.AI 新提交 85%

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

基于可学习全局合并的可变长度分词用于扩散变换器

Dong Hoon Lee, Seunghoon Hong

发表机构 * Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea（韩国科学技术院金载哲人工智能研究生院，大田，韩国）； School of Computing, KAIST, Daejeon, South Korea（韩国科学技术院计算学院，大田，韩国）

专题命中扩散模型：扩散Transformer可变长度分词

AI总结针对固定压缩比限制扩散模型质量-计算权衡的问题，提出基于可学习全局合并的可变长度分词器，通过合并令牌实现跨长度表示对齐，在ImageNet 256×256生成中实现更优的gFID-计算权衡。

详情

AI中文摘要

潜在扩散模型（LDM）在视觉合成中占据主导地位，但其质量-计算权衡很大程度上受限于分词器的固定压缩比。可变长度分词器（VLT）通过改变令牌数量实现自适应压缩，使扩散模型能够灵活平衡质量和计算。然而，传统的VLT通过截断有序令牌序列来调节长度，这使得令牌语义依赖于令牌位置，并破坏了跨长度的表示对齐。这导致潜在分布出现跨长度偏移，阻碍单个可变长度扩散模型有效运行。为了解决这个问题，我们提出了一种新颖的可变长度分词器，通过合并令牌来调节长度。我们表明，当扩散变换器根据合并模式运行时，鼓励相似令牌合并可以实现直接的跨长度表示对齐。由于传统的合并方法是数据依赖的，使得生成过程中无法访问合并模式，我们引入了可学习的全局合并，它是数据独立的，以确保与扩散变换器的兼容性。在ImageNet 256×256生成中，我们的基于合并的可变长度分词器与扩散变换器集成，相比之前的VLT方法实现了更优的gFID-计算权衡。代码可在[此https URL](此https URL)获取。

英文摘要

Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256$\times$256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](https://github.com/movinghoon/lgm)

URL PDF HTML ☆

赞 0 踩 0

2606.19894 2026-06-19 cs.LG 新提交 80%

Score Approximation for Diffusion Models on Arbitrary Low-Dimensional Structures

任意低维结构上扩散模型的分数近似

Xinhe Mu, Zaijiu Shang, Zhaoqi Zhou, Chuan Zhou, Qi Meng, Guiying Yan, Zhiming Ma

发表机构 * Shanghai Institute for Mathematics and Interdisciplinary Sciences（上海数学与交叉科学研究院）； Huawei Technologies Co., Ltd.（华为技术有限公司）

专题命中扩散模型：扩散模型分数近似理论，支持非光滑数据。

AI总结针对任意紧支撑分布，提出一种基于离散混合的分数近似方法，证明ReLU网络复杂度仅随上Minkowski维数d指数增长，打破环境维数诅咒，解释扩散模型在非光滑数据上的有效性。

详情

AI中文摘要

基于分数的扩散模型的显著成功激发了大量建立其理论基础的努力。然而，现有的分数近似复杂度界限严重依赖于限制性假设，如Lipschitz连续密度或光滑流形支撑，而这些假设通常被真实感知数据固有的奇异性、尖锐边界和不连续簇所违反。本文建立了一个通用的分数近似定理，适用于任何支撑在任意上Minkowski维数为$d$的紧集上的分布。通过一种新颖的离散混合公式，我们证明了分数函数可以用ReLU网络近似，其复杂度仅随$d$指数增长，从而打破了环境维数的指数诅咒。结合现有关于精确求解任意紧分布的反向扩散SDE的理论，我们的工作表明扩散模型能够自适应地处理不规则、非光滑的数据结构，解释了它们在真实生成任务中的能力。

英文摘要

The remarkable success of score-based diffusion models has spurred significant efforts to establish their theoretical foundations. However, existing complexity bounds for score approximation rely heavily on restrictive assumptions like Lipschitz continuous densities or smooth manifold supports, which are routinely violated by the singularities, sharp boundaries, and disjoint clusters inherent to real-world perceptual data. This work establishes a universal score approximation theorem that works for any distribution supported on any compact set of upper Minkowski dimension $d$. Using a novel discrete-mixture formulation, we prove that the score function can be approximated with a ReLU network whose complexity grows exponentially only with $d$, thus breaking the exponential curse of ambient dimensionality. Combined with existing theories on accurately solving the backward diffusion SDE for arbitrary compact distributions, our work shows that diffusion models readily adapt to irregular, non-smooth data structures, explaining their competence in real-world generative tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19397 2026-06-19 cs.RO 新提交 80%

DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

DiffusionVS：基于扩散策略的鲁棒视觉伺服生成框架

Hongkang Cui, Rui He, Haoyao Chen

专题命中扩散模型：基于扩散策略生成相机速度，利用条件去噪。

AI总结提出基于扩散策略的视觉伺服方法，通过条件去噪生成相机速度，并采用在线训练增强泛化能力，仿真成功率近100%，物理实验93%。

Comments 8 pages, 4 figures, 7 tables

详情

AI中文摘要

视觉伺服是机器人操作和导航中的基础技术。基于回归的视觉伺服常因噪声敏感的单步映射和分布偏移时的误差累积而出现轨迹抖动。相比之下，扩散策略通过预测动作序列保持时间一致性，并通过隐式数据增强提高鲁棒性。本文提出一种新颖的基于扩散的伺服方法。基于扩散策略，该方法使用观测标签角点的归一化图像坐标作为输入，通过条件去噪生成相机速度。为了克服在静态数据集上训练的模型的泛化限制，采用了在线训练范式，通过交互经验收集持续扩展训练数据的多样性。该策略显著提升了模型的性能和泛化能力。全面的仿真和实际实验证明了该方法的有效性，在仿真中实现了近100%的成功率，在物理实验中达到93%。除了具体的流程，我们进一步验证了扩散机制的通用性。实验表明，现有的视觉伺服网络在与我们的扩散模块集成时，性能持续提升。这些结果表明，所提出的策略具有广泛的适用性，能够增强除本文具体架构之外的各种视觉伺服系统。

英文摘要

Visual servoing is a fundamental technique in robotic manipulation and navigation. Regression-based visual servoing frequently experiences trajectory jitter as a result of noise-sensitive single-step mappings and the accumulation of errors during distribution shifts. In contrast, Diffusion Policy maintains temporal consistency by predicting action sequences and improves robustness through implicit data augmentation. This paper presents a novel diffusion-based servoing method. Based on Diffusion Policy, the proposed approach uses normalized image coordinates of observed tag corners as input and generates camera velocity through conditional denoising. To overcome the generalization limitations of models trained on static datasets, an online training paradigm is adopted, continuously expanding the diversity of training data through interactive experience collection. This strategy substantially enhances both the performance and generalization capability of the model. Comprehensive simulations and real-world experiments demonstrate the effectiveness of the proposed method, achieving success rates of nearly 100\% in simulation and 93\% in physical experiments. Beyond the specific pipeline, we further validate the generality of the diffusion mechanism. Experiments show that existing visual servoing networks consistently achieve improved performance when integrated with our diffusion-based module. These results indicate that the proposed strategy possesses broad applicability and can enhance various visual servoing systems beyond the specific architecture presented here.

URL PDF HTML ☆

赞 0 踩 0