VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation
VPG: 视觉前缀引导的自回归图像与视频生成
Xinyao Liao, Qiyuan He, Yicong Li, Jiayin Zhu, Xiaoye Qu, Wei Wei, Angela Yao
AI总结 提出VPG,一种无需训练、推理时引导的方法,通过对比生成前缀与损坏前缀下的模型输出来改进自回归图像和视频生成的下一步预测,提升生成质量。
详情
自回归图像和视频生成器在训练时使用教师强制历史,但在推理时必须从自身生成的前缀中采样,因此容易受到曝光偏差和前缀漂移的影响。现有的补救方法要么修改训练,要么应用主要针对外部语义条件(如类别标签或文本提示)的采样时引导,而不是测试下一步预测是否为生成的前缀本身提供强大的后验支持。我们提出视觉前缀引导(VPG),一种用于自回归图像和视频生成的无需训练、推理时引导方法。VPG通过对比模型在生成前缀下的输出与在损坏前缀下的输出,然后将logits外推到加强生成前缀后验支持的候选者,从而改进下一步预测。在基于VAR的类别条件图像生成、基于Infinity的文本到图像生成以及基于InfinityStar的文本到视频生成中,VPG在不重新训练基础模型的情况下提高了生成质量,平均将VAR上的FID降低了0.36,并在图像和视频生成上均提升了基准性能。
Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modify training or apply sampling-time guidance aimed primarily at external semantic conditions, such as class labels or text prompts, rather than testing whether a next-step prediction provides strong posterior support for the generated prefix itself. We propose Visual Prefix Guidance (VPG), a training-free inference-time guidance method for autoregressive image and video generation. VPG improves next-step prediction by contrasting the model's output under the generated prefix with its output under a corrupted prefix, then extrapolating logits toward candidates that strengthen the posterior support of the generated prefix. Across class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, VPG improves generation quality without retraining the base model, reducing FID on VAR by 0.36 on average and improving benchmark performance on both image and video generation.