arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

视频大模型

视频理解、视频生成、视频语言模型和时序视觉推理。

今日/当前日期收录 8 信号源:cs.CV, eess.IV, cs.MM

1. 视频生成 6 篇

2605.21028 2026-06-18 cs.CV cs.AI 版本更新 专题 95

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

DySink:动态帧 sinks 用于自回归长视频生成

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

专题命中 视频生成 :提出DySink框架用于自回归长视频生成,核心是视频生成。

AI总结 本文提出 DySink,一种基于检索的框架,通过维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks,以提高自回归长视频生成的动态性和时间质量。

2502.07531 2026-06-18 cs.CV cs.AI cs.LG cs.MM 版本更新 专题 95

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

VidCRAFT3: 面向图像到视频生成的相机、物体与光照控制

Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu

专题命中 视频生成 :可控图像到视频生成,控制相机、物体和光照

AI总结 提出VidCRAFT3框架,通过显式建模几何、运动与光照的跨因素交互,实现对相机运动、物体运动和光照方向的独立或联合控制,在控制精度和视觉一致性上达到最优。

Comments Accepted to TVCG 2026

2606.06361 2026-06-18 cs.CV 版本更新 专题 90

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

两步物理:在视觉细化之前锁定运动先验会擦除它们

Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang

专题命中 视频生成 :图像到视频扩散模型的物理一致性改进

AI总结 本文发现图像到视频扩散模型在两步生成中比多步生成具有更好的物理一致性,通过频谱分析将原因归结为去噪过程中的相位侵蚀,并提出无需训练的PhaseLock框架,通过从两步推理中提取运动先验并利用潜在增量引导强制到高保真生成中,有效缓解相位退化,提升物理一致性平均6.2点,同时保持视觉保真度且开销极小。

Comments ICML 2026

2605.15824 2026-06-18 cs.CV 版本更新 专题 90

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

FashionChameleon:迈向实时和交互式的人体服装视频定制

Quanjian Song, Yefeng Shen, Mengting Chen, Hao Sun, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Liujuan Cao

专题命中 视频生成 :提出实时交互式人体服装视频定制框架。

AI总结 本文提出FashionChameleon框架,通过单件服装视频数据实现交互式多服装视频定制,保留动作一致性,实现实时生成23.8FPS,比现有方法快30-180倍。

Comments Project Page: https://quanjiansong.github.io/projects/FashionChameleon/

2510.21615 2026-06-18 cs.CV 版本更新 专题 90

Epipolar Geometry Improves Video Generation Models

极线几何改进视频生成模型

Orest Kupyn, Théo Uscidda, Marta Tintore Gazulla, Fabian Manhardt, Federico Tombari, Christian Rupprecht

专题命中 视频生成 :利用极线几何约束改进视频生成模型的几何一致性。

AI总结 针对视频生成模型几何不一致和运动伪影问题,提出基于极线几何约束的偏好优化方法,在保持视觉质量的同时将极线误差降低31%,人类评分一致性从54%提升至72%。

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新 专题 85

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

专题命中 视频生成 :视频生成能力,世界模拟器

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

2. 视频理解 2 篇

2602.08355 2026-06-18 cs.CV 版本更新 专题 90

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds:面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

专题命中 视频理解 :电商短视频理解基准,评估多模态大模型视频理解能力。

AI总结 提出电商短视频理解基准E-VAds,通过多模态信息密度评估框架量化领域复杂性,并构建多智能体生成的问答数据集,最后开发基于强化学习的推理模型E-VAds-R1,在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新 专题 70

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni:从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

专题命中 视频理解 :视频未来预测基准,涉及时序推理

AI总结 提出FutureOmni基准,评估多模态大模型从音视频线索预测未来的能力,发现现有模型在语音密集场景下表现差,并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026