AI 大模型

视频大模型

视频理解、视频生成、视频语言模型和时序视觉推理。

今日/当前日期收录 2 篇信号源：cs.CV, eess.IV, cs.MM

2602.08355 2026-06-18 cs.CV 版本更新 90%

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds：面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

发表机构 * Alimama Tech, Taobao \& Tmail Group of Alibaba ； Huazhong University of Science ； Vin University

专题命中视频理解：电商短视频理解基准，评估多模态大模型视频理解能力。

AI总结提出电商短视频理解基准E-VAds，通过多模态信息密度评估框架量化领域复杂性，并构建多智能体生成的问答数据集，最后开发基于强化学习的推理模型E-VAds-R1，在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

详情

AI中文摘要

电商短视频代表了在线视频行业中高收入的细分领域，其特点是目标驱动的格式和密集的多模态信号。当前模型通常难以处理这些视频，因为现有基准主要关注通用任务，忽略了商业意图的推理。在这项工作中，我们首先提出了一个多模态信息密度评估框架，以量化该领域的复杂性。我们的评估显示，与主流数据集相比，电商内容在视觉、音频和文本模态上表现出显著更高的密度，为视频理解建立了更具挑战性的前沿。为了弥补这一差距，我们引入了电商视频广告基准（E-VAds），这是首个专门为电商短视频理解设计的基准。我们从淘宝精选了3,961个高质量视频，涵盖广泛的产品类别，并使用多智能体系统生成了19,785个开放式问答对。这些问题被组织成两个主要维度，即感知与认知和推理，包含五个不同的任务。最后，我们开发了E-VAds-R1，一个基于强化学习的推理模型，具有称为MG-GRPO的多粒度奖励设计。该策略为早期探索提供平滑指导，同时为专家级精度创造非线性激励。实验结果表明，E-VAds-R1在仅使用几百个训练样本的情况下，在商业意图推理上实现了109.2%的性能提升。

英文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

URL PDF HTML ☆

赞 0 踩 0

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新 70%

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni：从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳分校）； National University of Singapore（新加坡国立大学）

专题命中视频理解：视频未来预测基准，涉及时序推理

AI总结提出FutureOmni基准，评估多模态大模型从音视频线索预测未来的能力，发现现有模型在语音密集场景下表现差，并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）展现出强大的全模态感知能力，但它们从音视频线索预测未来事件的能力仍未被充分探索，因为现有基准主要关注回顾性理解。为弥补这一差距，我们引入了FutureOmni，这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理，并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建，包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明，当前系统在音视频未来预测方面存在困难，尤其是在语音密集场景中，Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限，我们整理了一个7K样本的指令微调数据集，并提出全模态未来预测（OFF）训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明，OFF增强了未来预测和泛化能力。我们公开发布所有代码（此 https URL ）和数据集（此 https URL ）。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

URL PDF HTML ☆

赞 0 踩 0