视频大模型 - arXivDaily 专题

2606.19849 2026-06-19 cs.CV 新提交 90%

ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference

ViCoStream: 流式视频大模型通过阶段协调推理可运行超过100 FPS

Yang Tan, Junlong Tong, Linan Yue, Hao Wu, Pengfei Fang, Xiaoyu Shen

发表机构 * Southeast University（东南大学）； Eastern Institute of Technology, Ningbo（宁波东方理工大学）； Shanghai Jiao Tong University（上海交通大学）

专题命中视频理解：提出流式视频大模型推理框架，提升视频吞吐和延迟。

AI总结提出ViCoStream框架，通过阶段协调的流水线（分块执行、CUDA流重叠、视觉令牌控制、有界视觉注意力、查询端检索）实现流式视频大模型的高吞吐低延迟推理，在单A100上达到134 FPS视频吞吐和<50 ms首令牌延迟，精度接近全历史基线。

Comments 19 pages, 7 figures, 13 tables

详情

AI中文摘要

流式视频大模型必须持续处理传入的视频，同时保持低查询延迟，这使得视频摄入吞吐量和查询时间响应性对于实时部署至关重要。现有方法主要集中于加速单个模块，如视觉编码、令牌剪枝或KV缓存压缩，但对由此产生的系统能否维持实时流式性能提供的见解有限。我们将流式视频大模型推理形式化为一个协调的流水线，涵盖视觉预处理、视觉编码、令牌丢弃和LLM预填充/解码。基于这一形式化，我们提出了ViCoStream（视频协调流式处理），一个阶段协调的流式框架，结合了分块执行、CUDA流重叠、视觉令牌控制、有界视觉注意力和查询端检索，以限制每块的计算和内存成本。我们进一步对瓶颈迁移进行了系统研究，揭示了块大小、令牌保留、注意力局部性和检索范围如何影响吞吐量-准确率权衡。在多个流式基准测试上使用Qwen2.5-VL-3B/7B-Instruct进行的实验表明，ViCoStream在单块A100 GPU上实现了134 FPS的视频吞吐量和小于50 ms的首令牌延迟，同时保持接近全历史基线的准确率。

英文摘要

Streaming VideoLLMs must continuously process incoming video while maintaining low query latency, making both video-ingestion throughput and query-time responsiveness critical for real-time deployment. Existing methods largely focus on accelerating individual modules, such as visual encoding, token pruning, or KV-cache compression, but provide limited insight into whether the resulting system can sustain real-time streaming performance. We formulate streaming VideoLLM inference as a coordinated pipeline spanning visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Building on this formulation, we propose ViCoStream (Video Coordinated Streaming), a stage-wise coordinated streaming framework that combines chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to bound per-chunk computation and memory costs. We further provide a systematic study of bottleneck migration, revealing how chunk size, token retention, attention locality, and retrieval scope shape the throughput-accuracy trade-off. Experiments with Qwen2.5-VL-3B/7B-Instruct across multiple streaming benchmarks show that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU while maintaining accuracy close to full-history baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19706 2026-06-19 cs.CV cs.CL 新提交 90%

NEST: Narrative Event Structures in Time for Long Video Understanding

NEST：面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）

专题命中视频理解：提出长视频叙事事件结构数据集，评估视频理解。

AI总结提出NEST数据集（1005部全长电影），通过多模态叙事事件标注和关系链接，评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力，实验表明事件检测等任务极具挑战性。

详情

AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能，但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索，而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展，例如，模型是否能够将早期的挫折（如失业）与后来的关系破裂联系起来，尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST（面向长视频理解的时间叙事事件结构），一个包含1005部全长电影（平均98分钟）的数据集，每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件，并通过反映叙事结构的关系（包括时间顺序、层次组合和长程依赖）将它们联系起来。我们引入了事件触发检测（ETD）、事件定位（EL）、事件论元抽取（EAE）和事件关系抽取（ERE）的基线。该基准对于基于事件发现极具挑战性，ETD低于8%，EL低于6%，EAE低于11%。相比之下，一旦事件给定，ERE更容易处理，零样本F1达到35.45%，微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.09547 2026-06-19 cs.CV cs.LG 新提交 90%

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

流式干预：视频大语言模型能否在错误发生时即时纠正？

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

发表机构 * Qualcomm AI Research（高通人工智能研究院）； York University（约克大学）； Vector Institute for AI（向量人工智能研究所）

专题命中视频理解：评估视频LLM在烹饪场景中的实时干预能力

AI总结提出Ego-MC-Bench基准评估视频LLM在烹饪场景中的实时干预能力，并构建Ego-CoMist反事实合成数据集提升小模型性能。

Comments The project page is available at https://apratimbh.github.io/livecookv2/

详情

AI中文摘要

学习日常技能（如烹饪一道菜）越来越依赖于教学媒体，例如在线视频。这为使用视频（和多模态）大语言模型（LLMs）作为任务指导助手打开了大门。一个潜在的任务指导助手在现实世界中成功的关键能力是，它能够在错误一出现时就主动干预以引导用户。为了评估这一关键能力，我们引入了Ego-MC-Bench（错误纠正），这是一个用于评估在现实烹饪场景中反应性、逐步任务指导的基准。大量实验表明，Ego-MC-Bench对于最先进的视频LLMs具有高度挑战性。我们认为一个关键原因是用于在此任务上微调模型的训练数据有限。尽管存在广泛的烹饪视频数据集，但现有数据集缺乏错误示例以及适当时间的干预。为了帮助解决这一数据限制，我们还引入了Ego-CoMist，这是一个反事实合成数据集，通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明，在Ego-CoMist上进行微调可以带来性能提升，特别是对于更适合在边缘设备上提供帮助的更小、更高效的视频LLMs。

英文摘要

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

URL PDF HTML ☆

赞 0 踩 0

2606.20561 2026-06-19 cs.CV 新提交 85%

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

TimeProVe: 先提出后验证，实现日常活动中的高效长视频时间推理

Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan, Hieu Le, Srijan Das

发表机构 * University of North Carolina, Charlotte（北卡罗来纳大学夏洛特分校）

专题命中视频理解：长视频时间推理与问答，结合VLM

AI总结提出TimeProVe框架，先通过轻量模块生成基于动作的候选假设，再调用昂贵VLM验证，在长视频问答中降低75%VLM调用和93%推理成本，性能提升7.3%。

详情

AI中文摘要

长视频问答（LVQA）需要在数小时未修剪的视频中识别稀疏的、与查询相关的证据。现有方法要么使用大型视觉语言模型（VLM）密集处理视频，导致计算成本过高，要么依赖稀疏的基于字幕的推理，这往往会遗漏时间局部化和以运动为中心的证据。我们提出TimeProVe，一种用于长视频中时间基础推理的高效混合框架。TimeProVe首先使用轻量模块生成基于动作的答案-证据假设，随后仅调用昂贵的VLM进行针对性验证。我们框架的核心在于基于动作的候选证据（ACE）模块，该模块通过轻量级LLM推理将时间局部化的动作转换为查询条件化的候选答案和支持证据窗口。我们进一步引入OpenTSUBench（OTB），一个开放基准测试，旨在评估真实世界日常活动（ADL）场景中的时间基础推理。实验表明，TimeProVe在OTB上比最强基线高出7.3%，同时减少了75%的VLM调用和93%的推理成本。此外，在没有显式时间基础训练的情况下，TimeProVe在Charades-STA上取得了竞争性性能，并在结合基础VLM增强时达到了最先进的结果。

英文摘要

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19682 2026-06-19 cs.CV 新提交 85%

Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval

Vortex: 面向智能视频检索的多模态融合系统

Duc-Tho Nguyen, Hieu-Hoc Tran-Minh, Khanh-Hoa Lam, Hoang-Nhut Ly, Huu-Phuc Huynh, Thanh-Tien Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（越南国立大学胡志明市理科大学）； Vietnam National University, Ho Chi Minh City（越南国立大学胡志明市）

专题命中视频理解：多模态视频检索系统，融合CLIP和SigLIP2

AI总结提出Vortex系统，融合自适应关键帧提取、多模态元数据生成及混合检索策略（CLIP与SigLIP2的倒数秩融合），结合Rocchio反馈和多阶段时序搜索，在比赛中取得优异成绩。

Comments SOICT 2025

详情

AI中文摘要

本文介绍了Vortex，这是我们的团队FocusOnFun为胡志明市AI挑战赛2025开发的多模态视频检索系统，旨在推进智能多媒体搜索和时间推理。该系统集成了自适应关键帧提取、来自视觉语言和语音模型的多模态元数据生成，以及通过倒数秩融合融合CLIP和SigLIP2嵌入的混合检索策略，以平衡全局和细粒度语义。为了增强交互性，Vortex引入了基于Rocchio的相关性反馈和多阶段时序搜索机制，用于顺序事件对齐。该系统基于Milvus和Elasticsearch构建，支持可扩展的索引和高效检索。在官方比赛中，我们的FocusOnFun团队的系统在初赛中获得了79.6/88（90.5%）的分数，并在决赛中进一步评估，整体表现达到“优秀”，在问答（QA）任务中取得“杰出”成绩。这证明了CLIP和SigLIP2的互补优势，并确认了混合检索方法的有效性。该系统为未来在智能、上下文感知和交互式视频检索方面的研究奠定了坚实基础。

英文摘要

This paper presents Vortex, the multimodal video retrieval system developed by our team, FocusOnFun, for the Ho Chi Minh City AI Challenge 2025, designed to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction, multimodal metadata generation from vision-language and speech models, and a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings through Reciprocal Rank Fusion to balance global and fine-grained semantics. To enhance interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture enables scalable indexing and efficient retrieval. Evaluated in the official competition, our FocusOnFun team's system achieved a score of 79.6/88 (90.5\%) in the Preliminary Round and was further evaluated in the Final Round, achieving an `Excellent' overall performance with `Outstanding' results in the question-answering (QA) task. This demonstrating the complementary strengths of CLIP and SigLIP2 and confirming the effectiveness of the hybrid retrieval approach. The system establishes a robust foundation for future research in intelligent, context-aware, and interactive video retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.20559 2026-06-19 cs.CV cs.LG 新提交 70%

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

UNIEGO：代理作为中介的统一自我中心视频表示学习

Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das

发表机构 * University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校）

专题命中视频理解：聚焦自我中心视频表示学习，提升视频理解。

AI总结提出分层多教师蒸馏框架UNIEGO，通过代理模型将异构教师知识转化为同质自我中心空间，并采用选择性代理蒸馏自适应筛选可靠监督，在三个自我中心视频理解任务上达到最优。

详情

AI中文摘要

自我中心视频理解本质上受限于可穿戴摄像头的狭窄视角：单一视角、单一模态、单一模型无法捕捉人类动作的全部丰富性。我们认为，真正富有表现力的自我中心表示必须包含跨视角、跨模态和基础模型表示的互补知识，同时仍能仅从自我中心视频部署。为此，我们引入了一个分层多教师蒸馏框架，生成UNIEGO，一个统一的自我中心编码器，使用九个教师（涵盖自我-外部视角、RGB、深度和骨架模态）以及四个基础模型进行训练。我们的框架不是直接从异构教师中蒸馏（其不兼容的架构和特征几何会导致冲突梯度），而是在其中插入一层表示特定的代理模型，将多样的教师知识转化为同质的自我中心空间。第二阶段蒸馏，即选择性代理蒸馏（SPD），然后自适应地为每个训练样本选择既正确又自信的代理子集，仅从可靠监督中蒸馏并抑制错误信号。SPD进一步通过将UNIEGO初始化为代理参数的凸组合来稳定，在蒸馏开始前将统一模型置于损失景观的良好条件区域。UNIEGO在三个自我中心视频理解任务（动作识别、视频检索和动作分割）上，在三个具有挑战性的自我-外部基准测试中达到了最先进的性能，优于朴素的多教师蒸馏基线，并证明了结构化的、代理中介的知识转移能产生更丰富、更具判别性的自我中心表示。

英文摘要

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

URL PDF HTML ☆

赞 0 踩 0

2606.20545 2026-06-19 cs.CV 新提交 65%

Current World Models Lack a Persistent State Core

当前世界模型缺乏持久状态核心

Jinpeng Lu, Dexu Zhu, Haoyuan Shi, Linghan Cai, Guo Tang, Yinda Chen, Jie Cao, Duyu Tang, Yi Zhang, Yong Dai, Xiaozhu Ju

发表机构 * University of Science and Technology of China（中国科学技术大学）； Beijing Innovation Center of Humanoid Robotics (X-Humanoid)（北京人形机器人创新中心）； NLPR, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所模式识别国家重点实验室）； Independent Researcher（独立研究者）； Dresden University of Technology（德累斯顿工业大学）； Peking University（北京大学）

专题命中视频理解：评估世界模型在观测中断时的状态演化。

AI总结提出WRBench基准测试，发现现有世界模型在观测中断时无法维持世界状态演化，强调物理状态核稳定性应成为世界模型设计首要目标。

Comments 39 pages, 16 figures

详情

AI中文摘要

世界模型日益被视为迈向通用人工智能的关键一步，然而对物理世界建模需要的不仅仅是按需生成令人信服的帧：它需要一个内部世界状态随时间持续演化，与观测解耦，使得物体持久存在、事件运行至结束，无论是否有相机在观察——就像月球在无人注视时仍保持轨道运行一样。这一要求是现有基准的盲点，它们奖励表面属性如保真度、运动和相机可控性，却从不询问生成的 world 在未被观测时是否持续演化。我们引入 \textbf{WRBench}，首个系统性的诊断基准，将相机运动视为对可观测性的干预，并将评估分解为一个人工校准的链条：询问相机是否执行了请求的交互，场景在视野内是否保持连续和可识别，以及返回的目标是否与已启动的事件保持一致。在来自 23 个模型（涵盖四种控制范式）的 9,600 个视频中，一个发现顽固地存在：当前系统将观测到的世界维持为跟踪镜头，返回的目标恢复为被遗弃时的状态，而非在未被观测时推进事件。由于这一失败在控制范式、模型家族和规模增量中重复出现，稳健的世界状态演化并非来自更清晰的图像、更严格的控制、更丰富的几何先验或单纯的参数数量。因此，我们主张物理状态核的稳定性和视角干预下世界线的一致性应成为世界模型设计的一级目标，使得世界模型捕捉世界将如何展开，而非下一帧如何呈现。

英文摘要

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

URL PDF HTML ☆

赞 0 踩 0