视频大模型 - arXivDaily 专题

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 新提交 90%

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； Qwen Team, Alibaba Group（阿里巴巴集团Qwen团队）

专题命中视频理解：长视频理解，POMDP主动感知框架

AI总结提出OmniAgent，一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体，通过主动感知将推理复杂度与视频时长解耦，在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情

AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式，无论查询难度如何都统一处理帧，导致计算成本随视频时长增长。尽管出现了交互式框架，但它们通常依赖于全局预扫描，其上下文成本仍随视频长度扩展。我们提出OmniAgent，第一个原生全模态智能体，将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作，选择性地将视听线索提炼到持久文本记忆中，有效将推理复杂度与原始视频时长解耦。为实现这一点，我们引入了(1)智能体监督微调，通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知；(2)带TAURA（轮次感知自适应不确定性重缩放优势）的智能体强化学习，利用轮次级熵将信用分配引导至关键发现轮次。关键的是，OmniAgent表现出正向测试时缩放，性能随推理轮次增加而提升，验证了主动感知的有效性。在十个基准（如VideoMME、LVBench）上的实验结果表明，OmniAgent在开源模型中达到了最先进性能。值得注意的是，在LVBench上，我们的7B智能体优于10倍大的Qwen2.5-VL-72B（50.5% vs. 47.3%）。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

URL PDF HTML ☆

赞 0 踩 0

2606.18943 2026-06-18 cs.CV 新提交 85%

Physics-IQ Verified

物理智力验证

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth

发表机构 * Anates Labs（Anates实验室）； Technical University of Munich（慕尼黑技术大学）； University of Technology Nuremberg（纽伦堡技术大学）； Tuebingen AI Center, University of Tuebingen（图宾根大学人工智能中心）； Helmholtz AI, Munich（慕尼黑海德堡人工智能研究所）； Google DeepMind research（谷歌DeepMind研究）

专题命中视频理解：评估视频生成模型对物理现实的理解

AI总结本文提出Physics-IQ Verified基准，通过改进提示和地面真实质量及引入样本级评分系统，提升视频生成模型对物理现实的理解评估，验证结果表明基准提升了57.6%的样本和34.8%的提示。

详情

AI中文摘要

视频生成模型（VGMs）已成为新的前沿，不仅用于视频生成，还用于多种下游任务，包括世界建模。为推进这些任务，一个良好的视频模型必须理解世界的物理现实。评估这种理解成为新兴领域，催生了Physics-IQ基准，通过将模型生成的视频与真实物理实验视频进行比较来量化。本文系统审计了Physics-IQ基准，揭示不足并提出三种解决方案，改进如何衡量VGMs的物理理解。具体而言，我们提高了提示和地面真实质量以减少混淆因素影响，并进一步引入样本级评分系统，使每个样本和指标权重相等。我们的基准Physics-IQ Verified优化了57.6%的所有样本并改进了超过34.8%的提示。在使用六个图像到视频生成模型的比较研究中，我们观察到中等但有意义的排名变化（Kendall's τ=0.46）。我们希望Physics-IQ Verified通过提供更可靠的信号推动社区发展，向物理准确的VGMs迈进。该基准的代码可通过此https URL访问。

英文摘要

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

URL PDF HTML ☆

赞 0 踩 0

2606.18586 2026-06-18 cs.CV cs.AI 新提交 85%

APT: Atomic Physical Transitions for Causal Video-Language Understanding

APT: 用于因果视频语言理解的原子物理转变

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

发表机构 * Northwestern University（西北大学）； Dolby Laboratories（杜比实验室）

专题命中视频理解：APT表示视频因果状态变化提升VLM理解

AI总结提出原子物理转变（APT）作为视频中因果状态变化的显式表示，并构建混合来源数据集，通过APT-Tune微调方法使VLM学习物理转变而不遗忘事件级知识。

详情

AI中文摘要

物理事件不仅通过其名称来理解，还通过组成它们的因果状态变化来理解。诸如“弹跳”之类的片段级标签可能是正确的，但同时隐藏了使事件在物理上有效的过程，从支撑丧失和接触开始到反弹和稳定。为了使这一隐藏过程显式化，我们引入了原子物理转变（APT）：最小的、时间局部化的状态变化，将可见线索与活跃的物理机制以及前后动力学状态联系起来。APT链将视频表示为有序的因果转变序列，而不是单个聚合事件标签：事件标签说明发生了什么；APT链解释为什么会发生。为了使VLM能够学习APT，我们从人工标注和模拟器真实数据构建了混合来源的APT数据，涵盖接触、重力、摩擦和旋转/稳定性中的14种转变类型，包含1,246个试验中的27,303个计时实例。利用这些数据，我们发现当前的VLM在转变级物理理解上存在不足，零样本召回率最多为14%，错误主要由遗漏的转变主导。直接在APT链上进行微调可以改善转变检测，但会导致事件级遗忘，表明模型学习的是专门的答案格式，而不是可复用的物理表示。因此，我们提出了APT-Tune，一种参数高效的方案，教会VLM使用因果转变而不遗忘如何回答视频问题。它结合了图像填充感知监督、格式条件协同训练和机制条件域到类型解码，使APT学习具有格式鲁棒性和物理基础。在Qwen3-VL-2B上仅使用11M LoRA参数，APT-Tune显著提高了APT召回率，同时改善了事件级视频迁移。这些结果表明，APT不是一种新的答案格式，而是一种用于物理视频理解的人类对齐的因果监督信号。

英文摘要

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.18441 2026-06-18 cs.CV 新提交 85%

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集：视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； Beijing University of Posts and Telecommunications（北京邮电大学）； Cloud and AI BU, Huawei（华为云与AI业务部）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

专题命中视频理解：提出视频推理奖励框架，提升视频MLLM推理能力

AI总结提出无时间标注的过程级奖励框架CF-GRPO，通过视频内在线索构建一致性帧先验，并利用一致性帧奖励优化模型帧使用与先验的对齐，提升视频推理性能。

详情

AI中文摘要

强化学习提升了大型语言模型的推理能力，但将仅结果奖励应用于视频多模态大语言模型（Video-MLLMs）时，对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发（其中一致的线索可以增强感知估计的显著性和可靠性），我们引入了一致性帧GRPO（CF-GRPO），一种无需时间标注的过程级奖励框架，用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验，包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后，它从视觉和响应表示中计算模型侧的帧使用分数，并通过一致性帧奖励（CFR）优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化，CFR提供了高对比度的奖励信号，无需人工时间标注。实验表明，VideoCFR在复杂视频推理基准上取得了有竞争力的性能，并在多个指标上优于代表性的Video-MLLM和RL基线，同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见：https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

URL PDF HTML ☆

赞 0 踩 0

2606.14702 2026-06-18 cs.CV 新提交 85%

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K：通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University（南京大学）； CASIA（中国科学院自动化研究所）

专题命中视频理解：视频问答与长时推理

AI总结提出OmniVideo-100K数据集，通过实体锚定视频脚本和线索引导的QA生成机制，解决音视频问答中跨段实体不一致和长时推理不足的问题，微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情

AI中文摘要

当前的音视频问答（QA）自动化流水线通常采用“视频-字幕-QA”范式。然而，这些方法通常将视频分割成短片段，并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联，而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外，将长文本理解和QA合成耦合到单一步骤中，往往将模型限制在局部事件上，生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题，我们提出了一种自动化数据引擎，包含两种机制：（1）**实体锚定视频脚本**将视频转换为结构化脚本，包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验，确保跨片段引用一致性并重建音视频关联。（2）**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索，然后基于这些高价值线索生成QA对。利用这一流水线，我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B，在OmniVideo-Test上获得了高达20.59%的性能提升，并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力（提升高达12.64%）。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

URL PDF HTML ☆

赞 0 踩 0

2606.15632 2026-06-18 cs.CV 新提交 80%

Open-World Video Segmentation

开放世界视频分割

Qing Su, Kaiyang Li, Yuan Zhuang, Fei Miao, Shihao Ji

发表机构 * University of Connecticut（康涅狄格大学）

专题命中视频理解：长时视频分割与对象发现，视频理解

AI总结提出Savvy系统，结合分层掩码发现、延迟接纳和轨迹整合，实现零样本开放世界长时视频分割；并设计粒度感知评估套件OGA，采用n:1匹配协议，解决传统1:1匹配对开放世界方法的不公平惩罚问题。

详情

AI中文摘要

尽管视频分割在短片段和封闭集基准上取得了快速进展，但开放世界视频分割仍然在很大程度上未被探索。挑战有两方面：（1）现有方法不支持在动态自我运动的长视频中进行对象发现和身份维护；（2）现有评估协议依赖于严格的1:1匹配，不公平地惩罚了具有不匹配粒度的语义有效预测。为了解决这两个问题，我们引入了Savvy，一个实用且强大的零样本开放世界长时视频分割系统。Savvy结合了分层掩码发现、延迟接纳和轨迹整合，以支持持久对象发现、安全轨迹提升和稳定的长距离身份维护。我们进一步提出了OGA，一个用于开放世界视频分割的粒度感知评估套件。基于粒度无关（GA）匹配协议，OGA将传统的1:1匹配放宽为n:1映射，但通过断点检测支持不连续性并通过对每个参考对象的优势连贯片段进行评分来强制执行时间严谨性。这防止了碎片化或闪烁的支持被过度奖励，同时实现了GA适应的指标和结构诊断：身份持久性（IP）和身份集中性（IC）。在VIPSeg上，我们展示了标准的1:1评估严重低估了开放世界方法，而GA评估恢复了许多被抑制的性能。在更现实的长时基准ScanNet和HM3D上，Savvy在经典指标和提出的指标（包括STQ、VPQ$_\infty$、IP和IC）上始终优于强基线。这些结果共同为开放世界长时视频分割建立了一个实用的基准和一个强基线。

英文摘要

While video segmentation has advanced rapidly on short clips and closed-set benchmarks, open-world video segmentation remains largely unexplored. The challenge is twofold: (1) existing methods are not designed to support object discovery and identity maintenance in long videos of dynamic ego-motion, and (2) existing evaluation protocols rely on a rigid 1:1 matching that unfairly penalizes semantically valid predictions with mismatched granularity. To address both gaps, we introduce Savvy, a practical and strong system for zero-shot open-world long-horizon video segmentation. Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long-range identity maintenance. We further propose OGA, a granularity-aware evaluation suite for open-world video segmentation. Built on a Granularity-Agnostic (GA) matching protocol, OGA relaxes conventional 1:1 matching to an n:1 mapping, but still enforces temporal rigor by detecting support discontinuities through sever points and scoring each reference object through its dominant coherent fragment. This prevents fragmented or flickering support from being over-rewarded while enabling GA-adapted metrics and structural diagnostics: identity persistence (IP), and identity concentration (IC). On VIPSeg, we show that standard 1:1 evaluation substantially underestimates open-world methods, whereas GA evaluation recovers much of their suppressed performance. On the more realistic long-horizon benchmarks: ScanNet and HM3D, Savvy consistently outperforms strong baselines across both classical and proposed metrics, including STQ, VPQ$_\infty$, IP and IC. Together, these results establish a practical benchmark and a strong baseline for open-world long-horizon video segmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.18610 2026-06-18 cs.RO cs.CV 新提交 60%

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； NVIDIA（英伟达）； Physical Intelligence ； Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； Allen Institute for AI（艾伦人工智能研究所）

专题命中视频理解：利用视频基础模型模拟策略展开

AI总结提出SC3-Eval方法，利用前向-反向动力学一致性、跨视角一致性和测试时一致性，将预训练视频基础模型转化为准确的策略评估器，在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情

AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差，多视角观测必须保持相互一致，且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战，这是一种自洽视频生成方案，通过强制三种互补的一致性，将预训练视频基础模型转化为准确的策略评估器。首先，前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作，将生成的 rollout 锚定在物理上合理的动作流形上，并抵消仅前向模型无法惩罚的漂移。其次，跨视角一致性训练模型从每个相机视角修补其他视角，使多相机观测在长 rollout 中保持连贯，无需任何显式记忆机制。第三，测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号，当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式，支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上，SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119，优于三个强先前的基于视频模型的基线，并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

URL PDF HTML ☆

赞 0 踩 0