多模态大模型 - arXivDaily 专题

2606.18249 2026-06-19 cs.CV 新提交 90%

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

统一多模态自回归建模：共享上下文-视觉分词器是实现统一的关键

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai

发表机构 * Institute of Trustworthy Embodied AI, Fudan University（可信具身AI研究院，复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Qwen Team, Alibaba Inc.（通义实验室，阿里公司）

专题命中图文多模态：统一多模态自回归建模，桥接视觉理解与生成

AI总结提出UniAR框架，通过单一离散视觉分词器桥接视觉理解与生成，采用并行位预测和扩散解码，在图像生成和编辑上达到最优，同时保持多模态理解竞争力。

Comments ICML2026. Project page https://sharelab-sii.github.io/uniar-web

详情

AI中文摘要

统一多模态建模旨在将视觉理解和生成集成到单个系统中。然而，现有方法通常依赖两个不同的视觉分词器，这分割了表示空间并阻碍了真正的统一建模。我们提出UniAR，一个统一的自回归框架，其中单个离散视觉分词器作为理解和生成之间的关键桥梁，使得模型能够直接解释其自身生成的视觉标记而无需额外的重新编码，从而实现共享上下文。UniAR采用预训练的视觉编码器，结合多级特征融合和无查找的逐位量化方案，在保留高层语义和低层细节的同时，以最小代价扩展有效视觉词汇。在此基础上，统一自回归模型采用并行逐位预测来联合预测空间分组的多级视觉编码，大幅减少视觉序列长度并加速生成。最后，基于扩散的视觉解码器对离散视觉标记进行操作，以解码高保真图像。通过大规模预训练，随后进行监督微调和强化学习，UniAR在图像生成和图像编辑上达到了最先进的性能，同时在多模态理解基准上保持竞争力。项目页面可在此URL获取。

英文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

URL PDF HTML ☆

赞 1 踩 0

2504.11171 2026-06-19 cs.CV cs.AI 版本更新 90%

TerraMind: Large-Scale Generative Multimodality for Earth Observation

TerraMind：面向地球观测的大规模生成式多模态模型

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

发表机构 * IBM Research – Europe（IBM欧洲研究院）； ETH Zurich（苏黎世联邦理工学院）； Forschungszentrum Jülich（尤利希研究中心）； European Space Agency（欧洲航天局）； Φ \Phi -Lab（Φ实验室）； NASA IMPACT ； University of Iceland（爱沙尼亚大学）

专题命中图文多模态：提出任意到任意多模态基础模型，覆盖九种地理空间模态。

AI总结提出首个任意到任意生成式多模态基础模型TerraMind，通过双尺度表示（token级和像素级）预训练，实现零样本/少样本应用，并引入“模态思考”能力，在PANGAEA等基准上达到领先性能。

Comments Accepted at ICCV'25

详情

AI中文摘要

我们提出了TerraMind，这是首个面向地球观测（EO）的任意到任意生成式多模态基础模型。与其他多模态模型不同，TerraMind在跨模态的双尺度表示（结合token级和像素级数据）上进行预训练。在token级别，TerraMind编码高层上下文信息以学习跨模态关系；在像素级别，TerraMind利用细粒度表示捕捉关键空间细节。我们在一个全球大规模数据集的九种地理空间模态上预训练了TerraMind。在本文中，我们证明：（i）TerraMind的双尺度早期融合方法为地球观测解锁了一系列零样本和少样本应用；（ii）TerraMind引入了“模态思考”（TiM）——在微调和推理过程中生成额外人工数据以改善模型输出的能力；（iii）TerraMind在PANGAEA等社区标准的地球观测基准上达到了超越现有最优的性能。预训练数据集、模型权重和我们的代码均在宽松许可下开源。

英文摘要

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

URL PDF HTML ☆

赞 0 踩 0

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 新提交 85%

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM：基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University（北京大学）； MSALab ； ByteDance（字节跳动）

专题命中图文多模态：多模态扩散语言模型实现并行区域感知

AI总结提出PerceptionDLM，利用扩散语言模型的并行解码特性，通过高效提示和结构化注意力掩码实现多区域并行感知，显著提升推理效率，并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解任务中取得了显著进展。然而，现有大多数MLLMs依赖自回归生成，这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中，我们提出PerceptionDLM，一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base（一个在开源扩散MLLMs中达到最先进性能的强基础基线），我们的架构充分利用了DLMs的并行解码特性。具体来说，我们引入了高效提示和结构化注意力掩码，以实现对多个掩码区域的同步感知，使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比，这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性，我们通过将DLC-Bench扩展为每张图像包含多个区域掩码，构建了一个新的并行详细局部描述基准（ParaDLC-Bench），从而能够联合评估描述质量和推理效率。实验表明，PerceptionDLM在区域描述中保持竞争性能，同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知，我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

URL PDF HTML ☆

赞 0 踩 0

2606.05833 2026-06-19 cs.CV cs.AI 版本更新 85%

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis（加州大学戴维斯分校）

专题命中图文多模态：提出GeoVR框架增强多模态大模型空间理解。

AI总结提出GeoVR框架，通过从2D视频序列中蒸馏3D几何知识（包括相机姿态、深度图、尺度因子和多尺度3D特征），重塑多模态大语言模型的内部表示以赋予其空间智能，在空间推理基准上达到最先进性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在2D语义理解方面表现出色，但缺乏内在的3D感知能力，导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性，我们提出了GeoVR，一种新颖的框架，仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间，以解锁空间智能。GeoVR并非采用浅层的特征混合，而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的，该策略由四个互补的几何目标驱动：（1）估计帧间相机姿态以嵌入变化的视角动态，（2）回归密集深度图以锚定物理距离，（3）预测度量尺度因子以进行真实世界校准，以及（4）蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下，模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明，GeoVR实现了最先进的性能，为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.19706 2026-06-19 cs.CV cs.CL 新提交 80%

NEST: Narrative Event Structures in Time for Long Video Understanding

NEST：面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）

专题命中图文多模态：多模态叙事事件标注，涉及视觉、对话和音频。

AI总结提出NEST数据集（1005部全长电影），通过多模态叙事事件标注和关系链接，评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力，实验表明事件检测等任务极具挑战性。

详情

AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能，但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索，而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展，例如，模型是否能够将早期的挫折（如失业）与后来的关系破裂联系起来，尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST（面向长视频理解的时间叙事事件结构），一个包含1005部全长电影（平均98分钟）的数据集，每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件，并通过反映叙事结构的关系（包括时间顺序、层次组合和长程依赖）将它们联系起来。我们引入了事件触发检测（ETD）、事件定位（EL）、事件论元抽取（EAE）和事件关系抽取（ERE）的基线。该基准对于基于事件发现极具挑战性，ETD低于8%，EL低于6%，EAE低于11%。相比之下，一旦事件给定，ERE更容易处理，零样本F1达到35.45%，微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.19413 2026-06-19 cs.LG 新提交 80%

Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting

文本真的有用吗？揭示并解决多模态时间序列预测中的文本坍缩问题

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

发表机构 * Applied Artificial Intelligence Initiative（应用人工智能计划）

专题命中图文多模态：多模态时间序列预测中文本与数值的融合。

AI总结针对多模态时间序列预测中文本分支被忽视导致“文本坍缩”的问题，提出REST-TS方法，通过让文本分支专门预测数值主干无法解释的残差，强制其提取真实内容，实现最先进性能。

详情

AI中文摘要

多模态时间序列预测将数值序列与领域相关的文本报告配对，有望将世界知识注入预测流程。然而，我们揭示了现有框架中的一个关键失败模式，称为文本坍缩：文本分支收敛到与内容无关的变换，无论输入描述如何，都贡献可忽略的判别信号。我们认为文本坍缩是时间序列预测中基本不对称性的结果：数值输入与输出强自相关，使得数值主干天生占主导地位，而文本分支尽管携带互补且通常关键的信息，却未被充分利用，导致其系统性欠利用。为解决此问题，我们提出REST-TS（时间序列中文本的残差独占监督），将不对称性转化为设计原则：数值主干产生其独立的数值预测，而文本分支被独占监督以预测残差的结构化组成部分，即数值无法解释的预测差距。由于没有数值路径可以减少这些损失，文本分支必须从输入描述中提取真实内容。在多样化的现实领域和主干架构上的评估表明，REST-TS实现了最先进的性能，并一致地显示出比现有框架更高的文本分支利用率，提供了强有力的经验证据，表明对文本分支进行残差监督迫使其从输入中提取真实内容。

英文摘要

Multimodal time series forecasting, which pairs numerical sequences with domain-relevant textual reports, promises to inject world knowledge into forecasting pipelines. However, we uncover a critical failure mode in existing frameworks that we term text collapse: the text branch converges to a content-independent transformation, contributing negligible discriminative signal regardless of the input description. We argue that text collapse is a consequence of a fundamental asymmetry in time series forecasting: the numerical input is strongly autocorrelated with the output, making the numerical backbone inherently dominant, while the text branch, despite carrying complementary and often critical information, is insufficiently utilized, leading to its systematic underexploitation. To address this, we propose \textbf{REST-TS} (\textbf{R}esidual-\textbf{E}xclusive \textbf{S}upervision for \textbf{T}ext in \textbf{T}ime \textbf{S}eries), which turns the asymmetry into a design principle: the numerical backbone produces its own independent numerical forecast, and the text branch is exclusively supervised to predict the structured components of the residual, the prediction gap that numbers cannot explain. Because no numerical pathway can reduce these losses, the text branch must extract genuine content from the input description. Evaluated across diverse real-world domains and backbone architectures, REST-TS achieves state-of-the-art performance and consistently demonstrates greater text-branch utilization than existing frameworks, providing strong empirical evidence that supervising the text branch on the residual compels it to extract genuine content from the input.

URL PDF HTML ☆

赞 0 踩 0

2606.20527 2026-06-19 cs.CL cs.CV 新提交 70%

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich（慕尼黑工业大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Princeton Center for Information and Technology Policy（普林斯顿信息与技术政策中心）

专题命中图文多模态：研究多模态大语言模型中的视觉偏见

AI总结提出StylisticBias基准，通过控制单一视觉属性变化，发现年龄和体型主导身份层面偏见，而时尚风格等约15个属性解释近80%的偏见变化，偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在个人和社会影响重大的场景中，但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的（群体）个体，难以将外貌效应与身份差异分离。我们引入StylisticBias，一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸，每张脸创建约50个单一属性变体，产生约25K张图像。这种设计保持身份不变，每次改变一个视觉属性，使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应，而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现，约15个属性解释了近80%的总变异，表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中，尤其是社会经济和风格相关判断，敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集：此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.19882 2026-06-19 cs.CV cs.LG 新提交 70%

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego（加州大学圣地亚哥分校）

专题命中图文多模态：结合图像和文本的多模态模型。

AI总结提出多模态概念瓶颈模型（MM-CBM），利用双概念瓶颈层对齐图像和文本嵌入，实现可解释的零样本分类和图像检索，在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情

AI中文摘要

概念瓶颈模型（CBM）通过将图像提取的特征与自然概念对齐，增强了深度学习网络的可解释性。然而，现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制，其中预期概念之外的预测信号被无意中利用。在本文中，我们提出了多模态概念瓶颈模型（MM-CBM）来解决这些问题，并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层（CBL）将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务，如零样本分类或图像检索。与现有方法相比，MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率，在黑盒性能的约5%以内，同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

URL PDF HTML ☆

赞 0 踩 0

2606.19727 2026-06-19 cs.CL cs.AI 新提交 70%

NRITYAM: Language Models Meet Art and Heritage of Dance

NRITYAM：语言模型遇见舞蹈的艺术与遗产

Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber, Haiqin Yang

发表机构 * Shenzhen Technology University（深圳技术大学）； New Delhi Institute of Management（新德里管理学院）； Technische Universität Dresden（德累斯顿工业大学）； Ramakrishna Mission Vivekananda Educational and Research Institute（罗摩克里希纳传道会维韦卡南达教育与研究学院）； Indian Institute of Technology（印度理工学院）； Swami Vivekananda Institute of Technology（斯瓦米·维韦卡南达技术学院）； GuangDong Engineering Technology Research Center of Edge Intelligence（广东省边缘智能工程技术研究中心）

专题命中图文多模态：包含多模态模型评估，涉及视觉和语言。

AI总结提出NRITYAM基准，包含9,260个跨12语言的文化问答对，评估语言模型对全球舞蹈传统的文化理解能力，涵盖多种模型类型。

Comments 18 pages, 12 figures, in ECML_PKDD'26

详情

AI中文摘要

语言模型已成为塑造现代工作流程的重要工具。然而，其全球有效性取决于对当地社会文化背景的细致理解。为弥补这一差距，我们提出NRITYAM，一个用于评估语言模型在全球舞蹈传统背景下文化理解能力的综合基准。NRITYAM包含9,260个精心策划的问答对，涵盖12种语言，是专门用于评估舞蹈文化知识的最大数据集。该数据集通过与本地舞蹈艺术家和母语者的密切合作从头开发，他们创作并验证了特定地区的文化相关问题。我们评估了一系列模型，包括大型语言模型、小型语言模型、多模态大型语言模型和小型多模态语言模型。作为一个多语言和多文化基准，NRITYAM为评估AI系统理解和推理传统表演艺术的能力设定了新标准。详细数据集样本可在\url{this https URL}获取。

英文摘要

Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.

URL PDF HTML ☆

赞 0 踩 0

2506.06952 2026-06-19 cs.CV 版本更新 70%

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Maryland（马里兰大学）； Nvidia（英伟达）； Salesforce AI Research（Salesforce AI研究）； Intuit AI Research（Intuit AI研究）

专题命中图文多模态：统一多模态模型，融合理解与生成。

AI总结提出LaTtE-Flow，一种基于预训练视觉语言模型的高效统一架构，通过层间时间步专家流和条件残差注意力机制，实现图像理解与生成，生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情

AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展，为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展，现有的统一模型通常需要大量的预训练，并且与专门针对每项任务的模型相比，难以达到相同的性能水平。此外，许多这些模型存在图像生成速度慢的问题，限制了它们在实时或资源受限环境中的实际部署。在这项工作中，我们提出了基于层间时间步专家流的Transformer（LaTtE-Flow），一种新颖且高效的架构，可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型（VLM）之上，以继承强大的多模态理解能力，并通过新颖的层间时间步专家流架构扩展它们，以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中，每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层，显著提高了采样效率。为了进一步提升性能，我们提出了一种时间步条件残差注意力机制，用于跨层高效的信息重用。实验表明，LaTtE-Flow在多模态理解任务上取得了强劲的性能，同时与最近的统一多模态模型相比，实现了具有竞争力的图像生成质量，推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

URL PDF HTML ☆

赞 0 踩 0

2305.14985 2026-06-19 cs.CV cs.CL 版本更新 70%

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

IdealGPT: 通过大型语言模型迭代分解视觉与语言推理

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University（哥伦比亚大学）； HKUST（香港科技大学）； University of California, Los Angeles（加州大学洛杉矶分校）

专题命中图文多模态：结合LLM和VLM进行多步推理。

AI总结提出IdealGPT框架，利用大型语言模型迭代分解视觉语言推理任务，通过子问题生成、子答案获取和最终答案推理的循环过程，在零样本设置下显著提升多步推理性能。

Comments 13 pages, 5 figures

详情

AI中文摘要

视觉与语言（VL）理解领域通过端到端的大型预训练VL模型（VLM）取得了前所未有的进展。然而，它们在需要多步推理的零样本推理任务中仍存在不足。为了实现这一目标，先前的工作采用了分而治之的流程。本文认为，先前的工作存在几个固有的缺点：1）它们依赖于特定领域的子问题分解模型。2）即使子问题或子答案提供的信息不足，它们也强制模型预测最终答案。我们通过IdealGPT框架解决了这些局限性，该框架利用大型语言模型（LLM）迭代分解VL推理。具体来说，IdealGPT使用一个LLM生成子问题，一个VLM提供相应的子答案，另一个LLM进行推理以得出最终答案。这三个模块迭代地执行分而治之的过程，直到模型对主问题的最终答案有信心。我们在零样本设置下对多个具有挑战性的VL推理任务评估了IdealGPT。特别是，我们的IdealGPT在VCR上比现有最好的GPT-4类模型绝对提高了10%，在SNLI-VE上提高了15%。代码可在以下网址获取：此 https URL

英文摘要

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

URL PDF HTML ☆

赞 0 踩 0

2504.02885 2026-06-19 cs.CL 版本更新 70%

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Med-R2：面向医学报告生成的感知与反思驱动复杂推理

Hao Wang, Shuchang Ye, Jinghao Lin, Usman Naseem, Jinman Kim

发表机构 * The School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； The School of Computing, Macquarie University（麦考瑞大学计算机学院）； Doubao Medical Group, ByteDance（字节跳动 doubao 医疗集团）

专题命中图文多模态：利用图像文本对进行医学报告生成

AI总结提出Med-R2微调策略，通过引入感知驱动的长推理过程和放射学知识指导，并加入反思机制修正感知错误，提升LVLMs在医学报告生成中的病理特征感知和诊断准确性。

Comments 28 pages, 3 figures, 1 table

详情

AI中文摘要

自动化医学报告生成（MRG）越来越多地被用于减轻人工报告负担和辅助决策。大型视觉语言模型（LVLMs）因其细粒度的图像-文本对齐和先进的文本生成能力，在自动化MRG中展现出巨大潜力。目前，最先进的MRG主要专注于通过直接监督微调（SFT）来适应预训练的LVLMs，这是一种使用医学图像-报告对的微调策略。然而，有几个因素限制了这些LVLMs的性能。首先，直接SFT使LVLMs能够直接生成医学报告，而无需经过病理特征感知和诊断推理的中间思考过程。这导致可能无法感知病理特征，从而引起误诊。其次，直接SFT缺乏放射学特定知识的指导，导致LVLMs误解感知到的病理特征并做出错误诊断。为了解决这些问题，我们提出了一种名为Med-R2的新型微调策略。我们引入了一个感知驱动的长推理过程，该过程在报告生成之前进行，并融入放射学特定知识作为指导。此外，为了减轻复杂推理中潜在的感知错误，引入了一种反思机制来细化病理特征的感知和生成的报告。我们的实验表明，Med-R2通过微调LVLMs有效增强了MRG的病理特征感知能力和诊断准确性。

英文摘要

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.20559 2026-06-19 cs.CV cs.LG 新提交 60%

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

UNIEGO：代理作为中介的统一自我中心视频表示学习

Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das

发表机构 * University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校）

专题命中图文多模态：融合多模态教师知识进行蒸馏学习。

AI总结提出分层多教师蒸馏框架UNIEGO，通过代理模型将异构教师知识转化为同质自我中心空间，并采用选择性代理蒸馏自适应筛选可靠监督，在三个自我中心视频理解任务上达到最优。

详情

AI中文摘要

自我中心视频理解本质上受限于可穿戴摄像头的狭窄视角：单一视角、单一模态、单一模型无法捕捉人类动作的全部丰富性。我们认为，真正富有表现力的自我中心表示必须包含跨视角、跨模态和基础模型表示的互补知识，同时仍能仅从自我中心视频部署。为此，我们引入了一个分层多教师蒸馏框架，生成UNIEGO，一个统一的自我中心编码器，使用九个教师（涵盖自我-外部视角、RGB、深度和骨架模态）以及四个基础模型进行训练。我们的框架不是直接从异构教师中蒸馏（其不兼容的架构和特征几何会导致冲突梯度），而是在其中插入一层表示特定的代理模型，将多样的教师知识转化为同质的自我中心空间。第二阶段蒸馏，即选择性代理蒸馏（SPD），然后自适应地为每个训练样本选择既正确又自信的代理子集，仅从可靠监督中蒸馏并抑制错误信号。SPD进一步通过将UNIEGO初始化为代理参数的凸组合来稳定，在蒸馏开始前将统一模型置于损失景观的良好条件区域。UNIEGO在三个自我中心视频理解任务（动作识别、视频检索和动作分割）上，在三个具有挑战性的自我-外部基准测试中达到了最先进的性能，优于朴素的多教师蒸馏基线，并证明了结构化的、代理中介的知识转移能产生更丰富、更具判别性的自我中心表示。

英文摘要

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

URL PDF HTML ☆

赞 0 踩 0