arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

多模态大模型

跨文本、图像、视频、音频等模态的大模型与学习方法。

2026-06-19 至 2026-06-19 收录 34 信号源:cs.CV, cs.CL, cs.AI, cs.MM, eess.AS

1. 图文多模态 3 篇

2305.14985 2026-06-19 cs.CV cs.CL 版本更新 70%

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

IdealGPT: 通过大型语言模型迭代分解视觉与语言推理

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University(哥伦比亚大学) HKUST(香港科技大学) University of California, Los Angeles(加州大学洛杉矶分校)

专题命中 图文多模态 :结合LLM和VLM进行多步推理。

AI总结 提出IdealGPT框架,利用大型语言模型迭代分解视觉语言推理任务,通过子问题生成、子答案获取和最终答案推理的循环过程,在零样本设置下显著提升多步推理性能。

Comments 13 pages, 5 figures

详情
AI中文摘要

视觉与语言(VL)理解领域通过端到端的大型预训练VL模型(VLM)取得了前所未有的进展。然而,它们在需要多步推理的零样本推理任务中仍存在不足。为了实现这一目标,先前的工作采用了分而治之的流程。本文认为,先前的工作存在几个固有的缺点:1)它们依赖于特定领域的子问题分解模型。2)即使子问题或子答案提供的信息不足,它们也强制模型预测最终答案。我们通过IdealGPT框架解决了这些局限性,该框架利用大型语言模型(LLM)迭代分解VL推理。具体来说,IdealGPT使用一个LLM生成子问题,一个VLM提供相应的子答案,另一个LLM进行推理以得出最终答案。这三个模块迭代地执行分而治之的过程,直到模型对主问题的最终答案有信心。我们在零样本设置下对多个具有挑战性的VL推理任务评估了IdealGPT。特别是,我们的IdealGPT在VCR上比现有最好的GPT-4类模型绝对提高了10%,在SNLI-VE上提高了15%。代码可在以下网址获取:此 https URL

英文摘要

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

2504.02885 2026-06-19 cs.CL 版本更新 70%

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Med-R2:面向医学报告生成的感知与反思驱动复杂推理

Hao Wang, Shuchang Ye, Jinghao Lin, Usman Naseem, Jinman Kim

发表机构 * The School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) The School of Computing, Macquarie University(麦考瑞大学计算机学院) Doubao Medical Group, ByteDance(字节跳动 doubao 医疗集团)

专题命中 图文多模态 :利用图像文本对进行医学报告生成

AI总结 提出Med-R2微调策略,通过引入感知驱动的长推理过程和放射学知识指导,并加入反思机制修正感知错误,提升LVLMs在医学报告生成中的病理特征感知和诊断准确性。

Comments 28 pages, 3 figures, 1 table

详情
AI中文摘要

自动化医学报告生成(MRG)越来越多地被用于减轻人工报告负担和辅助决策。大型视觉语言模型(LVLMs)因其细粒度的图像-文本对齐和先进的文本生成能力,在自动化MRG中展现出巨大潜力。目前,最先进的MRG主要专注于通过直接监督微调(SFT)来适应预训练的LVLMs,这是一种使用医学图像-报告对的微调策略。然而,有几个因素限制了这些LVLMs的性能。首先,直接SFT使LVLMs能够直接生成医学报告,而无需经过病理特征感知和诊断推理的中间思考过程。这导致可能无法感知病理特征,从而引起误诊。其次,直接SFT缺乏放射学特定知识的指导,导致LVLMs误解感知到的病理特征并做出错误诊断。为了解决这些问题,我们提出了一种名为Med-R2的新型微调策略。我们引入了一个感知驱动的长推理过程,该过程在报告生成之前进行,并融入放射学特定知识作为指导。此外,为了减轻复杂推理中潜在的感知错误,引入了一种反思机制来细化病理特征的感知和生成的报告。我们的实验表明,Med-R2通过微调LVLMs有效增强了MRG的病理特征感知能力和诊断准确性。

英文摘要

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

2606.20559 2026-06-19 cs.CV cs.LG 新提交 60%

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

UNIEGO:代理作为中介的统一自我中心视频表示学习

Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

专题命中 图文多模态 :融合多模态教师知识进行蒸馏学习。

AI总结 提出分层多教师蒸馏框架UNIEGO,通过代理模型将异构教师知识转化为同质自我中心空间,并采用选择性代理蒸馏自适应筛选可靠监督,在三个自我中心视频理解任务上达到最优。

详情
AI中文摘要

自我中心视频理解本质上受限于可穿戴摄像头的狭窄视角:单一视角、单一模态、单一模型无法捕捉人类动作的全部丰富性。我们认为,真正富有表现力的自我中心表示必须包含跨视角、跨模态和基础模型表示的互补知识,同时仍能仅从自我中心视频部署。为此,我们引入了一个分层多教师蒸馏框架,生成UNIEGO,一个统一的自我中心编码器,使用九个教师(涵盖自我-外部视角、RGB、深度和骨架模态)以及四个基础模型进行训练。我们的框架不是直接从异构教师中蒸馏(其不兼容的架构和特征几何会导致冲突梯度),而是在其中插入一层表示特定的代理模型,将多样的教师知识转化为同质的自我中心空间。第二阶段蒸馏,即选择性代理蒸馏(SPD),然后自适应地为每个训练样本选择既正确又自信的代理子集,仅从可靠监督中蒸馏并抑制错误信号。SPD进一步通过将UNIEGO初始化为代理参数的凸组合来稳定,在蒸馏开始前将统一模型置于损失景观的良好条件区域。UNIEGO在三个自我中心视频理解任务(动作识别、视频检索和动作分割)上,在三个具有挑战性的自我-外部基准测试中达到了最先进的性能,优于朴素的多教师蒸馏基线,并证明了结构化的、代理中介的知识转移能产生更丰富、更具判别性的自我中心表示。

英文摘要

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

2. 音视频多模态 1 篇

2606.20478 2026-06-19 eess.AS 新提交 60%

Beyond Speaker Independence: Evaluating Cross-Lingual Acoustic-to-Articulatory Inversion Across Finnish and Russian

超越说话人独立性:跨语言声学到发音反演在芬兰语和俄语上的评估

Ruchi Pandey, Tomi Kinnunen

专题命中 音视频多模态 :跨语言声学-发音映射,涉及多模态特征

AI总结 本研究系统评估了跨说话人和跨语言域偏移下的声学到发音反演(AAI)性能,利用新构建的芬兰语-俄语双语EMA语料库FROST-EMA,比较了不同发音目标、声学前端和反演后端,发现跨性别性能下降中等(约0.05-0.10),跨语言下降更大(约0.10-0.20)。

详情
AI中文摘要

声学到发音反演(AAI)在域偏移下仍然具有挑战性,其中说话人属性的变化和跨语言条件常常导致性能下降。我们在这种偏移下进行了系统评估,并在FROST-EMA(一个芬兰语-俄语双语EMA语料库)上建立了基线基准。FROST-EMA解决了现有资源的英语偏见和有限的说话人多样性。我们基准测试了(i)发音目标(原始EMA坐标与声道变量),(ii)声学前端(MFCC与SSL特征),以及(iii)反演后端(BiLSTM与轻量级基于注意力的序列模型)。我们进一步定义了跨性别迁移(语言内)和跨语言迁移(性别内)的评估协议。结果表明,相对于域内基线,跨性别不匹配导致皮尔逊相关系数适度下降(约0.05至0.10),而跨语言不匹配导致更大的下降(约0.10至0.20)。

英文摘要

Acoustic-to-articulatory inversion (AAI) remains challenging under domain shifts where changes in speaker attributes and cross-language conditions often degrade performance. We conduct a systematic evaluation under such shifts and establish baseline benchmarks on FROST-EMA, a Finnish-Russian bilingual EMA corpus. FROST-EMA addresses the English bias and limited speaker diversity of existing resources. We benchmark (i) articulatory targets (raw EMA coordinates vs tract variables), (ii) acoustic front-ends (MFCC vs SSL features), and (iii) inversion back-ends (BiLSTM vs a lightweight attention-based sequence model). We further define evaluation protocols for cross-gender transfer (within language) and cross-language transfer (within gender). The results indicate that cross-gender mismatch introduces moderate Pearson correlation declines (approximately 0.05 to 0.10) relative to the in-domain baseline, whereas cross-language mismatch causes larger drops (approximately 0.10 to 0.20).