视觉大模型 / VLM - arXivDaily 专题

2606.19646 2026-06-19 cs.IR cs.CV 新提交 85%

SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

SAFE-Cascade: 面向图表问答的成本自适应视觉语言路由

Ayush Dwivedi, Qixin Wang, Ashvi Soni, Ruoteng Wang, Han Li, Animesh Mahapatra, Neeraj Agrawal, Xintao Wu

发表机构 * University of Arkansas（亚拉巴马大学）

专题命中视觉问答：提出成本自适应路由系统，用于图表问答，涉及VLM调用决策。

AI总结提出SAFE-Cascade系统，通过OCR和轻量语言模型先给出答案，再由学习路由器决定是否调用VLM，在ChartQA上以73.1%的VLM调用率达到69.1%准确率，减少26.9%的VLM调用和9.3%的成本。

Comments Demo paper submitted at CIKM 2026. 4 pages, 2 figures

详情

AI中文摘要

视觉语言模型（VLM）在图表问答中表现出色，但若每个查询都调用VLM，当许多问题可通过OCR文本和轻量语言推理回答时，成本会不必要地高昂。我们展示了SAFE-Cascade，一个用于成本自适应图表问答的交互系统。给定图表图像和自然语言问题，SAFE-Cascade首先通过OCR提取图表文本，从纯文本语言模型获得临时答案，然后使用学习路由器决定接受文本答案还是升级到VLM。该演示向用户展示这一决策过程：OCR证据、纯文本答案、路由概率、升级决策、最终答案、估计成本和估计延迟并排显示。SAFE-Cascade被设计为一个透明界面，用于理解何时实际需要视觉基础。用户可以上传或选择图表、提问、检查每条路径使用的证据、比较纯文本和VLM答案，并调整升级阈值以探索准确率-成本边界。该系统使用Azure Document Intelligence进行OCR，gpt-5-mini作为纯文本模型，gemini-2.5-flash-image作为VLM，以及基于推理时特征训练的随机森林路由器。在从2500个样本实验中留出的375个ChartQA测试集上，SAFE-Cascade实现了69.1%的统一准确率和73.1%的VLM调用率，而全VLM基线为67.7%准确率和100% VLM调用率。观察到的+1.4个百分点差异在统计上不确定，因此我们将SAFE-Cascade解释为匹配全VLM性能，同时减少26.9%的VLM调用和9.3%的估计成本。该演示展示了选择性模态路由如何使多模态知识系统更加透明、可调优和成本感知。

英文摘要

Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.

URL PDF HTML ☆

赞 0 踩 0

2603.28387 2026-06-19 cs.AI cs.LG 版本更新 85%

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

脚手架效应：提示框架如何驱动临床VLM评估中的表面多模态增益

Doan Nam Long Vu, Simone Balloccu

发表机构 * Technical University of Darmstadt（达姆施塔特技术大学）

专题命中视觉问答：揭示临床VLM评估中提示框架的脚手架效应

AI总结研究发现，在临床VLM评估中，提示中提及MRI可用性即可解释70-80%的性能提升，与图像数据是否存在无关，这种“脚手架效应”揭示了表面评估无法反映真实多模态推理能力。

详情

AI中文摘要

可信的临床AI要求性能提升反映真实的证据整合而非表面伪影。我们在两个临床神经影像队列\textsc{FOR2107}（情感障碍）和\textsc{OASIS-3}（认知衰退）上评估了12个开源视觉语言模型（VLM）的二分类性能。两个数据集都包含结构MRI数据，但这些数据不携带可靠的个体级诊断信号。在这些条件下，较小的VLM在引入神经影像上下文后F1分数提升高达58%，蒸馏模型变得与规模大一个数量级的模型相当。对比置信度分析显示，仅仅在任务提示中\textit{提及}MRI可用性就解释了70-80%的转变，与影像数据是否存在无关，这是模态坍塌的一个领域特定实例，我们称之为\textit{脚手架效应}。专家评估揭示了在所有条件下捏造基于神经影像的正当理由，而偏好对齐虽然消除了引用MRI的行为，却使两种条件都退化为随机基线。我们的发现表明，表面评估不足以作为多模态推理的指标，这对VLM在临床环境中的部署有直接影响。

英文摘要

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 新提交 80%

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany（德国弗莱堡大学计算机视觉组）； Department of Radiology, Medical Center -- University of Freiburg, Germany（德国弗莱堡大学医学中心放射科）； CRIION-AI Lab, Freiburg, Germany（德国弗莱堡CRIION-AI实验室）

专题命中视觉问答：联合报告生成、VQA和空间定位

AI总结提出RefRad2D大规模双语数据集，通过LLM和自动分割生成空间定位数据，训练RadGrounder模型联合完成报告生成、VQA和空间定位，在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情

AI中文摘要

我们研究了如何在没有手动空间标注的情况下，为放射学训练具有视觉定位能力的视觉-语言模型（VLM）。我们引入了RefRad2D，这是一个大规模的双语（德语/英语）数据集，包含来自临床实践的120万对CT和MR图像-文本对，并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准（Slake，VQA-RAD）上，RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集，相比于仅在下游数据集上微调，提高了开放式VQA的性能，显示了数据集的迁移性。关键在于，添加定位监督不会降低语言质量，从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

URL PDF HTML ☆

赞 0 踩 0

2606.20561 2026-06-19 cs.CV 新提交 70%

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

TimeProVe: 先提出后验证，实现日常活动中的高效长视频时间推理

Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan, Hieu Le, Srijan Das

发表机构 * University of North Carolina, Charlotte（北卡罗来纳大学夏洛特分校）

专题命中视觉问答：使用VLM进行长视频问答验证

AI总结提出TimeProVe框架，先通过轻量模块生成基于动作的候选假设，再调用昂贵VLM验证，在长视频问答中降低75%VLM调用和93%推理成本，性能提升7.3%。

详情

AI中文摘要

长视频问答（LVQA）需要在数小时未修剪的视频中识别稀疏的、与查询相关的证据。现有方法要么使用大型视觉语言模型（VLM）密集处理视频，导致计算成本过高，要么依赖稀疏的基于字幕的推理，这往往会遗漏时间局部化和以运动为中心的证据。我们提出TimeProVe，一种用于长视频中时间基础推理的高效混合框架。TimeProVe首先使用轻量模块生成基于动作的答案-证据假设，随后仅调用昂贵的VLM进行针对性验证。我们框架的核心在于基于动作的候选证据（ACE）模块，该模块通过轻量级LLM推理将时间局部化的动作转换为查询条件化的候选答案和支持证据窗口。我们进一步引入OpenTSUBench（OTB），一个开放基准测试，旨在评估真实世界日常活动（ADL）场景中的时间基础推理。实验表明，TimeProVe在OTB上比最强基线高出7.3%，同时减少了75%的VLM调用和93%的推理成本。此外，在没有显式时间基础训练的情况下，TimeProVe在Charades-STA上取得了竞争性性能，并在结合基础VLM增强时达到了最先进的结果。

英文摘要

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19684 2026-06-19 cs.CV 新提交 70%

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

探索多模态大语言模型与两阶段微调在时尚图像检索中的应用

Nguyen Cao Hoang, Hoang Bui Le, Nam Vo Hoang, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（胡志明市国家大学下属理科大学）； Vietnam National University, Ho Chi Minh（胡志明市国家大学）

专题命中视觉问答：利用LLaVA生成属性感知三元组进行时尚图像检索

AI总结提出融合多模态大语言模型（LLaVA）生成属性感知三元组，并采用两阶段微调策略增强对比学习，以解决时尚图像检索中标注数据稀缺和负采样简单的问题。

Comments SOICT 2025

2506.06952 2026-06-19 cs.CV 版本更新 70%

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Maryland（马里兰大学）； Nvidia（英伟达）； Salesforce AI Research（Salesforce AI研究）； Intuit AI Research（Intuit AI研究）

专题命中视觉问答：统一图像理解与生成，基于预训练VLM。

AI总结提出LaTtE-Flow，一种基于预训练视觉语言模型的高效统一架构，通过层间时间步专家流和条件残差注意力机制，实现图像理解与生成，生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情

AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展，为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展，现有的统一模型通常需要大量的预训练，并且与专门针对每项任务的模型相比，难以达到相同的性能水平。此外，许多这些模型存在图像生成速度慢的问题，限制了它们在实时或资源受限环境中的实际部署。在这项工作中，我们提出了基于层间时间步专家流的Transformer（LaTtE-Flow），一种新颖且高效的架构，可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型（VLM）之上，以继承强大的多模态理解能力，并通过新颖的层间时间步专家流架构扩展它们，以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中，每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层，显著提高了采样效率。为了进一步提升性能，我们提出了一种时间步条件残差注意力机制，用于跨层高效的信息重用。实验表明，LaTtE-Flow在多模态理解任务上取得了强劲的性能，同时与最近的统一多模态模型相比，实现了具有竞争力的图像生成质量，推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

URL PDF HTML ☆

赞 0 踩 0

2504.02885 2026-06-19 cs.CL 版本更新 70%

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Med-R2：面向医学报告生成的感知与反思驱动复杂推理

Hao Wang, Shuchang Ye, Jinghao Lin, Usman Naseem, Jinman Kim

发表机构 * The School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； The School of Computing, Macquarie University（麦考瑞大学计算机学院）； Doubao Medical Group, ByteDance（字节跳动 doubao 医疗集团）

专题命中视觉问答：使用视觉语言模型进行医学报告生成

AI总结提出Med-R2微调策略，通过引入感知驱动的长推理过程和放射学知识指导，并加入反思机制修正感知错误，提升LVLMs在医学报告生成中的病理特征感知和诊断准确性。

Comments 28 pages, 3 figures, 1 table

详情

AI中文摘要

自动化医学报告生成（MRG）越来越多地被用于减轻人工报告负担和辅助决策。大型视觉语言模型（LVLMs）因其细粒度的图像-文本对齐和先进的文本生成能力，在自动化MRG中展现出巨大潜力。目前，最先进的MRG主要专注于通过直接监督微调（SFT）来适应预训练的LVLMs，这是一种使用医学图像-报告对的微调策略。然而，有几个因素限制了这些LVLMs的性能。首先，直接SFT使LVLMs能够直接生成医学报告，而无需经过病理特征感知和诊断推理的中间思考过程。这导致可能无法感知病理特征，从而引起误诊。其次，直接SFT缺乏放射学特定知识的指导，导致LVLMs误解感知到的病理特征并做出错误诊断。为了解决这些问题，我们提出了一种名为Med-R2的新型微调策略。我们引入了一个感知驱动的长推理过程，该过程在报告生成之前进行，并融入放射学特定知识作为指导。此外，为了减轻复杂推理中潜在的感知错误，引入了一种反思机制来细化病理特征的感知和生成的报告。我们的实验表明，Med-R2通过微调LVLMs有效增强了MRG的病理特征感知能力和诊断准确性。

英文摘要

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

URL PDF HTML ☆

赞 0 踩 0