多模态大模型

2606.14702 2026-06-18 cs.CV 新提交专题 90

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K：通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

专题命中音视频多模态：音视频推理数据集与问答

AI总结提出OmniVideo-100K数据集，通过实体锚定视频脚本和线索引导的QA生成机制，解决音视频问答中跨段实体不一致和长时推理不足的问题，微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

URL PDF HTML

2606.19157 2026-06-18 eess.AS cs.CL 新提交专题 85

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

IndicContextEval：评估8种印度语言音频大语言模型上下文利用能力的基准

Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

专题命中音视频多模态：评估音频大语言模型的上下文利用能力

AI总结提出IndicContextEval基准，包含8种印度语言555位说话人的56小时自然语音，通过7级提示框架评估音频大语言模型是否真正利用上下文而非依赖参数化知识。

Comments Accepted at Interspeech 2026

URL PDF HTML

2606.18924 2026-06-18 cs.SD 新提交专题 85

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突？音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

专题命中音视频多模态：分析音频大模型中文本偏差机制

AI总结本文通过机制分析揭示音频大模型中的文本主导偏差，发现文本路径主动抑制完整音频表征，并提出无训练干预方法back-patching以增强音频表征，缓解文本主导。

Comments Preprint

URL PDF HTML

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 新提交专题 85

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

专题命中音视频多模态：提出CoAT框架，增强音频语言模型的连续音频思考能力。

AI总结提出连续音频思考（CoAT）框架，通过专家蒸馏在连续潜在空间中组织声学信息，使音频语言模型在生成响应前利用丰富声学特征，无需额外自回归解码成本，在多个音频任务上提升性能。

Comments Preprint

URL PDF HTML

2606.19203 2026-06-18 eess.AS 新提交专题 80

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

DASH: 基于多层隐藏表示的双视角自蒸馏用于鲁棒语音识别

Jaeeun Baik, Ui-Hyeop Shin, Jiwoon Lee, Woocheol Jeong, Hyung-Min Park

专题命中音视频多模态：提出自蒸馏框架提升语音识别鲁棒性，属于音频处理

AI总结提出DASH自蒸馏框架，通过双视角学习干净-噪声一致性，从多层编码器蒸馏隐藏表示并最小化原型分配分布的KL散度，在保持干净准确率的同时提升噪声鲁棒性，额外开销仅约微调时间的4%。

Comments Accepted to Interspeech 2026

URL PDF HTML

2606.19338 2026-06-18 cs.CV 新提交专题 85

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测：评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

专题命中多模态评测：非马尔可夫博弈评估多模态模型记忆

AI总结提出RNG-Bench基准套件，通过配对记忆和3D迷宫两个博弈，评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力，发现主要错误源于遗忘而非决策，微调可提升性能。

URL PDF HTML

2606.19120 2026-06-18 cs.LG cs.CV 新提交专题 85

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思：解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

专题命中图文多模态：MLLM后训练框架，解耦感知与推理

AI总结提出ViGOS框架，通过解耦感知和推理，在MLLM后训练中避免文本捷径，提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

URL PDF HTML

2606.18988 2026-06-18 cs.AI 新提交专题 85

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception: 一种用于可解释多模态欺骗检测的渐进式强化学习框架

Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

专题命中图文多模态：引入多模态大模型进行可解释欺骗检测，结合视觉和音频。

AI总结提出ThinkDeception框架，将多模态大语言模型引入欺骗检测，通过逐步推理和视觉-音频一致性组相对策略优化（VAC-GRPO）实现可解释的认知推理，在主流基准上达到新SOTA。

Comments 10pages,4figures

URL PDF HTML

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 新提交专题 85

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA：面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

专题命中图文多模态：多模态信息抽取，利用多专家MLLM增强数据。

AI总结提出语义锚定对齐增强框架SAMA，通过构建结构化语义锚引导多专家多模态大模型生成高保真文本，并利用锚保留扩散机制合成图像，结合双约束过滤模块，在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

URL PDF HTML

2606.17030 2026-06-18 cs.CV 新提交专题 85

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告：通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

专题命中图文多模态：融合视觉与语言的多模态世界模型

AI总结提出Qwen-RobotWorld，一种以自然语言为统一动作接口的语言条件视频世界模型，通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练，在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测，在多个基准上取得最优结果。

URL PDF HTML

2606.15088 2026-06-18 cs.SD cs.CL eess.AS 新提交专题 85

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

当相同的音乐知识以不同方式遗忘：路径依赖遗忘的干净探测

Yu Liu, Zhiwei Yang, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Kun Peng, Haimei Qin, Lei Jiang, Jin B. Hong, Hao Peng, Yanbing Liu

专题命中图文多模态：研究多模态模型中知识遗忘路径依赖

AI总结提出配对路径控制协议（PPCP），发现多模态模型中通过文本路径获取的知识比音频路径更易遗忘，且该效应不受架构深度影响，主要源于输入表示差异。

URL PDF HTML

2606.18974 2026-06-18 cs.CV 新提交专题 80

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD：用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

专题命中图文多模态：跨模态自蒸馏将视觉推理能力转移到纯文本模型。

AI总结提出Visual-OPSD方法，通过跨模态在策略自蒸馏，将多步扩散生成的可视化思维推理能力转移到纯文本学生模型，实现14.3倍加速且性能提升3.40个百分点。

URL PDF HTML

2606.18893 2026-06-18 cs.CL 新提交专题 80

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

学习鲁棒的成对置信度用于多模态情感-原因对提取

Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia

专题命中图文多模态：多模态情感-原因对提取，学习鲁棒置信度

AI总结提出RPCL框架，通过置信度差异边界约束和对抗性扰动，增强多模态情感-原因对提取中成对置信度的判别性和稳定性，在三个数据集上提升Pair F1约2.6-2.8个百分点。

Comments 11 pages, 3 figures, 5 tables

URL PDF HTML

2606.18710 2026-06-18 cs.CR 新提交专题 80

Image Prompt Reconstruction Attacks on Distributed MLLM Inference Frameworks

分布式多模态大模型推理框架上的图像提示重建攻击

Xinjian Luo, Hongyan Chang, Jianxin Wei, Yuncheng Wu, Xiaofeng Gao, Meikang Qiu, Ting Yu, Xue Liu

专题命中图文多模态：分布式MLLM图像提示重建攻击。

AI总结研究分布式MLLM推理中中间嵌入泄露图像提示的风险，提出两种被动黑盒攻击方法MPAA和IEDA，实现像素级和语义级图像重建。

URL PDF HTML

2606.18262 2026-06-18 cs.HC 新提交专题 75

When Prompts Mislead: Textual Dominance and Diagnostic Bias in MLLMs

当提示误导：多模态大语言模型中的文本主导与诊断偏差

Inhyuk Park, Doohyun Park

专题命中图文多模态：研究多模态LLM在医学诊断中的文本主导偏差。

AI总结研究揭示在医学多模态大语言模型中，文本提示会主导视觉线索，导致诊断偏差，即使模型具备空间定位能力，提示策略仍可能不安全。

Comments Accepted to the CVPR 2026 MMFM-BIOMED Workshop

URL PDF HTML

2606.18661 2026-06-18 cs.CV cs.AI 新提交专题 70

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench：一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

专题命中图文多模态：多模态数据集包含图像、掩码和文本描述

AI总结提出指令驱动智能体框架，包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent，实现自主滑坡识别与分析。

URL PDF HTML

2606.18441 2026-06-18 cs.CV 新提交专题 70

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集：视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

专题命中图文多模态：涉及视频多模态大语言模型推理优化

AI总结提出无时间标注的过程级奖励框架CF-GRPO，通过视频内在线索构建一致性帧先验，并利用一致性帧奖励优化模型帧使用与先验的对齐，提升视频推理性能。

URL PDF HTML

2606.19062 2026-06-18 cs.CV 新提交专题 85

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

专题命中跨模态检索：跨模态检索，双目标编码。

AI总结提出DREAM模型，通过双路径表示增强与对齐，结合层级视觉编码器和混合语言建模，在视频检索任务中实现新SOTA。

URL PDF HTML

2606.18885 2026-06-18 cs.CV cs.IR 新提交专题 75

LARE: Low-Attention Region Encoding for Text-Image Retrieval

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

专题命中跨模态检索：文本-图像跨模态检索

AI总结提出LARE框架，通过并行编码低注意力区域和完整图像，解决拥挤场景下视觉编码器忽视关键细节的问题，在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

URL PDF HTML

2606.19140 2026-06-18 cs.LG 新提交专题 55

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis

ChronoSurv：一种临床路径引导的多模态生存分析图框架

Hugo Miccinilli, Theo Di Piazza

专题命中其他多模态：处理多模态临床数据，但非大模型

AI总结提出ChronoSurv，一种基于有向图的多模态生存分析框架，通过层次化拓扑和异质消息传递建模临床轨迹，在头颈癌数据集上取得最优判别性能与可靠校准。

Comments Accepted at MICCAI 2026. Submitted version due to embargo

URL PDF HTML

1. 音视频多模态 5 篇

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

Continuous Audio Thinking for Large Audio Language Models

DASH: Dual-View Self-Distillation with Multi-Layer Hidden Representations for Robust Speech Recognition

2. 多模态评测 1 篇

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

3. 图文多模态 11 篇

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

Image Prompt Reconstruction Attacks on Distributed MLLM Inference Frameworks

When Prompts Mislead: Textual Dominance and Diagnostic Bias in MLLMs

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

4. 跨模态检索 2 篇

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

LARE: Low-Attention Region Encoding for Text-Image Retrieval

5. 其他多模态 1 篇

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis