arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

多模态大模型

跨文本、图像、视频、音频等模态的大模型与学习方法。

今日/当前日期收录 3 信号源:cs.CV, cs.CL, cs.AI, cs.MM, eess.AS
2606.01711 2026-06-18 cs.CV 版本更新 90%

Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference

通过纠正失真改进视觉令牌减少以实现高效多模态大语言模型推理

Hyeonwoo Cho, Donghyeon Baek, Yewon Kim, Bumsub Ham

发表机构 * KAIST(韩国科学技术院)

专题命中 图文多模态 :多模态大模型视觉令牌减少,提升推理效率

AI总结 提出RESTORE框架,通过校准位置和注意力失真来改进视觉令牌减少,在保持效率的同时提升多模态大语言模型性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务中取得了显著成功,但大量视觉令牌带来的二次计算复杂度导致了严重的内存和延迟瓶颈。虽然已经探索了视觉令牌减少(VTR)策略来缓解这一负担,但现有方法忽略了完整序列与减少序列之间的位置和注意力一致性,导致表示失真。为此,我们提出RESTORE,一种新颖的VTR框架,在保持效率的同时纠正位置和注意力失真。具体来说,我们提出一种简单而有效的校准方法,通过基于相对距离增强注意力权重来恢复丢失的视觉注意力。我们还引入了一种独特的锚点选择用于令牌合并,以减轻特征平均过程中的信息损失。在多个基准上的实验结果表明,我们的方法持续提高了各种减少方法的准确性,在保持计算效率的同时实现了最先进的性能。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency. Project page is available at https://cvlab.yonsei.ac.kr/projects/RESTORE

2604.18109 2026-06-18 cs.CL cs.SD 版本更新 70%

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

FLiP:理解和解释多模态多语句子嵌入

Santosh Kesiraju, Bolaji Yusuf, Šimon Sedláček, Oldřich Plchot, Petr Schwarz

发表机构 * Brno University of Technology(布拉格技术大学)

专题命中 图文多模态 :多模态多语句子嵌入的理解与解释

AI总结 提出因子化线性投影(FLiP)模型,从多语言、多模态句子嵌入中恢复词汇内容,揭示编码器的模态和语言偏差。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

本文提出了因子化线性投影(FLiP)模型,用于理解预训练句子嵌入空间。我们训练FLiP模型从多语言(LaBSE)、多模态(SONAR)和基于API(Gemini)的句子嵌入空间中恢复多种高资源和中等资源语言的词汇内容。我们表明,FLiP可以从嵌入中召回超过75%的词汇内容,显著优于现有的非因子化基线。使用此作为诊断工具,我们揭示了所选句子编码器的模态和语言偏差,并为从业者提供了关于编码器的内在见解,而无需依赖传统的下游评估任务。我们的实现已公开,链接见此:https://this URL。

英文摘要

This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public https://github.com/BUTSpeechFIT/FLiP.

2601.14968 2026-06-18 cs.LG cs.AI 版本更新 70%

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

InstructTime++: 通过隐式特征增强的多模态语言建模进行时间序列分类

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Zhiding Liu, Yucong Luo, Yiheng Chen, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

专题命中 图文多模态 :融合数值序列、文本特征和指令的多模态输入

AI总结 提出将时间序列分类转化为多模态生成任务,通过离散化模块和对齐投影层弥合模态差距,并利用隐式特征建模提升语言模型性能。

详情
AI中文摘要

大多数现有的时间序列分类方法采用判别范式,将输入序列直接映射到独热编码的类别标签。虽然有效,但这种范式难以融入上下文特征,也无法捕捉类别间的语义关系。为了解决这些局限性,我们提出了InstructTime,一种将时间序列分类重新定义为多模态生成任务的新框架。具体来说,连续的数值序列、上下文文本特征和任务指令被视为多模态输入,而类别标签则通过调优的语言模型作为文本输出生成。为了弥合模态差距,InstructTime引入了一个时间序列离散化模块,将连续序列转换为离散的时间标记,同时结合对齐投影层和生成式自监督预训练策略,以增强跨模态表示对齐。在此框架基础上,我们进一步提出了InstructTime++,通过引入隐式特征建模来扩展InstructTime,以补偿语言模型有限的归纳偏差。InstructTime++利用专门的工具包从原始时间序列和上下文输入中挖掘信息丰富的隐式模式,包括统计特征提取和基于视觉-语言模型的图像描述,并将其转化为文本描述以实现无缝集成。在多个基准数据集上的大量实验证明了InstructTime++的优越性能。

英文摘要

Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.