多模态信息融合 - arXivDaily 专题

2504.11171 2026-06-19 cs.CV cs.AI 版本更新 90%

TerraMind: Large-Scale Generative Multimodality for Earth Observation

TerraMind：面向地球观测的大规模生成式多模态模型

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

发表机构 * IBM Research – Europe（IBM欧洲研究院）； ETH Zurich（苏黎世联邦理工学院）； Forschungszentrum Jülich（尤利希研究中心）； European Space Agency（欧洲航天局）； Φ \Phi -Lab（Φ实验室）； NASA IMPACT ； University of Iceland（爱沙尼亚大学）

专题命中融合架构与评测：多模态地球观测基础模型，属于融合架构

AI总结提出首个任意到任意生成式多模态基础模型TerraMind，通过双尺度表示（token级和像素级）预训练，实现零样本/少样本应用，并引入“模态思考”能力，在PANGAEA等基准上达到领先性能。

Comments Accepted at ICCV'25

详情

AI中文摘要

我们提出了TerraMind，这是首个面向地球观测（EO）的任意到任意生成式多模态基础模型。与其他多模态模型不同，TerraMind在跨模态的双尺度表示（结合token级和像素级数据）上进行预训练。在token级别，TerraMind编码高层上下文信息以学习跨模态关系；在像素级别，TerraMind利用细粒度表示捕捉关键空间细节。我们在一个全球大规模数据集的九种地理空间模态上预训练了TerraMind。在本文中，我们证明：（i）TerraMind的双尺度早期融合方法为地球观测解锁了一系列零样本和少样本应用；（ii）TerraMind引入了“模态思考”（TiM）——在微调和推理过程中生成额外人工数据以改善模型输出的能力；（iii）TerraMind在PANGAEA等社区标准的地球观测基准上达到了超越现有最优的性能。预训练数据集、模型权重和我们的代码均在宽松许可下开源。

英文摘要

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

URL PDF HTML ☆

赞 0 踩 0

2506.06952 2026-06-19 cs.CV 版本更新 85%

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Maryland（马里兰大学）； Nvidia（英伟达）； Salesforce AI Research（Salesforce AI研究）； Intuit AI Research（Intuit AI研究）

专题命中融合架构与评测：统一图像理解与生成，属于融合架构

AI总结提出LaTtE-Flow，一种基于预训练视觉语言模型的高效统一架构，通过层间时间步专家流和条件残差注意力机制，实现图像理解与生成，生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情

AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展，为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展，现有的统一模型通常需要大量的预训练，并且与专门针对每项任务的模型相比，难以达到相同的性能水平。此外，许多这些模型存在图像生成速度慢的问题，限制了它们在实时或资源受限环境中的实际部署。在这项工作中，我们提出了基于层间时间步专家流的Transformer（LaTtE-Flow），一种新颖且高效的架构，可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型（VLM）之上，以继承强大的多模态理解能力，并通过新颖的层间时间步专家流架构扩展它们，以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中，每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层，显著提高了采样效率。为了进一步提升性能，我们提出了一种时间步条件残差注意力机制，用于跨层高效的信息重用。实验表明，LaTtE-Flow在多模态理解任务上取得了强劲的性能，同时与最近的统一多模态模型相比，实现了具有竞争力的图像生成质量，推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

URL PDF HTML ☆

赞 0 踩 0

2601.03112 2026-06-19 eess.IV cs.CV 版本更新 70%

DiT-JSCC: Rethinking Deep JSCC with Diffusion Transformers and Semantic Representations

DiT-JSCC：基于扩散变换器与语义表示的深度JSCC再思考

Kailin Tan, Jincheng Dai, Sixian Wang, Guo Lu, Shuo Shao, Kai Niu, Wenjun Zhang, Ping Zhang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Shanghai Jiao Tong University（上海交通大学）； University of Shanghai for Science and Technology（上海科技大学）

专题命中融合架构与评测：联合学习语义编码与扩散解码的融合框架。

AI总结提出DiT-JSCC框架，联合学习语义优先表示编码器和扩散变换器生成解码器，通过粗细粒度条件解码和基于Kolmogorov复杂度的自适应带宽分配，在极端信道条件下提升语义一致性与传输效率。

Comments 14pages, 14figures, 2tables

详情

AI中文摘要

生成式联合源信道编码（GJSCC）已成为一种新的深度JSCC范式，用于在极端无线信道条件（如超低带宽和低信噪比）下实现高保真和鲁棒的图像传输。近期研究通常采用扩散模型作为生成解码器，但经常产生视觉上逼真但语义一致性有限的结果。这种局限性源于面向重建的JSCC编码器与生成解码器之间的根本性不匹配，因为前者缺乏显式的语义判别能力，无法提供可靠的条件线索。在本文中，我们提出DiT-JSCC，一种新颖的GJSCC骨干网络，能够联合学习语义优先的表示编码器和基于扩散变换器（DiT）的生成解码器，我们的开源项目旨在促进GJSCC的未来研究。具体来说，我们设计了一个语义-细节双分支编码器，与从粗到细的条件DiT解码器自然对齐，在极端信道条件下优先考虑语义一致性。此外，受Kolmogorov复杂度启发，引入了一种无需训练的自适应带宽分配策略，以进一步提高传输效率，从而真正重新定义生成解码时代的信息价值概念。大量实验表明，DiT-JSCC在语义一致性和视觉质量上始终优于现有JSCC方法，尤其是在极端条件下。

英文摘要

Generative joint source-channel coding (GJSCC) has emerged as a new Deep JSCC paradigm for achieving high-fidelity and robust image transmission under extreme wireless channel conditions, such as ultra-low bandwidth and low signal-to-noise ratio. Recent studies commonly adopt diffusion models as generative decoders, but they frequently produce visually realistic results with limited semantic consistency. This limitation stems from a fundamental mismatch between reconstruction-oriented JSCC encoders and generative decoders, as the former lack explicit semantic discriminability and fail to provide reliable conditional cues. In this paper, we propose DiT-JSCC, a novel GJSCC backbone that can jointly learn a semantics-prioritized representation encoder and a diffusion transformer (DiT) based generative decoder, our open-source project aims to promote the future research in GJSCC. Specifically, we design a semantics-detail dual-branch encoder that aligns naturally with a coarse-to-fine conditional DiT decoder, prioritizing semantic consistency under extreme channel conditions. Moreover, a training-free adaptive bandwidth allocation strategy inspired by Kolmogorov complexity is introduced to further improve the transmission efficiency, thereby indeed redefining the notion of information value in the era of generative decoding. Extensive experiments demonstrate that DiT-JSCC consistently outperforms existing JSCC methods in both semantic consistency and visual quality, particularly in extreme regimes.

URL PDF HTML ☆

赞 0 踩 0