多模态信息融合 - arXivDaily 专题

2606.19277 2026-06-18 cs.CV 新提交 85%

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架：适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Computational Data Science and Engineering（计算数据科学与工程）； College of Science and Technology（科学与技术学院）

专题命中遥感融合与全色锐化：遥感视觉问答中多模态融合的适配策略

AI总结提出RS Adapter参数高效微调策略，在三种视觉语言模型架构上注入轻量瓶颈适配器，仅用不到5%可训练参数实现遥感VQA，混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情

AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功，但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter（一种参数高效微调策略）在三种不同的视觉语言模型架构上进行了比较分析：双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线，将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层，从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明，虽然所有适配模型均实现收敛，但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.19204 2026-06-18 cs.CV 新提交 85%

ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

ROSA-TFormer: 一种雷达-光学传感器感知的时间Transformer用于基于GEE导出的Sentinel-1/2时间序列的陕北樟子松人工林分类

Nengbo Zhang, Chang sheng

发表机构 * Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences (AIRCAS)（遥感与数字地球重点实验室，航天信息研究所，中国科学院（AIRCAS））

专题命中遥感融合与全色锐化：融合雷达与光学时间序列数据用于森林分类

AI总结提出ROSA-TFormer模型，集成SAR和光学嵌入分支、传感器感知门和时间注意力池化，利用Sentinel-1/2时间序列数据实现高精度樟子松人工林分类，总体精度达99.67%。

Comments journal in tree classification

详情

AI中文摘要

准确识别樟子松人工林对于监测陕北地区造林质量和生态恢复具有重要意义。本文提出ROSA-TFormer，一种雷达-光学传感器感知的时间Transformer，利用Google Earth Engine生成的Sentinel-1/2时间序列数据进行樟子松分类。该模型集成了独立的SAR和光学嵌入分支、传感器感知门以及时间注意力池化，以捕获多源季节特征。在月度与半月点级数据集上的实验表明，ROSA-TFormer在HalfMonth-dataBig数据集上实现了强分类性能，总体精度99.67%，宏F1 99.56%，樟子松F1 98.91%。空间块验证和消融实验进一步表明了雷达-光学时间融合和传感器感知建模的有效性。结果展示了ROSA-TFormer在点级樟子松人工林分类中的潜力，但更广泛的wall-to-wall验证仍有必要。

英文摘要

Accurate identification of Pinus sylvestris var. mongolica plantations is important for monitoring afforestation quality and ecological restoration in northern Shaanxi. This paper proposes ROSA-TFormer, a radar-optical sensor-aware temporal Transformer for P. sylvestris classification using Sentinel-1/2 time-series data generated on Google Earth Engine. The model integrates separate SAR and optical embedding branches, a sensor-aware gate, and temporal attention pooling to capture multi-source seasonal features. Experiments on monthly and half-month point-level datasets show that ROSA-TFormer achieves strong classification performance, with 99.67% overall accuracy, 99.56% macro F1, and 98.91% P. sylvestris F1 on the HalfMonth-dataBig dataset. Spatial block validation and ablation results further indicate the effectiveness of radar-optical temporal fusion and sensor-aware modeling. The results demonstrate the potential of ROSA-TFormer for point-level P. sylvestris plantation classification, while broader wall-to-wall validation remains necessary.

URL PDF HTML ☆

赞 0 踩 0

2606.05368 2026-06-18 cs.CV 版本更新 80%

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Biomazon：亚马逊盆地三维森林结构与生物量建模的多模态数据集

Sayan Mandal, Rocco Sedona, Simon Besnard, Mikhail Urbazaev, Morris Riedel, Ehsan Zandi, Gabriele Cavallaro

发表机构 * Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich（julich超级计算中心（JSC），julich研究所）； School of Engineering and Natural Sciences (SENS), University of Iceland（工程与自然科学学院（SENS），冰岛大学）； Global Land Monitoring Group, GFZ Helmholtz Centre for Geosciences（全球土地监测组，geofz赫尔姆霍兹研究中心）

专题命中遥感融合与全色锐化：多传感器预测因子融合用于森林结构建模

AI总结针对现有方法未将森林垂直结构作为有序轮廓学习的问题，提出Biomazon多模态基准数据集，结合GEDI RH和AGBD目标与多传感器预测因子，通过共享编码器-解码器框架进行消融研究，为热带森林结构一致RH轮廓预测和结构-生物量建模建立参考基准。

Comments 32 pages, 21 figures, 8 tables

详情

AI中文摘要

准确、空间明确的描述热带森林结构对于碳核算和生态系统监测至关重要，然而大多数机器学习流程预测冠层顶部高度代理（例如RH95/RH98）或AGBD作为单独的标量目标，而不是将森林垂直结构作为有序轮廓学习。社区缺乏一个ML就绪的多模态基准，用于联合预测整个GEDI RH轮廓与AGBD，或评估强制RH百分位数之间物理一致排序的方法。我们通过Biomazon解决了这一问题，这是一个覆盖亚马逊盆地的20米多模态基准数据集，在标准化的空间划分和评估协议下，将GEDI RH和AGBD目标与多传感器预测因子（Sentinel-1/2、ALOS-2 PALSAR-2、Copernicus DEM、Dynamic World LULC和AlphaEarth嵌入）配对。使用共享编码器-解码器与任务特定头作为基线框架，我们对（i）骨干/模型规模、（ii）模态贡献以及（iii）在独立和融合设置下使用辅助嵌入进行了全面的消融研究，并报告了单目标和联合目标结果，以量化统一训练协议下的权衡。最后，我们通过与现有网格化产品（包括GEDI L4D RH10-RH98和AGBD）在匹配时间尺度上的区域对齐比较，将基线性能置于背景中。Biomazon连同随附的协议和基线结果，为未来热带森林中结构一致的RH轮廓预测和结构-生物量建模工作建立了参考基准。

英文摘要

Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.

URL PDF HTML ☆

赞 0 踩 0

2511.20302 2026-06-18 cs.CV 版本更新 80%

CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

CrossEarth-Gate：基于Fisher引导的自适应调优引擎用于高效跨域遥感语义分割

Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu

发表机构 * Sun Yat-sen University（中山大学）； The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）； National Supercomputing Center in Shenzhen（深圳国家超算中心）； The Hong Kong University of Science and Technology（香港科技大学）； Beijing Institute of Technology（北京理工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Tsinghua University（清华大学）

专题命中遥感融合与全色锐化：跨域遥感语义分割的自适应调优

AI总结提出CrossEarth-Gate，通过Fisher信息引导的自适应模块选择机制，动态激活最关键的跨域模块，在18个跨域基准中16个达到最优性能。

详情

AI中文摘要

在遥感（RS）中，参数高效微调（PEFT）已成为激活基础模型泛化表示能力以用于下游任务的关键方法。然而，现有的专用PEFT方法在应用于大规模地球观测任务时常常失败，因为它们无法完全处理遥感数据中固有的多面且不可预测的域差距（例如空间、语义和频率偏移）。为克服这一问题，我们提出CrossEarth-Gate，它包含两个主要贡献。首先，我们建立了一个全面的遥感模块工具箱，以解决多方面的域差距，包括空间、语义和频率模块。其次，我们开发了一种基于Fisher引导的自适应选择机制，该机制作用于该工具箱。该选择由Fisher信息引导，通过衡量每个模块对任务特定梯度流的贡献来量化其重要性。它动态地仅在适当层激活最关键模块，引导梯度流以最大化适应效果和效率。全面实验验证了我们方法的有效性和泛化能力，其中CrossEarth-Gate在18个遥感语义分割跨域基准中的16个上达到了最先进性能。

英文摘要

In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance on 16 out of 18 cross-domain benchmarks for RS semantic segmentation.

URL PDF HTML ☆

赞 0 踩 0