arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

多模态信息融合

面向图像、视频、多传感器和跨模态感知的信息融合,包括 Image Fusion、红外可见光、遥感、医学影像、LiDAR/雷达/相机和音视频融合。

今日/当前日期收录 2 信号源:cs.CV, eess.IV, eess.SP, cs.RO, cs.MM
2606.19277 2026-06-18 cs.CV 新提交 85%

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架:适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Computational Data Science and Engineering(计算数据科学与工程) College of Science and Technology(科学与技术学院)

专题命中 遥感融合与全色锐化 :遥感视觉问答中多模态融合的适配策略

AI总结 提出RS Adapter参数高效微调策略,在三种视觉语言模型架构上注入轻量瓶颈适配器,仅用不到5%可训练参数实现遥感VQA,混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情
AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功,但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter(一种参数高效微调策略)在三种不同的视觉语言模型架构上进行了比较分析:双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线,将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层,从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明,虽然所有适配模型均实现收敛,但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

2606.19204 2026-06-18 cs.CV 新提交 85%

ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

ROSA-TFormer: 一种雷达-光学传感器感知的时间Transformer用于基于GEE导出的Sentinel-1/2时间序列的陕北樟子松人工林分类

Nengbo Zhang, Chang sheng

发表机构 * Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences (AIRCAS)(遥感与数字地球重点实验室,航天信息研究所,中国科学院(AIRCAS))

专题命中 遥感融合与全色锐化 :融合雷达与光学时间序列数据用于森林分类

AI总结 提出ROSA-TFormer模型,集成SAR和光学嵌入分支、传感器感知门和时间注意力池化,利用Sentinel-1/2时间序列数据实现高精度樟子松人工林分类,总体精度达99.67%。

Comments journal in tree classification

详情
AI中文摘要

准确识别樟子松人工林对于监测陕北地区造林质量和生态恢复具有重要意义。本文提出ROSA-TFormer,一种雷达-光学传感器感知的时间Transformer,利用Google Earth Engine生成的Sentinel-1/2时间序列数据进行樟子松分类。该模型集成了独立的SAR和光学嵌入分支、传感器感知门以及时间注意力池化,以捕获多源季节特征。在月度与半月点级数据集上的实验表明,ROSA-TFormer在HalfMonth-dataBig数据集上实现了强分类性能,总体精度99.67%,宏F1 99.56%,樟子松F1 98.91%。空间块验证和消融实验进一步表明了雷达-光学时间融合和传感器感知建模的有效性。结果展示了ROSA-TFormer在点级樟子松人工林分类中的潜力,但更广泛的wall-to-wall验证仍有必要。

英文摘要

Accurate identification of Pinus sylvestris var. mongolica plantations is important for monitoring afforestation quality and ecological restoration in northern Shaanxi. This paper proposes ROSA-TFormer, a radar-optical sensor-aware temporal Transformer for P. sylvestris classification using Sentinel-1/2 time-series data generated on Google Earth Engine. The model integrates separate SAR and optical embedding branches, a sensor-aware gate, and temporal attention pooling to capture multi-source seasonal features. Experiments on monthly and half-month point-level datasets show that ROSA-TFormer achieves strong classification performance, with 99.67% overall accuracy, 99.56% macro F1, and 98.91% P. sylvestris F1 on the HalfMonth-dataBig dataset. Spatial block validation and ablation results further indicate the effectiveness of radar-optical temporal fusion and sensor-aware modeling. The results demonstrate the potential of ROSA-TFormer for point-level P. sylvestris plantation classification, while broader wall-to-wall validation remains necessary.