多模态信息融合 - arXivDaily 专题

2603.10791 2026-06-19 eess.IV 版本更新 90%

Semantic Satellite Communications for Synchronized Audiovisual Reconstruction

面向同步视听重建的语义卫星通信

Fangyu Liu, Peiwen Jiang, Wenjin Wang, Xiao Li, Shi Jin

专题命中音视频/视觉语言融合：提出视听语义传输系统，实现跨模态生成与同步重建

AI总结提出自适应多模态语义传输系统，通过双流生成架构和动态关键帧更新机制，在带宽受限的卫星场景下实现高质量同步视听重建，显著降低带宽消耗并提升鲁棒性。

详情

AI中文摘要

卫星通信在支持高保真同步视听服务方面面临严重瓶颈，因为传统方案在信道波动、带宽有限和长传播延迟下难以处理跨模态一致性。为了解决这些问题，本文提出了一种针对卫星场景的自适应多模态语义传输系统，旨在带宽约束下实现高质量同步视听重建。与具有固定模态优先级的静态方案不同，我们的框架采用双流生成架构，可灵活切换视频驱动音频生成和音频驱动视频生成。这使得系统能够动态解耦语义，仅传输最重要的模态，同时利用跨模态生成恢复另一种模态。为了平衡重建质量和传输开销，动态关键帧更新机制根据无线场景和用户需求自适应维护共享知识库。此外，引入基于大语言模型的决策模块以增强系统适应性。通过集成卫星特定知识，该模块联合考虑任务需求和信道因素（如天气引起的衰落），主动调整传输路径和生成工作流。仿真结果表明，所提系统在实现高保真视听同步的同时显著降低带宽消耗，提高了挑战性卫星场景下的传输效率和鲁棒性。

英文摘要

Satellite communications face severe bottlenecks in supporting high-fidelity synchronized audiovisual services, as conventional schemes struggle with cross-modal coherence under fluctuating channel conditions, limited bandwidth, and long propagation delays. To address these limitations, this paper proposes an adaptive multimodal semantic transmission system tailored for satellite scenarios, aiming for high-quality synchronized audiovisual reconstruction under bandwidth constraints. Unlike static schemes with fixed modal priorities, our framework features a dual-stream generative architecture that flexibly switches between video-driven audio generation and audio-driven video generation. This allows the system to dynamically decouple semantics, transmitting only the most important modality while employing cross-modal generation to recover the other. To balance reconstruction quality and transmission overhead, a dynamic keyframe update mechanism adaptively maintains the shared knowledge base according to wireless scenarios and user requirements. Furthermore, a large language model based decision module is introduced to enhance system adaptability. By integrating satellite-specific knowledge, this module jointly considers task requirements and channel factors such as weather-induced fading to proactively adjust transmission paths and generation workflows. Simulation results demonstrate that the proposed system significantly reduces bandwidth consumption while achieving high-fidelity audiovisual synchronization, improving transmission efficiency and robustness in challenging satellite scenarios.

URL PDF HTML ☆

赞 0 踩 0

2508.15228 2026-06-19 cs.CV 版本更新 85%

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore（南洋理工大学S实验室）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

专题命中音视频/视觉语言融合：协作多模态编码融合RGB、RGBD和点云特征。

AI总结提出TriMM，首个前馈式3D原生生成模型，通过协作多模态编码融合RGB、RGBD和点云特征，结合辅助2D/3D监督和三平面潜在扩散模型，实现高质量3D资产生成。

详情

AI中文摘要

3D内容本质上具有多模态特性，可投影到不同模态（如RGB图像、RGBD和点云）。每种模态在3D资产建模中表现出独特优势：RGB图像包含生动的3D纹理，而点云定义精细的3D几何。然而，现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势，要么局限于3D结构，从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模，我们提出了TriMM，这是第一个从基本多模态（如RGB、RGBD和点云）学习的前馈式3D原生生成模型。具体来说，1) TriMM首先引入协作多模态编码，该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外，引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码，TriMM采用三平面潜在扩散模型生成更高质量的3D资产，增强了纹理和几何细节。在多个知名数据集上的大量实验表明，TriMM通过有效利用多模态，尽管使用少量训练数据，仍能达到与在大规模数据集上训练的模型相竞争的性能。此外，我们在最近的RGB-D数据集上进行了额外实验，验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

URL PDF HTML ☆

赞 0 踩 0

2508.04424 2026-06-19 cs.CV 版本更新 85%

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

组合对象检索：通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China（新一代人工智能技术及跨学科应用国家重点实验室，东南大学，教育部，江苏，中国）； Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE（穆罕默德·本·扎耶德人工智能大学（MBZUAI），阿布扎赫德，阿联酋）

专题命中音视频/视觉语言融合：组合对象检索结合视觉与文本，属于视觉语言融合

AI总结提出组合对象检索（COR）任务，通过组合参考对象、掩码和检索文本进行对象级检索，并构建COR125K基准和CORE模型，显著优于现有方法。

详情

AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索（CIR）方法结合了参考图像和检索文本，但它们局限于图像级匹配，无法定位特定对象。为此，我们提出了组合对象检索（COR），一种新的对象级检索任务，从目标图像中的候选对象中检索目标对象，并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本，COR要求模型执行组合视觉-文本推理，而不是依赖显式的类别名称。这一设置带来了若干挑战，包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K，第一个大规模COR基准，包含408个类别的125,541个检索三元组，并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE，一个统一的端到端模型，集成了参考区域编码、自适应视觉-文本交互和区域级对比学习，以将组合表示与目标对象对齐，同时抑制背景和干扰物。大量实验表明，CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线，为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

URL PDF HTML ☆

赞 0 踩 0

2509.10416 2026-06-19 cs.RO 版本更新 75%

TASC: Task-Aware Shared Control for Relational Telemanipulation

TASC：面向关系遥操作的任务感知共享控制

Ze Fu, Pinhao Song, Yutong Hu, Renaud Detry

发表机构 * KU Leuven, Dept. Mechanical Engineering, Research unit Robotics, Automation and Mechatronics（KU莱顿机械工程系，机器人、自动化与机电一体化研究单位）； KU Leuven, Dept. Electrical Engineering, Research unit Processing Speech and Images（KU莱顿电气工程系，语音与图像处理研究单位）

专题命中音视频/视觉语言融合：利用视觉语言模型推断意图，属于视觉语言融合

AI总结提出TASC框架，通过视觉构建开放词汇交互图推断任务级用户意图，并基于空间约束提供共享控制辅助，提升关系遥操作效率与泛化能力。

Comments Accepted to IROS 2026

详情

AI中文摘要

我们提出了TASC，一个面向关系遥操作的任务感知共享控制框架，该框架从仅运动输入中推断任务级用户意图并提供辅助。为了在没有预定义模板的情况下支持抓取关系任务，TASC从视觉输入构建一个开放词汇的交互图来表示功能性物体关系，并据此推断用户意图。然后，共享控制策略在抓取和物体交互过程中提供辅助，该辅助由视觉语言模型预测的空间约束引导。我们的方法解决了共享控制下关系遥操作的两个关键挑战：（1）从低级运动命令中推断任务级意图，以及（2）跨不同物体和任务的泛化辅助。在仿真和真实世界的实验表明，与先前方法相比，TASC提高了任务效率并减少了用户输入努力，同时实现了跨多种关系遥操作任务的零样本泛化。支持我们实验的代码在此https URL公开提供。

英文摘要

We present TASC, a Task-Aware Shared Control framework for relational telemanipulation that infers task-level user intent and provides assistance from motion-only input. To support prehensile relational tasks without predefined templates, TASC constructs an open-vocabulary interaction graph from visual input to represent functional object relationships, and infers user intent accordingly. A shared control policy then provides assistance during both grasping and object interaction, guided by spatial constraints predicted by a vision-language model. Our method addresses two key challenges in relational telemanipulation under shared control: (1) task-level intent inference from low-level motion commands, and (2) generalizable assistance across diverse objects and tasks. Experiments in both simulation and the real world demonstrate that TASC improves task efficiency and reduces user input effort compared to prior methods, while enabling zero-shot generalization across diverse relational telemanipulation tasks. The code that supports our experiments is publicly available at https://github.com/fitz0401/tasc.

URL PDF HTML ☆

赞 0 踩 0

2606.05833 2026-06-19 cs.CV cs.AI 版本更新 70%

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis（加州大学戴维斯分校）

专题命中音视频/视觉语言融合：从视频学习3D几何表示，增强多模态大语言模型空间智能

AI总结提出GeoVR框架，通过从2D视频序列中蒸馏3D几何知识（包括相机姿态、深度图、尺度因子和多尺度3D特征），重塑多模态大语言模型的内部表示以赋予其空间智能，在空间推理基准上达到最先进性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在2D语义理解方面表现出色，但缺乏内在的3D感知能力，导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性，我们提出了GeoVR，一种新颖的框架，仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间，以解锁空间智能。GeoVR并非采用浅层的特征混合，而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的，该策略由四个互补的几何目标驱动：（1）估计帧间相机姿态以嵌入变化的视角动态，（2）回归密集深度图以锚定物理距离，（3）预测度量尺度因子以进行真实世界校准，以及（4）蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下，模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明，GeoVR实现了最先进的性能，为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0