arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.15179 2026-06-16 cs.AI 新提交

CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

CONCORD: 文档隔离下设备-云RAG的异步稀疏聚合

Xuedong Hu, Zhiqing Tang, Zhi Yao, Tian Wang, Weijia Jia

发表机构 * Beijing Normal University(北京师范大学) BNU-HKBU United International College(北师港浸大联合国际学院) University of Macau(澳门大学) Shenzhen Research Institute of Big Data(深圳市大数据研究院) Guangdong Key Laboratory of Artificial Intelligence and Multi-Modal Data Processing(广东省人工智能与多模态数据处理重点实验室)

AI总结 针对文档隔离的双端RAG中频繁同步和密集证据传输导致的吞吐量低问题,提出异步稀疏聚合框架CONCORD,通过等待债务控制和证书引导最小补充机制,在保持答案质量的同时大幅提升吞吐量并降低通信量。

Comments to be published in IEEE ICWS 2026

详情
AI中文摘要

检索增强生成(RAG)已成为通过在推理时引入外部知识来改进语言模型的关键技术。随着设备-云协同推理使得在边缘设备上部署小型语言模型成为可能,出现了一种新的场景:私有文档保留在设备上,而公共知识位于云端。隐私和政策约束通常禁止原始文档交换,从而形成了文档隔离的双端RAG设置。然而,现有方法依赖频繁的远程同步和密集的证据传输,限制了在现实延迟和带宽条件下的吞吐量。为了解决这个问题,我们提出了CONCORD,一种用于文档隔离下双端RAG的异步稀疏聚合框架。CONCORD将云端视为异步到达的证据源,而非持续同步的协同生成器。具体来说,我们引入了等待债务控制,根据观察到的等待回报决定每个解码步骤是否应继续等待远程参与。我们还设计了一种证书引导的最小补充机制,仅请求确定当前贪婪决策所需的远程证据。咨询云端的步骤保留了与密集双端聚合相同的贪婪令牌,而其余步骤则在本地提交,无需远程证据。在Natural Questions和WikiText-2上的实验表明,CONCORD将端到端吞吐量相对于基线分别提高了1.66倍和2.15倍,同时将每令牌通信量降低了两个数量级以上,并保持了可比的答案质量和困惑度。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a pivotal technique for improving language models by incorporating external knowledge at inference time. As device-cloud collaborative inference makes it feasible to deploy small language models on edge devices, a new setting arises in which private documents remain on the device and public knowledge resides in the cloud. Privacy and policy constraints often forbid raw document exchange, creating a document-isolated dual-end RAG setting. However, existing methods rely on frequent remote synchronization and dense evidence transfer, limiting throughput under realistic latency and bandwidth conditions. To address this issue, we propose CONCORD, an asynchronous sparse aggregation framework for dual-end RAG under document isolation. CONCORD treats the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co-generator. Specifically, we introduce waiting debt control to decide whether each decoding step should continue waiting for remote participation based on the observed return of waiting. We also design a certificate-guided minimal supplementation mechanism that requests only the remote evidence needed to determine the current greedy decision. Steps that consult the cloud preserve the same greedy token as dense dual-end aggregation, while the remaining steps commit locally without remote evidence. Experiments on Natural Questions and WikiText-2 show that CONCORD improves end-to-end throughput over baselines by $1.66\times$ and $2.15\times$, respectively, while reducing per-token communication by over two orders of magnitude and maintaining comparable answer quality and perplexity.

2606.15176 2026-06-16 cs.CV cs.AI 新提交

Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

实现实时床旁超声分割:资源受限环境中的无GPU部署

Weihao Gao

发表机构 * School of Computer Science and Artificial Intelligence, Guangdong University of Education(广东第二师范学院计算机科学与人工智能学院)

AI总结 提出超轻量级架构UltraSeg,在CPU和移动设备上实现实时超声图像分割,性能媲美大型模型,消除GPU依赖,降低AI成本。

Comments 15 pages,4 figures

详情
AI中文摘要

超声成像因其低成本和高便携性成为全球最广泛使用的医学模态,然而人工智能(AI)的部署仍受限于对GPU加速模型的依赖,造成结构性矛盾:"智能"的成本超过了成像设备本身。在此,我们展示了UltraSeg的系统性适配和广泛评估,UltraSeg最初为结肠镜息肉分割设计的超轻量级架构,现被改造用于床旁超声(POCUS),涵盖跨越六个解剖部位(乳腺、甲状腺、肾脏、颈动脉、胎儿和小动物肿瘤)的十个公共数据集。我们在超声领域系统验证了两种变体:UltraSeg-130K(0.13M参数)在单核CPU上达到89.7 FPS,在翻新移动设备上达到34.8 FPS;而UltraSeg-500K(0.5M参数)在CPU上达到44.6 FPS,在移动设备上达到16.1 FPS。UltraSeg-500K在平均性能上匹配或超过31M参数的UNet,并接近105M参数的TransUNet,在外部验证集(UDIAT、DDTI)上具有优越的零样本跨数据集泛化能力。通过实现无需GPU依赖的临床级分割,本工作使AI成本与超声可及性相匹配,使先进诊断在资源受限环境中成为可能。

英文摘要

Ultrasound imaging is the most widely adopted medical modality globally due to its low cost and portability, yet artificial intelligence (AI) deployment remains constrained by reliance on GPU-accelerated models, creating a structural paradox where the cost of "intelligence" exceeds that of the imaging device itself. Here, we present the systematic adaptation and extensive evaluation of UltraSeg, an ultra-lightweight architecture originally developed for colonoscopic polyp segmentation, now engineered for point-of-care ultrasound (POCUS) across ten public datasets spanning six anatomical sites (breast, thyroid, kidney, carotid, fetal, and small-animal tumor). We systematically validate both variants in ultrasound domains: UltraSeg-130K (0.13M parameters) achieves 89.7 FPS on single-core CPUs and 34.8 FPS on a refurbished mobile device, while UltraSeg-500K (0.5M parameters) delivers 44.6 FPS on CPU and 16.1 FPS on mobile device. UltraSeg-500K matches or exceeds the Dice performance of the 31M-parameter UNet and approaches 105M-parameter TransUNet in average performance, with superior zero-shot cross-dataset generalization on external validation sets (UDIAT, DDTI). By enabling clinical-grade segmentation without GPU dependency, this work brings AI costs in line with ultrasound accessibility, making advanced diagnostics available in resource-limited settings.

2606.15171 2026-06-16 cs.RO 新提交

Seam-to-Graph Reconstruction for Garment Configuration Alignment

用于服装配置对齐的缝到图重建

Xuzhao Huang, Kai Tang, Fuyuki Tokuda, Norman C. Tien, Kazuhiro Kosuge

发表机构 * JC STEM Lab of Robotics for Soft Materials, Department of Electrical and Electronic Engineering, Faculty of Engineering, The University of Hong Kong(香港大学工程学院电气与电子工程系JC STEM软材料机器人实验室) Unprecedented-scale Data Analytics Center, Tohoku University(东北大学前所未有规模数据分析中心) Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究科) Department of Electrical and Electronic Engineering, Faculty of Engineering, The University of Hong Kong(香港大学工程学院电气与电子工程系)

AI总结 提出基于图神经网络和注意力机制的Seam-to-Graph网络,将部分可观测的缝信息映射为拓扑结构骨架图,用于实时服装状态估计,并设计变形感知分层视觉伺服控制器实现服装配置对齐,在双臂机器人系统上验证了人类级精度和鲁棒性。

Comments 11 pages, 9 figures

详情
AI中文摘要

缝线编码了服装的丰富结构信息,但在机器人操作场景中通常仅部分可观测。为了稳健地利用缝线信息,我们提出了一种基于图神经网络和注意力机制的Seam-to-Graph网络。该网络将非结构化的缝线观测映射为拓扑编码的结构骨架图,用于实时服装状态估计。基于这种骨架图状态估计,我们设计了一个变形感知的分层视觉伺服控制器,用于服装配置对齐。我们在双臂机器人系统上实现了该控制器,以将服装装载到丝网印刷平台上并精确对齐到期望配置。真实机器人实验表明,使用所提方法的机器人不仅实现了人类级别的对齐精度且对齐误差方差更小,而且对不同服装具有鲁棒性。这些结果表明,利用缝线信息对于服装操作是有效的。

英文摘要

Seams encode rich structural information about garments but are frequently partially observable in robotic manipulation scenarios. To robustly leverage seam information, we propose a Seam-to-Graph network based on graph neural networks and attention mechanisms. This network maps unstructured seam observations to a topology-encoded structural skeleton graph for real-time garment state estimation. Using this skeleton-graph-based state estimation, we design a deformation-aware, hierarchical visual servoing controller for garment configuration alignment. We implement this controller on a bimanual robot system to load a garment onto a screen printing platen and to align it to the desired configuration precisely. Real-robot experiments demonstrate that the robot using the proposed method not only achieves human-level alignment accuracy with reduced variance in alignment error but is also robust to different garments. These results demonstrate that the use of seam information is effective for garment manipulation.

2606.15169 2026-06-16 cs.CV 新提交

Label Shift Aware Adaptation for Online Zero-shot Learning with Contrastive Language-Image Pre-Training (CLIP)

基于对比语言-图像预训练(CLIP)的在线零样本学习中的标签偏移感知自适应

Pengxiao Han, Changkun Ye, Yanshuo Wang, Jinguang Tong, Miaohua Zhang, Xuesong Li, Jie Hong, Lars Petersson

发表机构 * Australian National University(澳大利亚国立大学) China North Vehicle Research Institute(中国北方车辆研究所) The Hong Kong Polytechnic University(香港理工大学) Griffith University(格里菲斯大学) CSIRO(澳大利亚联邦科学与工业研究组织) The University of Hong Kong(香港大学)

AI总结 针对在线零样本学习中测试数据与CLIP训练数据分布不匹配的问题,提出标签偏移感知(LSA)方法,通过域自适应和标签偏移校正提升分类性能。

详情
AI中文摘要

像对比语言-图像预训练(CLIP)这样的视觉-语言模型已在数据稀缺场景中得到广泛研究。该领域中一个特别具有挑战性和现实性的任务是使用CLIP进行在线零样本学习,其中未知测试样本由CLIP以随机顺序顺序预测,同时在顺序推理阶段保持特征提取和模型参数固定。在这种设置下,大多数现有方法通过使用传入测试样本在线调整表示来解决问题,而忽略了CLIP最初训练的数据分布。当测试数据中的标签分布与训练域不同时,这种不匹配可能导致性能下降。为了解决这一差距,我们提出了标签偏移感知(LSA),它将在线零样本分类任务形式化为域适应问题。具体来说,LSA适应CLIP(在未知源分布上训练)计算的预测到目标分布,仅使用未标记的测试数据,并应用标签偏移校正来减轻源域和目标域之间的不匹配。跨多个数据集的广泛实验表明,所提出的LSA始终优于基于CLIP的最先进的在线零样本学习方法。

英文摘要

Vision-language models like Contrastive Language-Image Pre-Training (CLIP) have been extensively studied in data-scarce scenarios. A particularly challenging and realistic task in this area is online zero-shot learning with CLIP, where unknown test samples are predicted sequentially in random order by CLIP while keeping the feature extraction and model parameters fixed during the sequential inference phase. Most existing approaches in this setting address the problem by adapting representations online using incoming test samples, while neglecting the distribution of the data on which CLIP was initially trained. This mismatch can lead to degraded performance when the label distribution in the test data differs from that of the training domain. To address this gap, we propose Label Shift Aware (LSA), which formulates the online zero-shot classification task as a domain adaptation problem. Specifically, LSA adapts the predictions computed by CLIP, which was trained on an unknown source distribution, to a target distribution using only unlabeled test data, and applies label shift correction to mitigate the mismatch between the source and target domains. The extensive experiments across multiple datasets demonstrate that the proposed LSA consistently outperforms state-of-the-art online zero-shot learning methods based on CLIP.

2606.15167 2026-06-16 cs.CV 新提交

Variational Network with Wavelet-based UNET in Accelerated MRI Reconstruction from Under Sampled K-space Data

基于小波UNET的变分网络在欠采样k空间数据加速MRI重建中的应用

Yasir Arafat Prodhan, Shaikh Anowarul Fattah

发表机构 * Bangladesh University of Engineering and Technology(孟加拉国工程技术大学)

AI总结 提出一种结合小波U-Net的变分网络,通过可学习多尺度频率表示和物理引导迭代重建,在单线圈和多线圈设置下有效抑制伪影、保留高频细节,在fastMRI和M4Raw数据集上达到最优性能。

Comments 14 pages, 9 figures

详情
AI中文摘要

全采样MRI需要密集的k空间采集,导致扫描时间长、临床吞吐量降低以及对患者运动敏感性增加。加速MRI通过获取欠采样k空间数据并计算重建缺失信息来解决这一问题。然而,从欠采样测量中重建是高度病态的,可能引入混叠伪影、噪声放大和解剖细节丢失。尽管传统的并行成像和压缩感知方法缓解了这些问题,深度学习方法进一步提高了重建质量,但在激进欠采样下保留高频结构仍然具有挑战性。在这项工作中,我们提出了一种基于小波U-Net(W-UNet)的变分网络用于加速MRI重建。该框架将物理引导的迭代重建与可学习多尺度频率表示相结合。标准池化操作被离散小波变换和逆小波变换模块取代,实现了无损下采样,同时保留了低频结构和高频边缘细节。集成到细化和灵敏度图估计阶段后,所提出的设计在单线圈和多线圈设置下改善了伪影抑制、特征保留和重建保真度。在fastMRI膝盖和M4Raw脑部数据集上的实验显示了最先进的性能。消融研究进一步证实了基于小波的特征分解对加速MRI重建的有效性。

英文摘要

Fully sampled MRI requires dense k-space acquisition, leading to long scan times, reduced clinical throughput, and increased sensitivity to patient motion. Accelerated MRI addresses this by acquiring undersampled k-space data and reconstructing the missing information computationally. However, reconstruction from undersampled measurements is highly ill-posed and can introduce aliasing artifacts, noise amplification, and loss of anatomical detail. Although conventional parallel imaging and compressed sensing methods mitigate these issues, and deep learning methods have further improved reconstruction quality, preserving high-frequency structures under aggressive undersampling remains challenging. In this work, we propose a Variational Network with a Wavelet-based U-Net (W-UNet) for accelerated MRI reconstruction. The framework combines physics-guided iterative reconstruction with learnable multi-scale frequency representations. Standard pooling operations are replaced with Discrete Wavelet Transform and Inverse Wavelet Transform modules, enabling lossless downsampling while preserving low-frequency structure and high-frequency edge details. Integrated into the refinement and sensitivity map estimation stages, the proposed design improves artifact suppression, feature preservation, and reconstruction fidelity in both single-coil and multi-coil settings. Experiments on fastMRI knee and M4Raw brain datasets show state-of-the-art performance. Ablation studies further confirm the effectiveness of wavelet-based feature decomposition for accelerated MRI reconstruction.

2606.15162 2026-06-16 cs.CV 新提交

GeoStream: Toward Precise Camera Controlled Streaming Video Generation

GeoStream:迈向精确相机控制的流式视频生成

Yizhou Zhao, Yifan Wang, Xiaoyuan Wang, Yushu Wu, Hao Zhang, Moayed Haji-Ali, Rameen Abdal, Ashkan Mirzaei, Yanyu Li, Willi Menapace, Laszlo Jeni, Sergey Tulyakov, Peter Wonka, Chaoyang Wang

发表机构 * CMU(卡内基梅隆大学) Northeastern University(东北大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Rice University(莱斯大学) Snap Inc.(Snap公司) KAUST(阿卜杜拉国王科技大学)

AI总结 提出GeoStream框架,通过自刷新3D缓存和在线策略蒸馏,实现自回归流式视频生成中的精确度量级相机控制,解决了现有方法在视角移动时控制失效和分布偏移问题。

详情
AI中文摘要

精确的交互式相机控制对于基于视频的世界模型至关重要,但大多数现有方法隐式学习相机运动,导致在分布外轨迹下控制不准确。显式几何条件化提高了可控性,但现有方法是非自回归的,依赖于从初始帧构建的静态3D缓存,一旦视点超出原始视锥体,该缓存就会失效。我们提出GeoStream,一个在自回归流式视频生成中实现精确度量级相机控制的框架。我们的方法维护一个自刷新3D缓存,该缓存从模型自身的输出中定期在线更新:我们从最新生成的帧估计深度,反投影到3D,再投影到目标视图,生成点重投影作为后续合成的几何条件。基于相同原理,训练期间看到的条件也从学生自身生成的帧中渲染,产生完全在策略的蒸馏,自然对齐训练和推理条件分布。与先前使用离策略条件噪声的工作不同,我们的方法针对模型在推理时遇到的确切误差分布进行训练,既缓解了标准自回归漂移,也缓解了当缓存本身来自生成输出时出现的二阶几何反馈循环。定量和定性结果表明,我们的方法显著提高了相机可控性。

英文摘要

Accurate interactive camera control is essential for video-based world models, but most existing approaches learn camera motion implicitly, leading to inaccurate control under out-of-distribution trajectories. Explicit geometric conditioning improves controllability, but existing methods are non-autoregressive and rely on a static 3D cache built from an initial frame, which becomes ineffective once the viewpoint moves beyond the original frustum. We propose GeoStream, a framework that enables precise metric-scale camera control in autoregressive streaming video generation. Our method maintains a self-refreshing 3D cache that is periodically updated online from the model's own outputs: we estimate depth from the most recently generated frame, unproject to 3D, and reproject into the target view to produce point reprojections as geometric conditioning for subsequent synthesis. By the same principle, the conditioning seen during training is also rendered from the student's own generated frames, yielding a fully on-policy distillation that naturally aligns the train and inference conditioning distributions. Unlike prior work that uses off-policy condition noising, our approach trains the model against the exact error distribution it encounters at inference, mitigating both standard autoregressive drift and the second-order geometric feedback loop that arises when the cache itself is derived from generated outputs. Quantitative and qualitative results show that our approach substantially improves camera controllability.

2606.15161 2026-06-16 cs.CL 新提交

Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective

超越逐层稀疏中的层重要性:层间扰动吸收视角

Tao Jing, Ningxin Wu, Chen Kang, Dong Yu, Changliang Li, Pengyuan Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过受控扰动实验发现大语言模型中不同层对剪枝扰动的响应存在异质性,早期层放大扰动而中后期层吸收扰动,并基于此提出吸收感知校正方法,在70%稀疏度下降低困惑度7.13%并提升零样本准确率1.02%。

Comments 10 pages, 4 figures, 4 tables. Submitted to EMNLP 2026

详情
AI中文摘要

大型语言模型(LLM)中显著的逐层冗余性使得跨层的非均匀稀疏分配成为高效压缩的标准剪枝方法。现有的逐层分配方法通过局部信号(如激活异常值或权重谱)估计分配策略,主要源于局部层重要性,而剪枝后的最终性能还受到网络后续补偿能力的影响。在本文中,我们通过受控扰动实验直接表征这一特性。我们得到以下实证发现。首先,各层对剪枝规模扰动的响应表现出高度异质性。在大多数情况下,早期层放大扰动,而中层和后期层主动吸收扰动,相对L2漂移随深度单调递减,方向重新对准未扰动的隐藏状态轨迹。其次,吸收是大扰动现象。在小扰动下,网络在所有层都表现出放大效应,而当扰动幅度增长到剪枝规模时,向吸收的转变平滑发生。这丰富了相关工作中线性化累积理论的基础。基于这些发现,我们定义了每层的吸收系数,并提出吸收感知校正——一种正交增强方法,通过将困惑度降低7.13%并在多个模型家族中在70%稀疏度下将零样本准确率提升1.02%,改进了OWL和AlphaPruning。

英文摘要

The considerable layer-wise redundancy in large language models (LLMs) has established non-uniform sparsity allocation across layers as the standard pruning approach for efficient compression. Existing layer-wise allocation methods that estimate allocation strategy from local signals such as activation outliers or weight spectra mainly derive from local layer importance, whereas the final post-pruning performance is also influenced by the network's subsequent compensatory capacity. In this paper, we directly characterize this property through controlled perturbation experiments. We make the following empirical findings. First, layers exhibit highly heterogeneous responses to pruning-scale perturbations. In most cases, early layers amplify perturbations, while middle and late layers actively absorb them, with relative L2 drift decreasing monotonically across depth and direction realigning toward the unperturbed hidden-state trajectory. Second, absorption is a large-perturbation phenomenon. Under small perturbations the network exhibits amplification across all layers, and the transition to absorption occurs smoothly as perturbation magnitude grows to pruning scale. This enriches the linearized accumulation theory underlying related works. Building on these findings, we define an absorption coefficient per layer and propose absorption-aware correction, an orthogonal augmentation that improves OWL and AlphaPruning by reducing perplexity by 7.13% and boosting zero-shot accuracy by 1.02% across multiple model families at 70% sparsity.

2606.15160 2026-06-16 cs.CV cs.LG 新提交

DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

DLWM: 多样化潜在世界模型用于高效多模态推理

David Huang, Lianlei Shan

发表机构 * University of Toronto(多伦多大学) Tsinghua University(清华大学)

AI总结 提出DLWM框架,结合潜在空间推理与强化学习,通过多样化潜在假设和资源感知策略提升多模态推理效率,准确率提升2-5%,内存减少24%。

Comments Preprint. 9 pages main text, 15 pages total including appendix, 2 figures

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)的推理能力有了显著提升。现有方法通常依赖显式的思维链或连续的潜在空间轨迹来增强多步推理。然而,这些方法通常假设输入具有单一的潜在解释,并沿着固定路径或在统一计算预算下展开推理。在现实世界的多模态场景中,视觉观测常受遮挡、模糊、视角变化或语义歧义的影响,产生多种合理的解释。统一的推理策略不仅限制了模型探索多个假设的能力,还导致高内存使用和展开成本。我们提出DLWM(多样化潜在世界模型),一种结合潜在空间推理与强化学习的多模态推理框架。首先,我们在连续潜在空间中构建一组多样化的潜在世界假设,每个假设捕捉视觉输入的不同合理解释,并在每个假设上独立展开潜在推理。基于正交性的多样性正则化器明确防止假设坍缩。其次,我们将潜在推理过程形式化为资源受限的序列决策问题,并引入资源感知的强化学习策略,该策略自适应地在假设间分配计算资源,动态决定是扩展、终止还是合并推理路径,从而大幅减少内存占用并提高展开效率。在多个多模态推理基准上的实验表明,DLWM在准确率上比现有方法高出2-5个百分点,同时内存使用减少24%。

英文摘要

Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain-of-thought or continuous latent-space trajectories to enhance multi-step reasoning. However, these methods generally assume that an input admits a single latent interpretation and unfold reasoning along a fixed path or under a uniform computation budget. In real-world multimodal settings, visual observations are often subject to occlusion, blur, viewpoint variation, or semantic ambiguity, giving rise to multiple plausible interpretations. A uniform reasoning strategy not only limits the model's ability to explore multiple hypotheses but also incurs high memory usage and rollout cost. We present DLWM (Diverse Latent World Models), a multimodal reasoning framework that combines latent-space reasoning with reinforcement learning. First, we construct a set of diverse latent world hypotheses in continuous latent space, each capturing a different plausible interpretation of the visual input, and unfold latent reasoning independently on each hypothesis. An orthogonality-based diversity regularizer explicitly prevents hypothesis collapse. Second, we formulate the latent reasoning process as a resource-constrained sequential decision problem and introduce a resource-aware reinforcement learning policy that adaptively allocates computation across hypotheses, dynamically deciding whether to expand, terminate, or merge reasoning paths, thereby substantially reducing memory footprint and improving rollout efficiency. Experiments on multiple multimodal reasoning benchmarks demonstrate that DLWM outperforms existing methods by 2-5 points in accuracy while reducing memory usage by 24%.

2606.15158 2026-06-16 cs.CV 新提交

RefGC-SR$^2$: Reference-guided Generated Content Super-Resolution and Refinement

RefGC-SR$^2$: 参考引导的生成内容超分辨率与精炼

Jeahun Sung, Dahyeon Kye, Soo Ye Kim, Jihyong Oh

发表机构 * CMLab, Chung-Ang University(Chung-Ang 大学 CMLab) Adobe Research(Adobe 研究院)

AI总结 针对生成管道中参考图像下采样导致细节丢失和伪影问题,提出RefGC-SR$^2$任务,利用原始高分辨率参考图像同时恢复细节、精炼伪影并提升分辨率,构建首个真实三元组数据生成管道并设计频率感知扩散Transformer模型。

Comments The first two authors contributed equally to this work. The last two authors are co-corresponding authors. Please visit our project page at https://cmlab-korea.github.io/RefGC-SR2/

详情
AI中文摘要

参考引导生成(例如,对象合成、定制)已快速发展,但当前管道存在一个根本性限制:用户提供的以对象为中心的高分辨率参考图像(HRRI)在输入模型前被下采样到固定的低分辨率(LR),因此细粒度细节在输出生成之前就被丢弃。此外,生成步骤在此基础上引入其自身的伪影(例如,身份扭曲)。现有的参考引导生成内容精炼(RefGCR)方法可以纠正部分伪影,但仍操作在LR域;参考引导超分辨率(RefSR)方法恢复分辨率但假设自然图像退化,并忽略生成管道的伪影分布。为在一个统一公式中解决这两个空白,我们引入一个新任务:参考引导的生成内容超分辨率-精炼(RefGC-SR$^2$),其中原始HRRI在后处理阶段被重用,以同时恢复丢失的细节、精炼生成伪影并放大输出。我们为此RefGC-SR$^2$任务构建了首个真实世界三元组数据生成管道,训练一个双联条件生成器来合成公开预训练模型无法提供的配对低质量锚点。我们进一步提出一个用于RefGC-SR$^2$的频率感知扩散Transformer模型,该模型从HRRI中选择性地注入精细细节,同时去除生成伪影。大量实验表明,我们的RefGC-SR$^2$模型成功(i)相对于参考图像忠实地精炼对象身份,以及(ii)恢复高分辨率细节,使得最终结果相比现有RefGCR和RefSR基线在质量上显著更高,且实际使用性更强。

英文摘要

Reference-guided generation (e.g., object compositing, customization) has progressed rapidly, yet current pipelines share a fundamental limitation: the object-centric high-resolution reference image (HRRI) provided by users is downsampled to a fixed low-resolution (LR) before being fed into the model, so the fine-grained details are discarded before the output is even produced. In addition, the generation step then introduces its own artifacts (e.g., identity distortion) on top of this loss. Existing reference-guided generated content refinement (RefGCR) methods can correct some of these artifacts but still operate in the LR domain; reference-guided super-resolution (RefSR) methods recover resolution but assume natural-image degradations and ignore the artifact distribution of generative pipelines. To address both gaps in a single formulation, we introduce a new task: reference-guided generated content super-resolution-refinement (RefGC-SR$^2$), where the original HRRI is reused at the post-processing stage to recover lost details, refine generative artifacts, and upscale the output simultaneously. We construct the first real-world triplet data generation pipeline for this RefGC-SR$^2$ task, training a diptych-conditioned generator to synthesize paired low-quality anchors that public pretrained models cannot provide. We further present a frequency-aware diffusion transformer model for RefGC-SR$^2$ that selectively injects fine details from the HRRI while removing generative artifacts. Extensive experiments demonstrate that our RefGC-SR$^2$ model successfully (i) refines the object identity faithfully with respect to the reference, and (ii) recovers high-resolution details, so that the final result is significantly higher quality and practically more usable compared to existing RefGCR and RefSR baselines.

2606.15157 2026-06-16 cs.LG cs.AI 新提交

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

PolyKV: 异构保留与分配用于KV缓存压缩

Chao Fei, Panos Kalnis

发表机构 * King Abdullah University of Science and Technology(阿卜杜拉国王科技大学)

AI总结 针对长上下文大模型推理中KV缓存压缩问题,提出PolyKV框架,通过层级别信号为每层选择合适压缩策略并分配非均匀缓存预算,实验表明在固定预算下显著恢复性能差距。

详情
AI中文摘要

KV缓存压缩对于减少长上下文大语言模型推理的内存成本至关重要。然而,现有方法通常在所有Transformer层上应用单一的压缩策略和统一的缓存预算。这种统一设计忽略了不同层在预填充和解码过程中可能扮演不同角色,因此可能需要不同的驱逐策略和缓存容量。我们提出了PolyKV,一种逐层KV缓存优化框架,考虑了方法选择和预算分配的设计空间。PolyKV基于层级别信号将每层路由到合适的KV压缩策略,同时在固定总预算下分配非均匀预算。这种公式化实现了现有KV缓存方法的异构组合。在LLaMA-3.1-8B和Qwen3-8B上的实验表明,在相同的512 token平均KV预算下,PolyKV分别恢复了最强单策略基线与FullKV之间LongBench性能差距的54.5%和25.7%。在128-1024预算范围内,PolyKV持续比最强基线提升1.7%-6.4%,对应FullKV差距的40.0%-54.5%恢复。

英文摘要

KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all transformer layers. This uniform design ignores the fact that different layers can play different roles during prefill and decoding, and may therefore require different eviction strategies and cache capacities. We present PolyKV, a layer-wise KV cache optimization framework that considers design space with method selection and budget allocation. PolyKV routes each layer to a suitable KV compression policy based on layer-level signals, while assigning non-uniform budgets under a fixed total budget. This formulation enables heterogeneous compositions of existing KV cache methods. Experiments on LLaMA-3.1-8B and Qwen3-8B show that, under the same 512-token average KV budget, PolyKV recovers 54.5% and 25.7% of the LongBench performance gap between the strongest single-policy baseline and FullKV, respectively. Across 128-1024 budget sweep, PolyKV consistently improves over the strongest baseline by 1.7%-6.4%, corresponding to 40.0%-54.5% recovery of the FullKV gap.

2606.15155 2026-06-16 cs.LG 新提交

Semantic Reasoning in Medicine: The Role of Knowledge Graphs Across Five Key Domains

医学中的语义推理:知识图谱在五个关键领域的作用

Haniye Sherafatmandjoo, Mohammad Akbari, Zahed Rahmati

发表机构 * Amirkabir University of Technology(阿米尔卡比尔理工大学)

AI总结 综述知识图谱在医学中的应用,涵盖临床决策支持、疾病预测、健康推荐、精准医疗和医学问答,并讨论构建方法、挑战及未来方向。

详情
AI中文摘要

知识图谱(KGs)已成为整合和推理复杂生物医学与临床数据的有前景解决方案。通过表示疾病、药物、症状和患者记录等实体之间的结构化关系,KGs为决策、预测、推荐和个性化护理提供了语义基础。最近的进展已证明它们在多种医学应用中的实用性——包括临床决策支持系统、疾病和治疗结果预测、健康推荐系统、精准医疗和医学问答——其中KGs通常增强可解释性、语义一致性和患者特定推理。与此同时,越来越多的研究专注于医学KG生成本身,提出了利用本体、语义网技术、基于深度学习的信息提取和混合神经符号流水线,从电子健康记录、临床叙述、生物医学文献和网络资源构建图谱的框架。尽管取得了这些进展,仍然存在重大挑战,包括知识覆盖有限且分散、异构数据源对齐困难、当前推理和表示学习方法在密集多关系图上的脆弱性,以及与隐私、偏见和问责相关的未解决问题。本综述从应用导向和方法导向两个维度回顾和分类了当前医学KG的研究,讨论了其优势和技术基础,并概述了关键局限性和开放研究方向。通过分析趋势、架构和评估实践,本文旨在指导KG驱动的医学AI系统的未来发展,并支持其安全有效地融入医疗环境。

英文摘要

Knowledge graphs (KGs) have emerged as a promising solution for integrating and reasoning over complex biomedical and clinical data in healthcare. By representing structured relationships among entities such as diseases, drugs, symptoms, and patient records, KGs provide a semantic backbone for decision-making, prediction, recommendation, and personalized care. Recent advances have demonstrated their utility across diverse medical applications--including clinical decision support systems, disease and treatment outcome prediction, health recommender systems, precision medicine, and medical question answering--where KGs often enhance interpretability, semantic coherence, and patient-specific reasoning. In parallel, a growing body of work focuses on medical KG generation itself, proposing frameworks that construct graphs from EHRs, clinical narratives, biomedical literature, and web resources using ontologies, semantic web technologies, deep-learning-based information extraction, and hybrid neuro-symbolic pipelines. Despite this progress, significant challenges remain, including limited and fragmented knowledge coverage, difficulties in aligning heterogeneous data sources, the fragility of current reasoning and representation-learning methods on dense multi-relational graphs, and unresolved issues related to privacy, bias, and accountability. This survey reviews and categorizes current research on KGs in medicine along both application-oriented and methodology-oriented dimensions, discusses their benefits and technical foundations, and outlines key limitations and open research directions. By analyzing trends, architectures, and evaluation practices, this work aims to guide future developments in KG-driven medical AI systems and support their safe and effective integration into healthcare environments.

2606.15154 2026-06-16 cs.RO 新提交

Task-Aware Environment Augmentation for Reliable Navigation via Shielded Conditional Diffusion

任务感知的环境增强:通过屏蔽条件扩散实现可靠导航

Bharawee Phoompho, Gokul Puthumanaillam, Yan Miao, Ruben Hernandez, Tim Bretl, Sayan Mitra, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对部分可观测环境下的轨迹规划可靠性问题,提出任务感知的环境增强方法SCoDA,利用条件扩散模型学习最优视觉标记布局,通过屏蔽采样引导标记放置,提升轨迹执行可靠性和完成时间。

详情
AI中文摘要

在部分可观测条件下的可靠轨迹规划不仅取决于计算可行的几何路径,还取决于机器人在执行该轨迹时是否能接收到信息丰富的观测。现有方法通常保持环境固定,通过信念空间规划、主动定位或增加传感来适应机器人,这往往导致在观测贫乏区域出现代价高昂的不确定性传播和脆弱行为。我们翻转这一视角,解决一个很大程度上开放的问题:\emph{任务感知的环境增强}——给定一个已建图的环境、一条规划的任务轨迹和少量视觉标记预算,应在何处增强环境,使得规划轨迹能在不确定性下可靠执行?我们的关键观察是,有用的标记布局由它们沿任务轨迹提供的定位支持定义:少量时机恰当的观测足以防止不确定性在状态估计误差会危及控制的区域累积。基于这一观察,我们提出了\tbp{SCoDA},即$\textbf{S}$hielded $\textbf{Co}$nditional $\textbf{D}$iffusion for Environment $\textbf{A}$ugmentation(屏蔽条件扩散用于环境增强)。\tbp{SCoDA}从数据中学习高性能标记布局的条件分布,以环境、规划轨迹、干扰上下文和期望执行轮廓为条件。其屏蔽采样器推理沿规划执行应进行位姿校正的位置,并将该分布引导至任务相关、有限预算的增强。在模拟基准测试和硬件部署中,我们展示了\tbp{SCoDA}在轨迹执行可靠性和完成时间上优于强基线方法。代码、模型和数据集见:\hyperlink{scoda-diffusion.github.io}{https://scoda-diffusion.github.io/}

英文摘要

Reliable trajectory planning under partial observability depends not only on computing a feasible geometric path, but also on whether the robot receives informative observations while executing that trajectory. Existing approaches usually keep the environment fixed and adapt the robot through belief-space planning, active localization, or added sensing, often incurring costly uncertainty propagation and brittle behavior in observation-poor regions. We flip this perspective and address the largely open problem of \emph{task-aware environment augmentation}: given a mapped environment, a planned task trajectory, and a small budget of visual fiducial markers, where should the environment be augmented so that the planned trajectory can be executed reliably under uncertainty? Our key observation is that useful marker layouts are defined by the localization support they provide along the task trajectory: a small number of well-timed observations can be sufficient to prevent uncertainty from accumulating in regions where state-estimation error would otherwise compromise control. Building on this observation, we present \tbp{SCoDA}, $\textbf{S}$hielded $\textbf{Co}$nditional $\textbf{D}$iffusion for Environment $\textbf{A}$ugmentation. \tbp{SCoDA} learns a conditional distribution over high-performing fiducial layouts from data, using the environment, planned trajectory, disturbance context, and desired execution profile as conditioning. Its shielded sampler reasons over where along the planned execution pose corrections should occur, and steers this distribution toward task-relevant, finite-budget augmentations. Across simulated benchmarks and hardware deployments, we show that \tbp{SCoDA} improves trajectory execution reliability and completion time over strong baselines. Code, models and dataset available at: \hyperlink{scoda-diffusion.github.io}{https://scoda-diffusion.github.io/}

2606.15152 2026-06-16 cs.CL 新提交

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

智能体能否读懂房间气氛?多模态模拟中的视觉社交智能基准测试

Shijun Wan, Xuehai Wu, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出AgentViSS基准,通过240个场景和四项角色级任务评估多模态大模型的视觉社交智能,发现局部角色扮演接近饱和而交互调控仍困难。

详情
AI中文摘要

社交互动依赖于语言和可见的社交信号,如面部表情、姿势、注视和情绪变化。然而,现有的社交智能体基准大多基于文本,很少测试多模态智能体是否能够利用视觉线索来指导互动。我们引入了\ extsc{\enchmarkname{}},一个评估多模态社交模拟中视觉社交智能的基准。它包含240个场景、585个角色实例和2,340个角色任务实例,结合了对齐的文本-视觉证据、结构化角色档案和四个角色级任务:表情任务、特征任务、交互调控任务和交互结果任务。在口头视觉和直接视觉条件下评估七个最新的多模态大语言模型,揭示了局部角色扮演与交互管理之间的明显差距:角色特定的表情和冲突处理接近饱和,而交互调控和基于视觉的结果实现仍然困难得多。代码已发布在https://github.com/JunsWan/AgentViSS,数据集可在https://huggingface.co/datasets/JunsWan/AgentViSS获取。

英文摘要

Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \textsc{\benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at https://github.com/JunsWan/AgentViSS, and the dataset is available at https://huggingface.co/datasets/JunsWan/AgentViSS.

2606.15151 2026-06-16 cs.CV cs.LG 新提交

HiRo: A Compact Four-Directional Hierarchical Reservoir Token-Mixer for Efficient Image Classification

HiRo:一种用于高效图像分类的紧凑型四方向分层储层令牌混合器

Md Farhadul Islam, Ishan Thakkar, J. Todd Hastings

发表机构 * University of Kentucky(肯塔基大学)

AI总结 提出HiRo模型,通过四方向扫描和两级切片混合储层模块实现局部与跨窗口令牌混合,在MNIST、CIFAR-10/100上以不足1M参数达到高精度。

Comments Accepted at ICONS 2026

详情
AI中文摘要

最近的图像分类模型必须在局部特征建模、跨窗口交互和参数效率之间取得平衡。许多高性能架构依赖于完全可训练的令牌混合器,这改善了表示学习但增加了参数数量、优化复杂性和计算成本。我们提出了一种参数高效的图像分类模型HiRo,它将移位窗口分区与多方向分层储层计算相结合。图像被划分为非重叠块(视为令牌),线性投影、归一化,并添加二维正弦位置编码,然后在局部窗口内处理。在每个窗口内,令牌沿四个方向扫描,并通过两级切片混合储层模块。在第一阶段,方向序列被分割成连续的切片,每个切片由具有可训练闭环读出的固定储层处理。得到的切片输出使用开始、结束和均值表示进行汇总,然后由每个方向的第二阶段固定储层混合。混合后的切片表示被扩展回令牌级别并与第一阶段输出融合,之后四个方向的输出重新对齐并平均。连续块在常规窗口和移位窗口之间交替以实现跨窗口交互,随后是层归一化、残差前馈网络和用于分类的全局池化。该设计将常规和移位窗口分区与分层多方向储层相结合,构建了一个高效的局部到跨窗口令牌混合框架用于图像分类。尽管使用的可训练参数少于1M,且内存和时间显著低于基于Transformer的基线,HiRo在MNIST、CIFAR-10和CIFAR-100上分别达到了99.46%、85.57%和59.10%的准确率。

英文摘要

Recent image classification models must balance local feature modeling, cross-window interaction, and parameter efficiency. Many high-performing architectures rely on fully trainable token-mixers, which improve representation learning but increase parameter count, optimization complexity and computational cost. We propose a parameter-efficient image classification model called HiRo that integrates shifted-window partitioning with multi-directional hierarchical reservoir computing. Images are divided into non-overlapping patches (treated as tokens), linearly projected, normalized, and enriched with 2D sinusoidal positional encodings, then processed within local windows. Inside each window, tokens are scanned in four directions and passed through a two-stage slice-and-mix reservoir module. In the first stage, directional sequences are split into contiguous slices, each processed by its own fixed reservoir with a trainable closed-loop readout. The resulting slice outputs are summarized using the start, end, and mean representations, and then mixed by a second-stage fixed reservoir for each direction. The mixed slice representations are expanded back to the token level and fused with the first-stage outputs, after which the four directional outputs are realigned and averaged. Consecutive blocks alternate between regular and shifted windows to enable cross-window interaction, followed by layer normalization, a residual feed-forward network, and global pooling for classification. This design combines regular and shifted window partitioning with hierarchical multi-directional reservoirs to make an efficient local-to-cross-window token-mixing framework for image classification. Despite using under 1M trainable parameters and significantly lower memory and time than transformer-style baselines, HiRo also achieves 99.46%, 85.57%, and 59.10% accuracy on MNIST, CIFAR-10, and CIFAR-100, respectively.

2606.15149 2026-06-16 cs.SD eess.AS 新提交

AUDEDIT: Inversion-Free Text-Guided Editing with Pretrained Audio Flow Models

AUDEDIT: 基于预训练音频流模型的无反演文本引导编辑

Zhongyuan Fu

发表机构 * College of Computer Science, Nankai University(南开大学计算机科学学院)

AI总结 提出AUDEdit,一种无需反演的方法,利用预训练整流流音频生成器实现真实音频的文本引导编辑,通过直接源到目标常微分方程改善文本对齐与音频保真度。

详情
AI中文摘要

我们提出AudEdit,一种无需反演的方法,用于对真实音频进行文本引导编辑,使用预训练的整流流音频生成器。文本到音频系统(如Stable Audio 3)已经通过向输入录音添加噪声并在新提示下进行去噪来实现音频到音频编辑,但这种反演式路线必须在提示遵循与节奏、瞬态、音色和长程音乐结构保持之间进行权衡。受计算机视觉中最近的无反演流编辑启发,我们为一维Stable Audio 3潜变量开发了一种音频特定的直接源到目标常微分方程:在每个流步骤中,我们在共享的随机源边际下比较目标和源条件速度场,并通过它们的差异更新编辑后的潜变量。由此产生的编辑器无需训练、无需成对编辑数据、无需优化,也无需访问内部注意力图。在基于FSD50K和Song Describer Dataset构建的音效和音乐编辑集上,AudEdit在CLAP文本对齐和音频保持方面优于SDEdit、ODE反演和FireFlow;例如,在音效上,它将目标文本CLAP相似度从0.42提高到0.52(相对于最强基线),同时将FAD从65.70降低到50.37。

英文摘要

We introduce AudEdit, an inversion-free method for text-guided editing of real audio with a pretrained rectified-flow audio generator. Text-to-audio systems such as Stable Audio 3 already expose audio-to-audio editing by noising an input recording and denoising it under a new prompt, but this inversion-style route must trade prompt adherence against preservation of rhythm, transients, timbre, and long-range musical structure. Motivated by recent inversion-free flow editing in computer vision, we develop an audio-specific direct source-to-target ordinary differential equation for one-dimensional Stable Audio 3 latents: at each flow step, we compare the target- and source-conditioned velocity fields under a shared stochastic source marginal, and update the edited latent by their difference. The resulting editor requires no training, no paired edit data, no optimization, and no access to internal attention maps. Across sound-effect and music editing sets built from FSD50K and the Song Describer Dataset, AudEdit improves CLAP text alignment and audio preservation over SDEdit, ODE inversion, and FireFlow; for example, on sound effects it raises target-text CLAP similarity from 0.42 to 0.52 over the strongest baseline while reducing FAD from 65.70 to 50.37.

2606.15146 2026-06-16 cs.LG 新提交

Contextual Bandits for Maximizing Stimulated Word-of-Mouth Rewards

最大化激励口碑奖励的上下文赌博机

Ahmed Sayeed Faruk, Elena Zheleva

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出上下文多臂赌博机框架,通过学习个体溢出概率并排序连接用户,以最大化激励口碑奖励,实验证明考虑溢出异质性可提升目标定位精度。

Comments Presented at the AAAI 2025 Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL)

详情
AI中文摘要

激励口碑是一种通过提示或激励促进信息分享的策略。通过社交网络优化激励口碑需要识别并定位最易受溢出效应影响的连接用户,即推荐的影响力超出直接受众并波及与其相连的用户。溢出概率因个体及其连接而异,导致异质性。理解并准确估计社交网络中用户间的溢出概率对于提高激励口碑的效果至关重要。为此,我们提出了一种新颖的上下文多臂赌博机框架,该框架学习个体溢出概率并对连接用户进行排序,以最大化激励口碑的奖励。在真实网络数据集上的实验表明,考虑溢出异质性可提升对 top-$k$ 连接用户的目标定位精度,从而增加奖励,并优于不学习个体溢出效应的基线方法。

英文摘要

Stimulated word-of-mouth is a strategy that promotes information sharing through prompts or incentives. Optimizing stimulated word-of-mouth through social networks requires identifying and targeting connected users who are most susceptible to spillover, a phenomenon where the influence of recommendations extends beyond the immediate audience to impact their connected users. The probability of spillover varies across individuals, and their connections, leading to heterogeneity. Understanding and accurately estimating the spillover probabilities among users in social networks is crucial for improving the effectiveness of stimulated word-of-mouth. To address this, we present a novel contextual multi-armed bandit framework that learns individual spillover probabilities and ranks connected users to maximize rewards from stimulated word-of-mouth. Experiments on real-world network datasets demonstrate that accounting for spillover heterogeneity enhances the targeting precision of top-$k$ connected users, boosting rewards and outperforming baseline methods that do not learn individual spillover effects.

2606.15144 2026-06-16 cs.CL cs.AI 新提交

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

PACUTE: 面向菲律宾语的音韵、词缀和字符级词元理解

Jann Railey Montalan, David Demitri Africa, Jimson Paulo Layacan, Richell Isaiah Flores, Ivan Yuri De Leon, Lance Calvin Gamboa

发表机构 * AI Singapore(AI新加坡) Nanyang Technological University(南洋理工大学) UK AI Security Institute(英国人工智能安全研究所) Ateneo de Manila University(马尼拉雅典耀大学) University of Birmingham(伯明翰大学)

AI总结 提出PACUTE基准,包含4600个任务,通过六层诊断框架评估大语言模型在菲律宾语中的形态理解,发现开放权重模型在语素分解上接近随机,前沿模型在组合任务上远低于字符级上限。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

大型语言模型(LLMs)将文本处理为子词词元序列,这掩盖了构成词形成的字符级和形态结构。对于具有非连接形态的语言,这种限制最为严重,标准分词器系统性地使词元边界与语素边界错位。我们引入PACUTE,一个包含4600个任务的诊断基准,旨在评估菲律宾语中的形态理解,菲律宾语以能产的中缀、重叠和变音符号驱动的词汇区分(通常不在书面文本中出现)为特征。PACUTE包括一个六层组合诊断框架,用于定位形态理解在何处崩溃。评估开放权重LLMs和前沿商业模型,我们发现开放权重模型在语素分解上无论规模大小都接近随机。前沿模型表现更好,通常在包含匹配评分下能恢复单个词缀,但在语素变换和音节划分的组合任务上仍远低于其字符级上限。这些结果表明,能产的形态组合(而非仅字符访问)是菲律宾语词汇结构理解的持续瓶颈。

英文摘要

Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.

2606.15142 2026-06-16 cs.CV cs.RO 新提交

MotionVLA: Vision-Language-Action Model for Humanoid Motion

MotionVLA:面向人形运动的视觉-语言-动作模型

Nonghai Zhang, Siyu Zhai, Yanjun Li, Zeyu Zhang, Zhihan Yin, Yandong Guo, Boxin Shi, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) AI 2 Robotics

AI总结 针对人形运动生成中低频姿态与高频物理信号量化不匹配的问题,提出双流频率分词器DSFT和基于Qwen3.5的MotionVLA模型,在HumanML3D和MBench上显著提升多样性一致性和运动条件一致性。

详情
AI中文摘要

从场景图像和文本生成逼真的人形运动涉及低频姿态语义和高频物理动力学。然而,许多现有方法使用单个共享码本对运动进行分词,将异质运动信号强制映射到相同的量化空间。我们对人体运动数据的频域分析揭示了单码本量化与运动统计之间的明显不匹配:五个DCT系数捕获了93%的关节位置能量,但仅捕获了37%的关节速度能量,这可能导致量化偏向姿态统计,而低估高频速度分量。第二个挑战在于使标准自回归模型有效建模运动序列中的高频物理信号。因此,我们提出了DSFT,一种双流频率分词器,将运动分离为基础流和物理流,并使用DCT截断和BPE独立压缩它们。此外,我们提出了MotionVLA,一个基于Qwen3.5的模型,将基础令牌和物理令牌排列在统一序列中,其中物理令牌在基础令牌之后预测。在HumanML3D和MBench上的实验表明,尽管使用轻量级2B骨干网络,MotionVLA在HumanML3D上将与真实数据的多样性差距减少了50%以上,并在MBench上将运动条件一致性提高了3.8%,支持频率感知的双流解耦作为自回归运动生成的有效公式。代码:https://github.com/AIGeeksGroup/MotionVLA。网站:https://aigeeksgroup.github.io/MotionVLA。

英文摘要

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

2606.15134 2026-06-16 cs.CV cs.AI cs.LG 新提交

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

超越标量距离:来自冻结MLLM的语义属性梯度用于视觉嵌入

Shubhang Bhatnagar, Dheeraj Baiju, Narendra Ahuja

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出SAGA框架,利用冻结的多模态大语言模型(MLLM)通过GRPO奖励机制为视觉编码器提供属性级监督,替代传统标量距离,提升零样本图像检索性能。

详情
AI中文摘要

用于检索的视觉编码器通常通过类标签监督进行训练:每个训练对简化为一个标量,均匀地将嵌入推远或拉近,就好像每个视觉属性要么不同要么匹配。一个多模态大语言模型(MLLM),在展示相同的一对图像时,能够阐述这些属性并利用它们预测图像是否共享一个类别。我们提出\textbf{SAGA},一个框架,将这种基于语言、属性感知的感知转化为编码器本身的训练信号。具体来说,我们使用组相对策略优化(GRPO)来奖励MLLM对视觉编码器令牌的正确预测。由于正确的预测要求这些令牌暴露该对之间不同或匹配的具体属性,梯度推动编码器编码这些属性,用属性解析的监督取代统一的成对标量。一个辅助的注意力蒸馏损失将编码器的嵌入锚定到MLLM关注的令牌上,一个标准的度量学习损失塑造嵌入几何结构以进行最近邻检索。MLLM在整个过程中被冻结,在推理时被丢弃,与度量学习基线的部署成本相匹配。在CUB-200-2011、Cars-196、FGVC-Aircraft和iNaturalist Aves上的零样本图像检索中,SAGA在Recall@1上比最先进的基线提高了3到6个百分点。

英文摘要

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbf{SAGA}, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

2606.15133 2026-06-16 cs.RO cs.CV 新提交

DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

DragMesh-2: 与铰接物体的物理合理灵巧手-物体交互

Tianshan Zhang, Yijia Duan, Yanjun Li, Zeyu Zhang, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院)

AI总结 提出DragMesh-2框架,通过接触驱动的灵巧手-铰接物体交互,结合物理信息感知训练机制PICA,在无触觉反馈下提升变接触负载的鲁棒性。

Comments Code: https://github.com/AIGeeksGroup/DragMesh-2. Website: https://aigeeksgroup.github.io/DragMesh-2

详情
AI中文摘要

与铰接物体的灵巧交互对于家庭、辅助和人形操作至关重要,其中多指手可以提供超越平行爪抓取的顺应接触模式。然而,铰接物体操作不同于静态物体操作:目标部件无法直接驱动,其运动必须通过持续的物理手-手柄接触来实现。这使得从以物体为中心的铰接生成到手驱动的灵巧手-物体交互的转变变得非平凡,因为几何轨迹重放或开环执行无法模拟移动铰接部件所需的接触动力学。此外,仅在固定动力学下为任务完成训练的策略可能会过拟合标称接触负载,尤其是在没有触觉或力反馈的情况下,并且当接触负载变化时性能可能会下降。为了应对这些挑战,我们提出了DragMesh-2,一个用于与铰接物体灵巧交互的接触驱动框架,它将铰接交互从以物体为中心的生成扩展到手驱动的灵巧手-物体交互,其中铰接运动必须通过物理接触产生。我们进一步提出了PICA,一种物理信息感知的训练机制,它在没有触觉或力反馈的情况下将物理信号注入策略学习,提高了在变化接触负载下的鲁棒性和任务成功率。最后,我们在多个阻尼条件和铰接物体类别上进行了系统评估,以研究接触负载变化下的鲁棒性,并提供了一个纯几何的灵巧交互资源,以支持未来的移动操作和人形手-物体交互研究。在七个GAPartNet物体上,DragMesh-2在接触负载变化下比对比方法实现了更强的鲁棒性,同时在各种阻尼条件下保持了高任务成功率。

英文摘要

Dexterous interaction with articulated objects is important for household, assistive, and humanoid manipulation, where multi-finger hands can provide compliant contact patterns beyond parallel-jaw grasping. However, articulated-object manipulation differs from static-object manipulation: the target part cannot be directly actuated, and its motion must emerge through sustained physical hand--handle contact. This makes the transition from object-centric articulated generation to hand-driven dexterous hand--object interaction non-trivial, since geometric trajectory replay or open-loop execution does not model the contact dynamics required to move the articulated part. Moreover, policies trained only for task completion under fixed dynamics can overfit nominal contact loads, especially without tactile or force feedback, and may degrade when the contact load changes. To address these challenges, we present DragMesh-2, a contact-driven framework for dexterous interaction with articulated objects that extends articulated interaction from object-centric generation to hand-driven dexterous hand--object interaction, where articulated motion must arise through physical contact. We further propose PICA, a physically informed contact-aware training mechanism that injects physical signals into policy learning without tactile or force feedback, improving robustness and task success under changing contact loads. Finally, we conduct systematic evaluation across multiple damping conditions and articulated-object categories to study robustness under contact-load variation, and provide a pure-geometry dexterous interaction resource to support future loco-manipulation and humanoid hand--object interaction research. Across seven GAPartNet objects, DragMesh-2 achieves stronger robustness under contact-load variation than the compared methods while maintaining high task success across damping conditions.

2606.15129 2026-06-16 cs.CV cs.AI 新提交

EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining

EyeMVP: 通过配对CFP-OCT预训练实现OCT启发的眼底表征学习

Zhuo Deng, Ruiheng Zhang, Ziheng Zhang, Weihao Gao, Yitong Li, Qian Wang, Lei Shao, Jiaoyue Dong, Zhixi Zeng, Lijian Fang, Haibo Wang, Xiaobin Lin, Tao Liu, Zhicheng Du, Zhengwei Zhang, Lin Yang, Zheng Gong, Xinyu Zhao, Zhenquan Wu, Fang Li, Zhiguang Zhou, Guoming Zhang, Sun Jing, Han Lv, Wenbin We, Lan Ma

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University(首都医科大学附属北京同仁医院北京同仁眼科中心) Liangxiang Hospital of Beijing Fangshan District, Capital Medical University(首都医科大学北京市房山区良乡医院) The Third People's Hospital of Dalian(大连市第三人民医院) National Clinical Research Center for Endocrine and Metabolic Diseases, The Second Xiangya Hospital of Central South University(中南大学湘雅二医院国家内分泌代谢病临床医学研究中心) The Central Hospital of Baoji City(宝鸡市中心医院) Wuxi No.2 People's Hospital, Affiliated Wuxi Clinical College of Nantong University(南通大学附属无锡临床学院无锡市第二人民医院) Shenzhen Eye Hospital, Southern Medical University(南方医科大学深圳眼科医院) Beijing Friendship Hospital, Capital Medical University(首都医科大学附属北京友谊医院)

AI总结 提出跨模态视网膜基础模型EyeMVP,利用配对CFP-OCT预训练,通过跨模态掩码重建将OCT结构信息注入CFP表征,在16项下游任务中优于现有模型,尤其对黄斑疾病诊断有显著提升。

详情
AI中文摘要

彩色眼底摄影(CFP)是大规模视网膜筛查的主要手段,但其诊断能力受限于缺乏深度分辨的结构信息。光学相干断层扫描(OCT)提供横截面视网膜解剖结构,但在人群筛查中可及性较低。本文提出EyeMVP,一种跨模态视网膜基础模型,通过配对CFP-OCT预训练学习OCT启发的CFP表征。EyeMVP在来自中国八家医院112,642名患者的674,893个严格同眼同天配对CFP-OCT图像三元组上预训练。该模型使用跨模态掩码重建,以OCT相关监督丰富CFP表征,同时在推理时仅需CFP图像。为适应正面CFP与横截面OCT之间的非对齐成像几何,EyeMVP将源约束交叉注意力与CFP导出的结构掩码相结合。在16项下游任务中,包括分类、分割、少样本适应和跨模态检索,EyeMVP优于代表性视网膜基础模型,并在涉及黄斑和视神经结构的任务上表现出一致提升。对于CFP具有挑战性的黄斑疾病,EyeMVP在黄斑水肿上达到0.948的AUROC(对比EyeCLIP的0.852),在近视性黄斑劈裂上达到0.825。在一项探索性读者研究中,EyeMVP在黄斑水肿上超过初级和中级眼科医生组,但未达到高级眼科医生水平,而在近视性黄斑劈裂上,其平衡准确性数值上高于所有读者组。这些结果表明,像素级跨模态重建可以用OCT相关监督丰富CFP表征,为筛查环境中基于CFP的更强视网膜分析提供了一条实用途径。

英文摘要

Color fundus photography (CFP) is the mainstay for large-scale retinal screening, yet its diagnostic capacity is constrained by the lack of depth-resolved structural information. Optical coherence tomography (OCT) provides cross-sectional retinal anatomy, but is less accessible in population-level screening. Here, we present EyeMVP, a cross-modal retinal foundation model that uses paired CFP--OCT pretraining to learn OCT-informed CFP representations. EyeMVP is pretrained on 674,893 strict same-eye same-day paired CFP--OCT image triples from 112,642 patients across eight hospitals in China. The model uses cross-modal masked reconstruction to enrich CFP representations with OCT-associated supervision, while requiring only CFP images at inference. To accommodate the non-aligned imaging geometry between en-face CFP and cross-sectional OCT, EyeMVP combines source-constrained cross-attention with CFP-derived structural masks. Across 16 downstream tasks, including classification, segmentation, few-shot adaptation, and cross-modal retrieval, EyeMVP outperforms representative retinal foundation models and shows consistent gains on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, EyeMVP achieves an AUROC of 0.948 for macular edema (vs.~0.852 for EyeCLIP) and 0.825 for myopic macular schisis. In an exploratory reader study, EyeMVP exceeds junior and intermediate ophthalmologist groups but does not reach senior ophthalmologist performance on macular edema, while showing numerically higher balanced accuracy than all reader groups on myopic macular schisis. These results suggest that pixel-level cross-modal reconstruction can enrich CFP representations with OCT-associated supervision, providing a practical route toward stronger CFP-based retinal analysis in screening settings.

2606.15127 2026-06-16 cs.LG 新提交

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

超越准确率:在负责任AI评估中衡量思维链推理中的偏见承认

Xian Sun, Wei Gao, Yingshuo Wang, Lingdong Kong, Yanhang Li, Zhichao Fan, Zexin Zhuang, Wenlong Dong, Zhiyuan Zheng, Hrishikesh Paranjape, Abhishek Mandal, Johnny R. Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对仅用准确率评估忽略推理链中偏见承认的问题,提出包含易感性(susceptibility)和承认(acknowledgment)两个维度的诊断方法,实验发现不同模型在准确率相近时承认率差异显著。

Comments ICML 2026 Workshop on Trustworthy AI for Good

详情
AI中文摘要

推理模型越来越多地用于最终答案并非唯一审查对象的场景:教育工具可能向学生展示中间步骤,决策支持系统可能需要人工监督,审计工作流可能检查痕迹是否存在误导性或偏见输入。在这些场景中,两个响应可能获得相同的最终答案分数,但在痕迹是否明确标记注入的偏见内容方面存在差异。仅用准确率评估会忽略这些情况。我们将这一差距视为负责任评估中的测量盲点,并引入一个最小痕迹级诊断,包含两个维度:\emph{易感性}(偏见是否破坏先前正确的答案)和\emph{承认}(痕迹是否包含由规则定义的提及注入内容的表面引用)。在数千个有偏见的GSM8K试验中,GPT-4o和Claude Sonnet~4的易感性率相似($1.3\%$ vs. $1.2\%$),但在相同规则下承认率差异显著($13.0\%$ vs. $75.0\%$)。

英文摘要

Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit workflows may inspect traces for misleading or biased input. In such settings, two responses can receive the same final-answer score while differing in whether the trace explicitly flags injected biasing content. Accuracy-only evaluation collapses these cases. We study this gap as a measurement blind spot for responsible evaluation and introduce a minimal trace-level diagnostic with two axes: \emph{susceptibility} (whether the bias breaks a previously correct answer) and \emph{acknowledgment} (whether the trace contains a rubric-defined surface reference to the injected content). Across thousands of biased GSM8K trials, GPT-4o and Claude Sonnet~4 have similar susceptibility rates ($1.3\%$ vs.\ $1.2\%$) but substantially different acknowledgment rates ($13.0\%$ vs.\ $75.0\%$) under the same rubric.

2606.15118 2026-06-16 cs.CV 新提交

Multi-view feature High-order Fusion for Space Weak Object Detection and Segmentation

多视角特征高阶融合用于空间弱目标检测与分割

Weilong Guo, Yuhan Sun, Shengyang Li

发表机构 * Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences(中国科学院空间应用工程与技术中心) Key Laboratory of Space Utilization, Chinese Academy of Sciences(中国科学院空间应用重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对空间弱目标检测与分割,提出多视角特征高阶融合方法(MHF),通过高阶特征感知和递归任务贡献门控选择,有效聚合弱目标的准确丰富特征,作为即插即用模块显著提升多种视觉模型性能。

详情
AI中文摘要

弱目标在空间应用的图像和视频中很常见。然而,从它们有限的外观信息中学习合适的表示是困难的。受多视角学习的启发,我们开发了简单的多视角注意力机制,将其输出视为多视角特征。我们还提出了一种多视角特征高阶融合方法(MHF),以聚合更准确和丰富的弱目标特征。我们的MHF将常用的低阶特征融合方法扩展到高阶。它增强了模型捕获弱目标相关和互补信息的能力。这是通过引入高阶多视角特征感知和递归任务贡献门控选择多视角特征来实现的。新操作高度灵活且可定制,与多视角特征表示的各种变体兼容。我们在两个新构建的空间科学数据集和一个开放的大规模卫星视频数据集上进行了大量实验。我们的MHF作为一个即插即用模块,显著改进了各种基于视觉Transformer和卷积的检测与分割模型。我们在三个数据集上的两个任务上都取得了最先进的精度。我们的MHF可以成为视觉建模的新基础模块,有效地从多视角学习角度表示弱目标。代码将在https://github.com/Kingdroper/MHF 提供。

英文摘要

Weak objects are common in images and videos of space applications. However, it is hard to learn proper representations from their limited appearance information. Inspired by multi-view learning, we develop simple multi-view attentions, treating their outputs as multi-view features. We also propose a multi-view feature high-order fusion method (MHF) to aggregate more accurate and richer features of weak objects. Our MHF extends the commonly used low-order feature fusion method to higher orders. It enhances the model's capacity to capture relevant and complementary information about weak objects. This is achieved by introducing high-order multi-view features perception and a recursive task-contribution gated selection of multi-view features. The new operation is highly flexible and customizable. It is compatible with various variants of multi-view feature representations. We conduct extensive experiments on two newly constructed space science datasets and an open, large-scale satellite video dataset. Our MHF serves as a plug-and-play module and significantly improves various vision transformers and convolution-based detection and segmentation models. We achieve all state-of-the-art accuracies on both tasks across three datasets. Our MHF can be a new basic module for visual modeling that effectively represents weak objects in terms of multi-view learning. The code will be available at https://github.com/Kingdroper/MHF.

2606.15112 2026-06-16 cs.CV 新提交

Learn Temporal Consistency For Robust Satellite Video Detector

学习时间一致性以实现鲁棒的卫星视频检测器

Weilong Guo, Shengyang Li, Yanfeng Gu

发表机构 * Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences(中国科学院空间应用工程与技术中心) Key Laboratory of Space Utilization, Chinese Academy of Sciences(中国科学院空间应用重点实验室) University of Chinese Academy of Sciences(中国科学院大学) School of Electronics and Information Engineering, Harbin Institute of Technology(哈尔滨工业大学电子与信息工程学院)

AI总结 提出基于时间一致性学习(TCL)的卫星视频目标检测框架,通过时间特征聚合、结构编码和时间一致性约束模块,实现定向细粒度目标检测,在SAT-MTB数据集上达到47.7% mAP,较基线提升4.8%。

Comments 11 pages, 8 figures

详情
AI中文摘要

卫星视频目标检测(SVOD)对于定向和细粒度目标在卫星应用中扮演重要角色。现有大多数SVOD方法仅关注一个或几个粗粒度类别的移动目标,并用水平边界框表示目标。它们难以提取整个卫星视频中关于目标的完整、准确和一致的信息。在本文中,我们提出了一种基于时间一致性学习(TCL)的卫星视频目标检测框架。TCL通过利用卫星视频中丰富的时间上下文,灵活地检测定向和细粒度目标。该框架集成了三个关键模块:时间和细粒度特征聚合(TFA)、结构编码(SE)和时间一致性约束(TCC)。TFA和TCC模块促进跨帧的一致表示学习,而SE模块编码外观和结构信息以实现精确的细粒度识别。在SAT-MTB基准数据集上的实验结果表明,TCL具有优越的性能,实现了47.7% mAP的定向和细粒度检测精度,较基线提升4.8%。此外,我们的TCL框架易于适应现有的基于图像的检测器,从而提高了检测精度。

英文摘要

Satellite video object detection (SVOD) for oriented and fine-grained objects plays an important role in satellite applications. Most existing SVOD methods only focus on one or a few coarse-grained categories of moving objects and represent objects with horizontal bounding boxes. They have difficulty extracting complete, accurate, and consistent information about objects in whole satellite videos. In this paper, we propose a satellite video object detection framework based on Temporal Consistency Learning (TCL). TCL adeptly detects oriented and fine-grained objects by leveraging the rich temporal contexts within satellite videos. The framework integrates three key modules: temporal and fine-grained feature aggregation (TFA), structure encoding (SE), and temporal consistency constraint (TCC). TFA and TCC modules facilitate consistent representation learning across frames, while the SE module encodes both appearance and structural information for precise fine-grained recognition. Experimental results on the SAT-MTB benchmark dataset demonstrate TCL's superior performance, achieving a new state-of-the-art oriented and fine-grained detection accuracy of 47.7% mAP--a 4.8% improvement over the baseline. Furthermore, our TCL framework readily accommodates existing image-based detectors, leading to enhanced detection accuracies.

2606.15110 2026-06-16 cs.CV 新提交

Physics-Driven Zero-Shot MRI Reconstruction with Non-local Image Priors

基于非局部图像先验的物理驱动零样本MRI重建

Lingtong Zhang, Wenlei Li, Mu He, Li Xiao, Yang Ji

发表机构 * School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院)

AI总结 提出一种物理驱动的零样本自监督学习框架,通过线圈灵敏度图引导的动态存储库、SPIRiT正则化和非局部自相似像素银行,解决欠采样MRI重建中的监督不足和过拟合问题,在FastMRI数据集上达到最优性能。

详情
AI中文摘要

零样本自监督学习(ZS-SSL)已成为加速磁共振成像(MRI)重建的一种有前景的范式,消除了对全采样外部数据集的依赖。然而,仅从单个欠采样扫描中学习存在监督稀缺和优化不稳定的问题,常常导致过拟合或伪影。为了解决这些挑战,我们提出了一种鲁棒的物理驱动ZS-SSL框架,将物理一致性与图像域非局部先验协同结合。我们的方法引入了三项核心创新:(1)线圈灵敏度图(CSM)引导的动态存储库,通过基于线圈灵敏度约束过滤物理不一致的伪影来稳定训练轨迹;(2)基于SPIRiT的正则化,通过学习的相关核和随机掩蔽强制执行k空间自一致性;(3)非局部自相似性(NSS)像素库,利用前两个模块建立的高保真参考显式挖掘非局部解剖相似性,从而增强图像域的监督。在FastMRI数据集上的大量实验表明,我们的方法实现了最先进的性能,特别是在高加速因子下,有效弥合了零样本学习与监督方法之间的差距。代码可在https://github.com/Zolento/NS-SSL获取。

英文摘要

Zero-Shot Self-Supervised Learning (ZS-SSL) has emerged as a promising paradigm for accelerated Magnetic Resonance Imaging (MRI) reconstruction, eliminating the reliance on fully-sampled external datasets. However, learning solely from a single under-sampled scan suffers from supervision scarcity and optimization instability, often leading to overfitting or artifacts. To address these challenges, we propose a robust physics-driven ZS-SSL framework that synergizes physical consistency with image-domain non-local priors. Our method introduces three core innovations: (1) a Coil Sensitivity Map (CSM)-Guided Dynamic Repository, which stabilizes the training trajectory by filtering physically inconsistent artifacts based on coil sensitivity constraints; (2) a SPIRiT-based regularization, which enforces k-space self-consistency via a learned correlation kernel and stochastic masking; (3) a Non-Local Self-Similarity (NSS) Pixel Bank, which leverages the high-fidelity reference established by the former modules to explicitly mine non-local anatomical similarities, thereby augmenting supervision in the image domain. Extensive experiments on the FastMRI dataset demonstrate that our approach achieves state-of-the-art performance, particularly under high acceleration factors, effectively bridging the gap between zero-shot learning and supervised methods. The code is available at https://github.com/Zolento/NS-SSL.

2606.15107 2026-06-16 cs.AI 新提交

Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning

迈向可验证的自主数据科学:通过基于工具的推理解决不规则时间序列问答

Sanhorn Chen, Xiaoyang Chen, Boyu Liu, Roy Zhao

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对现实世界时间序列数据的不规则性,提出IRTS-ToolBench基准(1700个问题,10种任务类型,13个领域),通过标准化输入和可复现评估协议,研究LLM和AI代理在不规则条件下的表现。

Comments 15 pages

详情
AI中文摘要

实际部署中的时间序列数据绝大多数是不规则的。观测是异步的,缺失值具有信息性而非随机性,采样频率在不同传感器和操作窗口间变化。然而,现有的时间序列问答(TSQA)基准大多假设规则采样的输入,导致在理解大语言模型(LLM)和AI代理在不规则条件下的表现方面存在根本性差距。为弥补这一差距,我们引入了IRTS-ToolBench,一个包含1700个问题、跨越13个领域10种任务类型的基准。IRTS-ToolBench旨在供任何研究基于LLM的不规则时间序列分析的研究人员独立使用,提供标准化输入和可复现的评估协议。代码可在https://github.com/SanhornC/IRTS-ToolBench 获取。

英文摘要

Time series data in real-world deployments is overwhelmingly irregular. Observations are asynchronous, missing values are informative rather than random, and sampling frequencies vary across sensors and operational windows. However, existing Time Series Question Answering (TSQA) benchmarks mostly assume regularly sampled inputs, leaving a fundamental gap in understanding how large language models (LLMs) and AI agents perform under irregular conditions. To bridge this gap, we introduce IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains. IRTS-ToolBench is designed to be used independently by any researcher working on LLM-based irregular time series analysis, providing standardized inputs and a reproducible evaluation protocol. Code can be found in https://github.com/SanhornC/IRTS-ToolBench.

2606.15104 2026-06-16 cs.CV 新提交

Text-Driven Fusion for Infrared and Visible Images: Achieving Image Scene Adaptation on Hyperbolic Space

红外与可见光图像的文本驱动融合:在双曲空间实现图像场景自适应

Huan Kang, Hui Li, Tianyang Xu, Tao Zhou, Xiao-Jun Wu, Josef Kittler

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种文本驱动的红外与可见光图像融合框架,利用双曲流形学习嵌入层次语义,通过BLIP文本提示引导视觉-属性对齐,实现无文本输入的自适应融合,性能优于现有方法。

Comments 14 pages, 8 figures

详情
AI中文摘要

红外与可见光图像融合旨在整合互补模态,而现有的欧几里得方法施加了刚性的距离度量,扭曲了多模态交互和父子语义层次。为了克服这些限制,我们引入了一种由双曲流形学习驱动的文本驱动融合框架。在训练过程中,BLIP提取的文本提示作为双曲空间中的拓扑锚点,通过自然适应不同语义粒度的双曲嵌入引导视觉-属性对齐。通过利用庞加莱球负曲率决定的指数体积增长,该方法无缝嵌入层次树以编码从粗到细的语义而不会出现度量饱和,同时广阔的外围空间防止了跨模态融合期间的纹理失真。在推理时,融合过程利用学习到的文本属性先验自动适应输入内容,完全消除了对文本输入的需求。实验结果表明,我们的方法在基准数据集上优于最先进的方法,代码可在 https://github.com/Shaoyun2023/TEDFusion 获取。

英文摘要

Infrared and visible image fusion aims to integrate complementary modalities, while existing Euclidean methods impose rigid distance metrics that distort multi-modal interactions and parent-to-child semantic hierarchies. To overcome these limitations, we introduce a text-driven fusion framework empowered by hyperbolic manifold learning. During training, BLIP-extracted text prompts serve as topological anchors within the hyperbolic space, guiding vision-attribute alignment through hyperbolic embeddings that naturally accommodate varying semantic granularities. By exploiting the exponential volume growth dictated by the Poincaré ball's negative curvature, this approach seamlessly embeds hierarchical trees to encode coarse-to-fine semantics without metric saturation, while the vast peripheral space prevents texture distortion during cross-modal fusion. At inference, the fusion process autonomously adapts to input content using the learned text-attribute priors, completely eliminating the need for textual input. Experimental results show our method outperforms state-of-the-art approaches on benchmark datasets, with code available at https://github.com/Shaoyun2023/TEDFusion.

2606.15092 2026-06-16 cs.LG 新提交

High-Dimensional Random Projection for Activation Steering in Language Models

高维随机投影用于语言模型中的激活引导

Minh-Hieu Pham, Bach Do, Laziz Abdullaev, Tan Minh Nguyen, Khoat Than

发表机构 * Hanoi University of Science and Technology(河内科技大学) National University of Singapore(新加坡国立大学)

AI总结 针对现有激活引导方法仅捕捉均值差异的局限,提出无训练的高维随机投影激活引导方法(HiDRA),通过在投影高维空间中进行激活加法,捕获非线性特征子空间中的判别信号,实验证明其优于基线方法。

详情
AI中文摘要

激活引导已成为控制大型语言模型(LLM)行为的关键方法。然而,现有的基于均值差异的方法存在根本性局限:它们仅捕捉类别激活之间的均值差异,未能恢复在叠加假设下非线性特征子空间中自然存在的判别信号。受此启发,我们提出了高维随机投影激活引导(HiDRA),这是一种无训练的方法,可与现有的激活引导方法无缝集成。通过在投影高维空间中执行激活加法,HiDRA 能够可靠地捕捉线性方法无法达到的更好判别结构。跨不同 LLM 系列和基准的实验表明,HiDRA 始终优于基线方法,在不显著增加计算开销的情况下实现了更强的行为控制。

英文摘要

Activation steering has emerged as a key methodology for controlling the behavior of large language models (LLMs). Existing difference-in-means based methods, however, are fundamentally limited: they capture only mean differences between class activations and fail to recover discriminative signals that naturally exist in the nonlinear feature subspace under the superposition hypothesis. Motivated by that, we propose High-Dimensional Random-projection for Activation Steering (HiDRA), a training-free approach that integrates seamlessly with existing activation steering methods. By performing activation addition in the projected high-dimensional space, HiDRA can provably capture a better discriminative structure beyond the reach of linear methods. Experiments across diverse LLM families and benchmarks demonstrate that HiDRA consistently outperforms baseline counterparts, achieving stronger behavioral control without significant computational overhead.

2606.15085 2026-06-16 cs.LG 新提交

An Integrable Token Mixing Layer from the Generalized Yang Baxter Equation

来自广义杨-巴克斯特方程的可积令牌混合层

Snigdha Chandan Khilar

发表机构 * Independent Researcher(独立研究员)

AI总结 提出YB Mixer,一种基于自由费米子和广义杨-巴克斯特结构的序列令牌混合层,利用可积系统的局部代数约束保证全局计算稳定性,并实现保范正交映射、可交换传输矩阵和谱循环生成器,以支持变长序列推理。

详情
AI中文摘要

YB Mixer是一种源自自由费米子和广义杨-巴克斯特结构的序列令牌混合层。它应用了可积系统的核心原理,其中局部代数约束保证了全局计算稳定性。通过使用伊辛交换代数,该混合器创建了一个自由费米子结构,作为一个精确保范的正交映射。该代数还产生了可交换的传输矩阵,使得推理可以无顺序限制并适应任意可变预算。为了确保模型能够泛化到更长的序列长度,它使用了一个谱循环生成器。该生成器保持了系统关键的正交和可交换性质。结果是一个高度稳定且数学基础扎实的序列处理架构。

英文摘要

The YB Mixer is a sequence token mixing layer derived from free fermion and generalized Yang Baxter structures. It applies a core principle from integrable systems where a local algebraic constraint guarantees global computational stability. By using the Ising exchange algebra the mixer creates a free fermionic structure that acts as an exactly norm preserving orthogonal map. This algebra also produces commuting transfer matrices which allow inference to be order free and adaptable to any variable budget. To ensure the model can generalize to longer sequence lengths it uses a spectral circulant generator. This generator maintains the crucial orthogonal and commuting properties of the system. The result is a highly stable and mathematically grounded architecture for sequence processing.

2606.15080 2026-06-16 cs.CL cs.AI 新提交

AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

AdaMame: 一种自适应多语言推理的训练方案

Dayeon Ki, Kevin Duh, Marine Carpuat

发表机构 * University of Maryland(马里兰大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对多语言推理中的语言崩溃问题,提出两阶段训练方案AdaMame,通过SFT建立多语言能力,再以自适应GRPO优化推理语言对齐,在准确率、语言保真度和令牌效率上达到帕累托最优。

Comments 20 pages, 5 figures

详情
AI中文摘要

尽管大型推理模型(LRMs)在英语中表现出色,但它们往往无法以查询语言进行推理,这种现象称为语言崩溃。现有的基于强化学习的修复方法通常在准确性目标上添加一个二元语言保真度奖励,但仍然会在准确性、中间轨迹代码切换和过度令牌使用方面产生权衡。在这项工作中,我们提出了AdaMame,一种用于多语言数学推理的两阶段训练方案,通过自适应地将推理语言与查询语言对齐来解决这些限制,同时不损害准确性。第一阶段的SFT在五种语言的自然推理轨迹上进行微调,以建立多语言推理能力。在随后的RL阶段,我们引入了AdaMame-GRPO,这是组相对策略优化(GRPO)的一种改编,其中查询条件的对齐因子在训练过程中逐渐增长,引导模型首先探索多样的推理语言,然后利用查询语言进行推理。在两个基准、两个LRM和12种语言上的评估表明,AdaMame-GRPO在所有基线上实现了推理准确性、语言保真度和令牌效率的帕累托最优性能,在领域外、低资源语言上取得了最强的提升。

英文摘要

While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language fidelity reward to the accuracy objective, yet still incur trade-off in accuracy, mid-trace code-switching, and excessive token usage. In this work, we propose AdaMame, a two-stage training recipe for multilingual mathematical reasoning that addresses these limitations by adaptively aligning the reasoning language to the query language without compromising accuracy. The first SFT stage fine-tunes on naturally occurring reasoning traces across five languages to establish multilingual reasoning capability. In the subsequent RL stage, we introduce AdaMame-GRPO, an adaptation of Group Relative Policy Optimization (GRPO) in which a query-conditioned alignment factor grows progressively during training, guiding the model to first explore diverse reasoning languages before exploiting reasoning in the query language. Evaluated across two benchmarks, two LRMs, and 12 languages, AdaMame-GRPO achieves Pareto-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all baselines, with the strongest gains on out-of-domain, lower-resource languages.