arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪 全部专题
2605.07193 2026-05-11 cs.LG

Coupling Models for One-Step Discrete Generation

一步离散生成的耦合模型

Fred Zhangzhi Peng, Avishek Joey Bose, Anru R. Zhang, Alexander Tong

发表机构 * Duke University(杜克大学) Imperial College London(伦敦帝国学院) AITHYRA

AI总结 本文提出了一种一步离散生成模型,通过直接耦合离散序列与高斯潜在变量,提升生成效果,在文本生成、增强器设计和图像生成任务中均取得显著改进。

Comments Code is available at https://github.com/pengzhangzhi/Coupling-Models

详情
AI中文摘要

生成离散结构在深度学习中有广泛应用,但生成过程通常依赖自回归解码或迭代优化。本文提出耦合模型,通过学习离散序列与高斯潜在变量的直接耦合,实现一步生成。该模型避免了复杂连续流和人工指定的数据-噪声耦合。实验表明,该模型在LM1B文本生成、Fly Brain增强器设计和MNIST二值图像生成任务中分别提升了33%、18%和46%。代码已发布在https://github.com/pengzhangzhi/Coupling-Models。

英文摘要

Generative modeling over discrete structures underpins applications across deep learning, from biological sequence design and code generation to large language models, yet generation often remains sequential, relying on autoregressive decoding or iterative refinement. In this work, we introduce Coupling Models(Coupling Models), a one-step discrete generative model that learns a direct coupling between discrete sequences and Gaussian latents. Unlike recent distillation methods that compress a pretrained multi-step sampler into a few steps, Coupling Model trains a purpose-built decoder to invert this coupling and generate samples in a single step. The model also avoids complex continuous flows over the simplex and hand-specified data-to-noise couplings. Empirically,Coupling Model improves the strongest one-step baselines in each domain: it reduces LM1B text-generation perplexity by 33% at its lowest-perplexity operating point, Fly Brain enhancer-design FBD by 18%, and MNIST-Binary FID by 46%. These results suggest that effective one-step discrete generation depends strongly on how data and noise are coupled before decoding. Code is available at https://github.com/pengzhangzhi/Coupling-Models.

2605.07192 2026-05-11 cs.CV

AsyncEvGS: Asynchronous Event-Assisted Gaussian Splatting for Handheld Motion-Blurred Scenes

AsyncEvGS: 异步事件辅助高斯点云法用于手持运动模糊场景

Jun Dai, Renbiao Jin, Bo Xu, Yutian Chen, Linning Xu, Mulin Yu, Tianfan Xue, Shi Guo

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) CUHK MMLab(香港大学多模态实验室) CPII under InnoHK(创新香港下的CPII)

AI总结 本文提出异步事件辅助高斯点云法,通过高分辨率双相机系统和跨域姿态估计模块,解决运动模糊场景下的3D重建问题,提出AsyncEv-Deblur数据集,提升重建鲁棒性。

详情
AI中文摘要

3D重建方法如3D高斯点云法(3DGS)和神经辐射场(NeRF)在光实感方面表现优异,但在输入图像严重运动模糊时效果不佳。尽管事件相机提供高时间分辨率的运动线索,现有事件辅助方法依赖低分辨率传感器和严格同步,限制了其在智能手机等常见设备上的实用性。本文引入了灵活的高分辨率异步RGB-事件双相机系统及相应重建框架。我们的方法首先从事件数据中重建清晰图像,然后基于视觉几何变换(VGGT)的跨域姿态估计模块获得3DGS的稳健初始化。在优化过程中,我们采用结构驱动的事件损失和视图特定的一致性正则化器,以缓解传统事件损失和去模糊损失的不稳定性,确保稳定且高保真的重建。我们进一步贡献了AsyncEv-Deblur,一个使用异步系统捕获的新高分辨率RGB-事件数据集。实验表明,我们的方法在挑战性数据集和现有基准上均取得最先进的性能,显著提升了在严重运动模糊下的重建鲁棒性。项目页面:https://openimaginglab.github.io/AsyncEvGS/

英文摘要

3D reconstruction methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) achieve impressive photorealism but fail when input images suffer from severe motion blur. While event cameras provide high-temporal-resolution motion cues, existing event-assisted approaches rely on low-resolution sensors and strict synchronization, limiting their practicality for handheld 3D capture on common devices, such as smartphones. We introduce a flexible, high-resolution asynchronous RGB-Event dual-camera system and a corresponding reconstruction framework. Our approach first reconstructs sharp images from the event data and then employs a cross-domain pose estimation module based on the Visual Geometry Transformer (VGGT) to obtain robust initialization for 3DGS. During optimization, we employ a structure-driven event loss and view-specific consistency regularizers to mitigate the ill-posed behavior of traditional event losses and deblurring losses, ensuring both stable and high-fidelity reconstruction. We further contribute AsyncEv-Deblur, a new high-resolution RGB-Event dataset captured with our asynchronous system. Experiments demonstrate that our method achieves state-of-the-art performance on both our challenging dataset and existing benchmarks, substantially improving reconstruction robustness under severe motion blur. Project page: https://openimaginglab.github.io/AsyncEvGS/

2605.07191 2026-05-11 cs.CV cs.LG

Attention Transfer Is Not Universally Effective for Vision Transformers

注意力转移并非对视觉变换器普遍有效

Huaiyuan Qin, Muli Yang, Gabriel James Goenawan, Peng Hu, Chen Gong, Xi Peng, Hongyuan Zhu

发表机构 * Institute for Infocomm Research (I 2 R), A*STAR, Singapore(信息与通信研究 institute(I 2 R),A*STAR,新加坡) Sichuan University(四川大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 研究发现,尽管注意力转移能恢复预训练权重的全部效果,但并非所有视觉变换器家族都能成功,部分家族在不同训练条件下均表现不佳,问题源于学生架构与教师架构不匹配。

详情
AI中文摘要

一项近期工作表明,仅将预训练教师视觉变换器(ViT)的注意力模式转移到随机初始化的标准学生ViT中,足以恢复教师预训练权重的全部效果。我们重新审视这一发现,基于20个来自11个知名ViT家族的全面基准测试,发现注意力转移并非普遍有效。虽然7个家族转移成功,但4个家族持续失败,性能低于从头开始无转移基线达5.1%。进一步结果表明,这种失败在不同模型规模、延长训练时间、不同转移数据集和分布外评估中均一致存在。受控分析将问题归因于注意力路由通道,表明关键问题不在于学生能否匹配教师的注意力模式,而在于匹配的模式是否对学生产生功能。至关重要的是,我们识别出预训练教师与标准学生之间架构不匹配是主要机制。通过在随机初始化状态下仅添加教师的原生架构组件,我们完全逆转了4个家族的失败。值得注意的是,这些组件单独无法改善从头开始训练,证实它们仅解锁教师注意力的可用性。我们进一步系统地表明,这种失败并非由转移损失选择不足或预训练配方差异所致。我们的发现修正了对ViT表示中注意力的普遍理解:注意力仅在学生架构与教师架构匹配时才足够。

英文摘要

A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher's pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1\% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether the student can match the teacher's attention patterns, but whether the matched patterns remain functional for the student. Crucially, we identify architectural mismatch between the pre-trained teacher and the standard student as the primary mechanism. By adding only the teacher's native architectural components to the student in a randomly initialized state, we completely reverse the failure for all 4 families. Notably, these components alone do not improve from-scratch training, confirming that they specifically unlock the usability of the teacher's attention. We further systematically show that this failure is not explained by the inadequate choice of transfer loss or by differences in pre-training recipes. Our findings refine the prevailing understanding of attention in ViT representations: attention is sufficient \textit{only} when the student architecture matches the teacher.

2605.07186 2026-05-11 cs.CL cs.AI

The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

文本恐怖谷:LLM信息检索中的非单调性能退化

Zekai Tong, Ruiyao Xu, Aryan Shrivastava, Chenhao Tan, Ari Holtzman

发表机构 * University of Chicago(芝加哥大学) Northwestern University(西北大学)

AI总结 研究探讨了文本碎片化对LLM信息检索性能的影响,发现性能呈现U型曲线,提出模式过渡假说解释LLM在不同文本状态下的工作模式变化。

Comments 18 pages, 9 figures

详情
AI中文摘要

现有大语言模型(LLM)基准主要关注语法正确输入,缺乏对不完美文本的评估。本文研究了单词边界破坏对LLM检测目标信息的影响。通过在单词内插入空格字符将其分裂成碎片,LLM的检测准确性随插入率增加呈现U型曲线。我们将其称为文本恐怖谷。为解释此现象,我们提出模式过渡假说:LLM在近正常文本中处于单词级模式,在严重碎片化文本中处于字符级模式,恐怖谷标志着两者过渡时的无序状态。四项实验和一次分析支持此观点:上下文学习无法挽救谷底性能;正则化扰动显著降低U型曲线;数学推理任务在Gemini 3.0 Flash上复制U型曲线但对更强模型无效,表明当任务依赖较少精确词法对齐时效果减弱;并且tokenization熵在F1最小值前达到峰值,与制度冲突解释一致。这些发现揭示了一种在干净文本基准中不可见但直接相关于任何涉及嘈杂或未整理文本输入部署场景的失败模式。

英文摘要

Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs' detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting the effect is attenuated when tasks rely less on exact lexical alignment; and tokenization entropy peaks before the F1 minimum, consistent with a regime-conflict interpretation. These findings reveal a failure mode invisible to clean-text benchmarks yet directly relevant to any deployment scenario involving noisy or uncurated text inputs.

2605.07182 2026-05-11 cs.LG

Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

Star Elastic:多模型一体的推理LLM及其高效预算控制

Ali Taghibakhshi, Ruisi Cai, Saurav Muralidharan, Sharath Turuvekere Sreenivas, Aditya Vavre, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Sheldon Liang, Marcin Chochowski, Zijia Chen, Akhiad Bercovich, Ran Zilberstein, Ran El-Yaniv, Yonatan Geifman, Daniel Korzekwa, Yoshi Suhara, Oluwatobi Olabiyi, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

发表机构 * NVIDIA(英伟达)

AI总结 本文提出Star Elastic方法,通过单次训练任务添加多个嵌套子模型,降低训练成本并解决静态架构的刚性问题,实现动态预算控制,提升推理准确率和效率。

详情
AI中文摘要

训练一系列大语言模型(LLM)要么从头开始,要么通过迭代压缩,都是昂贵且低效的,需要为每个模型单独训练。本文介绍Star Elastic,一种新的LLM后训练方法,通过一次训练任务添加N个嵌套子模型(N倍节省),利用单次后训练任务。除了降低训练成本,Star Elastic还解决了高效推理的根本限制:静态架构的刚性,迫使分配固定资源,无论token难度如何。通过解锁弹性预算控制,Star Elastic实现了新的推理方案,使用不同的子模型进行每个推理阶段(思考和回答)。Star Elastic支持(1)沿SSM、嵌入通道、MoE和FFN轴进行嵌套,(2)通过端到端可训练的路由学习嵌套子模型,(3)基于课程学习的知识蒸馏。基于Nemotron Elastic框架,我们将其应用于NVIDIA Nemotron Nano模型,特别关注混合专家混合(MoE)架构:从Nemotron Nano v3(30B/3.6A)生成23B(2.8A)和12B(2.0A)变体,使用160B训练token。所有嵌套模型匹配或优于独立训练的基准模型,达到从头训练的360倍减少和比最先进压缩的7倍减少。关键的是,弹性预算控制推进了准确率-延迟帕累托前沿,通过动态每阶段模型选择实现最高16%的准确率和1.9倍更低的延迟。我们进一步通过量化感知蒸馏(QAD)扩展Star Elastic到量化领域,生成嵌套的NVFP4和FP8弹性检查点,保持零样本切片的同时,提供更小的部署足迹。

英文摘要

Training a family of large language models (LLMs), either from scratch or via iterative compression, is prohibitively expensive and inefficient, requiring separate training runs for each model in the family. In this paper, we introduce Star Elastic, a novel LLM post-training method that adds N nested submodels to a given parent reasoning model using the compute of one run (N-fold savings) via a single post-training job. Beyond reducing training costs, Star Elastic also addresses a fundamental limitation of efficient reasoning: the rigidity of static architectures, which forces the allocation of constant resources regardless of token difficulty. By unlocking elastic budget control, Star Elastic enables a novel inference scheme that uses different submodels for each reasoning phase (thinking and answering). Star Elastic supports (1) nesting along the SSM, embedding channel, MoE, and FFN axes, (2) learning nested submodels via an end-to-end trainable router, and (3) curriculum-based knowledge distillation. Building on the Nemotron Elastic framework, we apply Star Elastic to the NVIDIA Nemotron Nano models, with a particular focus on hybrid Mixture-of-Experts (MoE) architectures: from Nemotron Nano v3 (30B/3.6A), we generate 23B (2.8A) and 12B (2.0A) variants with 160B training tokens. All nested models match or outperform independently trained baselines of comparable size and achieve a 360x reduction versus pretraining from scratch and a 7x reduction over state-of-the-art compression. Crucially, elastic budget control advances the accuracy-latency Pareto frontier, achieving up to 16% higher accuracy and 1.9x lower latency via dynamic per-phase model selection. We further extend Star Elastic to quantized regimes via Quantization-Aware Distillation (QAD), producing nested NVFP4 and FP8 elastic checkpoints that preserve zero-shot slicing while delivering smaller deployment footprints.

2605.07181 2026-05-11 cs.CV

SatSurfGS: Generalizable 2D Gaussian Splatting for Sparse-View Satellite Surface Reconstruction

SatSurfGS: 基于2D高斯散射的通用稀疏视图卫星表面重建

Min Chen, Wei Guo, Bin Wang, Wen Li, Tong Fang, Jinbo Zhang, Junqi Zhao, Hong Kuang, Han Hu, Xuming Ge, Qing Zhu, Bo Xu

发表机构 * Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu, P.R. China(地质科学与工程学院,西南交通大学,成都,中华人民共和国)

AI总结 本文提出SatSurfGS方法,通过构建粗到细的高斯属性预测框架,解决稀疏视图卫星表面重建中的多视图匹配可靠性问题,提升重建精度和泛化能力。

详情
AI中文摘要

稀疏视图卫星图像表面重建仍极具挑战性,根本原因在于多视图匹配在卫星成像条件下的可靠性在空间上高度异质。受大光度差异、弱纹理和重复纹理的影响,多视图几何约束通常稀疏、分布不均且局部不可靠。尽管2D高斯散射(2DGS)比3D高斯散射(3DGS)更适合表示连续表面,但针对稀疏视图卫星表面重建的通用前馈2DGS框架研究仍不足。为解决此问题,我们提出SatSurfGS,一种基于2DGS的通用稀疏视图卫星图像表面重建方法。所提方法构建了粗到细的高斯属性预测框架,并在三个层面显式建模局部几何可靠性:特征学习、高斯参数估计和训练优化。具体而言,我们提出了一种自信度感知的单目多视图特征融合模块,根据局部自信度自适应整合单目先验和多视图匹配特征;一个跨阶段自一致性残差引导模块,利用前一阶段渲染高度图与当前阶段MVS高度图之间的残差以及自信度信息,稳定阶段间的高斯参数细化;以及一个自信度双向路由损失,实现几何和外观监督的差异化分配。在卫星数据集上的实验表明,所提方法在渲染质量、表面重建精度、跨数据集泛化能力和推理效率方面,相比代表性通用基线和竞争性单场景优化方法均有提升。

英文摘要

Sparse-view satellite image surface reconstruction remains highly challenging, fundamentally because the reliability of multi-view matching under satellite imaging conditions is strongly spatially heterogeneous. Affected by large photometric differences, weak textures, and repetitive textures, multi-view geometric constraints are often sparse, unevenly distributed, and locally unreliable. Although 2D Gaussian Splatting (2DGS) is more suitable than 3D Gaussian Splatting (3DGS) for the explicit representation of continuous surfaces, research on generalizable feed-forward 2DGS frameworks for sparse-view satellite surface reconstruction is still lacking. To address this issue, we propose SatSurfGS, a generalizable sparse-view surface reconstruction method for satellite imagery based on 2DGS. The proposed method builds a coarse-to-fine Gaussian attribute prediction framework and explicitly models local geometric reliability at three levels: feature learning, Gaussian parameter estimation, and training optimization. Specifically, we propose a confidence-aware monocular multi-view feature fusion module to adaptively integrate monocular priors and multi-view matching features according to local confidence; a cross-stage self-consistency residual guidance module to stabilize stage-wise Gaussian parameter refinement using the residual between the rendered height map from the previous stage and the current-stage MVS height map, together with confidence information; and a confidence bidirectional routing loss to achieve differentiated allocation of geometric and appearance supervision. Experiments on satellite datasets show that the proposed method achieves improved rendering quality, surface reconstruction accuracy, cross-dataset generalization, and inference efficiency compared with representative generalizable baselines and competitive per-scene optimization methods.

2605.07180 2026-05-11 cs.CL

Learning Agent Routing From Early Experience

从早期经验学习代理路由

Yimin Wang, Jiahao Qiu, Xuan Qi, Xinzhe Juan, Jingzhe Shi, Zelin Zhao, Hongru Wang, Shilong Liu, Mengdi Wang

发表机构 * AI Lab, Princeton University(普林斯顿大学人工智能实验室) University of Michigan(密歇根大学) Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University(清华大学交叉信息学院) Shanghai Jiao Tong University(上海交通大学) University of Edinburgh(爱丁堡大学) King’s College London(伦敦国王学院)

AI总结 本文研究在冷启动条件下如何在轻量级LLM推理与完整代理执行间进行路由,提出无需训练的BoundaryRouter框架,通过早期行为经验和指导性推理决定是否直接使用LLM还是升级至代理,实验显示其在降低推理时间与提升性能方面优于其他方法。

Comments 17 pages

详情
AI中文摘要

LLM代理在复杂推理任务上表现强劲,但存在高延迟和计算成本。实践中,许多查询位于先进LLM的能力边界内,无需完整代理执行,因此有效路由LLM与代理成为关键挑战。本文研究在现实冷启动条件下,如何在轻量级LLM推理与完整代理执行之间路由查询。为此,我们提出BoundaryRouter,一种无需训练的路由框架,利用早期行为经验与指导性推理决定是否直接使用LLM推理或升级至代理。BoundaryRouter通过在共享种子集上执行两个系统构建紧凑的经验记忆,并在推理时检索相似案例以指导路由决策。为评估此方法,我们引入RouteBench基准,涵盖领域内、改写和领域外的路由设置。实验表明,BoundaryRouter相比代理减少了60.6%的推理时间,相比直接LLM推理提升了28.6%的性能,优于基于提示和检索-only的路由方法平均37.9%和8.2%。

英文摘要

LLM agents achieve strong performance on complex reasoning tasks but incur high latency and compute cost. In practice, many queries fall within the capability boundary of cutting-edge LLMs and do not require full agent execution, making effective routing between LLMs and agents a key challenge. We study the problem of routing queries between lightweight LLM inference and full agent execution under realistic cold-start settings. To address this, we propose BoundaryRouter, a training-free routing framework that uses early behavioral experience and rubric-guided reasoning to decide whether to answer a query with direct LLM inference or escalate to an agent. BoundaryRouter builds a compact experience memory by executing both systems on a shared seed set and retrieves similar cases at inference time to guide routing decisions. To evaluate this method, we introduce RouteBench, a benchmark covering in-domain, paraphrased, and out-of-domain route settings. Experiments show that BoundaryRouter reduces inference time by 60.6% compared to the agent while improving performance by 28.6% over direct LLM inference, outperforming prompt-based and retrieval-only routing by an average of 37.9% and 8.2%, respectively.

2605.07178 2026-05-11 cs.CV

Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

掩码可以说话:从单模图像中提取结构化文本信息用于遥感变化检测

Kai Zheng, Hang-Cheng Dong, Jiatong Pan, Zhenkai Wu, Fupeng Wei, Wei Zhang

发表机构 * Zhejiang University(浙江大学) Harbin Institute of Technology(哈尔滨工业大学) Harbin Institute of Technology Suzhou Research Institute(哈尔滨工业大学苏州研究院) North China University of Water Resources and Electric Power(华北水利水电大学) University of Auckland(奥克兰大学)

AI总结 本文提出S2M框架,通过零额外标注成本从变化标签中直接提取结构化文本特征,提升遥感变化检测的多模态监督效果。

详情
AI中文摘要

遥感变化检测对城市监控、灾害评估和环境资源管理至关重要。然而,单模深度学习方法常混淆真实语义变化与视觉相似但无关的变化。最近的多模态方法利用文本作为辅助监督,但其描述要么语义粗略且无结构,要么是模型生成的,因此噪声较大。关键在于,所有方法都忽视了一个简单事实:细粒度的变化语义已经隐含在每组变化检测数据集的标准掩码标签中。这些掩码知道变化发生的位置、变化前后的土地覆盖类型、转变过程以及涉及的物体数量。在本文中,我们提出S2M框架,通过零额外标注成本直接从变化标签中获取结构化文本特征。具体而言,每个变化区域自动转录成语义四元组(哪里、什么、如何、有多少),并转换为几种固定模板的文本描述,提供精确、密集且无噪声的多模态监督。我们采用两阶段训练策略,首先在遥感影像上微调以获得稳健的领域特定表示,然后引入多模态解码器,结合双向对比损失,实现视觉特征与结构化文本嵌入的深度对齐。为了验证我们的方法,我们构建了Gaza-Change-v2,一个关于加沙地带的新多类变化检测(MCD)数据集。在该MCD数据集上,S2M实现了17.80%的Sek和66.14%的F_scz,显著超越了利用大语言模型的多模态方法。我们的工作表明,掩码确实可以说话。它们告诉我们确切的变化发生了什么、在哪里、如何以及有多少。

英文摘要

Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels at zero additional annotation cost. Specifically, each change region is automatically transcribed into a semantic quadruple (where, what, how, how many) and converted into several fixed-template text descriptions, providing precise, dense, and noise-free multimodal supervision. We adopts a two-stage training strategy to fine-tune on remote sensing imagery firstly for robust domain-specific representation, after which a multimodal decoder with a bi-directional contrastive loss is introduced to achieve deep alignment between visual features and structured textual embeddings. To validate our method, we construct Gaza-Change-v2, a new multi-class change detection (MCD) dataset about the Gaza Strip. On this MCD dataset, S2M achieves a Sek of 17.80\% and an F$_{\text{scd}}$ of 66.14\%, notably surpassing even multimodal methods that leverage large language models. Our work demonstrates that masks can indeed talk. They tell us exactly what, where, how, and how many changes have occurred.

2605.07175 2026-05-11 cs.LG cs.AI

Learning Multi-Relational Graph Representations for DNA Methylation-Based Biological Age Estimation

学习多关系图表示以用于基于DNA甲基化的生物年龄估计

Qing Qing, Xikun Zhang, Zhongyuan Zhang, Jiarui Liu, Xingtong Yu, Xiaotao Shen, Ziqi Xu, Qixin Zhang, Zhe Wang, Renqiang Luo

发表机构 * Jilin University(吉林大学) RMIT University(皇家墨尔本理工大学) The Chinese University of Hong Kong(香港中文大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出RelAge-GNN框架,通过构建三种互补图捕捉CpG位点间的共甲基化、基因组共定位和基因级关联,提升DNA甲基化数据的生物年龄预测准确性与相关性。

详情
AI中文摘要

衰老时钟旨在从可观察的生物标记物估计生物年龄,这是一种不同于生理年龄的测量方法,广泛用于健康评估和疾病分析。DNA甲基化是一种特别有信息的生物标记物,因其稳定性和与衰老的强关联而备受关注,最近的学习方法已提高了预测性能。然而,现有大多数方法将CpG位点视为独立特征,忽略了它们之间的复杂和异质生物学关系。我们提出RelAge-GNN,一种多关系图神经网络框架,用于DNA甲基化基于的年龄预测。我们的方法构建了三种互补的图,捕捉CpG位点间的共甲基化模式、基因组共定位以及基因级关联。每个图由独立的GNN分支建模,且一个可学习的门控机制适应性地融合所得表示。在大规模数据集上的实验表明,RelAge-GNN在准确性和与生理年龄的相关性方面与最先进的方法相比具有竞争力。此外,该模型在不同疾病队列中检测年龄加速的敏感性得到提升,突显了其在疾病表征中的潜在用途。最后,通过事后可解释性分析,我们量化了不同关系结构和CpG位点的贡献,提供了生物上有意义的见解,并指出了与衰老相关研究的潜在方向。我们的代码可在:https://anonymous.4open.science/r/RelAge-GNN-F1E3/获取。

英文摘要

Aging clocks aim to estimate biological age, a measure of physiological state distinct from chronological age, from observable biomarkers, and are widely used for health assessment and disease analysis. DNA methylation is a particularly informative biomarker due to its stability and strong association with aging, and recent learning-based approaches have improved predictive performance. However, most existing methods treat CpG sites as independent features, overlooking the complex and heterogeneous biological relationships among them. We propose RelAge-GNN, a multi-relational graph neural network framework for DNA methylation-based age prediction. Our method constructs three complementary graphs capturing co-methylation patterns, genomic co-localization, and gene-level associations among CpG sites. Each graph is modeled by an independent GNN branch, and a learnable gating mechanism adaptively fuses the resulting representations. Experiments on large-scale datasets show that RelAge-GNN achieves competitive accuracy and stronger correlation with chronological age compared to state-of-the-art methods. Moreover, the model exhibits improved sensitivity in detecting age acceleration across diverse disease cohorts, highlighting its potential utility for disease characterization. Finally, through post hoc interpretability analyses, we quantify the contributions of different relational structures and CpG sites, providing biologically meaningful insights and suggesting potential directions for aging-related research. Our code is available at: https://anonymous.4open.science/r/RelAge-GNN-F1E3/.

2605.07174 2026-05-11 cs.AI

Repeated Deceptive Path Planning against Learnable Observer

针对可学习观察者的重复欺骗路径规划

Shiyue Cao, Pei Xu, Likun Yang, Lei Cui, Shizhao Yu, Shiyu Zhang, Yongjian Ren, Xiaotang Chen, Kaiqi Huang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institution of Automation, Chinese Academy of Sciences(中国科学院复杂系统认知与决策智能国家重点实验室)

AI总结 本文提出DeMP框架,通过双层优化应对可学习观察者的适应性,提升欺骗持续性,实验显示其在重复欺骗路径规划中表现优异。

Comments Full version of the extended abstract accepted at AAMAS 2026

详情
AI中文摘要

我们研究了欺骗路径规划(DPP)问题,其中智能体旨在隐藏其真实目的地以避免外部观察者发现。现有研究假设观察者是静态且非学习的,但现实中的对手如关键物资运输或军事行动中的对手能够通过学习历史轨迹进行适应。为解决这一差距,我们引入了重复欺骗路径规划(RDPP),一种明确建模可学习观察者的新型方法。我们证明现有DPP方法在该设定下失效,因为它们无法适应演变的对抗预测。尽管将观察者先前预测纳入更新可实现某些适应,但这种渐进更新导致累积滞后,影响欺骗效果。为此,我们提出欺骗元规划(DeMP),一种双层优化框架,结合了回合级适应(实现短期策略调整以对抗更新的观察者)和元级更新(利用跨回合反馈捕捉观察者如何更新模型并加速未来回合的适应)。通过这种方式,DeMP减轻了适应滞后累积,使智能体能够持续欺骗可学习观察者。在多个环境中的实验表明,DeMP在RDPP中显著优于现有方法,同时保持竞争性的路径成本。我们的结果强调了建模与可学习对手的重复交互的重要性,为多智能体系统中的欺骗和隐私提供了新的见解。

英文摘要

We study the problem of deceptive path planning (DPP), where an agent aims to conceal its true destination from external observers. While existing work assumes static, non-learning observers, real-world adversaries-such as in critical goods transportation or military operations-can adapt by learning from historical trajectories. To address this gap, we introduce Repeated Deceptive Path Planning (RDPP), a new formulation that explicitly models learnable observers. We show that existing DPP methods fail under this setting, as they cannot adapt to evolving adversarial predictions. While incorporating observer previous predictions into updates enables some adaptation, such incremental updates cause accumulative lag that degrades deception. To this end, we propose Deceptive Meta Planning (DeMP), a two-level optimization framework that combines episode-level adaptation, which enables short-term policy adjustment to counter updated observer, and meta-level updates, which leverage cross-episode feedback to capture how observers update their models and accelerate adaptation in future episodes. In this way, DeMP mitigates the accumulation of adaptation lag, enabling sustained deception against a learning observer. Experiments across environments demonstrate that DeMP significantly outperforms existing approaches in RDPP while maintaining competitive path cost. Our results highlight the importance of modeling repeated interactions with learnable adversaries, providing new insights into deception and privacy in multi-agent systems.

2605.07172 2026-05-11 cs.CL

Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

基于拓扑的大型语言模型对齐:轨迹拓扑损失与拓扑偏好优化

Yurui Pan, Ke Xu, Bo Peng

发表机构 * School of Computing and Intelligent Innovation, Fudan University(复旦大学计算与智能创新学院) School of Economics and Management, Tongji University(同济大学经济与管理学院) College of Information Technology, Shanghai Ocean University(上海海洋大学信息学院)

AI总结 本文提出基于拓扑的对齐框架,通过轨迹拓扑损失和拓扑偏好优化提升大型语言模型的对齐性能,实验表明其在自动偏好指标和LLM判断评估中优于传统方法。

Comments Accepted to ACL 2026. 15 pages

详情
AI中文摘要

大型语言模型(LLMs)的对齐通常通过SFT和RLHF/DPO实现,但通常忽略表示空间的全局几何结构,转而依赖局部token似然或标量分数。本文将生成视为隐藏空间中的语义轨迹,并提出一种基于0维持久同调的拓扑增强对齐框架。首先,针对SFT,引入轨迹拓扑损失(TTL),将提示和黄金回答嵌入视为混合点云,利用0D持久同调算法提取

英文摘要

Alignment of large language models (LLMs) via SFT and RLHF/DPO typically ignores the global geometry of the representation space, relying instead on local token likelihoods or scalar scores. We view generation as tracing a semantic trajectory in hidden space and propose a topology-enhanced alignment framework that regularizes these trajectories using 0-dimensional persistent homology. First, for SFT, we introduce Trajectory Topology Loss (TTL). Treating prompt and gold-answer embeddings as a mixed point cloud, we use a 0D persistent homology algorithm to extract "prompt-answer bridges." TTL aligns the model's actual update direction with these topological bridges rather than arbitrary directions. Second, for DPO, we propose Topological Preference Optimization (TPO). TPO constructs topic-specific semantic preference vectors and aligns the improvement direction between rejected and chosen responses with these vectors in an intermediate hidden layer. We also introduce a dynamic weighting scheme to balance DPO and TPO losses. Evaluating on Qwen2.5-7B-Instruct using UltraChat and Anthropic HH-RLHF, our topology-enhanced objectives consistently outperform strong non-topological baselines (e.g., per-example, nearest-neighbor, random regularizers) on automatic preference metrics and LLM-judge evaluations, while maintaining or improving toxicity. Results show persistent homology and trajectory geometry offer a promising direction for controllable alignment.

2605.07171 2026-05-11 cs.LG cs.SY eess.SY stat.ML

Cost-Ordered Feasibility for Multi-Armed Bandits with Cost Subsidy

具有成本补贴的多臂老虎机的成本有序可行性

Ishank Juneja, Carlee Joe-Wong, Osman Yağan

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文研究了在未知最佳奖励基础上,基于成本约束的多臂老虎机问题,提出COF算法,通过实例依赖的上界分析,改进了理论 regret 上界并验证了算法的优越性能。

详情
AI中文摘要

经典的多臂老虎机(MAB)问题旨在在不确定性下获取最大奖励。然而,在应用中,通常目标是减少成本,同时满足最小可接受奖励的约束,这一目标由具有成本补贴的多臂老虎机(MAB-CS)捕捉。本文关注一种设置,其中质量(奖励)约束相对于未知最佳奖励和每个臂的成本而设定。我们通过证明实例依赖的下界来表征任何策略所需的期望次优样本数,提供了对问题的新见解,并是先前界限的严格泛化。然后,我们提出了一种名为成本有序可行性(COF)的算法,利用我们的洞察力,并智能地结合所有臂的样本以评估便宜臂的可行性。此后,我们分析COF,以建立其期望累积成本和质量遗憾的实例依赖上界,即相对于最便宜可行臂。最后,我们通过在MovieLens和Goodreads数据集以及代表性合成实例上的广泛模拟实验验证了COF的 merits,与文献中的基线进行比较。不仅本论文开发了更优的理论遗憾上界,而且COF还展示了改进的实证性能。

英文摘要

The classic multi-armed bandit (MAB) problem tackles the challenge of accruing maximum reward while making decisions under uncertainty. However, in applications, often the goal is to minimize cost subject to a constraint on the minimum permissible reward, an objective captured by multi-armed bandits with cost-subsidy (MAB-CS). Of interest to this paper is the setting where the quality (reward) constraint is specified relative to the unknown best reward and the cost of each arm is known. We characterize the expected sub-optimal samples required by any policy by proving instance-dependent lower bounds that offer new insight into the problem and are a strict generalization of prior bounds. Then, we propose an algorithm called Cost-Ordered Feasibility (COF) that leverages our insight and intelligently combine samples from all arms to gauge the feasibility of a cheap arm. Thereafter, we analyze COF to establish instance-dependent upper bounds on its expected cumulative cost and quality regret, i.e., relative to the cheapest feasible arm. Finally, we empirically validate the merits of COF, comparing it to baselines from the literature through extensive simulation experiments on the MovieLens and Goodreads datasets as well as representative synthetic instances. Not only does our paper develop qualitatively better theoretical regret upper bounds, but COF also convincingly demonstrates improved empirical performance.

2605.07170 2026-05-11 cs.CL

A Reproducible Multi-Architecture Baseline for Token-Level Chinese Metaphor Identification under the MIPVU Framework

基于MIPVU框架的可重复多架构中文隐喻识别基准

Yufeng Wu

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 本文提出基于MIPVU框架的可重复多架构中文隐喻识别基准,系统比较三种模型家族,发现MelBERT在中文中表现最佳,Qwen生成模型性能较编码器基准差11个F1点,揭示了生成模型的离散承诺限制。

详情
AI中文摘要

隐喻在日常语言中普遍存在,但基于MIPVU框架的中文隐喻相关词的token级计算识别仍远少于英语。本文提出一个可重复的多架构基准,用于PSU中文隐喻语料库(PSU CMC)。我们系统比较了三种模型家族:(i)使用中文RoBERTa-wwm-ext-large进行编码器微调;(ii)通过新构建的基于现代汉语词典第七版(MCD7)的基本意义资源适配的MelBERT,包含74,823个条目,覆盖PSU CMC词汇的71.51%;(iii)使用QLoRA微调的Qwen3.5-9B作为指令微调生成基准。在五个固定种子下,MelBERT MIP-only在测试正例F1上达到最强表现0.7281±0.0050,略高于MelBERT Full(0.7270±0.0069)和普通RoBERTa(0.7142±0.0121)。Qwen QLoRA生成配置比编码器基准低约11个F1点(0.6157±0.0113)。三个发现值得关注:(1)MelBERT的SPV通道在中文中不提供可靠的正信号,与传统隐喻主导一致;(2)Qwen-编码器差距集中在召回率,反映生成输出的离散承诺限制;(3)几种Qwen任务公式因格式设计而非模型容量失败。我们发布所有分割清单、每种子输出、MCD7基本意义嵌入管道和训练脚本,作为未来中文隐喻识别研究的共同参考。

英文摘要

Metaphor is pervasive in everyday language, yet token-level computational identification of metaphor-related words in Chinese under the MIPVU framework remains under-explored relative to English. This paper presents a reproducible multi-architecture baseline for token-level metaphor identification on the PSU Chinese Metaphor Corpus (PSU CMC), the only widely available MIPVU-annotated Chinese corpus. We systematically compare three model families: (i) encoder fine-tuning with Chinese RoBERTa-wwm-ext-large; (ii) MelBERT adapted to Chinese using a newly constructed basic-meaning resource derived from the Modern Chinese Dictionary, 7th edition (MCD7), comprising 74,823 entries with 71.51% PSU CMC vocabulary coverage; and (iii) Qwen3.5-9B fine-tuned with QLoRA as an instruction-tuned generative baseline. Across five fixed seeds, MelBERT MIP-only achieves the strongest performance at 0.7281 +/- 0.0050 test positive F1, marginally above MelBERT Full (0.7270 +/- 0.0069) and clearly above plain RoBERTa (0.7142 +/- 0.0121). The Qwen QLoRA generative configuration trails encoder baselines by approximately 11 F1 points (0.6157 +/- 0.0113). Three findings merit attention: (1) the SPV channel of MelBERT does not contribute reliable positive signal in Chinese, consistent with the dominance of conventional metaphor; (2) the Qwen-encoder gap is concentrated in recall, reflecting the discrete-commitment limitation of generative output; (3) several Qwen task formulations fail due to format design rather than model capacity. We release all split manifests, per-seed outputs, the MCD7 basic-meaning embedding pipeline, and training scripts to serve as a common reference for future Chinese metaphor identification research.

2605.07166 2026-05-11 cs.LG

Neurosymbolic Imitation Learning with Human Guidance: A Privileged Information Approach

基于人类指导的神经符号模仿学习:一种特权信息方法

Nikhilesh Prabhakar, Varun Balaji, Athresh Karanam, Kristian Kersting, Sriraam Natarajan

发表机构 * Department of Computer Science, The University of Texas at Dallas, Richardson TX 75080, USA(德克萨斯大学达拉斯分校计算机科学系) Computer Science Department and Centre for Cognitive Science, TU Darmstadt, Germany(德累斯顿技术大学计算机科学系和认知科学中心)

AI总结 本文提出一种神经符号模仿学习方法,结合高维数据处理与泛化能力,利用训练期间可用的特权信息(如眼动数据)提升学习效果。

Comments Under Review for ECML-PKDD 2026

详情
AI中文摘要

模仿学习广泛用于复杂环境中的行为学习。虽然纯神经方法能有效处理高维数据,但需要大量样本且易过拟合;纯符号方法虽泛化良好,但无法有效处理高维数据。本文提出一种神经符号方法,兼具高维数据处理与泛化能力。其关键优势在于能有效利用训练期间可用的特权信息(如眼动数据)。实验证明了所提方法的有效性、效率和泛化能力。

英文摘要

Imitation learning is widely used for learning to act in complex environments. While pure neural-based methods handle high dimensional data effectively, they suffer from the requirement of large number of samples and are prone to overfitting. Pure symbolic approaches, while generalize well, do not handle high-dimensional data effectively. We propose a neurosymbolic approach that achieves the best of both worlds, i.e, handling high-dimensional data while achieving generalization. The key advantage of our approach is that it can effectively exploit additional privileged information that is available only during training (in our case, gaze data). Our empirical evaluations demonstrate the effectiveness, efficiency and the generalization capability of our proposed approach.

2605.07164 2026-05-11 cs.CL

Rethinking Experience Utilization in Self-Evolving Language Model Agents

重新思考自演化语言模型代理中的经验利用

Weixiang Zhao, Yingshuo Wang, Yichen Zhang, Yanyan Zhao, Yu Zhang, Yang Wu, Dandan Tu, Bing Qin, Ting Liu

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文研究自演化代理中经验利用的关键设计维度,提出ExpWeaver方法,通过在决策过程中交织经验使用,使代理在需要额外指导时才调用经验,提升性能。

Comments 30 pages, 20 figures, 7 tables

详情
AI中文摘要

自演化代理通过积累和重用过去交互的经验来提升。现有研究主要关注经验的构建、表示和更新,而较少关注运行时决策制定中经验的使用方式。本文将经验利用作为自演化代理的关键设计维度,探讨是否在决策过程中交织经验使用,使经验仅在需要额外指导时被调用。通过引入ExpWeaver方法,在推理过程中将经验作为可选资源暴露出来,实验表明其在不同框架、LLM基础模型和环境类型中均表现最佳。强化学习实验进一步表明,这种行为可通过训练放大。使用模式、因果消融和熵基分析揭示ExpWeaver使代理能选择性地在有益决策点和更高推理不确定性下调用经验。整体而言,本文发现应从单纯研究存储经验的'是什么'转向理解经验何时应进入决策制定。

英文摘要

Self-evolving agents improve by accumulating and reusing experience from past interactions. Existing work has largely focused on how experience is constructed, represented, and updated, while paying less attention to how experience should be used during runtime decision-making. As a result, most agents rely on rigid usage strategies, either injecting experience once at initialization or at every step, without considering whether it is needed for the current decision. This paper studies experience utilization as a critical design dimension of self-evolving agents. We ask whether agents benefit from interweaving experience use with decision-making, so that experience is invoked only when additional guidance is needed. To examine this question, we introduce {ExpWeaver}, a lightweight instantiation that leaves experience construction unchanged and modifies only runtime utilization by exposing experience as an optional resource during reasoning. Across four representative frameworks, seven LLM backbones, and three types of environments, ExpWeaver consistently achieves the best performance among different utilization strategies. Reinforcement learning experiments further show that this behavior can be amplified through training. Usage-pattern, causal ablation, and entropy-based analyses reveal that ExpWeaver enables agents to invoke experience selectively, at beneficial decision points, and under higher reasoning uncertainty. Overall, our findings call for a shift from merely studying \emph{what} experience to store toward understanding \emph{how} and \emph{when} experience should enter decision-making.

2605.07162 2026-05-11 cs.CL

CLIPer: Tailoring Diverse User Preference via Classifier-Guided Inference-Time Personalization

CLIPer:通过分类器引导的推理时个性化实现多样化用户偏好

Jinyan Su, Jinpeng Zhou, Claire Cardie, Wen Sun

发表机构 * Cornell University(康奈尔大学)

AI总结 CLIPer通过分类器引导在推理时个性化,实现多样化用户偏好,无需大量微调,具有高效可控的个性化能力。

详情
AI中文摘要

个性化大语言模型可以通过调整响应以适应用户偏好如有用性、简洁性和幽默感来显著提升用户体验。然而,对所有可能的用户偏好组合进行微调计算成本高且不现实。本文介绍CLIPer(分类器引导的推理时个性化),一种轻量级的个性化方法,利用分类器模型在推理时动态引导LLM生成以适应不同用户偏好。我们的方法消除了大量微调的需求,诱导极小的额外计算开销,同时在单维和多维偏好上实现更可控和细致的个性化。全面的实证分析展示了我们的方法在交付个性化语言生成方面的可扩展性和有效性。

英文摘要

Personalized LLMs can significantly enhance user experiences by tailoring responses to preferences such as helpfulness, conciseness, and humor. However, fine-tuning models to address all possible combinations of user preferences is computationally expensive and impractical. In this paper, we introduce \textbf{CLIPer}(\textbf{Cl}assifier-guided \textbf{I}nference-time \textbf{Per}sonalization), a lightweight personalization approach that leverages a classifier model to steer LLM generation dynamically to different user preferences at inference time. Our method eliminates the need for extensive fine-tuning, inducing negligible additional computational overhead while enabling more controllable and nuanced personalization across single and multi-dimensional preferences. Comprehensive empirical analyses demonstrate the scalability and effectiveness of our approach in delivering personalized language generation.

2605.07157 2026-05-11 cs.LG

Learned Lagrangian Models of PDEs via Euler-Lagrange Residual Minimization

通过欧拉-拉格朗日残差最小化学习偏微分方程的拉格朗日模型

Lyra Zhornyak, Eric Forgoston, M. Ani Hsieh

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Montclair State University(蒙特克莱尔州立大学)

AI总结 本文提出了一种直接利用学习连续拉格朗日量预测偏微分方程系统动力学的方法,通过内在守恒结构实现稳定长程预测。方法基于优化的积分器,通过无网格近辛构造局部空间-时间块最小化欧拉-拉格朗日残差,实现模型误差与积分误差解耦,具有线性扩展性和无需结构要求的网络耦合能力。

Comments 9 pages, 8 figures, 2 tables, 7 pages of appendices

详情
AI中文摘要

我们提出了一种首次直接利用学习连续拉格朗日量来预测由偏微分方程(PDE)系统所支配的动力学的 方法,利用其内在的守恒结构以实现稳定长程预测。我们开发了一种基于优化的积分器,通过无网格近辛构造在局部空间-时间块上最小化平方欧拉-拉格朗日残差。与针对解析模型的积分器不同,针对学习模型的积分器应将模型误差(相位误差)与积分误差(守恒误差)解耦。通过依赖优化而非时间推进,我们绕过了固定离散化固有的全局耦合,这会减慢时间和空间推进并使学习复杂化。我们的方法通过雅可比迭代与域大小线性扩展,并且不对学习网络施加结构要求,允许其与现有的物理引导机器学习(ML)方法耦合。我们在双摆的学习表示、一维波动方程和二维波动方程上验证了我们的方法。我们的方法在误差上与经典辛方法相当,且能够泛化到空间变化的动力学和任意边界条件,而无需重新训练。

英文摘要

We present the first method to directly use a learned continuous Lagrangian to forecast the dynamics of systems governed by partial differential equations, exploiting the inherent conservative structure to achieve stable long-range predictions. We develop an optimization-based integrator that minimizes the squared Euler--Lagrange residual via a mesh-free near-symplectic construction on local space-time patches. Different from integrators for analytical models, integrators for learned models should decouple model error (phase error) from integration error (conservation error). By relying on optimization rather than time-stepping, we bypass the global coupling inherent to fixed discretizations, which slows time- and space-stepping and complicates learning. Our method scales linearly with domain size via Jacobi iteration, and places no structural requirements on the learned network, allowing it to be coupled with existing physics-guided machine learning (ML) methods. We validate our approach on a learned representation of a double pendulum, a one-dimensional wave equation, and a two-dimensional wave equation. Our method achieves error comparable to classical symplectic methods while generalizing to spatially varying dynamics and arbitrary boundary conditions without retraining.

2605.07156 2026-05-11 cs.CV

Hierarchical Perfusion Graphs for Tumor Heterogeneity Modeling in Glioma Molecular Subtyping

分层灌注图谱用于胶质瘤分子分型中的肿瘤异质性建模

Han Jang, Junhyeok Lee, Heeseong Eum, Joon Jang, Yoseob Han, Seung Hong Choi, Kyu Sung Choi

发表机构 * Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, South Korea(生物工程跨学科项目,首尔国立大学,首尔,韩国) Interdisciplinary Program in Cancer Biology, Seoul National University College of Medicine, Seoul, South Korea(癌症生物学跨学科项目,首尔国立大学医学院,首尔,韩国) Dept. of Biomedical Sciences, Seoul National University, Seoul, South Korea(生物医学科学系,首尔国立大学,首尔,韩国) Dept. of Electronic Engineering, Soongsil University, Seoul, South Korea(电子工程系,顺世大学,首尔,韩国) Dept. of Radiology, Seoul National University Hospital, Seoul, South Korea(放射科,首尔国立大学医院,首尔,韩国) Dept. of Radiology, Seoul National University College of Medicine, Seoul, South Korea(放射科,首尔国立大学医学院,首尔,韩国) Healthcare AI Research Institute, Seoul National University Hospital, Seoul, South Korea(医疗人工智能研究院,首尔国立大学医院,首尔,韩国)

AI总结 本文提出HiPerfGNN框架,通过学习灌注动态数据构建分层图谱,用于胶质瘤分子分型,实现了高准确率的分子亚型预测。

Comments Accepted at MICCAI 2026. 11 pages, 2 figures, 2 tables

详情
AI中文摘要

精确的胶质瘤分子分型,包括异柠檬酸脱氢酶(IDH)突变和1p/19q缺失,直接指导手术和治疗决策,但目前依赖侵入性组织采样。深度学习在结构性MRI上的应用为非侵入性替代方案,但仅基于解剖学的方法无法捕捉区分分子亚型的血动力学特征。基于动态对比度消退(DSC)MRI的放射基因组学在非侵入性表征胶质瘤分子亚型方面具有巨大潜力,但临床应用受站点间变异性和体素级分析限制。我们引入HiPerfGNN框架,首先通过向量量化变分自编码器(VQ-VAE)从原始时间强度曲线学习离散的血动力学表示。这些量化灌注代码定义了粗粒度的图节点,代表功能肿瘤栖息地,每个节点通过结构性MRI被分层细分。分层图神经网络在不同尺度上传播信息以进行分子预测。在内部队列(n=475)中,模型实现了IDH的AUC为0.96,1p/19q的AUC为0.89,WHO分级的AUC为0.84,并在独立外部队列(n=397)中保持了稳健的IDH性能(AUC 0.89)而无需重新校准。基于梯度的显著性分析确认了生物上合理的注意力模式,与已知的胶质瘤病理生理学一致。我们的结果证明了将灌注动力学整合到放射基因组学流程中对胶质瘤分子分型的附加价值。代码可在https://github.com/janghana/HiPerfGNN上获取。

英文摘要

Precise molecular subtyping of gliomas, including isocitrate dehydrogenase (IDH) mutation and 1p/19q codeletion, directly guides surgical and therapeutic decisions, yet currently relies on invasive tissue sampling. Deep learning on structural MRI has emerged as a non-invasive alternative, but anatomy-only approaches cannot capture the hemodynamic signatures that distinguish molecular subtypes. Radiogenomics based on dynamic susceptibility contrast (DSC) MRI holds immense potential for non-invasively characterizing glioma molecular subtypes, yet clinical deployment has been hindered by inter-site variability and the limitations of voxel-wise analysis. We introduce HiPerfGNN, a framework that first learns discrete hemodynamic representations from raw time-intensity curves using a vector-quantized variational autoencoder (VQ-VAE). These quantized perfusion codes define coarse-level graph nodes representing functional tumor habitats, each of which is hierarchically subdivided into fine-level subregions guided by structural MRI. A hierarchical graph neural network then propagates information across scales for molecular prediction. On an internal cohort (n=475), the model achieved AUCs of 0.96 (IDH), 0.89 (1p/19q), and 0.84 (WHO grade), and maintained robust IDH performance (AUC 0.89) on an independent external cohort (n=397) without recalibration. Gradient-based saliency analysis confirms biologically grounded attention patterns aligned with known glioma pathophysiology. Our results demonstrate the added value of integrating perfusion dynamics into radiogenomic pipelines for glioma molecular subtyping. Code is available at https://github.com/janghana/HiPerfGNN.

2605.07155 2026-05-11 cs.LG

Regret-Oracle Complexity Tradeoffs in Agnostic Online Learning

在无偏在线学习中 regrets-Oracle 复杂度的权衡

Idan Attias, Steve Hanneke, Arvind Ramaswami

发表机构 * Institute for Data, Econometrics, Algorithms, and Learning (IDEAL)(数据、计量经济学、算法和学习研究所) Department of Computer Science, Purdue University(计算机科学系,普渡大学)

AI总结 本文提出一种动态的无偏到可实现的减少方法,通过弱一致性 oracle 实现更优的 oracle 复杂度,将查询复杂度降低至 O(T^{d_VC+1}) 并保持最优预期遗憾。

详情
AI中文摘要

在经典无偏在线学习中,通常通过减少到可实现设置来解决,利用 Littlestone 的标准最优算法 (SOA) 作为基础学习器。然而,SOA 即使单轮执行也是计算不可行的。为克服这一障碍,最近的 oracle 高效在线学习工作用可实现的基础学习器取代 SOA,该学习器仅通过离线经验风险最小化 (ERM) oracle 访问概念类。尽管如此的无偏学习器能够达到接近最优的预期遗憾,但它们却面临双重指数的 oracle 复杂度 O(T^{2^{O(d_LD)}}),其中 d_LD 是 Littlestone 维度,T 是轮数。在本文中,我们显著改进了这种 oracle 复杂度,同时依赖于一个更弱的原始构件:一个仅决定给定标记数据集是否可实现的弱一致性 oracle。我们方法的核心是一种自适应且动态的无偏到可实现的减少,它在运行时主动修剪不可实现的标签序列。通过使用 VC 维度 (d_VC) 来限制动态维护的活跃路径数量,我们的算法将总查询复杂度降低到 O(T^{d_VC+1}),同时完美地保持了接近最优的预期遗憾。关键的是,这种动态修剪也带来了标准减少的内存减少。此外,我们正式量化了遗憾-oracle 复杂度的权衡,提供了上界,这些上界可以平滑地介于受限查询预算和可实现的预期遗憾之间。我们还补充了下界,证明任何受限于 Q = o(√T) 查询的 learner 必须承受预期遗憾为 Ω(T/Q)。

英文摘要

Agnostic online learning is classically solved via a reduction to the realizable setting, utilizing Littlestone's Standard Optimal Algorithm (SOA) as a base learner. However, the SOA is computationally intractable to execute even for a single round. To overcome this barrier, recent work in oracle-efficient online learning replaces the SOA with a realizable base learner that accesses the concept class exclusively through an offline empirical risk minimization (ERM) oracle. While such agnostic learners achieve near-optimal expected regret, they suffer from a doubly-exponential oracle complexity of $O\big(T^{2^{O(d_\mathrm{LD})}}\big)$, where $d_\mathrm{LD}$ is the Littlestone dimension and $T$ is the number of rounds. In this work, we significantly improve this oracle complexity while relying on an even weaker primitive: a weak-consistency oracle, which merely decides whether a given labeled dataset is realizable. At the core of our approach is an adaptive and dynamic agnostic-to-realizable reduction that actively prunes non-realizable label sequences on the fly. By using the VC dimension ($d_\mathrm{VC}$) to bound the number of dynamically maintained active paths, our algorithm reduces the total query complexity down to $O(T^{d_\mathrm{VC}+1})$ while perfectly preserving near-optimal expected regret. Crucially, this dynamic pruning also yields a memory reduction over the standard reduction. Furthermore, we formally quantify the regret--oracle complexity tradeoff, providing upper bounds that smoothly interpolate between restricted query budgets and attainable expected regret. We complement these with lower bounds proving that any learner restricted to $Q = o(\sqrt{T})$ queries must suffer an expected regret of $Ω(T/Q)$.

2605.07154 2026-05-11 cs.CV

PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition

PRIMED: 通过偏见竞争实现适应性模态抑制的指引用视听分割

Yuchen He, Jing Zhang

发表机构 * School of Information Science and Engineering, East China University of Science and Technology(信息科学与工程学院,东华大学)

AI总结 PRIMED通过偏见竞争理论,实现适应性模态抑制,提升指引用视听分割的准确性,通过模态先验解码器和token蒸馏器增强多模态融合,实验表明其在Ref-AVS基准上达到最优性能。

Comments 11 pages, 8 figures

详情
AI中文摘要

指引用视听分割(Ref-AVS)旨在基于视觉、听觉和文本指引用线索定位和分割目标对象。该任务具有挑战性,因为不同模态的相关性在指引用表达和场景中变化,而现有方法通常将多模态线索视为同质输入进行融合、提示或推理,导致对无关或误导性模态敏感。为解决此问题,我们提出PRIMED,受认知神经科学中的偏见竞争理论启发,显式建模视觉感知和语言驱动的先验调节,并通过适应性模态抑制实现更精确的Ref-AVS。具体而言,模态先验解码器首先估计指引用是否主要依赖音频、视觉或其联合交互,生成模态先验以适应性引导高层注意力。Token蒸馏器进一步从高层特征中提取紧凑的全局视觉token,并在竞争感知的跨模态融合模块中共享,以提供层次化的全局上下文。此外,我们引入了空间感知的语义对齐损失,通过对比学习进一步增强前景-背景区分。在Ref-AVS基准上的广泛实验表明,PRIMED在整体性能上达到最优。

英文摘要

Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.

2605.07153 2026-05-11 cs.CL

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

超越推理:强化学习解锁大语言模型中的参数知识

Wanli Yang, Hongyu Zang, Junwei Zhang, Wenjie Shi, Du Su, Jingang Wang, Xueqi Cheng, Fei Sun

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS University of Chinese Academy of Sciences(人工智能安全国家重点实验室,计算技术研究所,中国科学院,中国科学院大学)

AI总结 本文研究强化学习在零样本、单跳、封闭书 QA 任务中提升参数知识直接回忆的能力,发现其平均相对提升达27%,揭示 RL 通过重新分配概率质量解锁潜在知识而非获取新事实。

详情
AI中文摘要

强化学习(RL)在大语言模型(LLM)推理中取得显著成功,但其是否能提升参数知识的直接回忆仍存疑问。本文在无链式思维、零样本、单跳、封闭书 QA 环境中,仅训练二元正确性奖励,并应用事实级训练-测试去重,确保收益反映提升的回忆而非推理或记忆。在三个模型家族和多个事实性 QA 评估基准上,RL 平均相对提升达27%,超越训练和推理时间基线。机制上,RL 主要重新分配现有知识的概率质量,而非获取新事实,将正确答案从低概率尾部移动到可靠的贪婪生成中。数据归因研究揭示最困难的例子最具信息量:那些答案从未出现在128个预RL样本中的例子(仅约18%的训练数据)驱动了83%的收益,因为稀有正确生成仍会在训练中出现并被强化。这些发现将 RL 的作用扩展到推理之外,将其重新定位为解锁而非获取潜在参数知识的工具。

英文摘要

Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.

2605.07151 2026-05-11 cs.CV cs.AI

DPG-CD: Depth-Prior-Guided Cross-Modal Joint 2D-3D Change Detection

DPG-CD: 基于深度先验的跨模态联合2D-3D变化检测

Luqi Zhang, Zhen Dong, Bisheng Yang

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University(武汉大学测绘遥感信息工程国家重点实验室)

AI总结 本文提出DPG-CD框架,通过引入深度先验和多阶段跨时序跨模态融合,有效解决影像与DSM间模态差异导致的3D变化检测难题,提升2D语义和3D高度变化检测性能。

详情
AI中文摘要

城市空间演变不仅表现为水平扩张,还通过垂直结构变化体现。因此,联合捕捉2D语义变化和3D高度变化对城市形态分析和应急管理至关重要。在实际场景中,获取3D观测常受限于高采集成本和无法支持频繁更新。多时序跨模态输入包括事前数字表面模型(DSM)和事后影像,为高频城市监测、灾害评估和应急响应提供了可行方案。然而,影像和DSM数据存在显著的光谱-几何表示差距,且模态差异可能被误认为实际变化,稳健的变化检测需要有效融合多时序数据的语义和几何特征。本文提出DPG-CD,一种基于深度先验的多时序跨模态融合框架,用于联合2D语义和3D高度变化检测。具体而言,将估计的深度先验引入影像以缓解与DSM的模态差距。随后,门控融合机制选择性地注入来自深度先验的几何线索,同时保留判别性的光谱表示。接着,采用多阶段跨时序跨模态特征融合架构提取变化感知特征。最后,多任务解码器联合预测2D语义变化和3D高度变化,并辅以辅助DSM预测任务以提升结构一致性和高度估计精度。在Hi-BCD和3DCD两个公开数据集以及新数据集NYC-MMCD上的实验表明,DPG-CD在2D和3D变化检测任务上均优于现有最先进方法。

英文摘要

Urban spatial evolution is manifested not only through horizontal expansion but also through vertical structural changes. Consequently, jointly capturing 2D semantic changes and 3D height changes is essential for urban morphology analysis and emergency management. In practical scenarios, collecting 3D observations is often constrained by high acquisition costs and the inability to support frequent updates. The multi-temporal cross-modal input consisting of pre-event Digital Surface Model (DSM) and post-event imagery provides a practical solution for 3D change detection in high-frequency urban monitoring, disaster assessment, and emergency response scenarios. However, this setting remains challenging as imagery and DSM data exhibit significant spectral-geometric representation gaps. Moreover, modality differences may be confused with actual changes, and robust change detection requires effective fusion of semantic and geometric features from multi-temporal data. In this paper, we propose DPG-CD, a depth-prior-guided multi-temporal cross-modal fusion framework for joint 2D semantic and 3D height change detection. Specifically, an estimated depth prior is introduced into the imagery to mitigate the modality gap with DSM. A gated fusion mechanism then selectively injects geometric cues from depth prior while preserving discriminative spectral representations. Subsequently, a multi-stage cross-temporal cross-modal feature fusion architecture is employed to extract change-aware features. Finally, a multi-task decoder jointly predicts 2D semantic changes and 3D height changes, complemented by an auxiliary DSM prediction task to improve structural consistency and height estimation accuracy. Experiments on two public datasets, Hi-BCD and 3DCD, and a new dataset, NYC-MMCD, demonstrate that DPG-CD outperforms state-of-the-art methods on both 2D and 3D change detection tasks.

2605.07149 2026-05-11 cs.CV

Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection

Real-IAD MVN:一种多视角法向量数据集和基准,用于高保真工业异常检测

Wenbing Zhu, Jianing Liang, Linjie Cheng, Yurui Pan, Zhuhao Chen, Qingwang Yan, Yudong Cheng, Jianghui Zhang, Mingmin Chi, Bo Peng

发表机构 * Fudan University(复旦大学) Shanghai Ocean University(上海海洋大学) Donghua University(东华大学) Rongcheer Co., Ltd.(荣驰科技有限公司)

AI总结 本文提出Real-IAD-MVN数据集,通过多视角法向量捕捉微细几何缺陷,验证密集多视角伪3D数据在工业异常检测中的优越性能。

Comments Accepted to CVPR 2025. 15 pages

详情
AI中文摘要

工业异常检测(IAD)对质量控制至关重要,但现有方法难以检测细微几何缺陷。标准2D(RGB)图像对纹理和光照敏感,常遗漏细小几何异常。3D点云能捕捉宏观形状,但通常过于稀疏,无法检测如划痕或凹坑等微缺陷。我们通过升级采集系统,引入Real-IAD-MVN(多视角法向量)大规模工业数据集,捕获五种不同视角的高保真表面法向量图,取代稀疏3D数据。这提供了微细级别的全面几何表示,使之前不可见的侧壁和遮挡缺陷得以显式检测。实验表明,结合密集多视角伪3D(法向量)的数据比使用稀疏3D点云数据检测性能显著更好。为进一步验证数据集并提供强基准,我们引入基于重建的基线方法,学习从图像和法向量图流中提取跨模态统一原型。证明统一原型方法超越现有多模态融合方法,凸显新数据集在推进几何异常检测中的潜力。

英文摘要

Industrial Anomaly Detection (IAD) is critical for quality control, but existing methods struggle with subtle, geometric defects. Standard 2D (RGB) images are sensitive to texture and lighting but often miss fine geometric anomalies. While 3D point clouds capture macro-shape, they are typically too sparse to detect micro-defects like scratches or pits. We address this fundamental data limitation by introducing Real-IAD-MVN (Multi-View Normal), a large-scale industrial dataset. By upgrading our acquisition system, Real-IAD-MVN captures high-fidelity surface normal maps from five distinct viewpoints, replacing sparse 3D data entirely. This provides a comprehensive geometric representation at a micro-detail level, making previously invisible side-wall and occluded defects explicitly detectable. Our experiments, conducted on this new dataset, first provide evidence that incorporating dense, multi-view pseudo-3D (surface normals) yields significantly better detection performance than using sparse 3D point cloud data. To further validate the dataset and provide a strong benchmark, we introduce a baseline method based on reconstruction, which learns to extract cross-modal unified prototypes from the image and normal map streams. We demonstrate that this unified prototype approach surpasses existing state-of-the-art multimodal fusion methods, highlighting the rich potential of our new dataset for advancing geometric anomaly detection.

2605.07148 2026-05-11 cs.CV

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

揭示和塑造视觉-语言模型中3D场景拓扑的潜在表示

Haoming Wang, Wei Gao

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文研究了视觉-语言模型是否能构建3D环境的拓扑表示,通过隔离空间子空间并数学塑造潜在表示,证明其与场景3D高斯核图的拉普拉斯特征映射一致,并在空间任务中提升性能。

详情
AI中文摘要

几十年的认知科学表明,人类通过形成认知图来导航环境,认知图被定义为3D空间的 allocentric 且拓扑保持的表示。尽管现代视觉-语言模型(VLMs)从2D egocentric输入中表现出涌现的空间推理能力,但尚不清楚它们是否构建了类似的3D内部表示。本文证明当前VLMs确实具有3D场景的潜在拓扑地图,但该地图被非几何视觉语义(如颜色和形状)严重掩盖。通过跨场景线性特征提取隔离空间子空间,我们提取出一个干净的空间子空间,该子空间因果控制模型的空间输出。我们数学上塑造了该潜在表示,并证明其与场景3D高斯核图的拉普拉斯特征映射一致,在连续极限下收敛到物理3D空间。受此几何识别的启发,我们进一步引入基于狄利克雷能量的数学原理性潜在正则化方法用于VLMs。将此单一项正则化器应用于最小的500步监督VLM微调(SFT)在简单合成数据上,显著提升了现实空间基准性能,在涉及场景拓扑理解的空间任务中,优于标准SFT和竞争基线,最高提升12.1%。源代码可在https://github.com/pittisl/vlm-latent-shaping获取。

英文摘要

Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model's spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene's 3D Gaussian-kernel graph, converging to the physical 3D space in the continuous limit. Motivated by this geometric identification, we further introduce a mathematically principled latent regularization method for VLMs, based on Dirichlet energy. Applying this single-term regularizer to a minimal 500-step supervised VLM fine-tuning (SFT) on simple synthetic data yields significant improvements on real-world spatial benchmarks, outperforming standard SFT and competitive baselines by up to 12.1\% in spatial tasks involving scene topology understanding. Source code is available at https://github.com/pittisl/vlm-latent-shaping

2605.07146 2026-05-11 cs.CV

UniV2D: Bridging Visual Restoration and Semantic Perception for Underwater Salient Object Detection

UniV2D:连接视觉修复与语义感知以实现水下显著目标检测

Laibin Chang, Shaodong Wang, Yunke Wang, Xu Zhang, Kui Jiang, Chang Xu, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computer Science, The University of Sydney(悉尼大学计算机学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院)

AI总结 本文提出UniV2D,通过联合优化视觉修复与显著目标检测,解决水下视觉退化问题,提升语义一致性与检测性能。

详情
AI中文摘要

水下显著目标检测(USOD)在海洋视觉任务中至关重要,但因严重的视觉退化(如选择性吸收和介质散射)而极具挑战性。传统方法通常采用“增强后检测”的顺序流程,但将低级视觉修复与高级语义感知分离常导致语义不一致,恢复图像可能不利于检测甚至引入无关噪声。为打破这一顺序瓶颈,我们提出UniV2D,一个统一的视觉到检测网络,通过互惠框架联合优化视觉修复与显著目标检测。与传统方法依赖离散流程或刚性物理先验不同,UniV2D引入语义驱动学习范式:高层显著语义主动引导修复过程,而修复的视觉线索反过来增强显著性感知。具体而言,UniV2D具有层次化的双分支架构。首先,它使用自校准解码器预测初始显著性掩膜,并结合掩码感知的修复模块重建图像内容。随后,使用显著性引导的细化模块,配备跨层调制,以对齐结构保真度与语义一致性。在多个基准上的广泛实验表明,UniV2D在定量和定性评估中均显著优于现有最先进方法,为联合水下感知树立了新标准。

英文摘要

Underwater salient object detection (USOD) plays a vital role in marine vision tasks but remains fundamentally challenging due to severe visual degradation, such as selective absorption and medium scattering. Conventional pipelines typically adopt a sequential "enhance-then-detect" paradigm. However, isolating low-level visual restoration from high-level semantic perception often leads to semantic inconsistency, where the restored images may not be optimal for detection and can even introduce task-irrelevant noise. To break this sequential bottleneck, we propose UniV2D, a Unified Vision-to-Detection Network that jointly optimizes visual restoration and salient object detection within a mutually beneficial framework. Unlike traditional methods that rely on disjointed pipelines or rigid physical priors, UniV2D introduces a semantic-driven learning paradigm: high-level saliency semantics actively guide the restoration process, while the restored visual cues reciprocally enhance saliency perception. Specifically, UniV2D features a hierarchical dual-branch architecture. It first employs a self-calibrated decoder to predict initial saliency masks alongside a mask-aware restoration module to reconstruct image content. Subsequently, a saliency-guided refinement module equipped with cross-level modulation is utilized to align structural fidelity with semantic consistency. Extensive experiments across multiple benchmarks demonstrate that UniV2D significantly outperforms state-of-the-art methods in both quantitative and qualitative evaluations, establishing a new standard for joint underwater perception.

2605.07143 2026-05-11 cs.CV cs.NA cs.RO math.NA

TriP: A Triangle Puzzle Approach to Robust Translation Averaging

TriP:基于三角谜题的鲁棒翻译平均方法

Zhekai Fan, Wanze Li, Jinxin Wang, Yunpeng Shi

发表机构 * UC Davis(加州大学戴维斯分校) University of Chicago(芝加哥大学)

AI总结 TriP通过三角几何推断局部相对边尺度,并在对数域同步重叠三角形尺度以恢复一致的边长和相机位置,有效应对对抗性、循环一致等结构性干扰,具有强理论保障和高效计算性能。

详情
AI中文摘要

TriP通过三角几何推断局部相对边尺度,并在对数域同步重叠三角形尺度以恢复一致的边长和相机位置,有效应对对抗性、循环一致等结构性干扰,具有强理论保障和高效计算性能。

英文摘要

Translation averaging aims to recover camera locations from pairwise relative translation directions and is a fundamental component of global Structure-from-Motion pipelines. The problem is challenging because direction measurements contain no distance information, making the estimation problem highly ill-conditioned and highly sensitive to corrupted observations. In this paper, we propose TriP, a triangle-based framework for robust translation averaging. TriP first infers local relative edge scales from triangle geometry, and then synchronizes the scales of overlapping triangles in the logarithmic domain to recover globally consistent edge lengths and camera locations. By leveraging higher-order consistency across triangles, the proposed method is robust to adversarial, cycle-consistent, and other structured corruptions. In addition, TriP avoids the collapse issue without requiring any extra anti-collapse constraints, since log-scale synchronization excludes the degenerate zero-scale solution by construction. These structural advantages enable a particularly strong theory for exact location recovery. On the practical side, TriP is fully parallelizable, computationally efficient, and naturally scalable to graphs with millions of cameras. Moreover, it outperforms all previous translation averaging methods by a large margin on both synthetic and real datasets.

2605.07142 2026-05-11 cs.CV

AGA3DNet: Anatomy-Guided Gaussian Priors with Multi-view xLSTM for 3D Brain MRI Subtype Classification

AGA3DNet:基于解剖的高斯先验与多视角xLSTM用于3D脑MRI亚型分类

Peiyu Duan, Xueqi Guo, Sepehr Farhand, Mehmet Berk Sahin, Xinyuan Zheng, James S. Duncan, Gerardo Hermosillo Valadez, Yoshihisa Shinagawa

发表机构 * Yale University(耶鲁大学) Siemens Healthineers(西门子医疗) Purdue University(普渡大学)

AI总结 AGA3DNet结合解剖短语和轻量3D CNN与多视角xLSTM,通过高斯加权提供可解释的解剖引导,提升3D脑MRI亚型分类性能。

Comments CVPR CV4CLINIC 2026

详情
AI中文摘要

准确的3D脑MRI亚型分类受益于局部解剖线索和长程上下文推理。我们提出AGA3DNet,一种基于报告的框架,将从放射科报告中提取的简短解剖短语作为软解剖先验通道,并与轻量3D CNN和多视角xLSTM聚合。具体而言,提取的解剖短语被映射到标准定义的区域,并通过符号距离变换和高斯加权转换为平滑的空间先验,提供可解释、基于解剖的引导,而无需密集体素标注。我们在回顾性机构脑MRI队列中评估AGA3DNet以进行异常亚型判别,并与可重复的3D分类基线进行比较。AGA3DNet在性能指标上实现了整体平衡的提升,并通过先验通道支持临床可解释的定位。我们讨论了单一队列评估的局限性以及缺乏大规模公共脑MRI数据集与放射科报告配对的局限性。

英文摘要

Accurate 3D brain MRI subtype classification benefits from both localized anatomical cues and long-range contextual reasoning. We present AGA3DNet, a report-grounded framework that incorporates brief anatomical phrases extracted from radiology reports as a soft anatomical prior channel and fuses it with a lightweight 3D CNN and multi-view xLSTM aggregation. Specifically, extracted anatomical phrases are mapped to atlas-defined regions and converted into smooth spatial priors using a signed-distance transform followed by Gaussian weighting, providing interpretable, anatomy-grounded guidance without requiring dense voxel annotations. We evaluate AGA3DNet on a retrospective institutional brain MRI cohort for abnormal subtype discrimination and compare against reproducible 3D classification baselines. AGA3DNet achieves improved overall balance across performance metrics and supports clinically interpretable localization through the prior channel. We discuss limitations related to single-cohort evaluation and the lack of large-scale public brain MRI datasets paired with radiology reports under broadly usable terms.

2605.07141 2026-05-11 cs.CV cs.AI

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

Qwen3-VL-Seg: 解锁基于视觉-语言接地的开放世界指代分割

Yuan Yao, Qiushi Yang, Humen Zhong, Jiangning Wei, Yifang Men, Shuai Bai, Miaomiao Cui, Zhibo Yang

发表机构 * Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 Qwen3-VL-Seg通过视觉-语言接地框架实现开放世界指代分割,采用轻量级的框引导掩码解码器,结合多尺度空间特征注入和迭代掩码感知查询优化,实现参数高效且精确的像素级分割。

详情
AI中文摘要

开放世界指代分割需要将无约束的语言表达接地到精确的像素级区域。现有多模态大语言模型(MLLMs)在开放世界视觉接地方面表现强劲,但其输出仍局限于稀疏的边界框坐标,无法满足密集视觉预测需求。最近的基于MLLM的分割方法要么直接预测稀疏轮廓坐标,难以重建连续物体边界,要么依赖外部分割基础模型如Segment Anything Model(SAM),引入显著的架构和部署开销。我们提出了Qwen3-VL-Seg,一个参数高效的框架,将MLLM预测的框视为语义接地的结构先验,并将其解码为像素级指代分割。其核心是一个轻量级的框引导掩码解码器,结合多尺度空间特征注入、空间-语义查询构建、框引导的高分辨率像素融合以及迭代掩码感知查询优化,仅引入17M参数(约为基础模型的0.4%)。为了实现可扩展的开放世界训练,我们构建了SA1B-ORS,一个源自SA-1B的数据集,包含两个子集:SA1B-CoRS(类别导向样本)和SA1B-DeRS(描述性、实例特定样本)。对于评估,我们精心编排了ORS-Bench,一个人工筛选的基准,包含涵盖多样指代表达类型的在分布和不在分布子集。广泛的指代表达分割、视觉接地和ORS-Bench实验表明,Qwen3-VL-Seg在封闭集和开放世界设置中表现优异,在语言密集型指令和强不在分布泛化方面具有明显优势。在一般多模态基准上的评估进一步表明,经过分割导向适应后,模型在多模态能力方面保持广泛保留。

英文摘要

Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale spatial feature injection, spatial-semantic query construction, box-guided high-resolution pixel fusion, and iterative mask-aware query refinement, introducing only 17M parameters (about 0.4\% of the base model). For scalable open-world training, we construct SA1B-ORS, an SA-1B-derived dataset with two subsets: SA1B-CoRS (category-oriented samples) and SA1B-DeRS (descriptive, instance-specific samples). For evaluation, we curate ORS-Bench, a manually screened benchmark with in-distribution and out-of-distribution subsets covering diverse referring expression types. Extensive experiments on referring expression segmentation, visual grounding, and ORS-Bench show that Qwen3-VL-Seg performs strongly across closed-set and open-world settings, with clear advantages on language-intensive instructions and strong out-of-distribution generalization. Evaluations on general multimodal benchmarks further show that the model broadly preserves general-purpose multimodal competence after segmentation-oriented adaptation.

2605.07140 2026-05-11 cs.CV cs.AI

Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition

基于骨架的人体动作识别的概念驱动逻辑推理神经符号框架

Talha Ilyas, Deval Mehta, Zongyuan Ge

发表机构 * Department of ECSE, Faculty of Engineering, Monash University, Australia(莫纳什大学工程学院电子与计算机工程系,澳大利亚) AIM for Health Lab, Faculty of Information Technology, Monash University, Australia(莫纳什大学信息科技学院健康人工智能实验室,澳大利亚) Department of DSAI, Faculty of Information Technology, Monash University, Australia(莫纳什大学信息科技学院数据科学与人工智能系,澳大利亚)

AI总结 本文提出一种神经符号框架,将骨架动作识别转化为概念驱动的一阶逻辑推理,通过结合表征学习与符号推理,提升动作识别的可解释性。

Comments Accepted In Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情
AI中文摘要

基于骨架的人体动作识别的神经符号框架通过将动作识别转化为概念驱动的一阶逻辑推理,结合表征学习与符号推理,提升动作识别的可解释性。本文提出了一种神经符号框架,将骨架动作识别转化为概念驱动的一阶逻辑推理,通过结合表征学习与符号推理,提升动作识别的可解释性。

英文摘要

Skeleton-based human activity recognition has achieved strong empirical performance, yet most existing models remain black boxes and difficult to interpret. In this work, we introduce a neurosymbolic formulation of skeleton-based HAR that reframes action recognition as concept-driven first-order logical reasoning over motion primitives. Our framework bridges representation learning and symbolic inference by grounding first-order logic predicates in learnable spatial and temporal motion concepts. Specifically, we employ a standard spatio-temporal skeleton encoder to extract latent motion representations, which are then mapped to interpretable concept predicates via a spatio-temporal concept decoder that explicitly separates pose-centric and dynamics-centric abstractions. These concept predicates are composed through differentiable first-order logic layers, enabling the model to learn human-readable logical rules that govern action semantics. To impose semantic structure on the learned concepts, we align skeleton representations with LLM-derived descriptions of atomic motion primitives, establishing a shared conceptual space for perception and reasoning. Extensive experiments on NTU RGB+D 60/120 and NW-UCLA demonstrate that our approach achieves competitive recognition performance while providing explicit, interpretable explanations grounded in logical structure. Our results highlight neurosymbolic reasoning as an effective paradigm for interpretable spatio-temporal action understanding. Code: https://github.com/Mr-TalhaIlyas/REASON

2605.07139 2026-05-11 cs.CL cs.AI cs.LG

Structural Rationale Distillation via Reasoning Space Compression

通过推理空间压缩实现结构性理性提炼

Jialin Yang, Jiankun Wang, Jiajun Wu, Henry Leung, Jiayu Zhou, Steve Drew

发表机构 * University of Calgary(卡尔加里大学) University of Michigan(密歇根大学)

AI总结 本文提出D-RPC方法,通过压缩推理空间约束教师模型生成一致且多样化的推理路径,提升学生模型在数学和常识推理任务中的表现,优于多种基线方法。

详情
AI中文摘要

当从大型语言模型(LLMs)中提炼推理过程到小型模型时,教师模型对相似问题的推理路径结构和策略往往差异显著。本文提出Distillation through Reasoning Path Compression(D-RPC),通过约束教师模型遵循一个动态维护的可重用高层次推理路径库。对于每个训练问题,D-RPC检索最相关的路径并约束教师模型遵循它,生成在相似问题间一致但能覆盖不同问题类型的推理路径。PAC-Bayes分析正式化了银行大小与覆盖范围之间的权衡:较小的银行减少监督熵但可能造成覆盖缺口,而泛化界确定了一个最优的中间大小,通过消融实验得到验证。在五个数学和常识推理基准测试中,使用两个学生模型,D-RPC在表现上优于链式推理提炼、自由生成推理路径、直接提炼和结构化监督基线方法,同时使用的token数量少于模板密集型替代方法。

英文摘要

When distilling reasoning from large language models (LLMs) into smaller ones, teacher rationales for similar problems often vary wildly in structure and strategy. Like a chef who makes the same dish differently each time, this inconsistency burdens the student with noisy supervision that is hard to internalize. We propose Distillation through Reasoning Path Compression (D-RPC), which constrains the teacher to follow a compact, dynamically maintained bank of reusable high-level reasoning paths. For each training question, D-RPC retrieves the most relevant path and conditions the teacher to follow it, producing rationales that are consistent across similar problems yet diverse enough to cover different problem types. A PAC-Bayes analysis formalizes the resulting trade-off between bank size and coverage: smaller banks reduce supervision entropy but risk coverage gaps, and the generalization bound identifies an optimal intermediate size confirmed by our ablations. Across five math and commonsense reasoning benchmarks with two student models, D-RPC consistently outperforms chain-of-thought distillation, freeform rationale generation, direct distillation, and structured-supervision baselines, while using fewer tokens than template-heavy alternatives.