arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.22067 2026-06-01 cs.CV

Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

捕捉注视转移以引导：跨模态融合增强用于VLM幻觉缓解

Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas

发表机构 * AWS AI Labs（AWS人工智能实验室）

AI总结提出GIFT方法，通过预计算视觉显著性图并跟踪注视转移，在解码时增强对显著视觉信息和用户查询的注意力，以缓解视觉语言模型中的幻觉问题。

Comments ICML 2026

详情

AI中文摘要

视觉语言模型（VLM）经常产生幻觉，即无法由文本或视觉输入证实的内容。先前的工作主要将其归因于过度依赖语言先验知识而非视觉输入。一些方法尝试通过按注意力分数比例放大视觉令牌注意力来缓解幻觉。然而，这些方法忽视了视觉注意力沉没问题，即注意力经常被错误分配到与任务无关的视觉区域，并且忽略了跨模态融合平衡，仅增强视觉注意力而不调整对用户查询的注意力。这可能导致放大错误区域，同时无法正确解释用户查询。为解决这些挑战，我们提出了一种简单而有效的方法，称为注视转移引导的跨模态融合增强（GIFT）。GIFT通过在用户查询理解过程中跟踪视觉注意力的正向变化（即“注视转移”），预计算整体视觉显著性图，并利用该图在每个解码步骤放大对显著视觉信息和用户查询的注意力。这减少了视觉注意力沉没的影响，因为无关令牌的转移最小，同时确保平衡的跨模态融合以获得良好整合的表示。大量实验表明，GIFT在生成和分类任务中均有效缓解了VLM的幻觉，与贪婪解码相比实现了高达20.7%的改进，同时以低计算开销保持了通用的视觉语言性能。

英文摘要

Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2511.04393 2026-06-01 cs.AI

Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

将LLM后训练为更好的决策智能体：一种遗憾最小化方法

Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； University of Maryland, College Park（马里兰大学哥伦比亚学院）

AI总结提出迭代遗憾最小化微调（Iterative RMFT），通过反复蒸馏低遗憾决策轨迹来后训练LLM，提升其在在线决策任务中的表现，无需依赖已知算法或人工模板。

Comments Camera ready version of ICML 2026

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为交互式和动态环境中的决策智能体。然而，由于它们最初并非为决策设计，最近的研究表明，LLM即使在基本的在线决策问题中也可能表现不佳，无法实现低遗憾或有效的探索-利用权衡。为了解决这个问题，我们引入了迭代遗憾最小化微调（Iterative RMFT），这是一种后训练过程，反复将低遗憾决策轨迹蒸馏回基础模型。在每次迭代中，模型生成多个决策轨迹，选择k个最低遗憾的轨迹，并在此基础上进行微调。与先前方法（a）从已知决策算法中蒸馏动作序列或（b）依赖人工设计的思维链模板不同，我们的方法利用遗憾度量来激发模型自身的决策能力和推理依据。这种对模型生成推理的依赖避免了僵化的输出工程，并提供了更灵活、自然语言的训练信号。实验结果表明，Iterative RMFT在多种模型上提升了LLM的决策性能——从具有数值输入/输出的Transformer，到开源权重LLM，再到像GPT-4o mini这样的先进闭源模型。其在输出和推理格式上的灵活性使其能够泛化到具有不同时间范围、动作空间、奖励过程和自然语言上下文的任务。最后，我们提供了理论见解，表明在这种范式下，单层Transformer可以在简化设置中充当无遗憾学习器。总体而言，Iterative RMFT为增强LLM的决策能力提供了一个有原则且通用的后训练框架。

英文摘要

Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle even in basic online DM problems, failing to achieve low regret or an effective exploration-exploitation tradeoff. To address this, we introduce Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories back into the base model. At each iteration, the model rolls out multiple decision trajectories, selects the k-lowest regret ones, and fine-tunes itself on them. Unlike prior methods that (a) distill action sequences from known DM algorithms or (b) rely on manually crafted chain-of-thought templates, our approach leverages the regret metric to elicit the model's own DM ability and reasoning rationales. This reliance on model-generated reasoning avoids rigid output engineering and provides more flexible, natural-language training signals. Empirical results show that Iterative RMFT improves LLMs' DM performance across diverse models - from Transformers with numerical input/output, to open-weight LLMs, and advanced closed-weight models like GPT-4o mini. Its flexibility in output and reasoning formats enables generalization across tasks with varying horizons, action spaces, reward processes, and natural-language contexts. Finally, we provide theoretical insight showing that a single-layer Transformer under this paradigm can act as a no-regret learner in a simplified setting. Overall, Iterative RMFT offers a principled and general post-training framework for enhancing LLMs' decision-making capabilities.

URL PDF HTML ☆

赞 0 踩 0

2511.03100 2026-06-01 cs.LG cs.AI cs.MA

Scaling Multi-Agent Environment Co-Design with Diffusion Models

基于扩散模型的多智能体环境协同设计扩展

Hao Xiang Li, Michael Amir, Amanda Prorok

发表机构 * Department of Computer Science, University of Cambridge, Cambridge, United Kingdom（剑桥大学计算机科学系，剑桥，英国）

AI总结提出扩散协同设计（DiCoDe）框架，通过投影通用引导（PUG）和评论家蒸馏机制，实现高维环境设计空间下的可扩展、样本高效的智能体-环境协同优化。

详情

AI中文摘要

智能体-环境协同设计范式联合优化智能体策略和环境配置，以寻求系统性能提升。其应用领域从仓库物流到风电场管理，有望从根本上改变多智能体系统的部署方式。然而，当前的协同设计方法难以扩展：在高维环境设计空间下失效，且在处理联合优化中固有的移动目标时样本效率低下。我们通过开发扩散协同设计（DiCoDe）来应对这些挑战，这是一个可扩展且样本高效的协同设计框架，将协同设计推向实际相关场景。DiCoDe包含两项核心创新。首先，我们引入投影通用引导（PUG），这是一种采样技术，使DiCoDe能够在满足硬约束（如障碍物之间的空间间隔）的同时，探索奖励最大化环境的分布。其次，我们设计了一种评论家蒸馏机制，以共享来自强化学习评论家的知识，确保引导扩散模型利用密集且最新的学习信号适应不断演化的智能体策略。在具有挑战性的多智能体环境协同设计基准（包括仓库自动化、多智能体路径规划和风电场优化）上验证时，这些改进共同产生了更优的环境-策略对。我们的方法持续超越现有技术，例如在仓库场景中，以少66%的仿真样本实现了39%更高的奖励。这为智能体-环境协同设计设立了新标准，并向着在现实世界中收获协同设计成果迈出了关键一步。

英文摘要

The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co-design promises to fundamentally change how we deploy multi-agent systems. However, current co-design methods struggle to scale. They collapse under high-dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co-Design (DiCoDe), a scalable and sample-efficient co-design framework pushing co-design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward-maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up-to-date learning signal. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent environment co-design benchmarks including warehouse automation, multi-agent pathfinding and wind farm optimisation. Our method consistently exceeds the state-of-the-art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent-environment co-design, and is a stepping stone towards reaping the rewards of co-design in real world domains.

URL PDF HTML ☆

赞 0 踩 0

2510.15710 2026-06-01 cs.CV

UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

UniMedVL: 通过观察-知识-分析统一医学多模态理解与生成

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Zhongying Deng, Lihao Liu, Ming Hu, Junjun He

发表机构 * Shanghai Artificial Intelligence Laboratory ； Shanghai Innovation Institute ； Shanghai Jiao Tong University ； Shanghai Institute of Optics ； Fudan University ； University of Cambridge ； Monash University ； DAMO Academy, Alibaba Group ； Imperial College London ； The University of Hong Kong ； The Hong Kong University of Science ； Hupan Lab ； The Chinese University of Hong Kong

AI总结提出首个统一医学模型UniMedVL，通过渐进式训练流水线融合多模态理解与生成能力，并在8种影像模态的5.6M实例数据集上验证其性能。

Comments This submission has been converted to the ICML template

详情

AI中文摘要

医学工作流程通常结合阅读图像与生成视觉和文本输出，使得图像理解和生成成为医学AI的核心。然而，大多数现有系统在孤立模型中处理这些能力，失去了统一架构可以利用的共享知识。为弥合这一差距，我们提出了UniMedVL，这是第一个在单个模型中无缝集成多模态理解和生成能力而无需切换权重的统一医学模型。我们通过定制的渐进式训练流水线实现这一点，其中理解和生成相互增强。为有效训练UniMedVL，我们整理了UniMedVL-5M，这是第一个大规模医学数据集，包含跨越8种医学影像模态的超过560万个实例，专为统一医学理解和生成中的多模态输入输出任务设计。实验结果表明，UniMedVL在五个医学理解基准上取得了有竞争力的性能。关键的是，UniMedVL原生支持多种交错生成任务，例如虚拟染色、超分辨率、跨模态合成，这些对于复杂的医学工作流程至关重要。我们的代码和数据集已公开。

英文摘要

Medical workflows routinely combine reading images with producing visual and textual outputs, making both image understanding and generation central to medical AI. Most existing systems, however, address these abilities in isolated models, losing the shared knowledge that a unified architecture could exploit. To bridge this gap, we present UniMedVL, the first unified medical model that seamlessly integrates multimodal understanding and generation capabilities within a single model without switching weights. We achieve this via a tailored progressive training pipeline where understanding and generation mutually reinforce each other. To effectively train UniMedVL, we curate UniMedVL-5M, the first large-scale medical dataset comprising over 5.6M instances across 8 medical imaging modalities, tailored for multimodal input-output tasks in unified medical understanding and generation. Experimental results demonstrate that UniMedVL achieves competitive performance on five medical understanding benchmarks. Crucially, UniMedVL natively supports diverse interleaved generation tasks, e.g., virtual staining, super-resolution, cross-modal synthesis, essential for complex medical workflows. Our code and dataset are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2503.22996 2026-06-01 cs.CL

Rethinking Sparse Mixture of Experts from a Unified Perspective

从统一视角重新思考稀疏混合专家模型

Giang Do, Hung Le, Truyen Tran

发表机构 * Applied Artificial Intelligence Intiative (A2I2), Deakin University, Victoria, Australia（应用人工智能倡议（A2I2），德金大学，维多利亚，澳大利亚）

AI总结针对稀疏混合专家模型中固定预算导致无关选择或遗漏关键分配的问题，提出基于线性规划的统一框架USMoE，通过统一机制和评分实现灵活专家选择，提升性能并降低推理成本。

Comments 35 pages

Journal ref ICML 2026

详情

AI中文摘要

稀疏混合专家（SMoE）模型在保持恒定计算开销的同时扩展了模型容量。SMoE方法分为两类：Token Choice（将每个令牌路由到固定数量的专家）和Expert Choice（为每个专家分配固定数量的令牌）。然而，对令牌或专家使用固定预算导致两种方法都会选择不相关的令牌-专家对或忽略关键分配，从而降低整体性能。为填补这一空白，我们通过线性规划的视角从统一角度重新思考SMoE，为SMoE模型提供了通用公式。此外，我们引入了统一稀疏混合专家（USMoE），这是一个包含统一机制和统一评分的新框架，以克服这些限制。我们提供了理论证明和实证证据，展示了USMoE的有效性。在多种数据设置（干净和损坏）、多个领域（包括文本和视觉任务）以及不同学习方法（无训练和基于训练）上的广泛评估表明，USMoE不仅比现有SMoE方法带来了显著的性能提升，而且实现了更灵活的专家选择预算，在不影响模型性能的情况下降低了推理成本。我们的实现已在https://github.com/giangdip2410/USMoE公开。

英文摘要

Sparse Mixture of Experts (SMoE) models scale the capacity of models while maintaining constant computational overhead. SMoE methods fall into two categories: Token Choice, which routes each token to a fixed number of experts, and Expert Choice, which assigns a fixed number of tokens to each expert. However, the use of fixed budgets for tokens or experts causes both approaches to select irrelevant token-expert pairs or overlook critical assignments, which degrades overall performance. To fill that gap, we rethink SMoE from a unified perspective through the lens of linear programming, which provides a general formulation for SMoE models. Furthermore, we introduce Unified Sparse Mixture of Experts (USMoE), a novel framework comprising a unified mechanism and a unified score to overcome these limitations. We provide both theoretical justification and empirical evidence demonstrating USMoE's effectiveness. Extensive evaluations across diverse data settings (clean and corrupted), multiple domains (including texts and vision tasks), and different learning approaches (training-free and training-based) show that USMoE not only delivers significant performance improvements over existing SMoE methods, but also enables more flexible expert selection budgets, reducing inference costs without compromising model performance. Our implementation is publicly available at https://github.com/giangdip2410/USMoE.

URL PDF HTML ☆

赞 0 踩 0

2510.17111 2026-06-01 cs.RO cs.LG

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

面向具身操作的高效视觉-语言-动作模型：系统综述

Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Chinese Academy of Sciences（中国科学院大学）； AiRiA ； Nanjing University of Information Science and Technology（南京信息科学技术大学）

AI总结本文系统综述了通过模型架构、感知特征、动作生成和训练/推理策略四个维度降低视觉-语言-动作模型延迟、内存占用及计算成本的方法。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过将自然语言指令和视觉观察映射到机器人动作，将视觉-语言模型扩展到具身控制。尽管功能强大，但VLA系统因其巨大的计算和内存需求而面临重大挑战，这与需要实时性能的边缘平台（如机载移动操作器）的约束相冲突。解决这一矛盾已成为近期研究的核心焦点。鉴于对更高效、可扩展的VLA系统的日益关注，本综述系统回顾了提高VLA效率的方法，重点在于减少延迟、内存占用以及训练和推理成本。我们将现有解决方案分为四个维度：模型架构、感知特征、动作生成和训练/推理策略，并总结了每个类别中的代表性技术。最后，我们讨论了未来趋势和开放挑战，指出了推进高效具身智能的方向。

英文摘要

Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.

URL PDF HTML ☆

赞 0 踩 0

2506.22304 2026-06-01 cs.LG cs.CV

Unfolding Generative Flows with Koopman Operators: Trajectory-Preserving Linearization

利用Koopman算子展开生成流：轨迹保持的线性化

Erkan Turan, Aristotelis Siozopoulos, Louis Martinez, Julien Gaubil, Emery Pierson, Maks Ovsjanikov

发表机构 * University of Athens, Greece（雅典大学）

AI总结提出基于Koopman理论的全局线性化方法，将预训练的条件流匹配模型提升到高维Koopman空间，实现轨迹保持的线性化，从而支持一步并行采样和生成轨迹的谱分析。

详情

AI中文摘要

连续归一化流（CNFs）实现了优雅的生成建模，但受限于其迭代性质，需要昂贵的采样且缺乏中间状态的可解释性。最近的方法通过拉直轨迹或蒸馏端点来加速采样，但将原始生成过程视为黑箱，丢弃了教师模型的中间动态。我们提出了一种根本不同的视角：通过Koopman理论全局线性化流动态，以实现轨迹保持的线性化。通过将预训练的条件流匹配（CFM）模型提升到高维Koopman空间，我们用单个线性算子表示其演化。关键的是，与仅边界蒸馏不同，我们的方法沿整个生成路径强制与教师向量场保持无穷小一致性。我们推导了一个实用的、无模拟的训练目标，确保这种全局对齐，并带来两个关键优势。首先，采样变为一步且可并行化。其次，由于线性化忠实于动态，Koopman算子提供了对生成的独特见解。我们证明，这种结构能够实现先前方法无法实现的新应用，包括发现语义一致的编辑方向、使用与教师对齐的线性算子进行反演以及类条件谱特征。实验上，我们的方法在实现竞争性样本质量的同时，能够对生成流的整个轨迹进行谱分析和控制。

英文摘要

Continuous Normalizing Flows (CNFs) enable elegant generative modeling but remain bottlenecked by their iterative nature requiring costly sampling and lacking interpretability of the intermediate states. Recent approaches accelerate sampling by straightening trajectories or distilling endpoints, yet they treat the original generative process as a black box, discarding the teacher's intermediate dynamics. We propose a fundamentally different perspective: globally linearizing flow dynamics via Koopman theory to achieve trajectory-preserving linearization. By lifting a pre-trained Conditional Flow Matching (CFM) model into a higher-dimensional Koopman space, we represent its evolution with a single linear operator. Crucially, unlike boundary-only distillation, our method enforces infinitesimal consistency with the teacher's vector field along the full generative path. We derive a practical, simulation-free training objective that ensures this global alignment and yields two key benefits. First, sampling becomes one-step and parallelizable. Second, because the linearization is faithful to the dynamics, the Koopman operator provides unique insights on the generation. We demonstrate that this structure enables novel applications unavailable in prior approaches, including discovery of semantically coherent editing directions, inversion with a teacher-aligned linear operator and class-conditional spectral signatures. Empirically, our approach achieves competitive sample quality, while enabling spectral analysis and control of the entire trajectories of generative flows.

URL PDF HTML ☆

赞 0 踩 0

2510.17700 2026-06-01 cs.CV

Elastic ViTs from Pretrained Models without Retraining

无需重新训练的预训练模型弹性ViTs

Walter Simoncini, Michael Dorkenwald, Tijmen Blankevoort, Cees G. M. Snoek, Yuki M. Asano

发表机构 * University of Technology Nuremberg（图恩大学）； University of Amsterdam（阿姆斯特丹大学）； NVIDIA（英伟达）

AI总结提出SnapViT方法，通过结合梯度信息与进化算法近似跨网络结构相关性，实现无需重训练的结构化剪枝，支持连续计算预算下的弹性推理。

Comments Accepted at NeurIPS 2025

详情

AI中文摘要

视觉基础模型取得了显著性能，但仅以有限的预定尺寸可用，导致在现实约束下部署选择次优。我们引入SnapViT：用于剪枝视觉Transformer的单次网络近似，一种新的后预训练结构化剪枝方法，可在连续计算预算范围内实现弹性推理。我们的方法高效地将梯度信息与跨网络结构相关性相结合，通过进化算法近似，无需标注数据，可推广到无分类头的模型，且无需重训练。在DINO、SigLIPv2、DeIT和AugReg模型上的实验表明，在各种稀疏度下，该方法优于最先进方法，在单个A100 GPU上不到五分钟即可生成可调整到任何计算预算的弹性模型。我们的主要贡献包括：一种针对预训练ViT的高效剪枝策略，一种新颖的Hessian非对角结构的进化近似，以及一种无需重训练或标签即可保持强大性能的自监督重要性评分机制。代码和剪枝模型可在https://elastic.ashita.nl/获取。

英文摘要

Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/

URL PDF HTML ☆

赞 0 踩 0

2510.16138 2026-06-01 cs.LG stat.ML

Expert Merging in Sparse Mixture of Experts with Nash Bargaining

基于纳什谈判的稀疏混合专家模型专家合并

Dung V. Nguyen, Anh T. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Shiqi Jiang, Ethan Fetaya, Linh Duy Tran, Gal Chechik, Tan M. Nguyen

发表机构 * Department of Mathematics, National University of Singapore（新加坡国立大学数学系）； Viettel AI, Viettel Group（越南电信AI部门）； Faculty of Mathematics and Informatics, Hanoi University of Science and Technology（河内科学技术大学数学与信息学系）； Bar Ilan University, Israel（以色列巴伊兰大学）； AI Imaging Team, Data Solution Department, FPT Software Japan（日本FPT软件数据解决方案部门AI成像团队）

AI总结针对稀疏混合专家模型缺乏原则性加权机制的专家合并问题，提出基于纳什谈判的NAMEx框架，实现专家间更平衡高效的协作，在多项任务中优于现有方法。

Comments 10 pages in the main text. ICLR 2026 Poster

详情

AI中文摘要

现有的稀疏混合专家模型（SMoE）专家合并策略通常依赖于输入相关或输入无关的专家参数平均，但往往缺乏原则性的加权机制。在这项工作中，我们通过博弈论的视角重新解释专家合并，揭示了专家之间的合作与竞争动态。基于这一视角，我们引入了专家纳什合并（NAMEx），这是一个将纳什谈判融入合并过程的新框架，使专家之间能够实现更平衡和高效的协作。此外，我们将复杂动量纳入NAMEx，以加速专家传播，并提供了收敛的理论保证。在语言建模、文本分类、图像分类以及数据损坏下的零样本鲁棒性等广泛实验中，NAMEx始终优于竞争方法，同时与流行的MoE架构无缝集成。最后，我们通过将NAMEx应用于大规模系统（包括Qwen1.5-MoE (14B)和DeepSeek-MoE (16B)）展示了其可扩展性，在零样本和微调设置中均证明了其有效性。代码公开于：https://github.com/anh147/NAMEx。

英文摘要

Existing expert merging strategies for Sparse Mixture of Experts (SMoE) typically rely on input-dependent or input-independent averaging of expert parameters, but often lack a principled weighting mechanism. In this work, we reinterpret expert merging through the lens of game theory, revealing cooperative and competitive dynamics among experts. Based on this perspective, we introduce Nash Merging of Experts (NAMEx), a novel framework that incorporates Nash Bargaining into the merging process, enabling more balanced and efficient collaboration among experts. Additionally, we incorporate complex momentum into NAMEx to accelerate expert propagation with theoretical guarantees for convergence. Extensive experiments across language modelling, text classification, image classification, and zero-shot robustness under data corruption show that NAMEx consistently outperforms competing methods while integrating seamlessly with popular MoE architectures. Finally, we demonstrate NAMEx's scalability by applying it to large-scale systems, including Qwen1.5-MoE (14B) and DeepSeek-MoE (16B), where it proves effective in both zero-shot and fine-tuning settings. The code is publicly available at: https://github.com/anh147/NAMEx.

URL PDF HTML ☆

赞 0 踩 0

2503.05846 2026-06-01 cs.CL cs.AI

EMCEE: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

EMCEE：通过提取合成多语言上下文桥接知识与推理以提升大语言模型的多语言能力

Hamin Koo, Jaehyung Kim

发表机构 * Yonsei University（延世大学）

AI总结提出EMCEE框架，通过从LLM自身提取并融合语言特定知识，结合推理输出，显著提升多语言任务性能，尤其在低资源语言上平均提升31.7%。

Comments ACL 2026 Main

详情

AI中文摘要

大语言模型（LLMs）在广泛任务中取得了显著进展，但其对以英语为中心的训练数据的严重依赖导致在非英语语言中性能大幅下降。虽然现有的多语言提示方法强调将查询重新表述为英语或增强推理能力，但它们往往未能融入对某些查询至关重要的语言和文化特定基础。为了解决这一局限性，我们提出了EMCEE（提取合成多语言上下文并合并），一个简单而有效的框架，通过从LLM自身显式提取和利用查询相关知识来增强其多语言能力。具体来说，EMCEE首先提取合成上下文以揭示LLM中编码的潜在语言特定知识，然后通过基于判断的选择机制动态地将这种上下文见解与面向推理的输出合并。在涵盖多种语言和任务的四个多语言基准上的大量实验表明，EMCEE始终优于先前的方法，总体平均相对提升16.4%，在低资源语言中提升31.7%。

英文摘要

Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. While existing multilingual prompting methods emphasize reformulating queries into English or enhancing reasoning capabilities, they often fail to incorporate the language- and culture-specific grounding that is essential for some queries. To address this limitation, we propose EMCEE (Extracting synthetic Multilingual Context and merging), a simple yet effective framework that enhances the multilingual capabilities of LLMs by explicitly extracting and utilizing query-relevant knowledge from the LLM itself. In particular, EMCEE first extracts synthetic context to uncover latent, language-specific knowledge encoded within the LLM, and then dynamically merges this contextual insight with reasoning-oriented outputs through a judgment-based selection mechanism. Extensive experiments on four multilingual benchmarks covering diverse languages and tasks demonstrate that EMCEE consistently outperforms prior approaches, achieving an average relative improvement of 16.4% overall and 31.7% in low-resource languages.

URL PDF HTML ☆

赞 0 踩 0

2510.11683 2026-06-01 cs.LG cs.AI cs.CL

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

边界引导策略优化：面向扩散大语言模型的内存高效强化学习

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

发表机构 * Tsinghua University（清华大学）

AI总结针对扩散大语言模型中似然函数难以处理导致强化学习内存开销大的问题，提出边界引导策略优化（BGPO），通过构造满足线性和等价性的下界实现内存高效训练，在数学求解、代码生成和规划任务中显著优于现有方法。

详情

AI中文摘要

将强化学习（RL）应用于扩散大语言模型（dLLMs）的一个关键挑战是其似然函数的难解性，而似然函数对于RL目标至关重要，因此在训练过程中需要相应的近似。现有方法通过自定义蒙特卡洛（MC）采样，利用证据下界（ELBO）近似对数似然，但由于需要保留所有MC样本用于RL目标中非线性项的梯度计算，导致显著的内存开销，从而限制了可行的样本量，导致似然近似不精确和RL目标失真。为了解决这个问题，我们提出了边界引导策略优化（BGPO），一种内存高效的RL算法，它最大化基于ELBO的目标的一个特殊构造的下界。该下界经过精心设计，满足两个关键性质：（1）线性：它是一个线性求和，其中每一项仅依赖于单个MC样本，从而能够跨样本进行梯度累积并确保恒定的内存使用；（2）等价性：在在线策略训练中，该下界的值和梯度与基于ELBO的目标相等，因此它也是对原始RL目标的有效近似。这些性质使得BGPO能够采用大的MC样本量，改进似然近似和RL目标估计，从而带来性能提升。实验表明，BGPO在数学问题求解、代码生成和规划任务中显著优于先前的dLLMs RL算法。我们的代码和模型可在https://github.com/THU-KEG/BGPO获取。

英文摘要

A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) is the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation during training. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, they incur significant memory overhead due to the need to retain all MC samples for the gradient computation of non-linear terms in the RL objective, and thus restrict feasible sample sizes, leading to imprecise likelihood approximations and distorted RL objective. To address this, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, improving likelihood approximations and RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks. Our codes and models are available at \href{https://github.com/THU-KEG/BGPO}{https://github.com/THU-KEG/BGPO}.

URL PDF HTML ☆

赞 0 踩 0

2510.11711 2026-06-01 cs.LG stat.ML

Reinforced sequential Monte Carlo for amortised sampling

强化序贯蒙特卡洛用于摊销采样

Sanghyeok Choi, Sarthak Mittal, Víctor Elvira, Jinkyoo Park, Esmeralda S. Whitammer

发表机构 * University of Edinburgh ； Mila -- Qu\'ebec AI Institute ； CIFAR Fellow

AI总结本文提出一种摊销方法与粒子方法相结合的采样框架，通过最大熵强化学习训练序贯蒙特卡洛采样器，并利用离线策略学习提高目标分布探索效率，在合成多模态目标和丙氨酸二肽构象玻尔兹曼分布上验证了改进的近似精度与训练稳定性。

Comments ICML 2026. Code: https://github.com/hyeok9855/ReinforcedSMC

详情

AI中文摘要

本文提出了一种摊销方法和基于粒子的方法的协同作用，用于从未归一化的密度函数定义的分布中采样。我们阐述了序贯蒙特卡洛（SMC）与通过最大熵强化学习（MaxEnt RL）训练的神经序贯采样器之间的联系，其中学习的采样策略和价值函数定义了提议核和扭曲函数。利用这一联系，我们引入了一种离线策略RL训练程序，该程序使用来自SMC的样本（将学习的采样器作为提议）作为行为策略，以更好地探索目标分布。我们描述了稳定联合训练提议和扭曲函数的技术，以及一种自适应权重退火方案以减少训练信号方差。此外，基于过去使用经验回放指导神经采样器训练的尝试，我们推导出一种方法，将历史样本与退火重要性采样权重结合在回放缓冲区中。在合成多模态目标（连续和离散空间）以及丙氨酸二肽构象的玻尔兹曼分布上，我们展示了在近似真实分布以及训练稳定性方面相比摊销方法和蒙特卡洛方法的改进。

英文摘要

This paper proposes a synergy of amortised and particle-based methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC -- using the learnt sampler as a proposal -- as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.

URL PDF HTML ☆

赞 0 踩 0

2510.09364 2026-06-01 cs.CV

VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes

VAD-GS：动态城市场景中3D高斯泼溅的可见性感知致密化

Yikang Zhang, Rui Fan

发表机构 * Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University（同济大学智能自主系统研究所）； College of Electronic and Information Engineering, Tongji University（同济大学电子与信息工程学院）； National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University（西安交通大学人机混合增强智能国家实验室）

AI总结提出VAD-GS框架，通过体素可见性推理、多样性感知视图选择和多视图立体重建，在动态城市场景中恢复缺失几何结构，提升3D高斯泼溅的重建质量。

详情

AI中文摘要

3D高斯泼溅（3DGS）在合成高保真新视角方面表现出色。然而，其有效性关键取决于初始化点云的质量。具体而言，要实现对底层场景结构的均匀且完整的点覆盖，需要重叠的观察视锥，这一假设在无边界、动态的城市环境中经常被违反。使用部分初始化的点云训练高斯模型通常会导致失真和伪影，因为相机射线可能无法与有效表面相交，导致梯度错误传播到与遮挡或不可见几何体关联的高斯基元。此外，现有的致密化策略只是从现有基元中克隆和分割高斯基元，无法从缺失结构中重建几何体。为解决这些限制，我们提出了VAD-GS，一个专为具有挑战性的城市场景中几何恢复设计的3DGS框架。我们的方法通过基于体素的可见性推理识别不可靠的几何结构，通过多样性感知视图选择选择信息丰富的支持视图，并通过多视图立体重建恢复缺失结构。这种设计使得即使在缺乏初始点的区域，也能在可靠几何先验的指导下生成新的高斯基元。在Waymo和nuScenes数据集上的大量实验表明，VAD-GS优于最先进的3DGS方法，并显著提高了静态和动态物体的重建几何质量。我们的项目网页位于mias.group/VAD-GS。

英文摘要

3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing geometry from missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects. Our project webpage is at mias.group/VAD-GS.

URL PDF HTML ☆

赞 0 踩 0

2501.12500 2026-06-01 cs.LG stat.ME

Learning General Causal Structures with Hidden Dynamic Process for Climate Analysis

学习具有隐藏动态过程的通用因果结构用于气候分析

Minghao Fu, Biwei Huang, Zijian Li, Yujia Zheng, Ignavier Ng, Guangyi Chen, Yingyao Hu, Kun Zhang

发表机构 * Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE（穆罕默德·本·扎耶德人工智能大学）； Carnegie Mellon University, Pittsburgh, PA, USA（卡内基梅隆大学）； University of California San Diego, La Jolla, CA, USA（加州大学圣地亚哥分校）； Johns Hopkins University, Baltimore, MD, USA（约翰·霍普金斯大学）

AI总结提出统一框架CaDRe，联合发现观测变量间的因果关系和隐藏动态过程，在非参数设置下可识别，并在气候数据上验证了有效性和可解释性。

Comments Accepted by ICML 2026

详情

AI中文摘要

理解气候动力学需要超越观测数据中的相关性，揭示潜在的因果过程。诸如大气过程等潜在驱动因素在时间动态中起着核心作用，而地理上邻近的观测变量之间也存在直接的因果影响。传统的因果表示学习（CRL）通常关注潜在因素，但忽略了这种观测到观测的因果关系，这限制了其在气候分析中的适用性。在本文中，我们引入了一个统一框架，联合揭示（i）观测变量之间的因果关系和（ii）潜在驱动力及其相互作用。我们建立了条件，使得隐藏动态过程和观测变量之间的因果结构可以从时间序列数据中同时识别，并且我们的保证在非参数设置下通过恢复潜在变量和因果关系的上下文信息仍然成立。基于这些见解，我们提出了CaDRe（因果发现与表示学习），一个具有结构约束的时间序列生成模型，集成了CRL和因果发现。在合成数据集上的实验验证了我们的理论结果。在真实世界的气候数据集上，CaDRe提供了有竞争力的预测精度，并恢复了与领域专业知识一致的可视化因果图，从而为气候系统提供了可解释的见解。代码可在https://github.com/MinghaoFu/CaDRe获取。

英文摘要

Understanding climate dynamics requires going beyond correlations in observational data to uncover the underlying causal process. Latent drivers such as atmospheric processes play a central role in temporal dynamics, while direct causal influences also exist among geographically proximate observed variables. Traditional Causal Representation Learning (CRL) typically focuses on latent factors but overlooks such observable-to-observable causal relations, which limits its applicability to climate analysis. In this paper, we introduce a unified framework that jointly uncovers (i) causal relations among observed variables and (ii) latent driving forces together with their interactions. We establish conditions under which both the hidden dynamic process and the causal structure among observed variables are simultaneously identifiable from time-series data, and our guarantees continue to hold in the nonparametric setting through contextual information that recovers latent variables and causal relations. Building on these insights, we propose CaDRe (Causal Discovery and Representation learning), a time-series generative model with structural constraints that integrates CRL and causal discovery. Experiments on synthetic datasets validate our theoretical results. On real-world climate datasets, CaDRe delivers competitive forecasting accuracy and recovers visualized causal graphs aligned with domain expertise, thereby offering interpretable insights into climate systems. Code is available at https://github.com/MinghaoFu/CaDRe.

URL PDF HTML ☆

赞 0 踩 0

2510.07135 2026-06-01 cs.CV

Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

遥感视觉语言模型的少样本适应基准

Karim El Khoury, Maxime Zanella, Christophe De Vleeschouwer, Benoit Macq

发表机构 * UCLouvain（乌尔特-洛文大学）； UMons（蒙斯大学）； Fonds de la Recherche Scientifique（科学基金组织）

AI总结提出首个遥感视觉语言模型少样本适应基准，通过十个数据集和五种策略评估三个模型，发现零样本性能相似的模型在少样本适应下表现差异显著，需开发更鲁棒的方法。

详情

AI中文摘要

遥感视觉语言模型（RSVLMs）得益于大规模预训练，在各种任务上展现出强大的零样本性能。然而，它们在低数据场景（如少样本学习）中的泛化能力尚未得到充分探索。在这项工作中，我们提出了第一个用于评估RSVLMs少样本适应方法的结构化基准。我们在十个遥感场景分类数据集上进行了全面实验，将五种广泛使用的少样本适应策略应用于三个具有不同骨干网络的最先进RSVLMs。我们的发现表明，零样本性能相似的模型在少样本适应下可能表现出显著不同的行为，一些RSVLMs天生比其他模型更适合这种适应。性能的变异性以及现有方法中缺乏明确的优胜者，凸显了为遥感定制更鲁棒的少样本适应方法的必要性。为了促进未来研究，我们提供了一个可复现的基准框架和开源代码，以系统评估RSVLMs在少样本条件下的表现。源代码已在Github上公开：https://github.com/elkhouryk/fewshot_RSVLMs

英文摘要

Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs

URL PDF HTML ☆

赞 0 踩 0

2505.17607 2026-06-01 cs.AI cs.CL

Symbolic Intermediaries as a Linguistic-Numerical Interface for LLM-Driven Geometric Reasoning

符号中介作为LLM驱动几何推理的语言-数值接口

João Pedro Gandarela, Thiago Rios, Stefan Menzel, André Freitas

发表机构 * Idiap Research Institute（Idiap研究 institute）； École Polytechnique Fédérale de Lausanne（瑞士联邦理工学院）； Honda Research Institute Europe（本田欧洲研究院）； Department of Computer Science, University of Manchester（曼彻斯特大学计算机科学系）； National Biomarker Centre, CRUK-MI, University of Manchester（曼彻斯特大学国家生物标记中心）

AI总结提出符号中介作为连接物理模拟器数值输出与语言模型推理的接口，通过符号回归将连续数值转化为符号表达式，并在协同优化循环中提升几何推理性能。

Comments 33 pages, 18 figures

详情

AI中文摘要

大型语言模型（LLM）在语言和符号对象上展示出推理能力，但直接解释物理模拟器的连续数值输出（例如距离、曲率和轨迹）的能力有限，这些输出难以进行离散分词。在从机构设计到运动规划等空间基础的工程推理任务中，这定义了一个根本性的差距，限制了LLM在更广泛几何领域（例如与物理模拟器接口）的应用。我们提出符号中介，即通过符号回归发现的紧凑解析表达式，作为一种结构化接口，将模拟器的数值轨迹转换为符号形式，语言模型可以解释、比较和批评，同时保留原始几何语义。围绕这个接口，我们构建了一个智能体协调与优化循环：设计智能体将自然语言规范映射为可执行模拟代码，批评智能体基于共享符号词汇进行推理，修订步骤将此反馈转化为基于基础的优化决策，从而实现无需参数更新的推理时泛化。在平面机构综合的MSynth基准上，所有三个评估的LLM智能体比预算匹配的遗传算法基线高出19-53%（带反馈时中位误差降低高达63%），对三种模型架构的批评条目分析表明，该接口将推理从通用结构评论转向基于基础的几何验证。将连续模拟输出转换为符号形式的原理可推广到任何需要以语言方式解释模拟器行为的领域。

英文摘要

Large Language Models (LLMs) display reasoning capabilities over linguistic and symbolic objects but have limited capabilities to directly interpret the continuous numerical outputs of physics simulators, e.g., distances, curvatures, and trajectories that resist discrete tokenisation. Across spatially grounded engineering reasoning tasks, from mechanism design to motion planning, this defines a fundamental gap, which limits the wider application of LLMs within broader geometrical domains, for exmaple interfacing with physics simulators. We propose symbolic intermediaries, compact analytical expressions discovered via symbolic regression, as a structured interface that translates a simulator's numerical traces into a symbolic form, which language models can interpret, compare, and critique while preserving the original geometric semantics. Around this interface we build an agentic coordination-and-refinement loop: a design agent maps natural-language specifications to executable simulation code, a critique agent reasons over the shared symbolic vocabulary, and a revision step turns this feedback into grounded refinement decisions, enabling inference-time generalization without parameter updates. On the MSynth benchmark for planar mechanism synthesis, all three evaluated LLM agents outperform a budget-matched genetic-algorithm baseline by 19-53% (up to 63% lower median error with feedback), and analysis of the critique entries across three model architectures shows that the interface shifts reasoning from generic structural commentary to grounded geometric verification. The principle of translating continuous simulation outputs into symbolic forms generalises to any domain where simulator behaviour must be interpreted linguistically.

URL PDF HTML ☆

赞 0 踩 0

2510.03876 2026-06-01 cs.CV

Skin Lesion Classification Based on ResNet-50 Enhanced With Adaptive Spatial Feature Fusion

基于自适应空间特征融合增强的ResNet-50皮肤病变分类

Runhao Liu, Fengyi Zha, Fei Ding, Guangzhen Yao, Peng Zhang

发表机构 * Polytechnic Institute, Zhejiang University, Hangzhou, China（浙江大学杭州Polytechnic学院）； Chu Kochen Honors College, Zhejiang University, Hangzhou, China（浙江大学杭州Chu Kochen荣誉学院）； Alibaba Group, Chaoyang District, Beijing, China（北京朝阳区阿里巴巴集团）； School of Information Science and Technology, Northeast Normal University, Changchun, China（吉林师范大学信息科学与技术学院）； School of Mathematical Sciences, Zhejiang University, Hangzhou, China（浙江大学数学科学学院）

AI总结提出一种结合自适应空间特征融合（ASFF）的改进ResNet-50模型，通过双分支结构融合多尺度语义和细节特征，在ISIC 2020子集上达到93.182%准确率，并有效泛化至ISIC 2019外部验证集。

详情

AI中文摘要

皮肤癌分类因皮肤镜图像中类间相似度高、类内变异大以及伪影的存在而具有挑战性。为解决这些问题，我们提出了一种改进的ResNet-50模型，结合自适应空间特征融合（ASFF），该机制自适应地整合多尺度语义和表面特征以细化表示并减少过拟合。ResNet-50模型通过自适应特征融合机制增强，以实现更有效的多尺度特征提取并提升整体性能。具体而言，双分支设计融合了高层语义特征和中间层细节特征，利用全局平均池化和全连接层生成空间权重，并强调病变相关区域。在ISIC 2020平衡子集（从原始数据集中随机选取的3,297张图像）上评估，基于ASFF的ResNet-50优于多个CNN基线，达到93.182%的准确率，并具有优越的精确率、召回率、特异性和F1分数。其AUC（P-R）达到0.9670，AUC（ROC）达到0.9717。Grad-CAM可视化显示对病变区域的聚焦更加准确。所提模型在ISIC 2019外部验证集上也表现出良好的泛化能力，优于ResNet-50基线。这些发现表明，所提方法为计算机辅助皮肤癌诊断提供了更有效且高效的解决方案。生成代码、权重和混淆矩阵已在https://github.com/Grapesea/ASFF-ResNet50-enhanced开源。

英文摘要

Skin cancer classification is challenging due to high inter-class similarity, intra-class variability, and artifacts in dermoscopic images. To address these issues, we propose an improved ResNet-50 with Adaptive Spatial Feature Fusion (ASFF), which adaptively integrates multi-scale semantic and surface features to refine representations and reduce overfitting. The ResNet-50 model is enhanced with an adaptive feature fusion mechanism to achieve more effective multi-scale feature extraction and improve overall performance. Specifically, a dual-branch design fuses high-level semantic and mid-level detail features which use global average pooling and fully connected layers to produce spatial weights, and emphasizes lesion-relevant regions. Evaluated on a balanced subset of ISIC 2020 (3,297 images, randomly selected from the original dataset), the ASFF-based ResNet-50 outperforms multiple CNN baselines, achieving 93.182% accuracy with superior precision, recall, specificity, and F1. It also reaches 0.9670 AUC (P-R) and 0.9717 AUC (ROC). Grad-CAM visualizations show more accurate focus on lesion areas.The proposed model also generalizes well to ISIC 2019 external validation, outperforming the ResNet-50 baseline. These findings demonstrate that the proposed approach provides a more effective and efficient solution for computer-aided skin cancer diagnosis. The generation codes, weights and confusion matrices are open sourced in https://github.com/Grapesea/ASFF-ResNet50-enhanced.

URL PDF HTML ☆

赞 0 踩 0

2510.03839 2026-06-01 cs.LG stat.ML

Technical note on Sequential Test-Time Adaptation via Martingale-Driven Fisher Prompting

关于通过鞅驱动的Fisher提示进行顺序测试时间自适应的技术说明

Behraj Khan, Tahir Qasim Syed

发表机构 * Institute of Business Administration（商业管理学院）

AI总结提出M-FISHER框架，通过指数鞅检测分布漂移并利用Fisher预条件更新实现稳定自适应，提供时间一致的错误控制保证和最优检测延迟。

详情

AI中文摘要

我们提出了M-FISHER的理论框架，这是一种用于流数据中顺序分布漂移检测和稳定自适应的方法。对于检测，我们从非一致性分数构建指数鞅，并应用Ville不等式获得关于误报控制的时间一致保证，确保在任何停止时间下的统计有效性。在持续漂移下，我们进一步将期望检测延迟界定为$\mathcal{O}(\log(1/δ)/Γ)$，其中$Γ$反映了漂移后的信息增益，从而将检测效率与分布散度联系起来。对于自适应，我们展示了提示参数的Fisher预条件更新实现了在分布流形上的自然梯度下降，产生局部最优更新，最小化KL散度同时保持稳定性和参数化不变性。总之，这些结果确立了M-FISHER作为一种在协变量漂移下的顺序决策中实现鲁棒、任意时间有效检测和几何稳定自适应的原则性方法。

英文摘要

We present a theoretical framework for M-FISHER, a method for sequential distribution shift detection and stable adaptation in streaming data. For detection, we construct an exponential martingale from non-conformity scores and apply Ville's inequality to obtain time-uniform guarantees on false alarm control, ensuring statistical validity at any stopping time. Under sustained shifts, we further bound the expected detection delay as $\mathcal{O}(\log(1/δ)/Γ)$, where $Γ$ reflects the post-shift information gain, thereby linking detection efficiency to distributional divergence. For adaptation, we show that Fisher-preconditioned updates of prompt parameters implement natural gradient descent on the distributional manifold, yielding locally optimal updates that minimize KL divergence while preserving stability and parameterization invariance. Together, these results establish M-FISHER as a principled approach for robust, anytime-valid detection and geometrically stable adaptation in sequential decision-making under covariate shift.

URL PDF HTML ☆

赞 0 踩 0

2510.02919 2026-06-01 cs.CL

Self-Reflective Generation at Test Time

测试时的自反生成

Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Nanyang Technological University（南洋理工大学）； University of Edinburgh（爱丁堡大学）； City University of Hong Kong（香港城市大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出SRGen框架，通过动态熵阈值识别高不确定性token并训练校正向量，在测试时进行自反生成以纠正概率分布，提升大模型推理的可靠性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地通过长思维链解决复杂推理任务，但其仅前向的自回归生成过程是脆弱的；早期token错误可能级联，这明确需要自反机制。然而，现有的自反要么对完整草稿进行修订，要么通过昂贵的训练学习自我修正，两者本质上都是反应性的且低效。为解决此问题，我们提出测试时的自反生成（SRGen），一种轻量级测试时框架，在不确定点生成前进行反思。在token生成过程中，SRGen利用动态熵阈值识别高不确定性token。对于每个识别的token，它训练一个特定的校正向量，充分利用已生成的上下文进行自反生成以纠正token概率分布。通过回顾性分析部分输出，这种自反使得决策更可靠，从而显著降低高不确定点错误的概率。在具有挑战性的数学推理基准和多种LLM上的评估表明，SRGen能显著增强模型推理能力。此外，我们的发现将SRGen定位为一种即插即用的方法，将反思融入生成过程以实现可靠的LLM推理，在有限开销下取得一致收益，并可与其他训练时（如RLHF）和测试时（如SLOT）技术结合。

英文摘要

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can significantly strengthen model reasoning. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and can be combined with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

URL PDF HTML ☆

赞 0 踩 0

2510.02060 2026-06-01 cs.AI cs.LG

ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection

ReTabAD: 恢复表格异常检测中语义上下文的基准

Sanghyu Yoon, Dongmin Kim, Suhee Yoon, Ye Seul Sim, Seungdong Yoa, Hye-Seung Cho, Soonyoung Lee, Hankook Lee, Woohyung Lim

发表机构 * LG AI Research, Seoul, South Korea（LG人工智能研究实验室，首尔，韩国）； Sungkyunkwan University, Suwon, South Korea（成均馆大学，水原，韩国）

AI总结针对现有表格异常检测基准缺乏语义上下文的问题，提出ReTabAD基准，通过丰富结构化文本元数据并集成零样本LLM框架，验证了语义上下文能提升检测性能和可解释性。

Comments Accepted to ICLR 2026

详情

AI中文摘要

在表格异常检测（AD）中，文本语义通常承载关键信号，因为异常的定义与特定领域的上下文紧密相关。然而，现有基准仅提供原始数据点，缺乏语义上下文，忽略了专家在实践中依赖的丰富文本元数据，如特征描述和领域知识。这一限制阻碍了研究灵活性，并阻止模型充分利用领域知识进行检测。ReTabAD通过恢复文本语义来解决这一差距，以实现上下文感知的表格AD研究。我们提供（1）20个精心策划的表格数据集，这些数据集丰富了结构化的文本元数据，以及最先进的AD算法的实现，包括经典方法、深度学习和基于LLM的方法，以及（2）一个零样本LLM框架，该框架利用语义上下文而无需特定任务训练，为未来研究建立了强大的基线。此外，本工作通过实验和分析提供了关于文本元数据在AD中的作用和实用性的见解。结果表明，语义上下文通过支持领域感知推理提高了检测性能并增强了可解释性。这些发现将ReTabAD确立为系统探索上下文感知AD的基准。

英文摘要

In tabular anomaly detection (AD), textual semantics often carry critical signals, as the definition of an anomaly is closely tied to domain-specific context. However, existing benchmarks provide only raw data points without semantic context, overlooking rich textual metadata such as feature descriptions and domain knowledge that experts rely on in practice. This limitation restricts research flexibility and prevents models from fully leveraging domain knowledge for detection. ReTabAD addresses this gap by restoring textual semantics to enable context-aware tabular AD research. We provide (1) 20 carefully curated tabular datasets enriched with structured textual metadata, together with implementations of state-of-the-art AD algorithms including classical, deep learning, and LLM-based approaches, and (2) a zero-shot LLM framework that leverages semantic context without task-specific training, establishing a strong baseline for future research. Furthermore, this work provides insights into the role and utility of textual metadata in AD through experiments and analysis. Results show that semantic context improves detection performance and enhances interpretability by supporting domain-aware reasoning. These findings establish ReTabAD as a benchmark for systematic exploration of context-aware AD.

URL PDF HTML ☆

赞 0 踩 0

2510.00419 2026-06-01 cs.LG

Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs

学习零阶优化器以微调大语言模型

Kairun Zhang, Haoyu Li, Yanjun Zhao, Yifan Sun, Huan Zhang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出一种基于学习的零阶优化器ZO-Finetuner，通过紧凑且内存高效的设计自动学习高效扰动策略，实现大语言模型微调时避免反向传播并降低内存开销，在4个LLM和7个数据集上82.1%的任务-模型组合中优于现有零阶基线方法。

Comments ICML 2026

详情

AI中文摘要

零阶优化器最近成为微调大语言模型（LLM）的一种有吸引力的方法，因为它们避免了反向传播，并且相对于标准一阶训练可以大幅减少内存开销。然而，现有的零阶方法依赖于手工设计的静态采样策略，无法适应模型特定的结构。为了解决这个问题，我们提出了ZO-Finetuner，一种基于学习的零阶优化器，通过紧凑且内存高效的设计自动学习高效的扰动策略。基于少量基础LLM在多个任务上被重复微调这一事实，ZO-Finetuner支持一次性每模型训练，并在下游任务中以最小开销重用。因此，为给定LLM学习一次优化器并在不同下游任务中重用既是可行的也是高度可取的。相应地，ZO-Finetuner旨在通过支持一次性每模型训练且开销最小，将学习优化（L2L）扩展到基础模型时代。在4个LLM和7个数据集上的实验表明，ZO-Finetuner在82.1%的任务-模型组合中优于先前的零阶基线方法，从而展示了其在高效LLM微调中的强大性能和可扩展性。代码可在https://github.com/ASTRAL-Group/ZO_Fine_tuner找到。

英文摘要

Zeroth-order optimizers have recently emerged as an attractive approach for fine-tuning large language models (LLMs), as they avoid backpropagation and can substantially reduce memory overhead relative to standard first-order training. However, existing zeroth-order methods rely on hand-crafted, static sampling strategies that are not adaptable to model-specific structures. To address this, we propose ZO-Finetuner, a learning-based zeroth-order optimizer for LLMs that automatically learns efficient perturbation strategies through a compact and memory-efficient design. Motivated by the fact that a small set of base LLMs is repeatedly fine-tuned across tasks, ZO-Finetuner supports one-time per-model training and reuse across downstream tasks with minimal overhead. Therefore, learning the optimizer once for a given LLM and reusing it across diverse downstream tasks is both feasible and highly desirable. Accordingly, ZO-Finetuner is designed to scale learning to learn (L2L) to the foundation-model era by supporting one-time per-model training with minimal overhead. Experiments on 4 LLMs and 7 datasets show that ZO-Finetuner outperforms prior zeroth-order baselines in 82.1\% of task-model combinations, thereby demonstrating strong performance and scalability for efficient LLM fine-tuning. The code can be found in https://github.com/ASTRAL-Group/ZO_Fine_tuner.

URL PDF HTML ☆

赞 0 踩 0

2509.25906 2026-06-01 cs.LG

Federated Learning with Enhanced Privacy via Model Splitting and Random Client Participation

通过模型拆分和随机客户端参与增强隐私的联邦学习

Yiwei Li, Shuai Wang, Zhuojun Tian, Xiuhua Wang, Shijian Su

发表机构 * School of Optoelectronic & Communication Engineering, Xiamen University of Technology（厦门理工学院光电信息与通信工程学院）； National Key Laboratory of Wireless Communications, University of Electronic Science and Technology of China（电子科技大学信息与通信国家重点实验室）； Division of Information Science and Engineering, KTH Royal Institute of Technology（皇家理工学院信息科学与工程系）； School of Cyber Science and Engineering, Huazhong University of Science and Technology（华中科技大学网络安全科学与工程学院）； School of Engineering, Huaqiao University（华侨大学工程学院）

AI总结提出MS-PAFL框架，通过将模型拆分为私有和公共子模型并仅向公共子模型注入噪声，结合随机客户端参与和本地数据子采样的隐私放大分析，在强隐私保证下实现更优的隐私-效用权衡。

Comments Accepted for publication in IEEE Transactions on Cognitive Communications and Networking

详情

DOI: 10.1109/TCCN.2026.3694029

AI中文摘要

联邦学习（FL）通常采用差分隐私（DP）来保护客户端数据，但隐私保证所需的附加噪声会显著降低模型精度。为解决这一挑战，我们提出了模型拆分隐私放大联邦学习（MS-PAFL），一种结合结构模型拆分与统计隐私放大的新颖框架。在该框架中，每个客户端的模型被划分为保留在本地私有子模型和用于全局聚合的公共子模型。校准的高斯噪声仅注入公共子模型，从而限制其不利影响，同时保留本地模型的效用。我们进一步提供了严格的理论分析，刻画了在该架构下通过随机客户端参与和本地数据子采样实现的联合隐私放大。分析给出了单轮和总隐私损失的紧界，表明MS-PAFL显著减少了满足目标隐私保护水平所需的噪声。大量实验验证了我们的理论发现，表明MS-PAFL始终获得更优的隐私-效用权衡，并能在强隐私保证下训练高精度模型。

英文摘要

Federated Learning (FL) often adopts differential privacy (DP) to protect client data, but the added noise required for privacy guarantees can substantially degrade model accuracy. To resolve this challenge, we propose model-splitting privacy-amplified federated learning (MS-PAFL), a novel framework that combines structural model splitting with statistical privacy amplification. In this framework, each client's model is partitioned into a private submodel, retained locally, and a public submodel, shared for global aggregation. The calibrated Gaussian noise is injected only into the public submodel, thereby confining its adverse impact while preserving the utility of the local model. We further present a rigorous theoretical analysis that characterizes the joint privacy amplification achieved through random client participation and local data subsampling under this architecture. The analysis provides tight bounds on both single-round and total privacy loss, demonstrating that MS-PAFL significantly reduces the noise necessary to satisfy a target privacy protection level. Extensive experiments validate our theoretical findings, showing that MS-PAFL consistently attains a superior privacy-utility trade-off and enables the training of highly accurate models under strong privacy guarantees.

URL PDF HTML ☆

赞 0 踩 0

2506.01467 2026-06-01 cs.LG cs.DM

Feature-Aware (Hyper)graph Generation via Next-Scale Prediction

特征感知的（超）图生成：基于下一尺度预测

Dorian Gailhard, Enzo Tartaglione, Lirida Naviner, Jhony H. Giraldo

发表机构 * GitHub

AI总结提出FAHNES框架，通过层次化下一尺度预测联合生成图/超图的拓扑和特征，实现大规模带特征图/超图的高效生成。

详情

AI中文摘要

图生成模型在小型结构化数据上表现良好，但难以扩展到大型复杂结构。层次化方法提高了可扩展性，但通常忽略节点和边特征，而这些特征在实际应用中至关重要，特别是对于建模高阶关系的超图。在本文中，我们提出FAHNES（通过下一尺度预测进行特征感知的（超）图生成），这是一个层次化框架，可联合生成图和超图的拓扑与特征。FAHNES通过节点粗化和局部扩展构建多尺度表示，并由一种新颖的层次化尺度编码引导，该编码控制粒度并确保跨尺度一致性。在合成数据集、3D网格和图点云数据集上的实验表明，该方法在独特扩展到带特征的大规模图和超图的同时，实现了具有竞争力或最先进的性能。我们的代码是开源的。

英文摘要

Graph generative models perform well on small structured data but struggle to scale to large, complex structures. Hierarchical approaches improve scalability but often ignore node and edge features, which are critical in real-world applications, particularly for hypergraphs that model higher-order relationships. In this paper, we propose FAHNES (feature-aware (hyper)graph generation via next-scale prediction), a hierarchical framework that jointly generates topology and features for graphs and hypergraphs. FAHNES builds multi-scale representations through node coarsening and localized expansion, guided by a novel hierarchical scale encoding that controls granularity and ensures cross-scale consistency. Experiments on synthetic, 3D mesh, and graph point cloud datasets demonstrate competitive or state-of-the-art performance while uniquely scaling to featured large-scale graphs and hypergraphs. Our code is open source

URL PDF HTML ☆

赞 0 踩 0

2509.22335 2026-06-01 cs.LG cs.AI

Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning

深度持续学习中的谱坍缩导致塑性丧失

Arjun Prakash, Naicheng He, Kaicheng Guo, Saket Tiwari, Ruo Yu Tao, Tyrone Serapio, Amy Greenwald, George Konidaris

发表机构 * Department of Computer Science, Brown University（布朗大学计算机科学系）

AI总结研究深度神经网络在持续学习中塑性丧失的原因，发现新任务初始化时的Hessian谱坍缩是主要因素，并提出基于Kronecker分解的两种正则化方法以保持塑性。

详情

AI中文摘要

我们研究为什么深度神经网络在持续学习中会丧失塑性，从而在不重新初始化参数的情况下无法学习新任务。我们表明，这种失败之前在新任务初始化时会出现Hessian谱坍缩，其中有意义的曲率方向消失，梯度下降变得无效。通过分析线性化ReLU网络，我们推导出成功训练的显式$ε$-秩条件，并证明损失加权Gram矩阵在谱上与广义高斯-牛顿近似等价，从而将NTK动力学与Hessian曲率联系起来。直接针对谱坍缩，我们讨论了Hessian的Kronecker因子近似，这激发了两种正则化增强：保持高有效特征秩和应用L2惩罚。在持续监督学习和强化学习任务上的实验证实，结合这两种正则化器可以有效保持塑性。

英文摘要

We investigate why deep neural networks suffer from loss of plasticity in continual learning, and thus fail to learn new tasks without reinitializing parameters. We show that this failure is preceded by Hessian spectral collapse at new-task initialization, where meaningful curvature directions vanish and gradient descent becomes ineffective. Analyzing a linearized ReLU network, we derive explicit $ε$-rank conditions for successful training and prove that the loss-weighted Gram matrix is spectrally equivalent to the Generalized Gauss-Newton approximation, thereby relating NTK dynamics to Hessian curvature. Targeting spectral collapse directly, we then discuss the Kronecker factored approximation of the Hessian, which motivates two regularization enhancements: maintaining high effective feature rank and applying L2 penalties. Experiments on continual supervised and reinforcement learning tasks confirm that combining these two regularizers effectively preserves plasticity.

URL PDF HTML ☆

赞 0 踩 0

2509.19452 2026-06-01 cs.RO cs.CV cs.LG

HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

HUNT：通过瞬时相对帧在非结构化环境中进行高速无人机导航与跟踪

Alessandro Saviolo, Jeffrey Mao, Giuseppe Loianno

发表机构 * New York University（纽约大学）； University of California Berkeley（加州大学伯克利分校）

AI总结提出HUNT框架，利用瞬时相对帧统一搜索与跟踪，实现高速飞行和鲁棒自主性。

详情

AI中文摘要

搜索与救援任务要求无人机既能高速穿越未知的非结构化环境，又能在检测到目标后跟踪目标。在感知退化且无全局定位的情况下实现这两种能力仍是一个开放挑战。最近的相对导航工作通过将规划和控制锚定到可见的检测目标上展示了鲁棒跟踪，但在视野中没有目标时无法进行导航。我们提出了HUNT（高速无人机导航与跟踪），一个实时框架，在单一相对公式中统一了穿越、获取和跟踪。HUNT直接从机载瞬时观测量（如姿态、高度和速度）定义导航目标，从而在搜索过程中实现反应式高速飞行。一旦检测到目标，相同的感知-控制管道无缝过渡到跟踪。在茂密森林、集装箱场地以及使用车辆和人体模型的搜索与救援任务中的户外实验表明，在全局方法失败的情况下，该框架实现了鲁棒自主性。

英文摘要

Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

URL PDF HTML ☆

赞 0 踩 0

2509.23195 2026-06-01 cs.CL q-bio.NC

The relative strength of hierarchical structure and statistics differs across the measures in naturalistic reading

自然阅读中层级结构与统计的相对强度因测量指标而异

Nan Wang, Hanlin Wu, Jiaxuan Li

发表机构 * Department of Brain and Cognitive Sciences, University of Rochester（罗切斯特大学脑科学与认知科学系）； Department of Linguistics and Modern Languages, the Chinese University of Hong Kong（香港中文大学语言学与现代语言系）； Department of Language Science, University of California Irvine（加州大学 Irvine 分校语言科学系）

AI总结本研究通过同步脑电图和眼动追踪，结合贝叶斯网络建模和回归分析，探究层级句法结构与统计因素在在线理解中的相对强度，发现层级结构在阅读前即可影响理解，但其强度因行为或神经层面而异。

详情

AI中文摘要

层级句法结构与非层级、统计或序列因素长期以来被视为解释在线理解的竞争理论。大量证据表明，层级和非层级因素都能塑造理解，而更开放的问题是层级何时以及以多强的影响力作用于理解。我们通过同步脑电图和眼动追踪来探讨这一问题，将句法深度作为操作化层级结构的变量。关于时间问题，层级句法结构在阅读句子之前就已影响阅读，并且最早可在阅读前108毫秒出现。这一点得到了转移概率分析和注视相关电位回归的支持。注视转移分析表明，读者更倾向于在句法核心词之间移动，而非按照序列词序，这表明扫描路径是由深层句法结构而非纯统计驱动的。关于强度问题，我们结合贝叶斯网络建模和回归分析，表明变量的强度取决于待解释的现象。贝叶斯网络分析显示，层级句法结构比统计特征携带更多的预测权重。注视相关电位回归表明，在回归分析中，层级句法结构显著预测了右前脑区的词级神经活动，但与词汇惊奇度相比普遍较弱。综合证据，我们的分析表明，层级结构可以在行为和神经层面预期性地引导受试者的在线理解，其强度随阅读行为的不同方面而变化。

英文摘要

The hierarchical syntactic structure and non-hierarchical, statistical, or sequential factors have long been framed as rival theories in accounting for online comprehension. A lot of evidence has shown that both hierarchical and non-hierarchical factors can shape comprehension and the more open question is when, and how strongly, hierarchy exerts its influence in comprehension. We addressed the question with co-registered EEG and eye-tracking, treating syntactic depth as the variable for operationalizing hierarchical structure. For the timing question, hierarchical syntactic structure is shown to influence reading before reading a sentence and can emerge as early as 108ms before reading. This is supported by both transitional probability analysis and regression on fixation-related potential. Analyses on fixation-transition showed that readers preferentially moved between syntactically central words rather than according to serial word order, suggesting that scanpaths are driven by deep syntactic structure rather than by pure statistics. For the strength question, we combined Bayesian network modeling and regression analysis to show that strength of a variable is dependent on the phenomenon that is to be explained. Bayesian network analysis showed that hierarchical syntactic structure carried more predictive weight than statistical features. Regression on fixation-related potential demonstrated that hierarchical syntactic structure significantly predicted word-level neural activity in the front-right region in regression analyses, but is generally weaker in comparison with lexical surprisal. Evidence combined, our analyses suggested that hierarchical structure can anticipatorily guide subjects' online comprehension both on a behavioral and neural level, with its strength varies across different facets of reading behavior.

URL PDF HTML ☆

赞 0 踩 0

2509.21561 2026-06-01 cs.CV

Unsupervised Defect Detection for Surgical Instruments

手术器械的无监督缺陷检测

Joseph Huang, Yichi Zhang, Jingxi Yu, Wei Chen, Seunghyun Hwang, Qiang Qiu, Amy R. Reibman, Edward J. Delp, Fengqing Zhu

发表机构 * Purdue University School of Electrical

AI总结针对手术器械缺陷检测中纹理背景导致误检、小缺陷灵敏度低及领域迁移问题，提出结合背景掩蔽、补丁分析和高效域适应的无监督方法。

2509.20941 2026-06-01 cs.CV

Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery

解码手术场景：手术中场景图的范围综述

Angelo Henriques, Korab Hoxha, Daniel Zapp, Peter C. Issa, Nassir Navab, M. Ali Nasseri

发表机构 * School of Computation, Information and Technology, Technical University of Munich（计算信息学院，慕尼黑技术大学）； Klinik und Poliklinik für Augenheilkunde, TUM University Hospital（眼科诊所，TUM大学医院）； Computer Aided Medical Procedures, Technical University of Munich（医学辅助程序，慕尼黑技术大学）； Department of Biomedical Engineering, University of Alberta（生物医学工程系，阿尔伯塔大学）

AI总结本文通过PRISMA-ScR指导的范围综述，系统梳理了手术中场景图（SG）的研究现状，分析了52项研究，揭示了从图神经网络向基础模型和生成式AI的方法论转变，并提出了“验证三位一体”评估框架以弥合临床转化差距。

Comments Submitted and accepted to Medical Image Analysis (DOI: 10.1016/j.media.2026.104083). An interactive version of the summary tables is available at: osf.io/fruq8

Journal ref Medical Image Analysis (2026)

详情

DOI: 10.1016/j.media.2026.104083

AI中文摘要

随着手术人工智能从像素级检测向复杂推理过渡，场景图（SG）提供了解码动态手术环境所需的结构化关系表示。本项遵循PRISMA-ScR指南的范围综述系统性地绘制了手术中SG研究的发展格局，分析了52项主要研究，以描绘应用和方法论转变。我们的分析揭示了快速增长，但也发现了一个关键的“数据鸿沟”：内部视角研究（例如，从内窥镜视频中识别三元组）占研究的81%，且几乎完全使用真实世界的2D视频，而外部视角的手术室建模则严重依赖模拟数据。在方法论上，我们识别出从基础图神经网络向专门基础模型和生成式AI的决定性转变，这些模型在2025年合计约占研究的50%。至关重要的是，我们的综合表明，场景图正从简单的描述符演变为必要的“神经符号护栏”，提供结构化、可验证的中间表示，以防止日益自主的手术基础模型产生幻觉。尽管前景广阔，但仍存在一个主要的转化差距：所审查的研究均未进入前瞻性临床验证。我们得出结论，弥合这一差距需要超越标准的计算机视觉指标；因此，我们提出“验证三位一体”——优先考虑语义查询成功率、延迟感知准确率和安全关键召回率——作为将基于图的手术人工智能引入临床实践的必要评估框架。

英文摘要

As surgical AI transitions from pixel-level detection to complex reasoning, Scene Graphs (SGs) offer the structured, relational representations necessary to decode dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, analyzing 52 primary studies to chart applications and methodological shifts. Our analysis reveals rapid growth, yet uncovers a critical 'data divide': internal-view research (e.g., triplet recognition from endoscopic video) accounts for 81% of studies and almost exclusively uses real-world 2D video, while external-view operating room modeling relies heavily on simulated data. Methodologically, we identify a decisive shift from foundational graph neural networks to specialized foundation models and generative AI, which together now account for approximately 50% of research in 2025. Crucially, our synthesis suggests that Scene Graphs are evolving from simple descriptors into essential 'neuro-symbolic guardrails', providing the structured, verifiable intermediate representation needed to prevent hallucinations in increasingly autonomous Surgical Foundation Models. Despite this promise, a major translational gap remains: none of the reviewed studies have proceeded to prospective clinical validation. We conclude that bridging this gap requires moving beyond standard computer vision metrics; we therefore propose the 'Validation Trinity' -- prioritizing Semantic Query Success, Latency-Aware Accuracy, and Safety-Critical Recall -- as the necessary evaluation framework to bring graph-based surgical AI into clinical practice.

URL PDF HTML ☆

赞 0 踩 0

2509.18898 2026-06-01 cs.CV

DeblurSplat: SfM-free 3D Gaussian Splatting with Event Camera for Robust Deblurring

DeblurSplat：基于事件相机的无SfM三维高斯泼溅鲁棒去模糊方法

Pengteng Li, Yunfan Lu, Pinhao Song, Weiyu Guo, Huizai Yao, F. Richard Yu, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； KU Leuven（卢森堡大学）； Carleton University（卡尔顿大学）

AI总结提出首个无需运动恢复结构的去模糊三维高斯泼溅方法，利用密集立体模块和事件流实现高质量新视图合成与高效渲染。

Comments Accepted by TMM 2026

详情

AI中文摘要

本文提出首个无需运动恢复结构（SfM）的基于事件相机的去模糊三维高斯泼溅方法，称为DeblurSplat。我们从两个方面解决运动去模糊问题。首先，利用密集立体模块（DUSt3R）的预训练能力，直接从模糊图像中获取准确的初始点云。无需计算相机位姿作为中间结果，避免了不准确相机位姿到初始点云位置的累积误差传递。其次，将事件流引入去模糊流水线，利用其对动态变化的高敏感性。通过从事件流和模糊图像中解码潜在清晰图像，我们可以为场景重建优化提供细粒度监督信号。在多种场景上的大量实验表明，与去模糊3D-GS的最新方法相比，DeblurSplat不仅在新视图生成中表现出高保真度，而且实现了显著的渲染效率。

英文摘要

In this paper, we propose the first Structure-from-Motion (SfM)-free deblurring 3D Gaussian Splatting method via event camera, dubbed DeblurSplat. We address the motion-deblurring problem in two ways. First, we leverage the pretrained capability of the dense stereo module (DUSt3R) to directly obtain accurate initial point clouds from blurred images. Without calculating camera poses as an intermediate result, we avoid the cumulative errors transfer from inaccurate camera poses to the initial point clouds' positions. Second, we introduce the event stream into the deblur pipeline for its high sensitivity to dynamic change. By decoding the latent sharp images from the event stream and blurred images, we can provide a fine-grained supervision signal for scene reconstruction optimization. Extensive experiments across a range of scenes demonstrate that DeblurSplat not only excels in generating high-fidelity novel views but also achieves significant rendering efficiency compared to the SOTAs in deblur 3D-GS.

URL PDF HTML ☆

赞 0 踩 0

2506.11653 2026-06-01 cs.CV cs.AI cs.LG

DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

DISCO: 使用条件距离相关性减轻深度学习中的偏差

Emre Kavak, Tom Nuno Wolf, Christian Wachinger

发表机构 * Technical University of Munich, Germany（慕尼黑技术大学）； Konrad Zuse School of Excellence in Reliable AI, Germany（Konrad Zuse可靠性人工智能卓越学院）； Munich Center for Machine Learning (MCML), Germany（慕尼黑机器学习中心（MCML））

AI总结提出基于反因果模型的条件独立性准则，并设计条件距离相关性的高效估计器DISCO$_m$和sDISCO，通过正则化实现梯度模型中的偏差缓解，在多个数据集上优于或媲美现有方法。

Comments Accepted to ICML 2026 (oral)

AI 大模型

视觉与机器人

科学与医疗

Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

Scaling Multi-Agent Environment Co-Design with Diffusion Models

UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

Rethinking Sparse Mixture of Experts from a Unified Perspective

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Unfolding Generative Flows with Koopman Operators: Trajectory-Preserving Linearization

Elastic ViTs from Pretrained Models without Retraining

Expert Merging in Sparse Mixture of Experts with Nash Bargaining

EMCEE: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Reinforced sequential Monte Carlo for amortised sampling

VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes

Learning General Causal Structures with Hidden Dynamic Process for Climate Analysis

Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

Symbolic Intermediaries as a Linguistic-Numerical Interface for LLM-Driven Geometric Reasoning

Skin Lesion Classification Based on ResNet-50 Enhanced With Adaptive Spatial Feature Fusion

Technical note on Sequential Test-Time Adaptation via Martingale-Driven Fisher Prompting

Self-Reflective Generation at Test Time

ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection

Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs

Federated Learning with Enhanced Privacy via Model Splitting and Random Client Participation

Feature-Aware (Hyper)graph Generation via Next-Scale Prediction

Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning

HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

The relative strength of hierarchical structure and statistics differs across the measures in naturalistic reading

Unsupervised Defect Detection for Surgical Instruments

Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery

DeblurSplat: SfM-free 3D Gaussian Splatting with Event Camera for Robust Deblurring

DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation