arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2604.27859 2026-05-18 cs.AI cs.ET

Rethinking Agentic Reinforcement Learning In Large Language Models

Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, Jiahong Li

发表机构 * Beijing Beijing China(北京北京中国) Shanghai Beijing China(上海北京中国)

AI总结 本文探讨了在大型语言模型(LLM)背景下对智能体强化学习(Agentic RL)的重新思考。研究关注如何将LLM的认知能力,如目标设定、长期规划、动态策略调整和交互推理,融入强化学习框架,以应对复杂、开放式的现实任务。文章深入分析了该范式的核心概念、方法创新与设计原则,并指出了当前面临的挑战及未来发展方向。

详情
英文摘要

Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.

2604.26139 2026-05-18 cs.CL

HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models

Guoshenghui Zhao, Tan Yu, Weijie Zhao

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) NVIDIA Corporation(英伟达公司)

AI总结 本文提出了一种名为HIVE的隐藏证据验证框架,用于检测扩散大语言模型(D-LLMs)生成过程中的幻觉。HIVE通过从去噪轨迹中提取压缩的隐藏证据,并结合信息步层选择和前缀嵌入条件验证语言模型,实现了对幻觉的更精细检测,能够输出连续的幻觉评分及结构化的验证结果。实验表明,HIVE在多个基准测试中优于现有方法,验证了隐藏证据在提升幻觉检测性能中的有效性。

Comments 5 figures, appendix included

详情
英文摘要

Diffusion large language models generate text through multi-step denoising, where hallucination signals may emerge throughout the trajectory rather than only in the final output. Existing detectors mainly rely on output uncertainty or coarse trace statistics, which often fail to capture the richer hidden dynamics of D-LLMs. We propose HIVE, a hidden-evidence verification framework that extracts compressed hidden evidence from denoising trajectories, selects informative step-layer evidence, and conditions a verifier language model on the selected evidence through prefix embeddings. HIVE produces both a continuous hallucination score from verifier decision logits and structured verification outputs, including hallucination types, evidence pairs, and short rationales. Across two D-LLMs and three QA benchmarks, HIVE consistently outperforms eight strong baselines and achieves up to 0.9236 AUROC and 0.9537 AUPRC. Ablation studies further confirm the importance of hidden-evidence conditioning, learned evidence selection, two-stream evidence representation, and step-layer embeddings. These results suggest that selected hidden evidence from denoising trajectories provides a stronger and more usable hallucination signal than output-only uncertainty or coarse trace statistics.

2604.17669 2026-05-18 cs.CV

Low Light Image Enhancement Challenge at NTIRE 2026

George Ciubotariu, Sharif S M A, Abdur Rehman, Fayaz Ali Dharejo, Rizwan Ali Naqvi, Marcos V. Conde, Radu Timofte, Zhi Jin, Hongjun Wu, Wenjian Zhang, Chang Ye, Xunpeng Yi, Qinglong Yan, Yibing Zhang, Zaynab Ali, Saiprasad Meesiyawar, Varda I Pattanshetty, Varsha I Pattanshetty, Nikhil Akalwadi, Padmashree Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Hao Yang, Ruikun Zhang, Liyuan Pan, Furkan Kınlı, Donghun Ryou, Inju Ha, Junoh Kang, Bohyung Han, Wei Zhou, Yuval Haitman, Ariel Lapid, Reuven Peretz, Idit Diamant, Leilei Cao, Shuo Zhang, Praful Hambarde, Prateek Shaily, Jayant Kumar, Hardik Sharma, Aashish Negi, Sachin Chaudhary, Akshay Dudhane, Amit Shukla, MoHao Wu, Lin Wang, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi, Raul Balmez, Alexandru Brateanu, Ciprian Orhei, Cosmin Ancuti, Codruta O. Ancuti, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Kaifan Qiao, Bofei Chen, Jingyi Xu, Duo Zhang, Xin Deng, Mai Xu, Shengxi Li, Lai Jiang, Harini A, Ananya N, Lakshanya K, Ying Xu, Xinyi Zhu, Shijun Shi, Jiangning Zhang, Yong Liu, Kai Hu, Jing Xu, Xianfang Zeng, Jinao Song, Guangsheng Tang, Cheng Li, Yuqiang Yang, Ziyi Wang, Yan Chen, Long Bao, Heng Sun, Mohab Kishawy, Jun Chen, Wan-Chi Siu, Yihao Cheng, Hon Man Hammond Lee, Chun-Chuen Hui

发表机构 * NTIRE 2026

AI总结 本文综述了NTIRE 2026低光图像增强挑战赛,介绍了参赛者提出的各种解决方案及最终结果。该挑战赛旨在寻找能够有效提升低对比度和噪声图像清晰度与视觉吸引力的网络模型。共有22支队伍提交了有效作品,本文全面评估了当前在(联合去噪与)低光图像增强领域的先进方法,展示了该领域的重要进展,并基于新的数据集进行了分析。

详情
英文摘要

This paper presents a comprehensive review of the NTIRE 2026 Low Light Image Enhancement Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions by learning representative visual cues with the purpose of restoring information loss due to low-contrast and noisy images. A total of 195 participants registered for the first track and 153 for the second track of the competition, and 22 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in (joint denoising and) low-light image enhancement, showcasing the significant progress in the field, while leveraging samples of our novel dataset.

2604.16925 2026-05-18 cs.CV

Rethinking Cross-Dose PET Denoising: Mitigating Averaging Effects via Residual Noise Learning

Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo

发表机构 * IWR, Heidelberg University(海德堡大学IWR) Silicon Austria Labs(Silicon Austria实验室) College of Medicine and Biological Information Engineering, Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University(医学与生物信息工程学院,医学图像智能计算教育部重点实验室,东北大学) Department of Epidemiology & Global Health, Umeå University(流行病学与全球健康系,乌梅大学)

AI总结 本文研究了低剂量正电子发射断层扫描(LDPET)图像的跨剂量去噪问题,指出传统模型在不同剂量条件下泛化能力较差,主要由于噪声水平和统计特性差异导致。作者分析发现,现有方法在训练过程中隐式优化了异质噪声分布的期望,导致网络学习到的是跨剂量的平均去噪映射,无法准确建模特定剂量的噪声特性。为此,提出了一种统一的残差噪声学习框架,直接从低剂量图像中估计噪声,而非预测全剂量图像,实验表明该方法在多个医疗中心的大规模数据集上优于现有方法,显著提升了跨剂量去噪性能。

详情
英文摘要

Cross-dose denoising for low-dose positron emission tomography (LDPET) has been proposed to address the limited generalization of models trained at a single noise level. However, neural networks trained on a specific dose level often fail to generalize to other dose conditions due to variations in noise magnitude and statistical properties. Conventional "one-size-for-all" models attempt to mitigate this variability but tend to learn averaged representations across noise levels, resulting in degraded performance. In this work, we analyze this limitation and show that standard training formulations implicitly optimize an expectation over heterogeneous noise distributions, causing the network to learn an averaged denoising mapping that cannot accurately model dose-specific noise characteristics. We propose a unified residual noise learning framework that estimates noise directly from low-dose PET images rather than predicting full-dose images. Experiments on large-scale multi-dose PET datasets from two medical centers demonstrate that the proposed method outperforms the "one-size-for-all" model, individual dose-specific U-Net models, and dose-conditioned approaches, achieving improved denoising performance. These results indicate that residual noise learning effectively mitigates the averaging effect and enhances generalization for cross-dose PET denoising.

2604.15221 2026-05-18 cs.RO cs.CV

Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees

Jakob Thumm, Marian Frei, Tianle Ni, Matthias Althoff, Marco Pavone

发表机构 * Department of Aeronautics and Astronautics, Stanford University(斯坦福大学航空航天系) Chair of Imaging and Computer Vision, RWTH Aachen University(亚琛工业大学影像与计算机视觉教授职位) School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Department of Computer Engineering, Technical University of Munich(慕尼黑技术大学计算机工程系)

AI总结 本文提出了一种基于视觉的人体姿态估计与运动预测框架,能够在保证安全协作的前提下提供可验证的不确定性保障。该方法结合了对噪声不确定性的估计与分布外检测,以提升预测的置信度,并引入符合性预测集来确保预测结果在实际人机协作中的高可靠性。实验在真实的人体运动数据和实际人机协作场景中验证了方法的有效性。

详情
英文摘要

We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.

2604.08302 2026-05-18 cs.LG cs.AI

DMax: Aggressive Parallel Decoding for dLLMs

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种名为 DMax 的新方法,用于高效生成扩散语言模型(dLLMs)。该方法通过引入渐进式自优化机制和软并行解码策略,有效缓解了并行解码中的错误累积问题,从而在保持生成质量的同时实现更高效的并行生成。DMax 还提出了 On-Policy Uniform Training 训练策略,统一了掩码和非掩码模型的训练过程,显著提升了模型在多个基准测试中的生成效率与性能。

Comments Working in progress. Code is available at: https://github.com/czg1225/DMax

详情
英文摘要

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax

2604.04539 2026-05-18 cs.LG cs.RO

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra, Takuma Seno, Sehee Min, Daniel Palenicek, Florian Vogt, Danica Kragic, Jan Peters, Jaegul Choo, Hojoon Lee

发表机构 * KTH Royal Institute of Technology(皇家理工学院) German Research Center for AI (DFKI)(德国人工智能研究中心) Robotics Institute Germany (RIG)(德国机器人研究所)

AI总结 本文提出了一种名为 FlashSAC 的快速且稳定的离线策略强化学习算法,用于解决高维机器人控制问题。该方法基于软演员评论家(Soft Actor-Critic)框架,通过增大模型规模和提升数据吞吐量来减少梯度更新次数,同时通过显式限制权重、特征和梯度的范数来保持稳定性。实验表明,FlashSAC 在多个模拟器中的超过 60 个任务上均优于 PPO 和其他先进离线策略方法,尤其在高维任务中表现出显著性能提升,并在模拟到现实的人形机器人运动任务中大幅缩短了训练时间。

Comments RSS'26

详情
英文摘要

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

2604.04310 2026-05-18 cs.RO

frax: Fast Robot Kinematics and Dynamics in JAX

Daniel Morton, Marco Pavone

发表机构 * Departments of Mechanical Engineering and Aeronautics & Astronautics, Stanford University(机械工程系和航空与航天系,斯坦福大学)

AI总结 本文介绍了一个基于 JAX 的机器人运动学与动力学库 frax,旨在提供高性能、易用且兼容 CPU 和加速器的解决方案。该库采用全向量化方法,支持实时控制与并行计算,并兼容自动微分,适用于优化方法。实验表明,frax 在 CPU 上可实现微秒级计算,适用于千赫兹控制频率,性能优于常见 Python 库并接近优化的 C++ 实现;在 GPU 上则能扩展到数千个实例,每秒可达上亿次动力学计算。

Comments ICRA 2026 Workshop on Frontiers of Optimization for Robotics

详情
英文摘要

In robot control, planning, and learning, there is a need for rigid-body dynamics libraries that are highly performant, easy to use, and compatible with CPUs and accelerators. While existing libraries often excel at either low-latency CPU execution or high-throughput GPU workloads, few provide a unified framework that targets multiple architectures without compromising performance or ease-of-use. To address this, we introduce frax, a JAX-based library for robot kinematics and dynamics, providing a high-performance, pure-Python interface across CPU, GPU, and TPU. Via a fully-vectorized approach to robot dynamics, frax enables efficient real-time control and parallelization, while supporting automatic differentiation for optimization-based methods. On CPU, frax achieves low-microsecond computation times suitable for kilohertz control rates, outperforming common libraries in Python and approaching optimized C++ implementations. On GPU, the same code scales to thousands of instances, reaching upwards of 100 million dynamics evaluations per second. We validate performance on a Franka Panda manipulator and a Unitree G1 humanoid, and release frax as an open-source library.

2604.02268 2026-05-18 cs.LG

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

发表机构 * Zhejiang University(浙江大学) Meituan(美团) Tsinghua University(清华大学)

AI总结 该研究探讨了如何将技能内化为模型参数,以实现无需运行时检索的零样本自主行为。为此,提出了一种基于上下文强化学习的框架SKILL0,通过训练时逐步减少技能上下文,引导模型学习工具调用和多轮任务完成。实验表明,SKILL0在多个智能体任务中显著优于传统强化学习方法,同时保持了高效的上下文使用效率。

详情
英文摘要

Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld, +6.6\% for Search-QA, and+10.1\% for WebShop), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.

2603.27043 2026-05-18 cs.CL

Introducing MELI: the Mandarin-English Language Interview Corpus

Suyuan Liu, Molly Babel

发表机构 * Department of Linguistics, University of British Columbia(不列颠哥伦比亚大学语言学系)

AI总结 本文介绍了MELI语料库,一个包含51名 Mandarin-English 双语者29.8小时语音数据的开源语料库,涵盖阅读句子和关于语言变体、标准性及学习经历的自发访谈两种说话风格。语料库提供了逐字和音素级别的强制对齐转录,并记录了语言态度等元数据,支持跨语言及跨说话者的声学对比分析,有助于开展定量与定性研究。

Comments Accepted at LREC 2026 (14th International Conference on Language Resources and Evaluation), to appear in the conference proceedings

Journal ref In Proceedings of the Fifteenth Language Resources and Evaluation Conference (pp. 5896-5904). European Language Resources Association (ELRA) 2026

详情
英文摘要

We introduce the Mandarin-English Language Interview (MELI) Corpus, an open-source resource of 29.8 hours of speech from 51 Mandarin-English bilingual speakers. MELI combines matched sessions in Mandarin and English with two speaking styles: read sentences and spontaneous interviews about language varieties, standardness, and learning experiences. Audio was recorded at 44.1 kHz (16-bit, stereo). Interviews were fully transcribed, force-aligned at word and phone levels, and anonymized. Descriptively, the Mandarin component totals ~14.7 hours (mean duration 17.3 minutes) and the English component ~15.1 hours (mean duration 17.8 minutes). We report token/type statistics for each language and document code-switching patterns (frequent in Mandarin sessions; more limited in English sessions). The corpus design supports within-/cross-speaker, within/cross-language acoustic comparison and links acoustics to speakers' stated language attitudes, enabling both quantitative and qualitative analyses. The MELI Corpus will be released with transcriptions, alignments, metadata, scans of labelled maps and documentation under a CC BY-NC 4.0 license.

2603.23433 2026-05-18 cs.AI

Mecha-nudges for Machines

Giulio Frey, Kawin Ethayarajh

发表机构 * University of Chicago(芝加哥大学)

AI总结 本文研究了AI智能体在互联网环境中作为决策者时,其决策可能受到环境变化的系统性影响,这一现象被称为“机械助推”(mecha-nudging)。作者结合经济学中的贝叶斯劝导理论和计算机科学中的可利用信息理论,提出了一种量化环境变化对AI影响的统一方法,并基于超过六百万个Etsy商品列表的数据分析发现,ChatGPT发布后,商品信息中用于预测AI推荐决策的机器可利用信息显著增加,而人类可利用信息则几乎没有变化。该研究首次提供了大规模实证证据,表明系统性的机械助推已在实际环境中发生,但尚未被广泛察觉。

详情
英文摘要

AI agents are becoming active decision-makers on the Internet. As they make decisions in the same environments as humans, the environments themselves can change to influence them. We call this $\textit{mecha-nudging}$: changes to how choices are presented that systematically influence AI agents without materially degrading the decision environment for humans. To measure this phenomenon, we combine two frameworks -- Bayesian persuasion from economics and $\mathcal{V}$-usable information from computer science -- to get a common unit (bits) for quantifying how environments change across a wide range of interventions, contexts, and models. We apply this framework to over six million Etsy listings and find that, after ChatGPT's release, listings contain significantly more machine-usable information for predicting agent curation decisions, increasing by 0.143 bits out of a maximum possible increase of 0.355. This shift is robust across prompts, token choices, labeling models, and fine-tuning architectures; absent in a regulated-text placebo; and far larger than the effect of generic LLM rewriting. In contrast, a human study finds little to no change in human-usable information. Our results provide the first large-scale evidence that systematic mecha-nudging is already occurring in the wild, but going unnoticed.

2603.14764 2026-05-18 cs.CV cs.AI cs.LG

Topology-Preserving Polygon Augmentation for Segmentation in Structured Visual Domains

Sudip Laudari, Sang Hun Baek

发表机构 * Independent Researcher(独立研究者)

AI总结 该论文研究了在结构化视觉领域(如建筑平面图分析)中保持多边形标注拓扑结构的图像增强方法。针对传统几何增强可能导致多边形区域分割、破坏语义连通性的缺陷,提出了一种轻量的拓扑保持增强策略,能够在不改变顶点顺序的前提下修复索引空间中的邻接关系。实验表明,该方法在常见几何变换下能实现接近完美的循环邻接保持(CAP),并有效提升了基于多边形的分割标注一致性。

Comments 10 pages, 6 figures

详情
英文摘要

Geometric data augmentation is widely used in segmentation workflows, but polygon annotations are often assumed to remain valid after transformation. This assumption can fail in structured domains such as architectural floorplan analysis, where a region may contain an interior void encoded as part of a single ordered polygon chain. Cropping or clipping can remove bridge vertices in this chain, causing one semantic region to split into disconnected components. We propose a lightweight topology-preserving augmentation strategy that repairs missing adjacency relations in index space while preserving the original vertex order. The method adds minimal overhead and can be integrated into existing preprocessing workflows. Experiments show that the proposed approach achieves near-perfect Cyclic Adjacency Preservation (CAP) across common geometric transformations and improves annotation consistency in polygon-based segmentation.

2603.10881 2026-05-18 cs.LG

LAtte: Hyperbolic Lorentz Attention for Cross-Subject EEG Classification

Ahmad Bdeir, Johannes Burchert, Tom Hanika, Lars Schmidt-Thieme, Niels Landwehr

发表机构 * Data Science Group(数据科学组) ISMLL Universität Hildesheim(希尔德斯海姆大学)

AI总结 本文提出了一种名为LAtte的框架,用于解决跨被试脑电图(EEG)分类中的泛化难题。该方法结合了洛伦兹注意力机制与基于双曲几何的InceptionTime编码器,通过将EEG信号分解为基线和任务相关偏差,提升特征表示的结构化程度。此外,模型引入了针对每个被试的低秩适配模块,并结合洛伦兹提升和双曲投影技术,增强模型的鲁棒性和适应性,在多个数据集上均取得了优于现有方法的分类性能。

详情
英文摘要

Electroencephalogram (EEG) classification plays a key role in medical diagnosis and brain-computer interfaces, but remains challenging due to low signal-to-noise ratios and high inter-subject variability. As a result, many existing approaches rely on subject-specific models, which fail to exploit shared structure in neural signals and do not generalize to unseen subjects. To address these limitations, we propose LAtte, a framework that combines Lorentz attention with a hyperbolic InceptionTime-based encoder to improve cross-subject generalization in EEG classification. The model explicitly decomposes EEG signals into a learned baseline component and task-relevant deviations, enabling more structured representation learning. To further improve robustness and adaptability, we incorporate subject-specific low-rank adaptation (LoRA) modules at both encoder and decoder levels, augmented with a Lorentz boost-based LoRA mechanism and hyperbolic projection layers to reduce overfitting in geometric representations. We evaluate LAtte with and without finetuning in three settings: subject-specific, subject-conditional, and leave-one-subject-out (LOSO) on five established EEG datasets, achieving a consistent improvement in performance over current state-of-the-art methods for smaller datasets and maintaining performance for larger datasets.

2603.08063 2026-05-18 cs.CV

SkyLink: A Large Vision-Language Model Driven Re-ranking Framework for Cross-View UAV geolocalization

Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao

发表机构 * Department of Data Science, City University of Hong Kong, Hong Kong(香港城市大学数据科学系) Information Systems, City University of Hong Kong, Hong Kong(香港城市大学信息系统系) College of Computer Science and Technology, Zhejiang University of Technology, Zhejiang(浙江工业大学计算机科学与技术学院)

AI总结 SkyLink 是一种基于大视觉-语言模型(LVLM)的跨视角无人机地理定位重排序框架,旨在提升无人机图像与卫星图像之间的匹配精度。该方法通过建模不同视角之间的视觉-语义关系,实现更有效的跨视角匹配,并引入一种关系感知损失函数以增强模型的判别能力和训练稳定性。实验表明,SkyLink 显著提升了现有模型在多种基准数据集上的重排序性能,尤其在复杂场景中表现突出。

详情
英文摘要

Cross-view UAV geolocalization is fundamentally a challenging large-scale image retrieval task, aiming to determine the geographic coordinates of Unmanned Aerial Vehicle (UAV) queries by matching them against an extensive geo-tagged satellite image database. Most existing methods learn separate feature representations for each view and determine the final prediction using naive heuristics to assess feature similarity, thereby neglecting to model the crucial cross-view relationships. In this paper, we propose SkyLink, a novel plug-and-play ranking framework that pioneers joint relational modeling of inter-view relationships to enhance cross-view UAV geolocalization. SkyLink leverages a Large Vision-Language Model (LVLM) to model the intricate visual-semantic relationships between UAV and satellite views, facilitating effective cross-view matching. To further refine the learning process, we introduce a relational-aware loss. It leverages soft labels to provide a more nuanced supervision signal, mitigating the harsh penalty on near-positive pairs. This approach enhances both training stability and the model's discriminative capacity. Extensive experiments conducted across multiple base retrieval architectures and benchmark datasets demonstrate that SkyLink significantly boosts the ranking effectiveness of existing models, consistently achieving superior performance in various challenging scenarios.

2603.07514 2026-05-18 cs.LG cs.AI cs.CV

A Unified View of Score-Based and Drifting Models

Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao

发表机构 * Sony AI(索尼人工智能) Sony Group Corporation(索尼集团) Stanford University(斯坦福大学) Georgia Tech(佐治亚理工学院)

AI总结 本文探讨了漂移模型与基于分数的生成模型之间的内在联系,揭示了漂移方法在本质上等价于对平滑分布进行分数匹配的目标。研究发现,使用高斯核时,均值漂移场精确对应于数据分布与模型分布的分数差异,这一结论基于Tweedie公式。对于实际常用的拉普拉斯核,理论与实验均表明其残差项在高维情况下可忽略,因此实际应用中的漂移方法近似于基于分数的生成方法。该研究为理解生成模型提供了统一的视角,并指出了漂移模型与扩散模型在运输方向上的结构性相似与差异。

详情
英文摘要

Drifting models train one-step generators by optimizing a kernel-induced mean-shift discrepancy between the data and model distributions, with Laplace kernels used by default in practice. At each point, this discrepancy compares the kernel-weighted displacement toward nearby data samples with the corresponding displacement toward nearby model samples, thereby defining a transport direction for generated samples. In this paper, we show that drifting is more closely connected to score-based generative modeling than it may first appear, establishing a precise link to the score-matching principle underlying diffusion models. For Gaussian kernels, the population mean-shift field exactly equals the difference between the scores (i.e., the gradient-log-densities) of the Gaussian-smoothed data and model distributions. This identity follows from Tweedie's formula, which links the score of a Gaussian-smoothed density to its conditional mean, and implies that Gaussian-kernel drifting is exactly a score-matching objective on smoothed distributions. More generally, we derive an exact decomposition for radial kernels in which mean shift equals a score-based field plus a residual term. For the practical Laplace kernel, we further show theoretically and empirically that this residual is negligible in high dimension, implying that the transport field used in practice is nearly score-based. Our results reveal a structural connection to diffusion models: both methods use score-mismatch transport directions, but drifting realizes the score nonparametrically through kernel-based estimates, whereas diffusion models learn it parametrically with neural networks.

2603.03243 2026-05-18 cs.RO

HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

Xiaomeng Xu, Jisang Park, Han Zhang, Eric Cousineau, Aditya Bhat, Jose Barreiros, Dian Wang, Jeannette Bohg, Shuran Song

发表机构 * Stanford University(斯坦福大学) Toyota Research Institute(丰田研究院)

AI总结 本文提出了一种名为HoMMI的框架,用于从无需机器人的人类演示中直接学习全身移动操作任务。该框架通过增强UMI接口,引入以自我为中心的感知方式,实现了便携、可扩展的数据采集,但同时也带来了人机体感差距的问题。为此,研究者设计了一种跨体感的手眼策略,包括通用视觉表征、放松的头部动作表示以及协调全身运动的控制器,从而实现了复杂移动操作任务的策略迁移。

详情
英文摘要

We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io

2603.01283 2026-05-18 cs.AI cs.LG

The Informational Cost of Agency: A Bounded Measure of Interaction Efficiency for Deployed Reinforcement Learning

Wael Hafez, Cameron Reid, Amit Nazeri

发表机构 * Semarx Research LLC(Semarx研究公司)

AI总结 本文提出了一种名为“双可预测性”(Bipredictability,记为P)的信息论指标,用于量化智能体与环境之间的闭环交互在消除不确定性、提升共享可预测性方面的效率。该指标具有理论上的上限(小于0.5),并证明智能体的主动行为会抑制P值低于这一阈值,这一现象被称为“智能体的信息成本”。实验表明,P不仅在强化学习系统中有效,还适用于语言模型、视觉系统等不同领域,展示了其广泛的适用性;同时,基于P构建的信息数字孪生(IDT)架构在检测系统退化方面表现出更高的准确率和更低的延迟,为部署中的自主系统提供了新的可靠性评估手段。

Comments 12 pages, 2 figures

详情
英文摘要

Deployed reinforcement learning systems lack a principled runtime reliability theory. We close this gap by introducing Bipredictability, P, a closed form information theoretic metric that quantifies how efficiently a closed loop interaction between agent and environment converts uncertainty into shared predictability. P admits a provable classical bound P equal, smaller than 0.5, derived from Shannon entropy subadditivity, and responsive agency necessarily suppresses P below this ceiling, a structural prediction we term the informational cost of agency. Across 21 trained continuous control agents, we confirm this prediction empirically at P = 0.33 plus minus 0.02. The same suppression signature reproduces in language model dialogue, convolutional vision systems, and classical mechanical baselines, indicating that P captures a substrate independent property of agentic interaction rather than an algorithm specific artifact. The Information Digital Twin, IDT, a model agnostic architecture that computes P from the external interaction stream, detects 89.3% of coupling degradations against 44.0% for reward based monitoring, with 4.4 times lower latency. P provides the missing measurement layer for runtime reliability and closed loop self regulation in deployed autonomous systems.

2602.23409 2026-05-18 cs.LG cs.AI cs.ET quant-ph

Long Range Frequency Tuning for QML

Michael Poppel, Markus Baumann, Sebastian Wölckert, Claudia Linnhoff-Popien, Jonas Stein

发表机构 * LMU Munich(慕尼黑大学) Aqarios GmbH(Aqarios公司)

AI总结 该研究针对变分量子电路中的频率编码问题,提出了一种新的初始化方法以提升其对高频函数的拟合能力。传统方法在固定编码下需要大量门操作,而可训练频率电路虽有潜力,但因频谱间隙导致梯度下降效果受限。本文提出的三进制网格初始化方法通过合理设置频率前缀,消除了频谱间隙的影响,显著提升了模型性能。实验表明,该方法在合成和真实数据集上均优于现有方法。

详情
英文摘要

Angle-encoded variational quantum circuits admit a truncated Fourier series representation of their output, but approximating functions with maximum frequency $ω_{\max}$ using fixed unary encoding requires $\mathcal{O}(ω_{\max})$ encoding gates. Trainable-frequency (TF) circuits promise a reduction by learning the data-encoding prefactors alongside the ansatz parameters, adapting the accessible frequency spectrum to the target during training. We identify a practical barrier that prevents this promise from being realized: the prefactor gradient is suppressed by the spectral gap between the circuit's accessible frequencies and the target spectrum, independently of the ansatz parameters, confining gradient-driven prefactor movement to a narrow neighborhood of initialization. We propose \emph{ternary grid initialization} -- setting prefactors to $\{1, 3, 9, \ldots, 3^{k-1}\}$ -- which resolves this limitation by ensuring every target frequency within $[-ω_{\max}, ω_{\max}]$ lies within $\tfrac{1}{2}$ unit of a grid point at initialization, removing the spectral gap suppression by construction. On a synthetic benchmark with target frequencies shifted well beyond the standard initialization range, ternary initialization achieves median $R^2 = 0.997$ versus $0.18$ for unary initialization, with $100\%$ of runs achieving $R^2 > 0.95$ against $0\%$. CMA-ES with $20\times$ the evaluation budget reaches only $25\%$ success, confirming the limitation is a property of the optimization landscape rather than of gradient-based optimization specifically. Real-world validation on two benchmark datasets demonstrates consistent advantages over both fixed and trainable unary baselines.

2602.22918 2026-05-18 cs.CL

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Jonathan Steinberg, Oren Gal

发表机构 * Swarms & AI Lab (SAIL), University of Haifa(Swarms与AI实验室(SAIL),海法大学)

AI总结 该研究探讨了视觉语言模型中光学字符识别(OCR)信息如何融入语言处理流程,并定位了OCR路由机制中的关键瓶颈。通过因果干预和激活差异分析,研究发现不同架构的OCR敏感层位置存在差异,且OCR信号具有高度低维特性,主成分分析方向在不同数据集间具有可迁移性。研究还揭示了在模块化OCR电路中,去除OCR信息可提升模型的计数性能,表明OCR可能干扰其他视觉处理任务。

详情
英文摘要

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures up to 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

2602.20630 2026-05-18 cs.CV

From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

Yepeng Liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge, Yuliang Gu, kuang Gao, Bing Wang, Guang Chen, Hangjun Ye, Yongchao Xu

发表机构 * School of Computer Science, Wuhan University(1 武汉大学计算机学院) Xiaomi EV(2 小米电动车)

AI总结 本文将关键点检测问题重新定义为一个序列决策过程,提出了一种基于强化学习的端到端框架 TraqPoint,旨在直接优化关键点在图像序列中的长期可追踪性。其核心创新在于引入了一种关注轨迹质量的奖励机制,通过策略梯度方法同时提升关键点在多视角下的一致性和区分度。实验表明,TraqPoint 在稀疏匹配任务中显著优于当前最先进的关键点检测与描述方法。

Comments Accepted by CVPR 2026 (Oral)

详情
英文摘要

Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.The code will be available at https://github.com/xiaomi-research/traqpoint.

2602.20207 2026-05-18 cs.LG cs.AI

Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis

Shrestha Datta, Hongfu Liu, Anshuman Chhabra

发表机构 * University of South Florida(佛罗里达州立大学) Brandeis University(布兰迪大学)

AI总结 本文研究了如何在大语言模型中高效地进行知识编辑,即在不破坏模型整体性能的前提下,针对特定查询更新模型的输出。作者提出了一种基于层梯度分析(LGA)的新方法,通过分析模型各层的梯度信息,高效识别出对知识编辑效果最佳的“黄金层”,从而避免了传统方法中繁琐的试错过程。实验表明,该方法在多种大语言模型和知识编辑任务中均表现出良好的有效性和鲁棒性。

详情
英文摘要

Knowledge editing in Large Language Models (LLMs) aims to update the model's prediction for a specific query to a desired target while preserving its behavior on all other inputs. This process typically involves two stages: identifying the layer to edit and performing the parameter update. Intuitively, different queries may localize knowledge at different depths of the model, resulting in different sample-wise editing performance for a fixed editing layer. In this work, we hypothesize the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers. To validate this hypothesis, we provide empirical evidence by comparing golden layers against ground-truth sample-wise optimal layers. Furthermore, we show that golden layers can be reliably identified using a proxy dataset and generalize effectively to unseen test set queries across datasets. Finally, we propose a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs. Extensive experiments on several benchmark datasets demonstrate the effectiveness and robustness of our LGA approach across different LLM types and various knowledge editing methods.

2602.19069 2026-05-18 cs.AI

Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

Hengyuan Hu, Tingchen Fu, Minqi Jiang, Alexander H Miller, Yoram Bachrach, Jakob Nicolaus Foerster

发表机构 * FAIR at Meta(Meta的FAIR) Stanford University(斯坦福大学) University of Oxford(牛津大学)

AI总结 该研究探讨了如何通过生成中间“台阶问题”来提升大型语言模型在复杂推理任务中的表现。研究提出了一种名为ARQ的框架,通过引入问题生成器到默认推理流程中,帮助模型逐步分解任务、构建有用的中间步骤。实验表明,这些生成的台阶问题具有可迁移性,能够有效辅助不同能力的模型解决目标任务,并可通过后训练方法进一步优化生成质量。

详情
英文摘要

Recent years have witnessed tremendous progress in enabling LLMs to solve complex reasoning tasks such as math and coding. As we start to apply LLMs to harder tasks that they may not be able to solve in one shot, it is worth paying attention to their ability to construct intermediate stepping stones that prepare them to better solve the tasks. Examples of stepping stones include simplifications, alternative framings, or subproblems. We study properties and benefits of stepping stones in the context of modern reasoning LLMs via ARQ (Asking the Right Questions), a simple framework that introduces a question generator to the default reasoning pipeline. We first show that good stepping stone questions exist and are transferrable, meaning that good questions can be generated, and they substantially help LLMs of various capabilities in solving the target tasks. We next frame stepping stone generation as a post-training task and show that we can fine-tune LLMs to generate more useful stepping stones by SFT and RL on synthetic data.

2602.17363 2026-05-18 cs.LG

2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

Gabriel Mongaras, Eric C. Larson

发表机构 * Lyle School of Engineering(莱尔学校工程学院) Southern Methodist University(南方 Methodist 大学)

AI总结 本文提出了一种名为2Mamba的线性注意力模型,旨在弥补线性注意力在准确率上相对于softmax注意力的不足。通过简化并改进Mamba-2的核心组件,2Mamba在保持高内存效率的同时,达到了接近softmax注意力的精度,尤其在处理长上下文任务时表现突出。研究还探讨了提升线性注意力性能的关键因素,并提供了实验代码。

详情
英文摘要

Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments.

2602.17050 2026-05-18 cs.LG

Multi-Probe Zero Collision Hash (MPZCH): Mitigating Embedding Collisions and Enhancing Model Freshness in Large-Scale Recommenders

Ziliang Zhao, Bi Xue, Emma Lin, Tianqi Lu, Mengjiao Zhou, Kaustubh Vartak, Shakhzod Ali-Zade, Tao Li, Bin Kuang, Rui Jian, Bin Wen, Dennis van der Staay, Yixin Bao, Eddy Li, Chao Deng, Henry Wei, Songbin Liu, Qifan Wang, Kai Ren

发表机构 * Meta Platforms, Inc.(Meta平台公司) OpenAI

AI总结 在大规模推荐系统中,嵌入表是处理高基数分类特征的关键组件,但传统哈希索引方法在面对大量唯一ID时容易产生碰撞,影响模型性能与个性化质量。本文提出了一种基于线性探测的新型索引机制——多探针零碰撞哈希(MPZCH),能够有效缓解嵌入碰撞问题,并通过合理配置表大小实现几乎零碰撞。MPZCH引入辅助张量和高性能CUDA内核,支持可配置的探测与主动驱逐策略,防止过时嵌入的继承,提升新特征的学习效果,实验表明其在保持训练吞吐量和推理延迟的同时显著提升了嵌入的新鲜度与质量。

Comments 9 pages, 6 figures

详情
英文摘要

Embedding tables are critical components of large-scale recommendation systems, facilitating the efficient mapping of high-cardinality categorical features into dense vector representations. However, as the volume of unique IDs expands, traditional hash-based indexing methods suffer from collisions that degrade model performance and personalization quality. We present Multi-Probe Zero Collision Hash (MPZCH), a novel indexing mechanism based on linear probing that effectively mitigates embedding collisions. With reasonable table sizing, it often eliminates these collisions entirely while maintaining production-scale efficiency. MPZCH utilizes auxiliary tensors and high-performance CUDA kernels to implement configurable probing and active eviction policies. By retiring obsolete IDs and resetting reassigned slots, MPZCH prevents the stale embedding inheritance typical of hash-based methods, ensuring new features learn effectively from scratch. Despite its collision-mitigation overhead, the system maintains training QPS and inference latency comparable to existing methods. Rigorous online experiments demonstrate that MPZCH achieves zero collisions for user embeddings and significantly improves item embedding freshness and quality. The solution has been released within the open-source TorchRec library for the broader community.

2602.16363 2026-05-18 cs.LG

Improved Bounds for Reward-Agnostic and Reward-Free Exploration

Oran Ridel, Alon Cohen

发表机构 * Department of Engineering, Tel Aviv University, Tel Aviv, Israel(特拉维夫大学工程系,以色列特拉维夫) Google Research, Tel Aviv, Israel(谷歌研究,以色列特拉维夫)

AI总结 本文研究了无奖励和奖励无关的探索问题,在回合制有限时间马尔可夫决策过程(MDPs)中,智能体在没有外部奖励信号的情况下探索未知环境。针对奖励无关设置,作者提出了一种新的算法,显著放宽了对精度参数 $ε$ 的限制,并通过设计精心的奖励函数进行在线学习,构建用于数据收集的探索策略,从而实现对动力学的精确估计和后续的 $ε$-最优策略计算。此外,作者还建立了无奖励探索的紧致下界,填补了已知上界与下界之间的差距。

详情
英文摘要

We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

2602.16274 2026-05-18 cs.LG stat.ML

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos

发表机构 * MBZUAI, UAE(MBZUAI, 阿拉伯联合酋长国) Stanford University, USA(斯坦福大学, 美国) EPITA, France(EPITA, 法国) Indian Institute of Technology Bombay, India(印度班加罗尔理工学院, 印度)

AI总结 本文首次为无限时间折扣马尔可夫决策过程中的经典在线Q学习提供了悔恨界,无需依赖乐观或奖励项。研究分析了衰减温度的玻尔兹曼Q学习,并提出了一种结合ε_n-贪心与玻尔兹曼探索的平滑探索策略,证明其悔恨界对子优化间隙具有鲁棒性,达到近似O(N^{9/10})的上界。同时,作者还给出了高概率下的样本复杂度保证,并发展了一种适用于合缩马尔可夫随机逼近的高概率集中界,该结果具有独立研究价值。

详情
英文摘要

We present the first regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes (MDPs), without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ε_n$-Greedy exploration scheme that combines $ε_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. We also obtain sample complexity guarantees, with both regret and sample complexity bounds holding with high probability. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our framework is allowed to converge to one asymptotically.

2602.14896 2026-05-18 cs.LG

Algorithmic Simplification of Neural Networks with Mosaic-of-Motifs

Pedram Bakhtiarifard, Tong Chen, Jonathan Wenshøj, Erik B Dam, Raghavendra Selvan

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 本文探讨了深度神经网络为何适合压缩这一核心问题,提出从算法复杂度的角度进行解释。研究假设训练后的模型参数具有更多结构,因而算法复杂度更低,并引入了一种基于可重复模块(motif)的参数化方法,通过约束参数块的选择来引导优化过程趋向更简单的解。实验表明,该方法在保持模型性能的同时有效降低了网络的算法复杂度,为模型压缩提供了理论依据和新思路。

详情
英文摘要

Large-scale deep learning models are well-suited for compression. Across a variety of tasks, methods like pruning, quantization, and knowledge distillation have been used to achieve massive reductions in model parameters with only marginal performance drops. This raises the central question: *Why are deep neural networks suited for compression?* In this work, we take up the perspective of algorithmic complexity to explain this behavior. We hypothesize that the parameters of trained models have more structure and, hence, exhibit lower algorithmic complexity compared to the weights at (random) initialization. Furthermore, model compression methods harness this reduced algorithmic complexity to compress models. Although an unconstrained parameterization of model weights, $\mathbf{w} \in \mathbb{R}^n$, can represent arbitrary weight assignments, the solutions found during training exhibit repeatability and structure, making them simpler to implement than a trivial program. To this end, we formalize the Kolmogorov complexity of $\mathbf{w}$ by $\mathcal{K}(\mathbf{w})$. We introduce a constrained parameterization $\widehat{\mathbf{w}}$ that partitions parameters into blocks of size $s$ and restricts each block to be selected from a set of $k$ reusable motifs, specified by a reuse pattern (or mosaic). The resulting method, $\mathit{Mosaic\text{-}of\text{-}Motifs}$ (MoMos), provides a theoretically justified parameterization that biases optimization toward algorithmically simpler solutions. Empirical evidence from multiple experiments shows that MoMos consistently lowers the algorithmic complexity of neural networks during training while preserving the performance of unconstrained models. These results suggest that parameter compressibility is not only observed after training, but can be induced from the optimization domain.

2602.12262 2026-05-18 cs.CL cs.LG

Few-Step Diffusion Language Models via Trajectory Self-Distillation

Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Chengzhi Mao, Hao Wang, Vladimir Pavlovic, Dimitris N. Metaxas

发表机构 * Rutgers University(罗格斯大学) Red Hat AI Innovation(红帽AI创新) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室)

AI总结 该论文研究了如何在扩散语言模型中实现高效且高质量的少步解码。为了解决少步解码导致的生成质量下降问题,作者提出了一种基于轨迹自蒸馏的框架,通过让少步学生模型学习完整步教师模型的生成轨迹,从而缓解因分词错误带来的性能损失。此外,引入了直接判别优化方法,进一步提升了模型在复杂推理任务中的表现,显著缩小了少步解码与完整步解码之间的性能差距。

详情
英文摘要

Diffusion large language models (DLLMs) have emerged as powerful generative models with the promise of fast text generation through parallel decoding. However, realizing this potential in practice remains challenging: reducing the number of decoding steps, typically causes a substantial degradation in output quality due to token factorization error. To alleviate this, we propose a self-distillation framework that trains a few-step student to match the generative trajectory of a full-step teacher. We theoretically and empirically show that trajectory-level supervision mitigates this factorization error, thereby enabling effective few-step decoding. We further incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that encourages mode-seeking toward the teacher's modes, yielding stronger performance on challenging reasoning tasks. Across reasoning and code-generation benchmarks, our method substantially narrows the gap between few-step and full-step decoding. The source code is available at https://github.com/Tyrion58/T3D.

2602.10687 2026-05-18 cs.CV cs.AI

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China(合肥工业大学计算机科学与信息工程学院) Wuhan University, Wuhan, China(武汉大学) Lab for Intelligence and visiON (LION)(智能视觉实验室)

AI总结 现有伪造检测方法多局限于单模态或双模态设置,难以应对现实中的多模态虚假信息。本文提出OmniVL-Guard,一个基于平衡强化学习的统一视觉-语言伪造检测与定位框架,旨在解决多模态交互与多任务优化中的偏差问题。该方法包含自进化推理路径生成和自适应奖励缩放策略优化两个核心设计,有效提升了检测与定位的综合性能,并在多个数据集上展现出优越的零样本泛化能力。

Comments Accepted by ICML 2026

详情
英文摘要

Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios. The dataset and code are publicly available at https://github.com/shen8424/OmniVL-Guard.

2602.09297 2026-05-18 cs.LG

Laplacian Heads Improve Transformers by Smoothing Token Representations

Yuchong Zhang, Vardan Papyan

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 本文提出了一种改进Transformer模型的方法,通过引入拉普拉斯头(Laplacian Heads)来平滑令牌表示。该方法将部分注意力头的softmax矩阵替换为对应的拉普拉斯矩阵,从而在更新令牌表示时同时控制序列内的方差,并在图结构视角下解释为热扩散过程。实验表明,该方法在监督学习、语言建模和自监督学习任务中均能提升性能,且有助于增强令牌表示的可分性和结构对齐,挑战了传统认为令牌过度平滑有害的观点。

详情
英文摘要

Transformers update token representations through multi-head attention and residual connections as $X \leftarrow X + \sum_{i} P^{(i)}XW_{V_i}W_{o_i}$, where $P^{(i)}$ is the softmax attention matrix in head $i$. We propose replacing a subset of $P^{(i)}$'s with the Laplacian $I - P^{(i)}$, giving $X \leftarrow X + \sum_{i \in \mathcal{A}} P^{(i)}XW_{V_i}W_{o_i} + \sum_{i \in \mathcal{L}} (I - P^{(i)})XW_{V_i}W_{o_i}$. Our proposal has two motivations. First, it allows attention heads to update the mean of token representations, while Laplacian heads can directly control within-sequence variance. Second, if tokens are viewed as nodes in a graph with edge weights $P^{(i)}$, then $I - P^{(i)}$ is the corresponding graph Laplacian, and the update can be interpreted as one step of heat diffusion on the graph. We show that this simple modification improves performance across supervised learning, language modeling, and self-supervised learning tasks. To investigate why, we examine the token representations learned with and without Laplacian heads. In supervised learning, Laplacian heads collapse token representations within the same sequence and align the sequence means with the geometry of Neural Collapse. In language modeling, they increase the separability of token representations that share the same next-token prediction. In self-supervised learning, they produce token representations whose principal components are better suited for segmentation. Across modalities, they also lead to faster-decaying spectra, indicating stronger token smoothing. Overall, our findings challenge the prevailing view that token oversmoothing is inherently harmful, showing instead that certain forms of smoothing can be beneficial.